It's been a while since we last wrote about some of the things we've done to scale our Elixir systems to meet the needs of our community. Since then, not only has the overall size of Discord's user base grown massively, but our largest communities have individually grown even faster. With that growth, those servers started slowing down and creeping ever closer to their throughput limits. As that’s happened, we’ve continued to find many improvements to keep making them faster and pushing the limits out further. In this post we’ll talk about some of the ways we’ve scaled individual Discord servers from tens of thousands of concurrent users to approaching two million concurrent users in the past few years.

Elixir as a fanout system

Whenever something happens on Discord, such as a message being sent, or someone joining a voice channel, we need to update the UI in the client of everyone who’s online that is in the same server (also known as a “guild”). We use a single Elixir process per guild as a central routing point for everything happening on that server, and another process (a “session”) for each connected user’s client. The guild process keeps track of the sessions for users who are members of that guild, and is responsible for fanning out actions to those sessions. Once the sessions receive those updates, they’ll forward them over a websocket connection to the client.

Some actions go to everyone in the server, while others require checking permissions, which requires knowing the user’s roles as well as information about the roles and channels on that server. Why not have multiple guild processes per server? Some things need a full view: showing the list of people in a voice channel, or maintaining channel member lists, for instance.

How a message flows from the sender (left) through Discord’s systems to other users and bots in the same server. This post focuses on the Guild elixir process in the middle. Dotted lines indicate that some users in the server may not have permissions to see the message, so should not receive it.

The nature of fanning out messages means the odds are against us: the amount of activity in a guild is proportional to the number of people in that server. On top of that, the amount of work needed to fan out a single message is proportional to the number of people online in that server too. That means that the amount of work needed to handle a discord server grows quadratically with the size of the server: If a server has 1000 people online, and they all were to say “I love jello” once, that’s 1 million notifications. The same thing with 10,000 people is 100 million notifications. With 100,000 people that’s 10 billion notifications to deliver!

In addition to overall throughput concerns, some operations end up getting slower the larger a server is. Ensuring that we can keep almost all operations quick is important to the server feeling like it's responsive: when a message is sent, others should see it right away; when someone joins a voice channel, they should be able to start participating right away. Taking multiple seconds to process an expensive operation hurts that experience.

Knowing these challenges were standing in our way, how did we get to supporting Midjourney, which has over 10 million members, of whom over 1 million are online (and thus have a session connected to the guild) at any given time? First, it was a matter of being able to understand how our systems were performing. Once we had that data, it was a matter of finding opportunities to improve both throughput and responsiveness.

Understanding how our systems perform

Before being able to make our systems better, ensuring that we have a good sense of what they’re spending their time and RAM on is critical.

Wall time analysis

The simplest way of understanding what an Elixir process is doing is to take a stack trace with Process.info(pid, :current_stacktrace). If a process is stuck waiting on something, this is a great way to find out what. If its running code, then it's less simple, but still useful. Grabbing a thousand stacktraces, sleeping 100 milliseconds between each one, takes a couple minutes and gives you a pretty good idea of what that process is doing. This can be done for any process in an elixir system to get insights into it, and requires very little effort.

However, sometimes richer information is needed. For this, we instrumented the event processing loop of the guild processes to record how many of each type of message we receive, as well as the max/min/average/total time to process these. This means we could ignore all of the operations which account for < 1% of the total time unless they're extremely bursty. In addition to ruling out cheap operations, it helped highlight the operations which are most expensive. From there it was a matter of looking at code and stack traces to understand why those operations were slow, and what could be done about it. In the graph below, the impact of a change targeting how we handle presence updates can be clearly seen, giving us confidence in knowing that our theories are correct and helping guide us toward the next area of concern.

A line chart labeled "Guild % of time busy by message type."
Breaking out time spent by type of operation allows seeing the impact of specific changes more clearly. Here is a significant performance win on processing of presence updates (light blue line), preceded a couple days before by a smaller performance win on processing normal messages (yellow line) a couple days before.

Process Heap Memory analysis

In addition to understanding where guilds are spending their time, understanding how they're using memory is important, too. For us, memory not only affects the type of machines we should use, but also the performance of the garbage collector. For this, erts_debug.size can help. However, that function is extremely slow for large objects. To make its cost reasonable, we wrote a helper library which samples large (non-struct) maps and lists to produce estimated memory usage rather than walking every single element.

This was helpful not only in understanding GC performance, but in finding which fields were worth focusing on for optimizing, and which were ultimately irrelevant.

Once we had a good sense of where guild processes were spending their time, we could start coming up with strategies for keeping the guild process away from being 100% busy. In some cases it was just a matter of rewriting an inefficient implementation more efficiently. However that could only get us so far. We needed more radical changes.

Passive sessions - avoiding unnecessary work

One of the best ways of pushing a throughput bottleneck out is to just do less work. One way we did this was by considering the needs of our client application. In our original topology, every user receives every action which they're allowed to see in all of their guilds. However, some users are in many guilds, and may not even click in to check out what's happening in some of those. What if we didn't send them everything until they clicked in? We wouldn't need to bother checking their permissions for every message and would send their client much less data as a result. We called this a "passive" connection, and kept it in a separate list from the "active" connections that should receive all of the data. This ended up working even better than we expected: around 90% of user-guild connections in large servers were passive, resulting in the fanout work being 90% less expensive! This got us some breathing room, but naturally as communities kept growing it wouldn't be enough (a 10x reduction in work gives us a ~3x win in maximum community size).

Relays - splitting fanout across multiple machines

One of the standard techniques for extending a single core throughput limit is to split the work across multiple threads (or processes, to use the Elixir term). This idea led us to build a system called "relays" which sits between guilds and the users' sessions. By splitting the work of handling those sessions across multiple relays instead of it all being done by a single process, we could allow a single guild to use more resources to serve a large community. While some work would still need to be done in the main guild process, this got us to handling communities in the hundreds of thousands of members.

A chart representing a Guild tranferring to three separate relay sessions. One of the three Relay sessions splits off into two sessions, which move to Mobile and Desktop. The second Relay session moves to a mobile device. The third Relay session moves to a robot.
Relays maintain connections to the sessions instead of the guild, and are responsible for doing fanout with permission checks. We handle up to 15,000 connected sessions per relay.

Implementing this required identifying which operations were critical to do in the relays, which ones had to be done in the guild, and which ones could be done by either system. Once we had an idea of what was needed, the effort of refactoring to pull out logic that could be shared between the systems could begin. For example, most of the logic for how to do fanout was refactored into a library used both for guilds and relays. Some logic could not be shared like this, and different solutions were needed: voice state management was implemented by basically having the relay proxy all of the messages to the guild with minimal changes.

One of the more interesting design decisions we made when initially rolling out relays was to include the entire list of members in the state of each relay. This was good from a simplicity perspective: any member information needed was guaranteed to be available. Unfortunately, at Midjourney scale (many millions of members) this design started to make less and less sense: not only did we have dozens of copies of those tens of millions members sitting in RAM, but creating a new relay would require serializing and sending all of that member information to the new relay, stalling the guild for tens of seconds. To fix that, we added logic to identify the members that the relay actually needed to function, which was a tiny percentage of the overall members.

Keeping servers responsive

In addition to ensuring we stayed within throughput limits, we also needed to make sure that servers stayed responsive. Again, looking at timing data was useful: here, focusing more on operations which have high per-call durations rather than high total durations was more impactful.

Worker Processes + ETS

One of the biggest culprits of unresponsiveness was operations which needed to run in the guild and iterate over all of the members. These are fairly rare, but do happen. For instance, when someone does an everyone ping, we need to know everyone who's in the server that is able to see that message. But doing those checks can take many seconds. How can we handle that? Ideally we'd run that logic while the guild is busy processing other things, but Elixir processes don't share memory nicely. So we needed a different solution.

One of the tools available in Erlang/Elixir for storing data in memory where processes can share it is ETS. This is an in-memory database that supports the ability for multiple elixir processes to access it safely. While it’s less efficient than accessing data in the process heap, it’s still quite fast. It also has the benefit of reducing the size of the process heap, which reduces the latency of garbage collection.

We decided to create a hybrid structure for holding the list of members: We would store the list of members in ETS, which allows other processes to read it too, but also store a set of recent changes (inserts, updates, deletes) in the process heap. Since most members are not being updated all the time, the set of recent changes is a very small fraction of the overall set of members.

With the members in ETS, we can now create worker processes and pass them the ETS table identifier to work on when we have expensive operations. The worker process can handle the expensive part while the guild continues doing other work. See the code snippet for a simplified version of how we do this.

One place where we use this is when we need to hand off a guild process from one machine to another (usually for maintenance or deploys). In that process, we need to spawn a new process to handle the guild on the new machine, then copy over the state of the old guild process to the new one, re-connect all connected sessions to the new guild process, and then process the backlog that built up during that operation. By using a worker process, we can send most of the members (which can be multiple GB of data) while the old guild process is still doing work, and thus cut off minutes of stalling every time we do a hand off.

Manifold Offload

Another idea we had for improving responsiveness (and also pushing out the throughput limit) was to extend Manifold (discussed in a previous blog post) to use a separate “sender” process to do the fanout to recipient nodes (instead of having the guild process do it). This not only reduced the amount of work done by the guild process, but it would also insulate it from BEAM's backpressure in case one of the network connections between guilds and relays were to back up temporarily (BEAM is the virtual machine which our Elixir code runs in). In theory this looked like it would be a quick win. Unfortunately, when we tried turning this feature (called Manifold Offload) on, we found that it actually led to a massive performance degradation. How could this be? We're theoretically doing less work but the process is busier?

Looking more closely, we noticed that most of the extra work was related to garbage collection. Trying to understand how and why naturally needed more data. Here, the erlang.trace function came to the rescue. This allowed us to get data about every time the guild process did a garbage collection, giving us insights into not only exactly how frequently it was happening, but also what triggered it.

Taking the information from those traces, and looking at BEAM’s garbage collection code, we realized that the trigger condition for the major (full) garbage collection when manifold offload was enabled was the virtual binary heap. The virtual binary heap is a feature which is designed to allow for freeing memory used by strings that are not stored inside the process heap, even when the process would otherwise not need to garbage collect. Unfortunately, our usage pattern meant that we would continually trigger garbage collection to reclaim a couple hundred kilobytes of memory, at the cost of copying a heap which was gigabytes in size, a trade-off which was clearly not worth it. Thankfully BEAM allowed us to tune this behavior using a process flag, min_bin_vheap_size. Once we increased that to a few megabytes, the pathological garbage collection behavior disappeared, and we were then able to turn on manifold offload and observe a significant performance win.

Looking forward

The optimizations mentioned above are the tip of the iceberg of a lot of other changes we’ve done to improve performance both for Midjourney and other huge servers. And, with the instrumentation mentioned above, we’re able to stay on top of new bottlenecks that arise as different communities grow and use Discord in new and exciting ways.