Keeping servers responsive
In addition to ensuring we stayed within throughput limits, we also needed to make sure that servers stayed responsive. Again, looking at timing data was useful: here, focusing more on operations which have high per-call durations rather than high total durations was more impactful.
Worker Processes + ETS
One of the biggest culprits of unresponsiveness was operations which needed to run in the guild and iterate over all of the members. These are fairly rare, but do happen. For instance, when someone does an everyone ping, we need to know everyone who's in the server that is able to see that message. But doing those checks can take many seconds. How can we handle that? Ideally we'd run that logic while the guild is busy processing other things, but Elixir processes don't share memory nicely. So we needed a different solution.
One of the tools available in Erlang/Elixir for storing data in memory where processes can share it is ETS. This is an in-memory database that supports the ability for multiple elixir processes to access it safely. While it’s less efficient than accessing data in the process heap, it’s still quite fast. It also has the benefit of reducing the size of the process heap, which reduces the latency of garbage collection.
We decided to create a hybrid structure for holding the list of members: We would store the list of members in ETS, which allows other processes to read it too, but also store a set of recent changes (inserts, updates, deletes) in the process heap. Since most members are not being updated all the time, the set of recent changes is a very small fraction of the overall set of members.
With the members in ETS, we can now create worker processes and pass them the ETS table identifier to work on when we have expensive operations. The worker process can handle the expensive part while the guild continues doing other work. See the code snippet for a simplified version of how we do this.
One place where we use this is when we need to hand off a guild process from one machine to another (usually for maintenance or deploys). In that process, we need to spawn a new process to handle the guild on the new machine, then copy over the state of the old guild process to the new one, re-connect all connected sessions to the new guild process, and then process the backlog that built up during that operation. By using a worker process, we can send most of the members (which can be multiple GB of data) while the old guild process is still doing work, and thus cut off minutes of stalling every time we do a hand off.
Manifold Offload
Another idea we had for improving responsiveness (and also pushing out the throughput limit) was to extend Manifold (discussed in a previous blog post) to use a separate “sender” process to do the fanout to recipient nodes (instead of having the guild process do it). This not only reduced the amount of work done by the guild process, but it would also insulate it from BEAM's backpressure in case one of the network connections between guilds and relays were to back up temporarily (BEAM is the virtual machine which our Elixir code runs in). In theory this looked like it would be a quick win. Unfortunately, when we tried turning this feature (called Manifold Offload) on, we found that it actually led to a massive performance degradation. How could this be? We're theoretically doing less work but the process is busier?
Looking more closely, we noticed that most of the extra work was related to garbage collection. Trying to understand how and why naturally needed more data. Here, the erlang.trace function came to the rescue. This allowed us to get data about every time the guild process did a garbage collection, giving us insights into not only exactly how frequently it was happening, but also what triggered it.
Taking the information from those traces, and looking at BEAM’s garbage collection code, we realized that the trigger condition for the major (full) garbage collection when manifold offload was enabled was the virtual binary heap. The virtual binary heap is a feature which is designed to allow for freeing memory used by strings that are not stored inside the process heap, even when the process would otherwise not need to garbage collect. Unfortunately, our usage pattern meant that we would continually trigger garbage collection to reclaim a couple hundred kilobytes of memory, at the cost of copying a heap which was gigabytes in size, a trade-off which was clearly not worth it. Thankfully BEAM allowed us to tune this behavior using a process flag, min_bin_vheap_size. Once we increased that to a few megabytes, the pathological garbage collection behavior disappeared, and we were then able to turn on manifold offload and observe a significant performance win.