At Discord, we’re always thinking about ways to improve our services and increase performance. After all, the faster our app gets, the sooner you can return to your friends and conversations!
Over the last six months, we embarked on a quest to support this endeavor, working to reduce the amount of bandwidth our clients use, especially on iOS and Android, hoping that decreasing bandwidth usage would lead to a more responsive experience.
Background
When your client connects to Discord, it receives real-time updates about what’s happening through a service that we call the “gateway.” Since late 2017, the client’s gateway connection has been compressed using zlib, making messages anywhere from 2 to 10 times smaller.
Since then, zstandard (originally released in 2015) has gained enough traction to become a viable replacement for zlib. Zstandard offers higher compression ratios and shorter compression times and supports dictionaries: a way to preemptively exchange information about compressed content, further increasing compression ratios and reducing the overall bandwidth usage.
We attempted to use zstandard in the past, but, at the time, the benefits weren’t worth the costs. Our testing in 2019 was desktop-only and used too much RAM. However, a lot can happen in five years! We wanted to give it another try, and the support for dictionaries appealed to us, especially as most of our gateway payloads are small and in a well-defined shape.
We believed the predictability of these payloads would be a perfect application of dictionaries to further reduce bandwidth usage.
Armed with this knowledge, we put on our lab coats, slapped on our goggles, and started experimenting. On paper, we thought zstandard would be better than zlib but we wanted to validate this theory against our current workload.
We opted to do a “dark launch” of plain zstandard: the plan was to compress a small percentage of production traffic both with zlib and zstandard, collect a bunch of metrics, then discard the zstandard data. This allowed us to experiment with zstandard to quickly compare its results against zlib. Without this experiment, we would have to add zstandard support for our clients — desktop, iOS, and Android — which would require about a month’s lead time before we could fully determine the effects of zstandard. We didn’t know how well zstandard would perform and didn’t want to wait a whole month, but a dark launch allowed us to iterate over days as opposed to weeks.
Once we got our experiment set up and deployed onto our gateway cluster, we set up a dashboard to see how zstandard performed. We flipped the switch to start sending a teeeeeny bit of traffic through the dark launch code, and the initial results appeared to be… underwhelming. Zstandard was performing worse than zlib was.
To compare the performance of these two compression algorithms, we used their “compression ratio.” The compression ratio is measured by taking the uncompressed size of the payload and dividing it by the compressed size — a larger number is better.
Looking at the images above, which measure the compression ratio for the various dispatch types (op 0), with zlib, user_guild_settings_update has a compression ratio of 13.95 while with zstandard it has a compression ratio of 12.26.
The graph below further illustrates that zstandard performed worse than zlib: the average size of a MESSAGE_CREATE payload compressed with zlib was around 250 bytes, while with zstandard, the same payload was over 750!
The same trend was observed for most other dispatches: zstandard was not outperforming zlib like we thought it would. What’s going on here?
Streaming Zstandard
It turns out that one of the key differences between our zlib and zstandard implementations was that zlib was using streaming compression, while zstandard wasn’t.
As mentioned previously, most of our payloads are comparatively very small, only a few hundred bytes at most, which doesn’t give zstandard much historical context to work with to further optimize how it compresses future payloads. With streaming compression, the zlib stream is spun up when the connection is opened and exists until the websocket is closed. Instead of having to start fresh for every websocket message, zlib can draw on its knowledge of previously compressed data to inform its decisions on how to process fresh data. This ultimately leads to smaller payload sizes.
The question then became: “could we get zstandard to do this?” The answer to that was… “sorta.” Our gateway service is written in elixir, and while zstandard supports streaming compression, the various zstandard bindings for elixir/erlang we looked at didn’t.
We ultimately settled on using ezstd as it had dictionary support (more on that later).While it didn’t support streaming at the time,in the spirit of open source we forked ezstd to add support for streaming, which we later contributed back upstream.
We then repeated the dark launch experiment, but with zstandard streaming and got the following results:
As the above data illustrates, zstandard streaming increased the compression ratio from 6 to almost 10 and dropped the payload size from 270 bytes to 166.
This trend held true for most of the other dispatches: zstandard streaming significantly outperforms zlib both in time to compress and compression ratio.
Looking once again at MESSAGE_CREATE, the compression time per byte of data is significantly lower for zstandard streaming than zlib, with zlib taking around 100 microseconds per byte and zstandard taking 45 microseconds.
Pushing Further
While our initial experimentation proved that zstandard streaming outperformed zlib streaming, the remaining question we had was: “How far can we push this?” Our initial experiments used the default settings for zstandard and we wanted to know how high we could push our compression ratio by playing around with the compression settings.
So how far did we get?
Tuning
Zstandard is highly configurable and enables us to tweak various compression parameters. We focused our efforts on three parameters that we thought would have the biggest impact on compression: chainlog, hashlog, and windowlog. These parameters offer trade-offs between compression speed, memory usage, and compression ratio. For example, increasing the value of the chainlog generally improves the compression ratio, but at the cost of increasing memory usage and compression time.
We also wanted to ensure that with the settings we decided on, the compression contexts would still fit in memory on our hosts. While it’s simple to add more hosts to soak up the extra memory usage, extra hosts cost money and at some point, provide diminishing returns on the gains.
We settled on an overall compression level of 6, a chainlog and hashlog of 16, and a windowlog of 18. These numbers are slightly above the default settings that you can see here and would comfortably fit in the memory of a gateway node.
Zstandard Dictionaries
Additionally, we wanted to investigate if we could take advantage of zstandard’s dictionary support to compress data even further. By pre-seeding zstandard with some information, it can more efficiently compress the first few kilobytes of data.
However, doing this adds additional complexity as both the compressor (in this case, a gateway node) and the decompressor (a Discord client) need to have the same copy of the dictionary to communicate with each other successfully.
To generate a dictionary to use, we needed data… and a lot of it. Zstandard has a built-in way to generate dictionaries (zstd --train) from a sample of data, so we just had to collect a whooole buncha samples.
Notably, the gateway supports two encoding methods for payloads: JSON and ETF, and a JSON dictionary wouldn’t perform as well on ETF (and vice versa) so we had to generate two dictionaries: one for each encoding method.
Since dictionaries contain portions of the training data and we’d have to ship the dictionaries to our clients, we needed to ensure that the samples we would generate the dictionaries from were free of any personally-identifiable user data. We collected data involving 120,000 messages, split them by ETF and JSON encoding, anonymized them, and then generated our dictionaries.
Once our dictionaries were built, we could use our gathered data to quickly evaluate and iterate on its efficacy without needing to deploy our gateway cluster.
The first payload we tried compressing was “READY.” As one of the first (and largest) payloads sent to the user, READY contains most of the information about the connecting user, such as guild membership, settings, and read states (What channels should be marked as read/unread). We compressed a single READY payload of 2,517,725 bytes down to 306,745 using the default zstandard settings which established a baseline. Utilizing the dictionary we just trained, the same payload was compressed down to 306,098 bytes – a gain of around 600 bytes.
Initially, these results seemed discouraging, but we next tried compressing a smaller payload, called TYPING_START, sent to the client so it can show the “XXX is typing…” notification. In this situation, a 636 byte payload compresses down to 466 bytes without the dictionary and 187 bytes with the dictionary. We saw much better results with our dictionaries against smaller payloads simply due to how zstandard operates.
Most compression algorithms “learn” from data that has already been compressed, but with small payloads, there isn’t any data for it to learn from. By preemptively informing zstandard what the payload is going to look like, it can make a more informed decision on how to compress the first few kilobytes of data before its buffers have been fully populated.
Satisfied with these findings, we deployed dictionary support to our gateway cluster and started experimenting with it. Utilizing the dark launch framework, we compared zstandard to zstandard with dictionaries.
Our production testing yielded the following results:
We specifically looked at the READY payload size as it’s one of the first messages sent over the websocket and would be most likely to benefit from a dictionary. As shown in the table above, the compression gains were minimal for READY, so we looked at the results for more dispatch types hoping dictionaries would give more of an edge for smaller payloads.
Unfortunately, the results were a bit mixed. For example, looking at the message create payload size that we’ve been comparing throughout this post, we can see that the dictionary actually made things worse.
Ultimately, we decided not to continue with our dictionary experiments. The slightly improved compression dictionaries would provide was outweighed by the additional complexity they would add to our gateway service and clients. Data is a big driver of engineering at Discord, and the data speaks for itself: it wasn’t worth investing more effort into.
Buffer Upgrading
Finally, we explored increasing zstandard buffers during off-peak hours. Discord’s traffic follows a diurnal pattern, and the memory we need to handle peak demand is significantly more than what’s needed during the rest of the day.
On the surface, autoscaling our gateway cluster would prevent us from wasting compute resources during off-peak hours. However, due to the nature of gateway connections being long-lived, traditional autoscaling methods don’t work well for our workload. As such, we have a lot of extra memory and compute during off-peak hours. Having all this extra compute laying around raised the question: Could we take advantage of these resources to offer greater compression?
To figure this out, we built a feedback loop into the gateway cluster. This loop would run on each gateway node and monitor the memory usage by the clients connected to it. It would then determine a percentage of new connecting clients that should have their zstandard buffer upgraded. An upgraded buffer increases the windowlog, hashlog, and chainlog values by one, and since these parameters are expressed as a power of two, increasing these values by one will roughly double the amount of memory usage the buffer uses.
After deploying and letting the feedback loop run for a bit, the results weren’t as good as we had initially hoped. As illustrated by the graph below, over a 24 hour period, our gateway nodes had a relatively low upgrade ratio (Up to 30%), and was significantly less than we anticipated: around 70%.
After doing a bit of digging, we discovered that one of the primary issues that was causing the feedback loop to behave sub-optimally was memory fragmentation: the feedback loop looked at real system memory usage, but BEAM was allocating significantly more memory from the system than was needed to handle the connected clients. This caused the feedback loop to think that it had less memory to work with than was available.
To try and mitigate this, we did a little experimentation to tweak the BEAM allocator settings — more specifically, the driver_alloc allocator, which is responsible for (shockingly) driver data allocations. The bulk of the memory used by a gateway process is the zstandard streaming context, which is implemented in C using a NIF. NIF memory usage is allocated by driver_alloc. Our hypothesis was that if we could tweak the driver_alloc allocator to more effectively allocate or free memory for our zstandard contexts, we’d be able to decrease fragmentation and increase upgrade ratio overall.
However, after messing around with the allocator settings for a little bit, we decided to revert the feedback loop. While we probably would have eventually found the right allocator settings to dial in, the amount of effort needed to tweak the allocators combined with the overall additional complexity that this introduced into the gateway cluster outweighed any gains that we would’ve seen if this was successful.
Implementation and Rollout
While the original plan was to only consider zstandard for mobile users, the bandwidth improvements were significant enough for us to ship to desktop users as well! Since zstandard ships as a C library, it was simply a matter of finding bindings in the target language —Java for Android, Objective C for iOS, and Rust for Desktop — and hooking them into each client. Implementation was straightforward for Java (zstd-jni) and Desktop (zstd-safe), as bindings already existed, however for iOS, we had to write our own bindings.
This was a risky change with the potential to render Discord completely unusable if things were to go wrong, so the rollout was gated behind an experiment. This experiment served three purposes: allow the quick rollback of these changes if things were to go wrong, validate the results we saw in the “lab,” and enable us to gauge if this change was negatively affecting any baseline metrics.
Over the course of a few months, we were able to successfully roll out zstandard to all of our users on all platforms.
Another Win: Passive Sessions V2
While this next part isn’t directly related to the zstandard work, the metrics that guided us during the dark launch phase of this project revealed a surprising behavior. Looking at the actual size of dispatches that were sent to the client, passive_update_v1 stood out. This dispatch consisted of over 30% of our gateway traffic while the actual number of dispatches sent were comparatively small–around 2%.
We employ passive sessions to avoid sending most messages that a server generates to clients that may not even open the server. For example, a Discord server could be very active sending thousands of messages per minute, but if a user isn’t actually reading those messages, it doesn’t make sense to send them and waste their bandwidth. Once you tab into the server, a passive session will be “upgraded” into a normal session and receive the full firehose of dispatches from that guild.
However, passive sessions still need to be periodically sent a limited amount of information, which is the purpose of PASSIVE_UPDATE_V1. Periodically, all passive sessions will receive an update with a list of channels, members, and members in voice so your client can still be kept in sync with the server.
Diving into the actual contents of one of those PASSIVE_UPDATE_V1 dispatches, we would send all of the channels, members, or members in voice, even if only a single element changed. Passive sessions were implemented as a means to scale Discord servers to hundreds of thousands of users and it worked well at the time.
However as we’ve continued to scale, sending these snapshots which consist of mostly redundant data was no longer sufficient. To fix this, we introduced a new dispatch that only sends the delta of what’s changed since the last update. This dispatch, aptly named PASSIVE_UPDATE_V2, significantly reduced the overall bandwidth from 35% of the gateway’s bandwidth to 5%, equating to a 20% reduction cluster-wide.
B I G Savings
Through the combined effects of Passive Sessions v2 and zstandard, we were able to reduce the gateway bandwidth used by our clients by almost 40%. That’s a LOT of data!
The chart shows the relative outgoing bandwidth of the gateway cluster from January 1 2024 to August 12, 2024, with the two demarcations being the zstandard rollout in April, followed by passive sessions v2 in late May.
While the passive sessions optimization was an unintended side-effect of the zstandard experimentation, it shows that with the right instrumentation and by looking at graphs with a critical eye, big savings can be achieved with a reasonable amount of effort.