Engineering & Developers

How Discord Scaled Elixir to 5,000,000 Concurrent Users

Stanislav Vishnevskiy

July 6, 2017

From the beginning, Discord has been an early adopter of Elixir. The Erlang VM was the perfect candidate for the highly concurrent, real-time system we were aiming to build. We developed the original prototype of Discord in Elixir; that became the foundation of our infrastructure today. Elixir’s promise was simple: access the power of the Erlang VM through a much more modern and user-friendly language and toolset.

Fast forward two years, and we are up to nearly five million concurrent users and millions of events per second flowing through the system. While we don’t have any regrets with our choice of infrastructure, we did have to do a lot of research and experimentation to get here. Elixir is a new ecosystem, and the Erlang ecosystem lacks information about using it in production (although Erlang in Anger is awesome). What follows is a set of lessons learned and libraries created throughout our journey of making Elixir work for Discord.

Message Fanout

While Discord is rich with features, most of it boils down to pub/sub. Users connect to a WebSocket and spin up a session process (a GenServer), which then communicates with remote Erlang nodes that contain guild (internal for a “Discord Server”) processes (also GenServers). When anything is published in a guild, it is fanned out to every session connected to it.

When a user comes online, they connect to a guild, and the guild publishes a presence to all other connected sessions. Guilds have a lot of other logic behind the scenes, but here’s a simplified example:

	def handle_call({:publish, message}, _from, %{sessions: sessions}=state) do
	Enum.each(sessions, &send(&1.pid, message))
	{:reply, :ok, state}
	end

view raw publish.ex hosted with ❤ by GitHub

This was a fine approach when we originally built Discord to groups of 25 of less. However, we have been fortunate enough to have “good problems” arise as people started using Discord for large scale groups. Eventually we ended up with many Discord servers like /r/Overwatch with up to 30,000 concurrent users. During peak hours, we began to see these processes fail to keep up with their message queues. At a certain point, we had to manually intervene and turn off features that generated messages to help cope with the load. We had to figure this out before it became a full-time job.

We began by benchmarking hot paths within the guild processes and quickly stumbled onto an obvious culprit. Sending messages between Erlang processes was not as cheap as we expected, and the reduction cost — Erlang unit of work used for process scheduling — was also quite high. We found that the wall clock time of a single send/2 call could range from 30μs to 70us due to Erlang de-scheduling the calling process. This meant that during peak hours, publishing an event from a large guild could take anywhere from 900ms to 2.1s! Erlang processes are effectively single threaded, and the only way to parallelize the work is to shard them. That would have been quite an undertaking, and we knew there had to be a better way.

We knew we had to somehow distribute the work of sending messages. Since spawning processes in Erlang is cheap, our first guess was to just spawn another process to handle each publish. However, each publish could be scheduled at a different time, and Discord clients depend on linearizability of events. That solution also wouldn’t scale well because the guild service was also responsible for an ever-growing amount of work.

Inspired by a blog post about boosting performance of message passing between nodes, Manifold was born. Manifold distributes the work of sending messages to the remote nodes of the PIDs (Erlang process identifier), which guarantees that the sending processes at most only calls send/2 equal to the number of involved remote nodes. Manifold does this by first grouping PIDs by their remote node and then sending to Manifold.Partitioner on each of those nodes. The partitioner then consistently hashes the PIDs using :erlang.phash2/2, groups them by number of cores, and sends them to child workers. Finally, those workers send the messages to the actual processes. This ensures the partitioner does not get overloaded and still provides the linearizability guaranteed by send/2. This solution was effectively a drop-in replacement for send/2:

Manifold.send([self(), self()], :hello)

view raw manifold_send.ex hosted with ❤ by GitHub

An awesome side-effect of Manifold was that we were able to not only distribute the CPU cost of fanning out messages, but also reduce the network traffic between nodes:

Manifold is available on our GitHub, so give it a spin. https://github.com/discordapp/manifold.

Fast Access Shared Data

Discord is a distributed system achieved through consistent hashing. Using this method requires us to create a ring data structure that can be used to lookup the node of a particular entity. We want that to be fast, so we chose the wonderful library by Chris Moos via a Erlang C port (process responsible for interfacing with C code). It worked great for us, but as Discord scaled, we started to notice issues when we had bursts of users reconnecting. The Erlang process responsible for controlling the ring would start to get so busy that it would fail to keep up with requests to the ring, and the whole system would become overloaded. The solution at first seemed obvious: run multiple processes with the ring data to better utilize all the machine’s cores to answer the requests. However, we noticed that this was a hot path. Could we do better?

Let’s break down the cost of this hot path.

A user can be in any number of guilds, but an average user is in 5.
An Erlang VM responsible for sessions can have up to 500,000 live sessions on it.
When a session connects, it has to lookup the remote node for each guild it is interested in.
The cost of communicating with another Erlang process using request/reply is about 12μs.

If the session server were to crash and restart, it would take about 30 seconds just for the cost of lookups on the ring. That does not even account for Erlang de-scheduling the single process involved in the ring for other processes’ work. Could we remove this cost completely?

The first thing people do in Elixir when they want to speed up data access is to introduce ETS. ETS is a fast, mutable dictionary implemented in C; the tradeoff is that data is copied in and out of it. We couldn’t just move our ring into ETS because we were using a C port to control the ring, so we converted the code to pure Elixir. Once that was implemented, we had a process whose job was to own the ring and constantly copy it into ETS so other processes could read directly from ETS. This noticeably improved performance, but ETS reads were about 7μs, and we were still spending 17.5 seconds on looking up values in the ring. The ring data structure is actually fairly large, and copying it in and out of ETS was the majority of the cost. We were disappointed; in any other language we could easily just have a shared value that was safe to read. There had to be a way to do this in Erlang!

After doing some research, we found mochiglobal, a module that exploits a feature of the VM: if Erlang sees a function that always returns the same constant data, it puts that data into a read-only shared heap that processes can access without copying the data. mochiglobal takes advantage of this by creating an Erlang module with one function at runtime and compiling it. Since the data is never copied, the lookup cost decreases to 0.3us, bringing the total time down to 750ms! There’s no such thing as a free lunch though; the cost of building a module with a data structure as large as the ring at runtime can take up to a second. The good news is that we rarely change the ring, so it was a penalty we were willing to take.

We decided to port mochiglobal to Elixir and add some functionality to avoid creating atoms. Our version is called FastGlobal and is available at https://github.com/discordapp/fastglobal.

Limited Concurrency

After solving the performance of the node lookup hot path, we noticed that the processes responsible for handling guild_pid lookup on the guild nodes were getting backed up. The inherent back pressure of the slow node lookup had previously protected these processes. The new problem was that nearly 5,000,000 session processes were trying to stampede ten of these processes (one on each guild node). Making this path faster wouldn’t solve the problem; the underlying issue was that the call of a session process to this guild registry would timeout and leave the request in the queue of the guild registry. It would then retry the request after a backoff, but perpetually pile up requests and get into an unrecoverable state. Sessions would block on these requests until they timed out while receiving messages from other services, causing them to balloon their message queues and eventually OOM the whole Erlang VM resulting in cascading service outages.

We needed to make session processes smarter; ideally, they wouldn’t even try to make these calls to the guild registry if a failure was inevitable. We didn’t want to use a circuit breaker because we didn’t want a burst in timeouts to result in a temporary state where no attempts are made at all. We knew how we would solve this in other languages, but how would we solve it in Elixir?

In most other languages, we could use an atomic counter to track outstanding requests and bail early if the number was too high, effectively implementing a semaphore. The Erlang VM is built around coordinating through communication between processes, but we knew we didn’t want to overload a process responsible for doing this coordination. After some research we stumbled upon :ets.update_counter/4, which performs atomic conditional increment operations on a number inside an ETS key. Since we needed high concurrency, we could also run ETS in write_concurrency mode but still read the value out, since :ets.update_counter/4 returns the result. This gave us the fundamental piece to create our Semaphore library. It is extremely easy to use and performs really well at high throughput:

	semaphore_name = :my_sempahore
	semaphore_max = 10
	case Semaphore.call(semaphore_name, semaphore_max, fn -> :ok end) do
	:ok ->
	IO.puts "success"
	{:error, :max} ->
	IO.puts "too many callers"
	end

view raw semaphore.ex hosted with ❤ by GitHub

This library has proved instrumental in protecting our Elixir infrastructure. A similar situation to the aforementioned cascading outages occurred as recently as last week, but there were no outages this time. Our presence services crashed due to an unrelated issue, but the session services did not even budge, and the presence services were able to rebuild within minutes after restarting:

*Live presences within presence service*

*CPU usage on the session services around the same time period.*

You can find our Semaphore library on GitHub at https://github.com/discordapp/semaphore.

Conclusion

Choosing to use and getting familiar with Erlang and Elixir has proven to be a great experience. If we had to go back and start over, we would definitely choose the same path. We hope that sharing our experiences and tools proves useful to other Elixir and Erlang developers, and we hope to continue sharing as we progress on our journey, solving problems and learning lessons along the way.

We are hiring, so come join us if this type of stuff tickles your fancy.

How Discord Scaled Elixir to 5,000,000 Concurrent Users

Message Fanout

Fast Access Shared Data

Limited Concurrency

Conclusion

related articles

Discord Update: June 30, 2025 Changelog

Get More From Your Boosts With New Server Perks

Gift Nitro and Earn A Flavorful Splash for your Avatar

Discord Social SDK Updates & Integrations

Discord Patch Notes: June 3, 2025

Go Beyond, Plus Ultra! with the My Hero Academia Collection

STAR WARS™ Makes Its Way to Discord

Discord Patch Notes: May 1, 2025

Worthy of a Plaque: Nameplates Land in the Shop

Make More Closet Space! Nitro Members Can Now Keep Avatar Decoration Quest Rewards for Longer

Discord Patch Notes: April 3, 2025

Discord Update: March 25, 2025 Changelog

Revamped Overlay & Refreshed Desktop Give Game Time a Boost

Discord Patch Notes: March 11, 2025

Discord Patch Notes: February 3, 2025

Discord Update: December 19, 2024 Changelog

Gift Ideas for the Dedicated Discord User in Your Life

Discord Patch Notes: December 5, 2024

Discord Update: November 18, 2024 Changelog

Celebrate Arcane’s Second Season with a new Shop Collection

Discord Patch Notes: November 1, 2024

Set Out for a Discord Adventure! Check Out Our Roll20 Adventure & D&D Shop Collection

Discord Patch Notes: October 1, 2024

Discord Update: September 26, 2024 Changelog

Discover More Ways to Play with Apps – Now Anywhere on Discord!

Legacy Shop Favorites Emerge from The Vault for a First Anniversary Encore!

Discord Patch Notes: August 30, 2024

Discord Update: August 28, 2024 Changelog

Queue Up Your Playlists on Discord with the Amazon Music Listening Party Activity!

Discord Patch Notes: August 1, 2024

Now Available: See What’s Happening on Discord, Directly from your Xbox console

Discord Update: July 26, 2024 Changelog

WHO LIVES ON YOUR PROFILE FOR ALL TO SEE? 🎶 SPONGEBOB, IN THE SHOP!

Discord Patch Notes: July 1, 2024

Discord Update: June 20, 2024 Changelog

How to Join Discord Calls Directly From Your PS5® — No Phone Needed!

Feast Your Monit-eyes on Today's Exciting Developer Updates!

Discord Patch Notes: May 2024

Refining Discord’s Mobile Experience With Your Feedback

Discord Update: May 13, 2024 Changelog

Discord Patch Notes: April 2024

Discord Update: April 3, 2024 Changelog

Lock in. Stand out. VALORANT arrives in the Shop.

Discord Update: March 5, 2024 Changelog

Discord Update: December 13, 2023 Changelog

Improving Our Mobile Experience

Discord Update: October 19, 2023 Changelog

Avatar Decorations & Profile Effects: Collect and Keep the Newest Styles

Discord Update: September 13, 2023 Changelog

Now Available: Stream Your Xbox Games Directly to Discord

Discord Update: July 29, 2023 Changelog

Meme Up Some Fun with Remix

Discord Update: June 22, 2023 Changelog

Server Subscriptions Just Got Super Powered: Introducing Media Channels, Tier Templates and more!

Discord Update: May 22, 2023 Changelog

Evolving Usernames on Discord

Discord Update: April 14, 2023 Changelog

Welcome Your New Members Easily with Community Onboarding

Introducing Discord Voice Messages

April Showers Bring Super-Cool Nitro Powers

New to Discord Nitro: Super Reactions Make Your Emoji Burst to Life

Ready Your Airhorns! 🎺 Discord Soundboard is Coming Your Way

Discord Update: March 20, 2023 Changelog

Now in Nitro: Bring Your Vibe to Discord with New Themes

Discord Activities: Play Games and Watch Together

Discord is Your Place for AI with Friends

Now Available: Use Discord Voice Chat on Your PlayStation®5 Console

Discord Update: February 20, 2023 Changelog

Introducing Video, Screen Share, and Text Chat Support for Stage Channels

Discord Update: January 25, 2023 Changelog

Make Your Connection: Connected Accounts Get a Huge Functionality Boost

Announcing Server Subscriptions and the Creator Portal, Now Open to More Communities

Discord Update: November 1, 2022 Changelog

Attention Server Owners: The App Directory is Here!