Nelly, a robot hamster, holding a flathead screwdriver.
Engineering & Developers

From Single-Node to Multi-GPU Clusters: How Discord Made Distributed Compute Easy for ML Engineers

How Discord ML Hit Its Scaling Limit

At Discord, our machine learning systems have evolved from simple classifiers to sophisticated models serving hundreds of millions of users. As our models grew more complex and datasets larger, we increasingly ran into scaling challenges: training jobs that needed multiple GPUs, datasets that wouldn’t fit on single machines, and computational demands that outpaced our infrastructure.

Access to distributed compute was necessary — but not sufficient. We needed distributed ML to be easy. Ray, an open-source distributed computing framework, became our foundation. At Discord, we built a platform around it: custom CLI tooling, orchestration with Dagster + KubeRay, and an observability layer called X-Ray. Our focus was on developer experience, turning distributed ML from something hard to use into a system they are excited to work with.

This is how Discord went from no deep learning, to ad-hoc experiments, to a production orchestration platform, and how that work enabled models like Ads Ranking that delivered a +200% improvement on our business metrics.

Ray's Early Adopters at Discord

Our journey with Ray began organically, with individual ML engineers exploring it as a solution to their specific scaling problems. These early adopters were running Ray clusters manually, following open-source documentation and adapting examples to their needs to see how everything worked out for them.

While this got them unblocked, we quickly noticed problems: cluster configuration wasn’t standardized, resource management wasn’t consistent, and there were no options for job scheduling or monitoring. Each team was solving the same infrastructure challenges with their own solutions. It became clear that while Ray solved the distributed computing problem, we needed to build a “Ray platform” within Discord.

From YAML Headaches to One CLI Command

The first step toward building our Ray platform was creating a command-line interface that helped ease Ray’s cluster complexity. 

Instead of maintaining dozens of YAML templates for every possible GPU configuration, we built a single parameterized template that generates the full cluster specification at runtime. Engineers specify what they need — such as GPU type, worker count, or memory — and the CLI handles all the underlying Kubernetes configurations and security settings. Once set up, engineers can submit jobs to their own personalized clusters.

This solved our immediate problems: configurations became consistent across teams, resource requests matched hardware capabilities, and engineers could spin up multi-GPU clusters with a single command instead of debugging YAML files. The CLI handled the full lifecycle, from creation to deletion.

Just as importantly, this focus on usability set the tone for everything that followed. Our platform succeeded not only because Ray was technically powerful, but because we made it ergonomic for engineers to use.

Orchestration: From Ad-hoc to Automated

The next big step towards productionized Ray was moving from one-off, ad-hoc training to scheduling jobs in our orchestration system. This would allow engineers to retrain on defined schedules depending on the use case. 

We built our orchestration system around a trio of Dagster + KubeRay + Ray:

  • Dagster defines workflows, configs, and dependencies. Engineers launch their jobs on schedule or through the Dagster Launchpad UI by filling in structured configs (like model_name, dataset window, or GPU pool). Defaults are schema-validated, so jobs don’t fail because of missing parameters. Dagster serves a key role in Discord’s scalable data platform
  • KubeRay provisions Ray clusters dynamically on Kubernetes, attaching the correct service account and GPU node pool, using the same behind-the-scenes logic as the CLI.
  • Ray executes the distributed workload once the cluster is live. This includes training, evaluation, or batch inference.

How it works

  1. An engineer launches or schedules a pipeline in Dagster.
  2. Dagster submits the job spec to the Ray Job Operator.
  3. KubeRay spins up a Ray cluster in the right namespace, with the right node pool.
  4. Ray distributes the training workload across GPUs.
  5. Logs and metrics stream back to Dagster and monitoring systems.

This design gives us:

  • Predictability: jobs run with versioned configs, not hand-written YAMLs.
  • Reproducibility: infra and resources are tied directly to pipeline definitions.
  • Visibility: engineers can see logs and cluster state in Dagster without SSH’ing into pods.

In practice, this means the ad relevance model — one of our most GPU-intensive — now trains daily without engineers needing to touch cluster configs or debug why jobs aren’t starting.

Observability

As adoption increased, we identified a need for better observability across our Ray infrastructure. 

To support this, we built X-Ray, a centralized web UI that gives ML engineers a single place to observe all Ray cluster operations. It can show active clusters, ownership, machine types, and status in real time, making it simple for engineers to view dashboards and start interactive notebooks for experimentation.

The X-Ray internal observability platform for Ray

Proof in Production: Ads Ranking ML

The clearest example of impact came from our Ads Ranking model, which is how Discord determines which Quest we think a particular user will be most interested in and excited to see. This was the first time we shipped large-scale deep learning to production at Discord.

Before Ray, the team was limited to XGBoost. Scaling to neural networks using the infrastructure powering XGBoost models was technically possible, but the pieces weren’t there to ensure a good experience for our MLEs. At the time, we had no sharding, no multi-GPU support, and no way to retrain at the scale and cadence the system needed.

With Ray, Ads Ranking shifted to sharded neural networks trained on multi-GPU clusters. The results were immediate:

  • Doubled the number of players joining Quests
  • Coverage expanded from ~40% to nearly 100% of ads traffic

Ads Ranking is now a deep learning pipeline that runs in production, retrains daily, and continues to ship new versions — with percentage lifts that are hard to believe until you see them.

And the system just works. One of our ML engineers built and launched a full testing framework in a day using the open-source docs from Ray and Dagster. No handholding needed, no custom tools, no waiting around — that’s what good infra unlocks.

This model shifted what's possible at Discord. It showed that with the right infrastructure, deep learning at scale doesn’t have to be painful. It can be fast, reliable, and something ML engineers actually want to use.

Ray as Discord's ML Foundation

Discord’s Ray platform has been a huge success. What started as scattered experimentation is now the backbone of how we run machine learning. From ad-hoc tooling to production-level orchestration support to ergonomic observability systems, engineers now have the tools they need to move quickly, monitor jobs, and build confidently on top of distributed compute.

Models that used to be blocked due to infrastructure restraints now run every day. Teams across Ads, Safety, Shop, and beyond are testing ideas faster, shipping faster, and hitting real results in production — in some cases seeing massive step-change improvements in performance that would’ve been impossible on our old setup.

We’re continuing to invest in the platform by tuning performance, improving developer experience, and pushing toward the best possible version of this system for the people who rely on it. 

There’s more on the ho-Ray-zon.

Tags
No items found.

related articles