Nelly, a robot hamster, holding a flathead screwdriver.

Engineering & Developers

From Single-Node to Multi-GPU Clusters: How Discord Made Distributed Compute Easy for ML Engineers

Serrana Aguirregaray & Nathaniel Jenkins

October 9, 2025

How Discord ML Hit Its Scaling Limit

At Discord, our machine learning systems have evolved from simple classifiers to sophisticated models serving hundreds of millions of users. As our models grew more complex and datasets larger, we increasingly ran into scaling challenges: training jobs that needed multiple GPUs, datasets that wouldn’t fit on single machines, and computational demands that outpaced our infrastructure.

Access to distributed compute was necessary — but not sufficient. We needed distributed ML to be easy. Ray, an open-source distributed computing framework, became our foundation. At Discord, we built a platform around it: custom CLI tooling, orchestration with Dagster + KubeRay, and an observability layer called X-Ray. Our focus was on developer experience, turning distributed ML from something hard to use into a system they are excited to work with.

This is how Discord went from no deep learning, to ad-hoc experiments, to a production orchestration platform, and how that work enabled models like Ads Ranking that delivered a +200% improvement on our business metrics.

Ray's Early Adopters at Discord

Our journey with Ray began organically, with individual ML engineers exploring it as a solution to their specific scaling problems. These early adopters were running Ray clusters manually, following open-source documentation and adapting examples to their needs to see how everything worked out for them.

While this got them unblocked, we quickly noticed problems: cluster configuration wasn’t standardized, resource management wasn’t consistent, and there were no options for job scheduling or monitoring. Each team was solving the same infrastructure challenges with their own solutions. It became clear that while Ray solved the distributed computing problem, we needed to build a “Ray platform” within Discord.

From YAML Headaches to One CLI Command

The first step toward building our Ray platform was creating a command-line interface that helped ease Ray’s cluster complexity.

Instead of maintaining dozens of YAML templates for every possible GPU configuration, we built a single parameterized template that generates the full cluster specification at runtime. Engineers specify what they need — such as GPU type, worker count, or memory — and the CLI handles all the underlying Kubernetes configurations and security settings. Once set up, engineers can submit jobs to their own personalized clusters.

This solved our immediate problems: configurations became consistent across teams, resource requests matched hardware capabilities, and engineers could spin up multi-GPU clusters with a single command instead of debugging YAML files. The CLI handled the full lifecycle, from creation to deletion.

Just as importantly, this focus on usability set the tone for everything that followed. Our platform succeeded not only because Ray was technically powerful, but because we made it ergonomic for engineers to use.

‍

Orchestration: From Ad-hoc to Automated

The next big step towards productionized Ray was moving from one-off, ad-hoc training to scheduling jobs in our orchestration system. This would allow engineers to retrain on defined schedules depending on the use case.

We built our orchestration system around a trio of Dagster + KubeRay + Ray:

Dagster defines workflows, configs, and dependencies. Engineers launch their jobs on schedule or through the Dagster Launchpad UI by filling in structured configs (like model_name, dataset window, or GPU pool). Defaults are schema-validated, so jobs don’t fail because of missing parameters. Dagster serves a key role in Discord’s scalable data platform.
KubeRay provisions Ray clusters dynamically on Kubernetes, attaching the correct service account and GPU node pool, using the same behind-the-scenes logic as the CLI.
Ray executes the distributed workload once the cluster is live. This includes training, evaluation, or batch inference.

How it works

An engineer launches or schedules a pipeline in Dagster.
Dagster submits the job spec to the Ray Job Operator.
KubeRay spins up a Ray cluster in the right namespace, with the right node pool.
Ray distributes the training workload across GPUs.
Logs and metrics stream back to Dagster and monitoring systems.

This design gives us:

Predictability: jobs run with versioned configs, not hand-written YAMLs.
Reproducibility: infra and resources are tied directly to pipeline definitions.
Visibility: engineers can see logs and cluster state in Dagster without SSH’ing into pods.

In practice, this means the ad relevance model — one of our most GPU-intensive — now trains daily without engineers needing to touch cluster configs or debug why jobs aren’t starting.

An example of the System Design DAG for Ray x Dagster Integration

Observability

As adoption increased, we identified a need for better observability across our Ray infrastructure.

To support this, we built X-Ray, a centralized web UI that gives ML engineers a single place to observe all Ray cluster operations. It can show active clusters, ownership, machine types, and status in real time, making it simple for engineers to view dashboards and start interactive notebooks for experimentation.

The X-Ray internal observability platform for Ray

Proof in Production: Ads Ranking ML

The clearest example of impact came from our Ads Ranking model, which is how Discord determines which Quest we think a particular user will be most interested in and excited to see. This was the first time we shipped large-scale deep learning to production at Discord.

Before Ray, the team was limited to XGBoost. Scaling to neural networks using the infrastructure powering XGBoost models was technically possible, but the pieces weren’t there to ensure a good experience for our MLEs. At the time, we had no sharding, no multi-GPU support, and no way to retrain at the scale and cadence the system needed.

With Ray, Ads Ranking shifted to sharded neural networks trained on multi-GPU clusters. The results were immediate:

Doubled the number of players joining Quests
Coverage expanded from ~40% to nearly 100% of ads traffic

Ads Ranking is now a deep learning pipeline that runs in production, retrains daily, and continues to ship new versions — with percentage lifts that are hard to believe until you see them.

And the system just works. One of our ML engineers built and launched a full testing framework in a day using the open-source docs from Ray and Dagster. No handholding needed, no custom tools, no waiting around — that’s what good infra unlocks.

This model shifted what's possible at Discord. It showed that with the right infrastructure, deep learning at scale doesn’t have to be painful. It can be fast, reliable, and something ML engineers actually want to use.

Ray as Discord's ML Foundation

Discord’s Ray platform has been a huge success. What started as scattered experimentation is now the backbone of how we run machine learning. From ad-hoc tooling to production-level orchestration support to ergonomic observability systems, engineers now have the tools they need to move quickly, monitor jobs, and build confidently on top of distributed compute.

Models that used to be blocked due to infrastructure restraints now run every day. Teams across Ads, Safety, Shop, and beyond are testing ideas faster, shipping faster, and hitting real results in production — in some cases seeing massive step-change improvements in performance that would’ve been impossible on our old setup.

We’re continuing to invest in the platform by tuning performance, improving developer experience, and pushing toward the best possible version of this system for the people who rely on it.

There’s more on the ho-Ray-zon.

From Single-Node to Multi-GPU Clusters: How Discord Made Distributed Compute Easy for ML Engineers

How Discord ML Hit Its Scaling Limit

Ray's Early Adopters at Discord

From YAML Headaches to One CLI Command

Orchestration: From Ad-hoc to Automated

How it works

Observability

Proof in Production: Ads Ranking ML

Ray as Discord's ML Foundation

related articles

Discord Update: November 6, 2025 Changelog

A Cornucopia of Updates Make Discord on Desktop Fresher Than a Crisp Fall Breeze

Discord Patch Notes: November 4, 2025

Discord Patch Notes: October 7, 2025

Discord Update: September 25, 2025 Changelog

New Looks for Nitro, New Looks for You. Get Yourself a Nitro-exclusive Profile Bundle!

Transforming Game Discovery with Instant Play Experiences on Discord

Reward Your Play: Complete Quests. Earn Orbs. Get Sweet Stuff.

Discord Update: June 30, 2025 Changelog

Get More From Your Boosts With New Server Perks

Gift Nitro and Earn A Flavorful Splash for your Avatar

Discord Social SDK Updates & Integrations

Discord Patch Notes: June 3, 2025

Go Beyond, Plus Ultra! with the My Hero Academia Collection

STAR WARS™ Makes Its Way to Discord

Discord Patch Notes: May 1, 2025

Worthy of a Plaque: Nameplates Land in the Shop

Make More Closet Space! Nitro Members Can Now Keep Avatar Decoration Quest Rewards for Longer

Discord Patch Notes: April 3, 2025

Discord Update: March 25, 2025 Changelog

Revamped Overlay & Refreshed Desktop Give Game Time a Boost

Discord Patch Notes: March 11, 2025

Discord Patch Notes: February 3, 2025

Discord Update: December 19, 2024 Changelog

Gift Ideas for the Dedicated Discord User in Your Life

Discord Patch Notes: December 5, 2024

Discord Update: November 18, 2024 Changelog

Celebrate Arcane’s Second Season with a new Shop Collection

Discord Patch Notes: November 1, 2024

Set Out for a Discord Adventure! Check Out Our Roll20 Adventure & D&D Shop Collection

Discord Patch Notes: October 1, 2024

Discord Update: September 26, 2024 Changelog

Discover More Ways to Play with Apps – Now Anywhere on Discord!

Legacy Shop Favorites Emerge from The Vault for a First Anniversary Encore!

Discord Patch Notes: August 30, 2024

Discord Update: August 28, 2024 Changelog

Queue Up Your Playlists on Discord with the Amazon Music Listening Party Activity!

Discord Patch Notes: August 1, 2024

Now Available: See What’s Happening on Discord, Directly from your Xbox console

Discord Update: July 26, 2024 Changelog

WHO LIVES ON YOUR PROFILE FOR ALL TO SEE? 🎶 SPONGEBOB, IN THE SHOP!

Discord Patch Notes: July 1, 2024

Discord Update: June 20, 2024 Changelog

How to Join Discord Calls Directly From Your PS5® — No Phone Needed!

Feast Your Monit-eyes on Today's Exciting Developer Updates!

Discord Patch Notes: May 2024

Refining Discord’s Mobile Experience With Your Feedback

Discord Update: May 13, 2024 Changelog

Discord Patch Notes: April 2024

Discord Update: April 3, 2024 Changelog

Lock in. Stand out. VALORANT arrives in the Shop.

Discord Update: March 5, 2024 Changelog

Discord Update: December 13, 2023 Changelog

Improving Our Mobile Experience

Discord Update: October 19, 2023 Changelog

Avatar Decorations & Profile Effects: Collect and Keep the Newest Styles

Discord Update: September 13, 2023 Changelog

Now Available: Stream Your Xbox Games Directly to Discord

Discord Update: July 29, 2023 Changelog

Meme Up Some Fun with Remix

Discord Update: June 22, 2023 Changelog

Server Subscriptions Just Got Super Powered: Introducing Media Channels, Tier Templates and more!

Discord Update: May 22, 2023 Changelog

Evolving Usernames on Discord

Discord Update: April 14, 2023 Changelog

Welcome Your New Members Easily with Community Onboarding

Introducing Discord Voice Messages

April Showers Bring Super-Cool Nitro Powers

New to Discord Nitro: Super Reactions Make Your Emoji Burst to Life

Ready Your Airhorns! 🎺 Discord Soundboard is Coming Your Way