Clyde, a robot, conversing in a Discord voice call. Files and videos are attached to the text conversation.

Engineering & Developers

How Discord Uses Open-Source Tools for Scalable Data Orchestration & Transformation

Zach Bluhm

July 12, 2024

At Discord, we take pride in making data-driven decisions to deliver a great experience for users around the world. As our platform and user base have grown over the years, so have the demands on our data orchestration system.

Until recently, we’ve been using Derived, an in-house orchestration system that’s provided the foundation for Discord’s data analytics over the last five years. As our data organization grew, it became apparent that both self-service and top-notch observability would be key for our ability to effectively scale as a team.

To continue delivering seamless service and insightful data analytics, we embraced an ambitious project: to overhaul our data orchestration infrastructure using modern, open-source tools.

Keep reading to learn about how we embarked on this journey, the candid lessons we learned along the way, and how our new system is powering over 2000 dbt tables today.

Reflecting on Derived

Derived was originally engineered in-house to fulfill our requirements when we used to have a relatively smaller user base and a more manageable data volume. It played its part well during our earlier days, but our flexibility and observability requirements have substantially increased over time. Similarly, where we previously relied on software engineers to manage the system, the intent now is to foster greater self-service and maintain a more user-friendly design.

While Derived was instrumental in providing advanced features and setting a foundation of expectations for data transformation systems at Discord, it missed the mark in offering usability and flexibility. We had outgrown our system, which led us to the next iteration of our data transformation journey.

If you’d like to learn about Derived and how it worked, check out a previous blog post written here

Dagster & dbt: a match made in heaven

There’s been a lot of innovation in the data orchestration space since Airflow, an orchestration platform created by Airbnb, was open-sourced back in 2015.

Today, a Google search for “open source data orchestration tool” will net you things like Argo, Prefect, Dagster, Kestra, and Mage to name a few. On the modeling side, you’ll find tools like dbt, Coalesce, and SQLMesh. The breadth of functionality around dbt made it a straightforward pick for our data modeling tool. However, our team had to spend a bit of extra time to find the right data orchestrator that would help solve the pain points that both our customers and our team were experiencing with Derived.

There were a few key criteria we felt were imperative during our search:

Declarative automation: there was conviction around this being a necessary component of our self-service model. It helped enable the types of flexibility our users were accustomed to in our old system.
A modern UI that provided a “single pane of glass” for our data engineers and data scientists. In the ideal world, this would allow for total data asset self-service, from observability to operations.
Reliability and scalability: running orchestration workloads on Kubernetes is tried and true, and we felt strongly that any serious contender needed to work with Kubernetes to be considered.
Integration with existing tooling: How quickly and easily can our existing Airflow jobs, CI/CD scaffolds and data quality solutions be migrated over without too much disruption?

Ultimately, our team landed on a combination of Dagster and dbt.

While Dagster was a newer kid on the block and was less battle-proven than airflow, it hit the mark on our four criteria above: It provided out-of-the-box support for deployment and execution on Kubernetes, had built-in support for declarative automation, and provided a UI that allowed data producers and consumers to quickly understand the state of their data assets. Plus, its airflow integration and Python APIs meant migrating over existing jobs would be less of a burden.

Although it wasn't part of our initial requirements, we were pleasantly surprised by how straightforward it was to run Dagster locally. Our developers and pilot testers were able to create a mock environment locally that enabled them to get a good sense of how the Dagster API functioned and how our use case would fit into it.

There is some inherent risk with betting on newer technologies, but Discord is no stranger to moving fast and leveraging the bleeding edge. Dagster’s openness to work with us and build out new functionality to handle our scale gave us the confidence to ultimately move forward and break ground on the new system.

Breaking ground on our new data transformation system

Building out our new system was a journey — one that had a healthy balance of both technical challenges and “aha” moments as things “just worked”.

Thanks to Dagster’s out-of-the-box support for deploying to Kubernetes, we were able to get things running quickly. Integrating dbt with Dagster using software-defined assets felt natural and quickly became a standard part of our team’s nomenclature: off-the-bat, our team made the crucial decision to utilize dbt mainly for SQL templating, managing data quality tests, configuring asset metadata, and execution of queries. Dagster would be the true “brain” behind the orchestration and would be responsible for things running in the correct order.

We decided to schedule our entire DAG using Dagster's declarative automation mechanism, triggered by scheduled runs that monitored our raw data layer. The declarative nature of this scheduling allows our data producers to easily create dependencies between wildly varying partition definitions (think “hourly → daily”) without having to implement custom logic.

Learn more about declarative automation and how it differs from cron-based scheduling here.

With these decisions in place, we focused on building out a minimum lovable product to enable our data engineering crew to begin modeling our new data architecture in parallel as soon as possible. This allowed for fast feedback on the new mechanisms we were building into the system, including components such as custom partition mappings, dbt test execution, and custom dbt model configurations.

It took a couple of tries to get right from both a user interface and functionality perspective, but ultimately we engineered a system that combined the advanced modeling capabilities of dbt with the cutting-edge scheduling and out-of-the-box visibility provided by Dagster.

A data lineage diagram illustrating how multiple daily tables are derived from a single hourly source. The central 'staging_hourly_table' feeds into three 'int_daily_table' nodes, each representing a daily aggregation. This structure highlights how higher-frequency (hourly) data can be scheduled with lower-frequency (daily) tables using declarative automation. — Hourly data maps to daily models with ease, requiring no additional effort from data producers

We eventually realized that we could leverage Dagster Labs' Hybrid Cloud offering, Dagster+, to enable our small and mighty team to move faster. This decision offloaded a lot of the time we spent on correctly configuring infrastructure and debugging behind-the-scenes issues, letting us focus on what mattered most: data orchestration.

Plus, features included in Dagster+, such as SSO and branch deployments, gave our new system a final layer of polish that enhanced both our productivity and the overall quality of our data workflows.

Lessons learned

While we were getting our new systems established, there were a couple of issues we identified early on that we knew would have to be solved before we could enable the system in production. For one, dbt did not support parallelism well.

For incremental models, dbt stores data in a temporary table before merging the processed partitions in its production location. This created a race condition when multiple instances of dbt run were initiated for the same model, as dbt would try to delete the temporary table once it was completed. We solved this by adjusting dbt’s logic for storing temporary data, which enabled us to run multiple partitions of the same asset in parallel.

Second, backfilling assets partition-by-partition did not play well with our data warehouse (BigQuery) and led to extremely lengthy backfill times. We worked closely with the Dagster team to push an open-source commit, which resulted in us being able to configure how many partitions could be backfilled at once for each asset.

One more challenge we faced stemmed from a constraint that we felt was critical for the high bar of data quality that our data consumers expect: atomicity and data consistency. In essence, the code-version for an asset needed to remain consistent across partitions, even while backfilling. While not supported out of the box, Dagster provides a flexible graphQL interface which, in combination with a series of sensors and jobs, enabled us to bring this functionality to life!

How users are benefiting from the new system

Once the core pieces were in place, it didn’t take long for our data teams to benefit from the new tools at their disposal. For one, answering the very simple, yet important, question “Why isn’t my data asset updating?” is now a self-serve, at-a-glance feature:

A table titled 'Evaluation metadata' displaying two rows of dependency information. The first row shows 'waiting_on_ancestor_1' corresponding to 'core / dim_users_growth_accounting'. The second row shows 'waiting_on_ancestor_2' corresponding to 'core / dim_users_hfu'. This metadata indicates that an asset's materialization is pending completion of these two ancestor assets in the Dagster pipeline. — The automation tab indicates exactly why or why not a given asset is being queued up for materialization

Empowered by Dagster's asset definition pages, asset owners now seamlessly manage the lifecycles of their data assets, from backfilling to incremental updates. This interface provides comprehensive, real-time insights into every activity related to an asset. One asset owner has been quoted as enjoying the UI so much that they feel confident doing things like “launching backfills from my phone.” (We don’t recommend trying this at home)

A screenshot of Dagster's data lineage view, showcasing interconnected data assets. The view displays four tables: table_a, table_b, table_c, and table_d, with arrows indicating dependencies. Tables a, b, and c are marked as 'Materialized' with timestamps, while table_d is currently 'Materializing'. The image highlights Dagster's ability to visualize data freshness and dependencies in a data pipeline. — The lineage view allows anyone to quickly determine table landing times and identify blockages that prevent downstream execution.

Data quality is at the heart of this new system — our engineers can now write point-in-time quality checks that can be tuned to “warn,” or even “block,” downstream runs on failure. This allows us to quickly catch, alert, and fix issues before impacting critical downstream use cases like company dashboards. We alert our table owners by utilizing a notification system called DAN (Data Asset Notifications) that informs users of table failures via a Discord app. (Who uses email nowadays?)

On the dbt side, we’ve been able to standardize complex metric calculations using macros, which has played a key role in removing discrepancies across the business and streamlined the way data practitioners are transforming and consuming data.

To boost developer productivity, we created an internal suite of custom dbt CLI commands One of these, dubbed autogen-schema, automatically generates the boilerplate dbt YAML files a new model requires, which can often be verbose when creating from scratch.

We also implemented a robust CI/CD process to prevent disruptive changes across table logic, macros, dbt tests, and more. Our advanced dbt table configurations and custom materializations are tailored to meet business demands while effortlessly integrating with our Dagster orchestration system and maintaining parity with the previous Derived system.

Last but not least, we were able to leverage and contribute to the wide range of dbt packages, such as great-expectations and elementary, to quickly enable new features and functionality in our project.

Where we’ve arrived post-Derived

Today, our new system (internally referred to as “Transformation 2.0,” or “T2”) powers over 2000 dbt tables, covered by over 12000 dbt tests.

On a typical day we will see roughly ~4000 materializations automatically triggered across both hourly and daily assets. Our migration effort has seen petabytes of data churned through as we’ve moved off of Derived and onto Transformation 2.0.

A timeline visualization from the Dagster UI 'overview' page, showing multiple scheduled runs across a deployment. The chart uses color-coded bars to represent different run statuses and durations. Predominantly green bars indicate successful runs, with occasional red sections for failures. Blue bars represent ongoing or queued runs. The timeline is densely populated, suggesting frequent and regular job executions. — A typical 6-hour overview of our system in its current state

As Dagster continues to evolve, our data producers and consumers have been able to take advantage of new functionality, including column level lineage and other catalog-like features, that accentuate data discovery. We have strong conviction that Dagster will be the heart of many of Discord’s future Data Platform developments, and hope to soon open up the platform to be the place for orchestration at Discord. We greatly appreciate our partners over at Dagster, and we’re looking forward to staying on the cutting edge of orchestration with their support!

If you’re interested in working on technologies like those mentioned today (without launching backfills from your phone), be sure to check out our open roles on Discord’s mighty Data Platform team and others at our jobs page.

How Discord Uses Open-Source Tools for Scalable Data Orchestration & Transformation

Reflecting on Derived

Dagster & dbt: a match made in heaven

Breaking ground on our new data transformation system

Lessons learned

How users are benefiting from the new system

Where we’ve arrived post-Derived

related articles

Reward Your Play: Complete Quests. Earn Orbs. Get Sweet Stuff.

Discord Update: June 30, 2025 Changelog

Get More From Your Boosts With New Server Perks

Gift Nitro and Earn A Flavorful Splash for your Avatar

Discord Social SDK Updates & Integrations

Discord Patch Notes: June 3, 2025

Go Beyond, Plus Ultra! with the My Hero Academia Collection

STAR WARS™ Makes Its Way to Discord

Discord Patch Notes: May 1, 2025

Worthy of a Plaque: Nameplates Land in the Shop

Make More Closet Space! Nitro Members Can Now Keep Avatar Decoration Quest Rewards for Longer

Discord Patch Notes: April 3, 2025

Discord Update: March 25, 2025 Changelog

Revamped Overlay & Refreshed Desktop Give Game Time a Boost

Discord Patch Notes: March 11, 2025

Discord Patch Notes: February 3, 2025

Discord Update: December 19, 2024 Changelog

Gift Ideas for the Dedicated Discord User in Your Life

Discord Patch Notes: December 5, 2024

Discord Update: November 18, 2024 Changelog

Celebrate Arcane’s Second Season with a new Shop Collection

Discord Patch Notes: November 1, 2024

Set Out for a Discord Adventure! Check Out Our Roll20 Adventure & D&D Shop Collection

Discord Patch Notes: October 1, 2024

Discord Update: September 26, 2024 Changelog

Discover More Ways to Play with Apps – Now Anywhere on Discord!

Legacy Shop Favorites Emerge from The Vault for a First Anniversary Encore!

Discord Patch Notes: August 30, 2024

Discord Update: August 28, 2024 Changelog

Queue Up Your Playlists on Discord with the Amazon Music Listening Party Activity!

Discord Patch Notes: August 1, 2024

Now Available: See What’s Happening on Discord, Directly from your Xbox console

Discord Update: July 26, 2024 Changelog

WHO LIVES ON YOUR PROFILE FOR ALL TO SEE? 🎶 SPONGEBOB, IN THE SHOP!

Discord Patch Notes: July 1, 2024

Discord Update: June 20, 2024 Changelog

How to Join Discord Calls Directly From Your PS5® — No Phone Needed!

Feast Your Monit-eyes on Today's Exciting Developer Updates!

Discord Patch Notes: May 2024

Refining Discord’s Mobile Experience With Your Feedback

Discord Update: May 13, 2024 Changelog

Discord Patch Notes: April 2024

Discord Update: April 3, 2024 Changelog

Lock in. Stand out. VALORANT arrives in the Shop.

Discord Update: March 5, 2024 Changelog

Discord Update: December 13, 2023 Changelog

Improving Our Mobile Experience

Discord Update: October 19, 2023 Changelog

Avatar Decorations & Profile Effects: Collect and Keep the Newest Styles

Discord Update: September 13, 2023 Changelog

Now Available: Stream Your Xbox Games Directly to Discord

Discord Update: July 29, 2023 Changelog

Meme Up Some Fun with Remix

Discord Update: June 22, 2023 Changelog

Server Subscriptions Just Got Super Powered: Introducing Media Channels, Tier Templates and more!

Discord Update: May 22, 2023 Changelog

Evolving Usernames on Discord

Discord Update: April 14, 2023 Changelog

Welcome Your New Members Easily with Community Onboarding

Introducing Discord Voice Messages

April Showers Bring Super-Cool Nitro Powers

New to Discord Nitro: Super Reactions Make Your Emoji Burst to Life

Ready Your Airhorns! 🎺 Discord Soundboard is Coming Your Way

Discord Update: March 20, 2023 Changelog

Now in Nitro: Bring Your Vibe to Discord with New Themes

Discord Activities: Play Games and Watch Together

Discord is Your Place for AI with Friends

Now Available: Use Discord Voice Chat on Your PlayStation®5 Console

Discord Update: February 20, 2023 Changelog

Introducing Video, Screen Share, and Text Chat Support for Stage Channels

Discord Update: January 25, 2023 Changelog

Make Your Connection: Connected Accounts Get a Huge Functionality Boost