惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

GbyAI
GbyAI
博客园_首页
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
阮一峰的网络日志
阮一峰的网络日志
酷 壳 – CoolShell
酷 壳 – CoolShell
博客园 - 司徒正美
V
V2EX
Cloudbric
Cloudbric
Hugging Face - Blog
Hugging Face - Blog
腾讯CDC
量子位
博客园 - 三生石上(FineUI控件)
博客园 - 叶小钗
K
Kaspersky official blog
博客园 - 【当耐特】
T
Tenable Blog
L
Lohrmann on Cybersecurity
The Cloudflare Blog
S
Schneier on Security
A
Arctic Wolf
Latest news
Latest news
C
Cyber Attacks, Cyber Crime and Cyber Security
罗磊的独立博客
T
The Exploit Database - CXSecurity.com
Cisco Talos Blog
Cisco Talos Blog
小众软件
小众软件
P
Privacy & Cybersecurity Law Blog
WordPress大学
WordPress大学
Simon Willison's Weblog
Simon Willison's Weblog
雷峰网
雷峰网
NISL@THU
NISL@THU
人人都是产品经理
人人都是产品经理
月光博客
月光博客
J
Java Code Geeks
V
Visual Studio Blog
S
Security Affairs
博客园 - Franky
T
Tailwind CSS Blog
Apple Machine Learning Research
Apple Machine Learning Research
H
Heimdal Security Blog
有赞技术团队
有赞技术团队
V2EX - 技术
V2EX - 技术
AWS News Blog
AWS News Blog
G
GRAHAM CLULEY
T
Troy Hunt's Blog
SecWiki News
SecWiki News
Spread Privacy
Spread Privacy
宝玉的分享
宝玉的分享
www.infosecurity-magazine.com
www.infosecurity-magazine.com
博客园 - 聂微东

Discord Blog

Making It Easier Than Ever to Connect with Friends in League & VAL! Every Voice and Video Call on Discord Is Now End-to-End Encrypted How to Link Discord to Battlefield 6, Marvel Rivals & More Official Discord Integrations for Steal a Brainrot, Grow a Garden, Brookhaven RP, and more Celebrate Discord’s 11th Birthday with an Exclusive Set of Emoji and Wallpapers Nitro Now Comes with Xbox Game Pass and New Benefits. Welcome to Nitro Rewards. How to Use Nitro: A Beginner’s Guide to Discord’s Premium Subscription Stock Up in the New Rust Shop! Enjoy a Discord-Only 20% Sale on Most Items until 5/21 Discord Patch Notes: May 4, 2026 You’ve Got (Too Much) Mail: Behind the Scenes of the 3/25/26 Voice Outage Measure Less to Learn More: Using Fewer, Higher-quality Metrics to Capture What Matters Discord Patch Notes: April 6, 2026 MULTIPLAYER SEQUEL TO ACCLAIMED AAAA GAME “THE LAST MEADOW” ANNOUNCED: PLAYABLE NOW Making Discord on Desktop Look Just Right: Display Settings to Ease the Eyes Discord Patch Notes: December 8, 2025 Building on the Social Layer of Games: What’s New from GDC 2026 Gift Ideas for the Dedicated Discord User in Your Life Discord Update: November 6, 2025 Changelog During October, Treat a Friend to Nitro and Trick Out Your Profile for Halloween 🎃 Discord Social SDK Updates & Integrations Go Beyond, Plus Ultra! with the My Hero Academia Collection STAR WARS™ Makes Its Way to Discord Worthy of a Plaque: Nameplates Land in the Shop Announcing Discord’s Social SDK, Helping Power Your Game’s Social Experiences A Cornucopia of Updates Make Discord on Desktop Fresher Than a Crisp Fall Breeze Starting Your First Discord Server Transforming Game Discovery with Instant Play Experiences on Discord Introducing the Discord for Business Newsletter, Vol. 1 Discord Update: March 24, 2026 Changelog Discord Update: December 19, 2024 Changelog How Multi-Factor Authentication Helps Keep Your Discord Account Safe How ROOST is Advancing Online Safety You’re Now Discord Official: Developers, Claim Your Game and Verify Your Server Discord Patch Notes: March 6, 2026 Tracing Discord's Elixir Systems (Without Melting Everything) Getting Global Age Assurance Right: What We Got Wrong and What's Changing Discord Patch Notes: February 4, 2026 Osprey: Open Sourcing our Rule Engine How to Change Your Theme to Bring Your Vibe to Discord Your Discord Checkpoint is Rolling Out! Celebrate What You Did in 2025 How to Customize Your Discord Profile How to Make and Use Custom Emoji on Discord Save and Display Your Faves: Add Discord Shop & Marvel Rivals Items to Your Profile’s Wishlist Discord Patch Notes: November 4, 2025 Bringing In-Game Commerce to Discord Communities Reward Your Play: Complete Quests. Earn Orbs. Get Sweet Stuff. How to Share What You’re Playing, Listening to, or Watching as Your Status on Discord Staff Picks, September 2025: Welcome to Our Video Game Museum From Single-Node to Multi-GPU Clusters: How Discord Made Distributed Compute Easy for ML Engineers Discord Patch Notes: October 7, 2025 Discord Update: September 25, 2025 Changelog New Looks for Nitro, New Looks for You. Get Yourself a Nitro-exclusive Profile Bundle! Discord Patch Notes: September 3, 2025 Bringing DAVE to All Discord Platforms Discord’s Powerful Cross-Platform Chat: Ready for Your Game Introducing the Community Server Cleanup Report for August 2025 Discord for Business Vol. 2: Cannes-worthy ad product updates Discord Patch Notes: August 4, 2025 ROOST Announces “Coop” and “Osprey”: Free, Open-Source Trust and Safety Infrastructure for the AI Era *FLAILS AROUND* SUMMER SPECIAL! JOIN NITRO, GET AN EXTRA MONTH OF NITRO ON US! Get More From Your Boosts With New Server Perks Discord Patch Notes: July 7, 2025 Discord Update: June 30, 2025 Changelog Authenticity Matters: Discord's Pride Month 2025 Staff Picks, June 2025: Summer of Showcases How to Set Up Your Server’s Roles for Members, Mods & Admins Gift Nitro and Earn A Flavorful Splash for your Avatar Discord Patch Notes: June 3, 2025 How to Use the Discord Soundboard & Add More Sounds Checkpoint 3: Leveling Up Discord Quests with Orbs and Advanced Measurement Thank You for Ten Years Staff Picks, May 2025: The Games That Brought Us to Discord Discord Patch Notes: May 1, 2025 How Discord Indexes Trillions of Messages Passing the Torch Discord Appoints Humam Sakhnini as Chief Executive Officer Staff Picks, April 2025: All The Adaptations Make More Closet Space! Nitro Members Can Now Keep Avatar Decoration Quest Rewards for Longer How to Use Discord’s In-Game Overlay to Talk While Playing on PC The Game Developer Playbook, Part One: Getting Started on Discord The Game Developer Playbook, Part Two: Early Access and Pre-Launch MAJOR NEWS: DISCORD ANNOUNCES ITS FIRST IN-HOUSE AAAA VIDEO GAME, “THE LAST MEADOW” Overclocking dbt: Discord's Custom Solution in Processing Petabytes of Data Wicked Saints Turns Players into IRL Superheroes with the Help of e.l.f. Beauty and Discord Discord Update: March 25, 2025 Changelog How to Stream Games and Applications to Discord from Desktop or Mobile Discord Patch Notes: April 3, 2025 Checkpoint 2: Our First Year With Discord Quests How to Create & Upload Your Own Stickers on Discord Revamped Overlay & Refreshed Desktop Give Game Time a Boost Discord Announces First Mobile Ad Format, Broadening Advertising Opportunities Announcing Discord’s Social SDK, Helping Power Your Game’s Social Experiences The Game Developer Playbook: Three Incredible Game-Focused Communities Modern Image Formats at Discord: Supporting WebP and AVIF Discord Patch Notes: March 11, 2025 Supercharging Discord Mobile: Our Journey to a Faster App December Staff Picks: It’s Giving Games Discord Patch Notes: February 3, 2025 Leveling Up Black Voices in Gaming How Discord Seamlessly Upgraded Millions of Users to 64-Bit Architecture
How Discord Automates ScyllaDB Clusters at Scale
Peter FrenchMay 8, 2026 · 2026-05-08 · via Discord Blog

Wumpus climbing a staircase of developer-icon blocks.

You've been asked to stand up a brand-new database cluster — a full replica of production, running real traffic, so you can validate a new release before it touches actual data. 

You're looking at the next day and a half, and it’s lookin’ stacked: provisioning and configuring dozens of nodes, joining them to the cluster one at a time, validating replication, wiring up dual-write pipelines, and babysitting the whole thing because any mistake on the ninth step means starting the whole thing over from scratch. While grinding through the whole process, you start to daydream: what if this whole ordeal took less than two hours? 

We found ourselves in exactly this situation. This is the story of how we got ourselves into this mess and how we made our way out of it.

The Perl Script of Reckoning

The Persistence Infrastructure team at Discord manages all kinds of database clusters, including Elasticsearch, Postgres, and ScyllaDB. Each of these databases has its own challenges, and we operate each at a pretty large scale, so there’s a lot on the plate of our 7-person team! ScyllaDB is the distributed database that stores messages, channels, servers, and most of Discord’s user data, so naturally it’s our service with the largest scope: dozens of clusters, with hundreds of database nodes in total.

That ratio of engineers to database scale sounds somewhat manageable until you consider what managing all that infrastructure actually looks like: it’s rolling restarts after config changes and expanding clusters as traffic grows. It’s upgrading operating systems across hundreds of nodes without taking anything offline, and standing up entirely new clusters to validate new ScyllaDB releases before they touch production. None of these are fire-and-forget when you have siloed tools: they demand careful sequencing, validation, and sustained attention throughout.

For years, we automated these operations the way many teams do when traffic scales dramatically: incrementally, under pressure, and without requiring a long-term strategy for where the tooling was headed. A Python script here, a bash script there… Our tools got the job done, but they were fragile and required significant institutional knowledge to operate safely. These scripts might’ve been considered our toolset’s final form if the operational demand had stayed constant. 

Unfortunately (but also fortunately), it did not, so we decided to build something more principled: the Scylla Control Plane, or SCP!

Shadow Clusters: the Final Boss

ScyllaDB upgrades are high-stakes. At Discord's scale, we regularly encounter edge cases that simply don't appear in smaller deployments. They’re the kind of bugs that only surface under the kind of load we run, and sometimes they only show up once every node in the cluster has been upgraded. As we’ve operated these clusters over the years, our data layer (in particular, our data services as mentioned in a past engineering blog) has unlocked all kinds of powerful tooling. 

One such tool is our shadow clusters: a short-lived, full replica cluster that receives, reads, and writes the same data as our production traffic. If the shadow cluster misbehaves under real load, we catch it before it touches production data. This setup has been so valuable in catching issues that we consider them standard practice before changing anything about our production cluster that may have big implications (OS, hardware, Scylla version, etc).

Establishing a new shadow cluster manually is labor intensive, involving provisioning nodes, configuring them, joining them to the cluster, validating replication, establishing dual-write pipelines, and eventually tearing everything down. Repeat all that work for every Scylla cluster we run, and the complexity really starts to compound.

Since we were aiming to upgrade our Scylla version in a safe manner, we badly needed automation that actually worked across all our clusters, so we set out to redesign SCP with all our prior pain and experience behind us.

Lessons from the Wreckage

Before writing a single line of SCP, we aligned on what had gone wrong and what we actually needed.

The old scripts failed in three major ways: 

  • They were unsafe: meaning they were easy to run in the wrong order, against the wrong nodes, and with no precondition checks.
  • They were unrecoverable: any failure between steps 7 to 12 meant starting over.
  • They were hard to extend: adding a new operation often meant copying and modifying an existing script rather than composing existing pieces. 

For SCP, we had four goals:

  1. An extensible task framework: Adding a new operation should be straightforward — define the task's inputs, implement its logic, and it should work everywhere the framework works. New authors shouldn't need to understand the orchestration internals.
  2. Configurable parallelism: Some operations are safe to run on multiple nodes simultaneously, while others aren't. The framework should make it easy to express constraints like "never run this on nodes in different availability zones at the same time."
  3. Safety by default: Tasks should declare their preconditions. Transient failures should be retried automatically. State should be persisted, that way an interrupted job can be resumed without redoing completed work.
  4. Incremental delivery: Ship something usable, run it on real clusters, and adjust based on what we learn.

Turns out, the last goal was the key to getting this off the ground. A framework that no one uses because it's too complex to onboard is worthless! Building SCP incrementally let us catch any usability problems early while they were still easy (and cheap) to fix, pushing us to keep investing in the tool instead of trying to build something huge and complicated right from the get-go.

How SCP Works

SCP is built around a few layered concepts: tasks, workflows, and jobs.

Tasks

A task is a single unit of work; it includes things like "drain this node," "check the repair status," or "run a cleanup." Tasks come in two flavors: node tasks operate on a single node, while cluster tasks coordinate across an entire cluster (which includes running individual node tasks across many nodes in the cluster).

Between tasks, we often need to wait for the cluster to reach a desired state before it's OK to proceed. So, we establish some conditions: a special type of task that blocks execution until a criterion is satisfied. It verifies whether or not it’s safe to proceed by polling Scylla's API or Prometheus metrics until either the check passes, or it times out and surfaces an error.

After restarting a Scylla node, you often need to wait for compactions to settle before considering the node as back to a normal state. If you move too quickly, you’ll risk cascading pressure across the entire cluster. Without an explicit condition check, you'd either hardcode a sleep — too short and you cause problems; too long and a rolling restart across 30 nodes takes all day — or accept that your operation might fail unpredictably. Conditions make the wait explicit, observable, and tunable.

In Rust, tasks are defined using a trait that requires three things: 

  • A name() method, describing what the task is doing. 
  • A preconditions() method that lists conditions that must be true before the task runs.
  • An execute() method that does all the work.
struct Drain;

impl ExecuteNodeTask for Drain {
    fn name(&self) -> String {
        "Drain Scylla node".into()
    }

    fn preconditions(&self) -> Vec<ConditionCheck<NodeCondition>> {
        vec![
            ConditionCheck::new_with_defaults(QuorumSafe.into()),
            ConditionCheck::new_with_defaults(ClusterNormal.into()),
        ]
    }

    async fn execute(&mut self, ctx: &NodeExecutionContext) -> TaskResult<()> {
        ctx.scylla_api().drain().await?;
        info!("Drain completed successfully");
        Ok(())
    }
}

One property we require of all tasks: idempotency. Running a task twice should produce the same result as running it once. This isn't always easy to achieve, but retrying is a key part of how the Scylla Control Plane handles failures, therefore idempotency is required to make retries safe.

Workflows

Workflows are defined in YAML and describe a sequence of tasks, along with their configuration; it details how many retries each task gets, whether to abort on the first failure, and how to handle parallelism.

name: Drain and restart each node in the cluster
variables:
  - name: compactions_nominal_timeout_seconds
    type: integer
    description: Seconds to wait for compactions to reach nominal levels
    default: 90
cluster_tasks:
  - task: !node-workflow
      name: Drain and restart each node
      node_tasks:
        - task: !scylla-drain
        - task: !systemd-stop-scylla-server
        - task: !systemd-start-service
            service: scylla-server
        - task: !wait-for-conditions
            conditions:
              - condition: !compactions-nominal
                success_window_seconds: 20
                poll_interval_seconds: 5
                timeout_seconds: +compactions_nominal_timeout_seconds+

YAML was a deliberate choice. We didn't want every workflow change to require a Rust recompile, and we wanted operators to be able to tune parameters (such as retry counts and concurrency limits) without requiring a full binary deploy. 

Template variables let workflows be parameterized at runtime, so you can scope a workflow to specific nodes or availability zones at invocation time without modifying the workflow definition.

Jobs and Orchestration

A job is a single execution of a workflow, bound to a specific cluster. Jobs are the thing you monitor, resume, and refer back to.

Jobs also support targeting, or running a workflow on a subset of a cluster's nodes rather than all of them. You can target an explicit list of nodes, a specific availability zone, or omit targeting to run against all nodes in a cluster.

Two parameters in the workflow YAML control how jobs run across the nodes in the cluster:

  • concurrency_unit controls how nodes are grouped for parallel execution. Setting it to zone means nodes are batched by availability zone, and a task won't run on nodes in multiple zones simultaneously. For a cluster with replication across three zones, this prevents a scenario where simultaneous node failures in multiple zones cause quorum loss.
  • concurrency_limit caps how many nodes can be running a task at once, regardless of grouping. A limit of 1 means strictly serial execution within each batch; a limit of 3 allows up to three nodes to proceed in parallel.

Together, these two parameters let you express things like "restart nodes one zone at a time, with at most two nodes restarting concurrently within a zone" without any custom orchestration logic.

Resumability

Any long-running operation across a large cluster will eventually be interrupted (e.g. a node becomes unreachable, an SSH connection times out, the engineer running the job closes their terminal). Before SCP, this interruption would mean starting over, or worse, manually reconstructing which nodes had already been touched and writing a one-off script to handle the remainder.

SCP tracks the state of every job in its own SQLite database, including which tasks have completed on which nodes, which are in progress, and which have failed. When a job is interrupted and resumed, it’s able to pick up from exactly where it left off. Completed tasks are not re-run, and tasks that were mid-execution when the interruption occurred are attempted again.

While we considered more complex state backends, the operational simplicity of a file-based database that lives alongside the binary won out. There's no external dependency to manage, the job state survives the process and restarts on its own, and the files themselves are small enough to inspect by hand when something goes wrong. Plus, we can always move to a distributed system down the road if we need it.

Error Classification

Not all errors are equal. Ideally, a task that fails due to a transient network timeout should be reattempted, while a task that detects data corruption or an unsafe cluster state should stop immediately and notify a human.

SCP distinguishes between recoverable and unrecoverable errors. Recoverable errors trigger the retry logic configured for that task in the workflow YAML. Unrecoverable errors halt the job immediately and fire a webhook notification to a designated ops channel in Discord, tagging the operator who invoked the job.

Getting this classification right is one of the trickier parts of writing a new task. Your natural instinct might be to mark everything as recoverable and let the auto-retries handle it, but a retry loop on a genuinely broken state can cause real harm. Task authors need to understand exactly what different failure modes mean for their specific operation.

Webhook notifications turned out to matter more than we initially expected. It turns out that running a two-hour rolling restart across a 30-node cluster while trusting the system to ping you if something goes wrong, is a wildly different experience than babysitting a terminal for two hours.

Scylla Control Plane in Action

Now that we’ve covered the core concepts of SCP’s design, what does using it actually look like?

Below is an example of the SDC running a one-off task against a single node:

$ scyllactl node-task --node scylla-messages-stg-us-east1-b-1 scylla-drain
2026-01-28 18:29:02.441Z Condition Check (Node is quorum safe): Start
2026-01-28 18:29:02.495Z Condition Check (Cluster is normal): Start
2026-01-28 18:29:02.550Z Condition Check (Node is quorum safe): Condition passed   duration=109ms
2026-01-28 18:29:02.554Z Condition Check (Cluster is normal): Condition passed   duration=59ms
2026-01-28 18:29:02.554Z All conditions passed
2026-01-28 18:29:02.554Z Start
2026-01-28 18:29:04.813Z Drain completed successfully
2026-01-28 18:29:04.814Z Finished   duration=2.259s

Before the drain runs, SCP automatically checks that the node is quorum-safe (i.e. there are enough nodes available to serve accurate requests) and that the cluster is healthy. These checks aren't optional — they're part of the task definition and run every time, regardless of who invokes the operation.

Next, we’ll query for the repair status across a cluster:

$ scyllactl cluster-task --cluster messages-prd get-repair-status
2026-01-29 18:03:05.678Z Start
repair/475ebc46-...: RUNNING (keyspaces: discord.messages; schedule: every 1h)
repair/531c5bd3-...: DONE   (keyspaces: discord, !discord.messages; schedule: once)
2026-01-29 18:03:05.693Z Finished   duration=15ms

And kicking off a full cluster workflow:

$ scyllactl job run add_nodes_to_cluster \
      --cluster=scylla-messages-prd \
      --nodes=scylla-messages-prd-us-east1-b-10,\
        scylla-messages-prd-us-east1-c-10,\
        scylla-messages-prd-us-east1-d-10

Adding Nodes Without Losing Sleep

Adding nodes to a running ScyllaDB cluster is the kind of operation that rewards careful orchestration. But when it’s done right, it runs as beautifully as a real orchestra. 

We kept Scylla’s historical limitation of only joining a single node at once to avoid overwhelming the cluster. Retaining this limitation requires us to be a bit more particular about what we execute, and against which nodes we execute on. Specifically, we join nodes into the cluster one at a time, grouped by availability zone, waiting for each node to finish bootstrapping and reach a healthy state before continuing to the next node.

The add_nodes_to_cluster workflow encodes this logic:

- task: !node-workflow
    name: Join nodes to cluster
    concurrency_unit: zonal
    concurrency_limit: 1
    node_tasks:
      - task: !send-webhook-message
          channels: [infra-doing]
          message: Node is about to join!

      - task: !salt-highstate

      - task: !wait-for-conditions
          conditions:
            - condition: !is-up-normal
                known_node: +known_node+
              success_window_seconds: 60
              poll_interval_seconds: 5
              timeout_seconds: 86400  # 24 hours

In the above example, concurrency_unit: zonal combined with concurrency_limit: 1 means nodes join strictly one at a time, never across zones simultaneously. The is-up-normal check waits up to 24 hours for the node to stabilize, with a 60-second success window ensuring it's continuously healthy, not just healthy for one poll. And since joining nodes can temporarily impact cluster availability, the webhook notifies whoever is on-call before each join..

This is exactly the kind of operation that exposes whether a workflow framework is actually useful. The orchestration logic is non-trivial —zone-aware batching, per-step precondition checks, webhook notifications, retries upon failures — but in SCP, that logic lives in the workflow YAML and uses the individual tasks as composable primitives to execute operations. The engineer running the expansion isn't making decisions at each step; they're watching SCP execute a well-tested workflow and trusting that the system will yell if it hits anything unexpected.

From Dread to Y(E)A(H!)ML

Since shipping SCP, we've automated many of the operations that used to require the most careful hand-holding, such as:

  • Standing up new clusters end-to-end
  • Expanding clusters
  • Rolling Ubuntu upgrades across nodes
  • Rolling restarts after config changes
  • Other common remediations, such as cycling binaries, applying scylla.yaml changes, sending SIGHUP, and running cleanups

The next major investments are in making complex multi-phase operations fully automated. Today, spinning up a shadow cluster still requires several manual steps. We're building toward a single workflow that handles the full lifecycle, including provision, configure, validate, and tear down. Similarly, cluster expansion currently joins nodes in a single group; in the future, we want to join in smaller batches and run repairs between them, so we're never far behind on repair coverage as capacity grows.

Long Live SCP 👽

Today, that daydream of running a 36-hour operation in less than two hours is now a reality. And even better, most of those two hours are waiting for nodes to bootstrap while the engineer does something else.

That's the shift: not just faster, but substantially different. "Sustained attention for a full day" became "kick it off and check back later." All the operations we used to dread — standing up shadow clusters, expanding production clusters, rolling OS upgrades across hundreds of nodes — are now workflows we trust to run, surface problems on their own, and wait for our input when something needs a human decision.

SCP isn't done: we're still building a foundation for fully automating shadow cluster lifecycles and smarter expansion strategies, but every new workflow we add makes the next operation a little less manual.

If you like building tools that make hard operations a little less monotonous, we're hiring!

Peter French

Senior Software Engineer, Database Infrastructure

related articles