How Discord Automates ScyllaDB Clusters at Scale

Wumpus climbing a staircase of developer-icon blocks.

You've been asked to stand up a brand-new database cluster — a full replica of production, running real traffic, so you can validate a new release before it touches actual data.

You're looking at the next day and a half, and it’s lookin’ stacked: provisioning and configuring dozens of nodes, joining them to the cluster one at a time, validating replication, wiring up dual-write pipelines, and babysitting the whole thing because any mistake on the ninth step means starting the whole thing over from scratch. While grinding through the whole process, you start to daydream: what if this whole ordeal took less than two hours?

We found ourselves in exactly this situation. This is the story of how we got ourselves into this mess and how we made our way out of it.

The Perl Script of Reckoning

The Persistence Infrastructure team at Discord manages all kinds of database clusters, including Elasticsearch, Postgres, and ScyllaDB. Each of these databases has its own challenges, and we operate each at a pretty large scale, so there’s a lot on the plate of our 7-person team! ScyllaDB is the distributed database that stores messages, channels, servers, and most of Discord’s user data, so naturally it’s our service with the largest scope: dozens of clusters, with hundreds of database nodes in total.

That ratio of engineers to database scale sounds somewhat manageable until you consider what managing all that infrastructure actually looks like: it’s rolling restarts after config changes and expanding clusters as traffic grows. It’s upgrading operating systems across hundreds of nodes without taking anything offline, and standing up entirely new clusters to validate new ScyllaDB releases before they touch production. None of these are fire-and-forget when you have siloed tools: they demand careful sequencing, validation, and sustained attention throughout.

For years, we automated these operations the way many teams do when traffic scales dramatically: incrementally, under pressure, and without requiring a long-term strategy for where the tooling was headed. A Python script here, a bash script there… Our tools got the job done, but they were fragile and required significant institutional knowledge to operate safely. These scripts might’ve been considered our toolset’s final form if the operational demand had stayed constant.

Unfortunately (but also fortunately), it did not, so we decided to build something more principled: the Scylla Control Plane, or SCP!

Shadow Clusters: the Final Boss

ScyllaDB upgrades are high-stakes. At Discord's scale, we regularly encounter edge cases that simply don't appear in smaller deployments. They’re the kind of bugs that only surface under the kind of load we run, and sometimes they only show up once every node in the cluster has been upgraded. As we’ve operated these clusters over the years, our data layer (in particular, our data services as mentioned in a past engineering blog) has unlocked all kinds of powerful tooling.

One such tool is our shadow clusters: a short-lived, full replica cluster that receives, reads, and writes the same data as our production traffic. If the shadow cluster misbehaves under real load, we catch it before it touches production data. This setup has been so valuable in catching issues that we consider them standard practice before changing anything about our production cluster that may have big implications (OS, hardware, Scylla version, etc).

Establishing a new shadow cluster manually is labor intensive, involving provisioning nodes, configuring them, joining them to the cluster, validating replication, establishing dual-write pipelines, and eventually tearing everything down. Repeat all that work for every Scylla cluster we run, and the complexity really starts to compound.

Since we were aiming to upgrade our Scylla version in a safe manner, we badly needed automation that actually worked across all our clusters, so we set out to redesign SCP with all our prior pain and experience behind us.

Lessons from the Wreckage

Before writing a single line of SCP, we aligned on what had gone wrong and what we actually needed.

The old scripts failed in three major ways:

They were unsafe: meaning they were easy to run in the wrong order, against the wrong nodes, and with no precondition checks.
They were unrecoverable: any failure between steps 7 to 12 meant starting over.
They were hard to extend: adding a new operation often meant copying and modifying an existing script rather than composing existing pieces.

For SCP, we had four goals:

An extensible task framework: Adding a new operation should be straightforward — define the task's inputs, implement its logic, and it should work everywhere the framework works. New authors shouldn't need to understand the orchestration internals.
Configurable parallelism: Some operations are safe to run on multiple nodes simultaneously, while others aren't. The framework should make it easy to express constraints like "never run this on nodes in different availability zones at the same time."
Safety by default: Tasks should declare their preconditions. Transient failures should be retried automatically. State should be persisted, that way an interrupted job can be resumed without redoing completed work.
Incremental delivery: Ship something usable, run it on real clusters, and adjust based on what we learn.

Turns out, the last goal was the key to getting this off the ground. A framework that no one uses because it's too complex to onboard is worthless! Building SCP incrementally let us catch any usability problems early while they were still easy (and cheap) to fix, pushing us to keep investing in the tool instead of trying to build something huge and complicated right from the get-go.

How SCP Works

SCP is built around a few layered concepts: tasks, workflows, and jobs.

Tasks

A task is a single unit of work; it includes things like "drain this node," "check the repair status," or "run a cleanup." Tasks come in two flavors: node tasks operate on a single node, while cluster tasks coordinate across an entire cluster (which includes running individual node tasks across many nodes in the cluster).

Between tasks, we often need to wait for the cluster to reach a desired state before it's OK to proceed. So, we establish some conditions: a special type of task that blocks execution until a criterion is satisfied. It verifies whether or not it’s safe to proceed by polling Scylla's API or Prometheus metrics until either the check passes, or it times out and surfaces an error.

After restarting a Scylla node, you often need to wait for compactions to settle before considering the node as back to a normal state. If you move too quickly, you’ll risk cascading pressure across the entire cluster. Without an explicit condition check, you'd either hardcode a sleep — too short and you cause problems; too long and a rolling restart across 30 nodes takes all day — or accept that your operation might fail unpredictably. Conditions make the wait explicit, observable, and tunable.

In Rust, tasks are defined using a trait that requires three things:

A name() method, describing what the task is doing.
A preconditions() method that lists conditions that must be true before the task runs.
An execute() method that does all the work.

struct Drain;

impl ExecuteNodeTask for Drain {
    fn name(&self) -> String {
        "Drain Scylla node".into()
    }

    fn preconditions(&self) -> Vec<ConditionCheck<NodeCondition>> {
        vec![
            ConditionCheck::new_with_defaults(QuorumSafe.into()),
            ConditionCheck::new_with_defaults(ClusterNormal.into()),
        ]
    }

    async fn execute(&mut self, ctx: &NodeExecutionContext) -> TaskResult<()> {
        ctx.scylla_api().drain().await?;
        info!("Drain completed successfully");
        Ok(())
    }
}

One property we require of all tasks: idempotency. Running a task twice should produce the same result as running it once. This isn't always easy to achieve, but retrying is a key part of how the Scylla Control Plane handles failures, therefore idempotency is required to make retries safe.

Workflows

Workflows are defined in YAML and describe a sequence of tasks, along with their configuration; it details how many retries each task gets, whether to abort on the first failure, and how to handle parallelism.

name: Drain and restart each node in the cluster
variables:
  - name: compactions_nominal_timeout_seconds
    type: integer
    description: Seconds to wait for compactions to reach nominal levels
    default: 90
cluster_tasks:
  - task: !node-workflow
      name: Drain and restart each node
      node_tasks:
        - task: !scylla-drain
        - task: !systemd-stop-scylla-server
        - task: !systemd-start-service
            service: scylla-server
        - task: !wait-for-conditions
            conditions:
              - condition: !compactions-nominal
                success_window_seconds: 20
                poll_interval_seconds: 5
                timeout_seconds: +compactions_nominal_timeout_seconds+

YAML was a deliberate choice. We didn't want every workflow change to require a Rust recompile, and we wanted operators to be able to tune parameters (such as retry counts and concurrency limits) without requiring a full binary deploy.

Template variables let workflows be parameterized at runtime, so you can scope a workflow to specific nodes or availability zones at invocation time without modifying the workflow definition.

Jobs and Orchestration

A job is a single execution of a workflow, bound to a specific cluster. Jobs are the thing you monitor, resume, and refer back to.

Jobs also support targeting, or running a workflow on a subset of a cluster's nodes rather than all of them. You can target an explicit list of nodes, a specific availability zone, or omit targeting to run against all nodes in a cluster.

Two parameters in the workflow YAML control how jobs run across the nodes in the cluster:

concurrency_unit controls how nodes are grouped for parallel execution. Setting it to zone means nodes are batched by availability zone, and a task won't run on nodes in multiple zones simultaneously. For a cluster with replication across three zones, this prevents a scenario where simultaneous node failures in multiple zones cause quorum loss.
concurrency_limit caps how many nodes can be running a task at once, regardless of grouping. A limit of 1 means strictly serial execution within each batch; a limit of 3 allows up to three nodes to proceed in parallel.

Together, these two parameters let you express things like "restart nodes one zone at a time, with at most two nodes restarting concurrently within a zone" without any custom orchestration logic.

Resumability

Any long-running operation across a large cluster will eventually be interrupted (e.g. a node becomes unreachable, an SSH connection times out, the engineer running the job closes their terminal). Before SCP, this interruption would mean starting over, or worse, manually reconstructing which nodes had already been touched and writing a one-off script to handle the remainder.

SCP tracks the state of every job in its own SQLite database, including which tasks have completed on which nodes, which are in progress, and which have failed. When a job is interrupted and resumed, it’s able to pick up from exactly where it left off. Completed tasks are not re-run, and tasks that were mid-execution when the interruption occurred are attempted again.

While we considered more complex state backends, the operational simplicity of a file-based database that lives alongside the binary won out. There's no external dependency to manage, the job state survives the process and restarts on its own, and the files themselves are small enough to inspect by hand when something goes wrong. Plus, we can always move to a distributed system down the road if we need it.

Error Classification

Not all errors are equal. Ideally, a task that fails due to a transient network timeout should be reattempted, while a task that detects data corruption or an unsafe cluster state should stop immediately and notify a human.

SCP distinguishes between recoverable and unrecoverable errors. Recoverable errors trigger the retry logic configured for that task in the workflow YAML. Unrecoverable errors halt the job immediately and fire a webhook notification to a designated ops channel in Discord, tagging the operator who invoked the job.

Getting this classification right is one of the trickier parts of writing a new task. Your natural instinct might be to mark everything as recoverable and let the auto-retries handle it, but a retry loop on a genuinely broken state can cause real harm. Task authors need to understand exactly what different failure modes mean for their specific operation.

Webhook notifications turned out to matter more than we initially expected. It turns out that running a two-hour rolling restart across a 30-node cluster while trusting the system to ping you if something goes wrong, is a wildly different experience than babysitting a terminal for two hours.

Scylla Control Plane in Action

Now that we’ve covered the core concepts of SCP’s design, what does using it actually look like?

Below is an example of the SDC running a one-off task against a single node:

$ scyllactl node-task --node scylla-messages-stg-us-east1-b-1 scylla-drain
2026-01-28 18:29:02.441Z Condition Check (Node is quorum safe): Start
2026-01-28 18:29:02.495Z Condition Check (Cluster is normal): Start
2026-01-28 18:29:02.550Z Condition Check (Node is quorum safe): Condition passed   duration=109ms
2026-01-28 18:29:02.554Z Condition Check (Cluster is normal): Condition passed   duration=59ms
2026-01-28 18:29:02.554Z All conditions passed
2026-01-28 18:29:02.554Z Start
2026-01-28 18:29:04.813Z Drain completed successfully
2026-01-28 18:29:04.814Z Finished   duration=2.259s

Before the drain runs, SCP automatically checks that the node is quorum-safe (i.e. there are enough nodes available to serve accurate requests) and that the cluster is healthy. These checks aren't optional — they're part of the task definition and run every time, regardless of who invokes the operation.

Next, we’ll query for the repair status across a cluster:

$ scyllactl cluster-task --cluster messages-prd get-repair-status
2026-01-29 18:03:05.678Z Start
repair/475ebc46-...: RUNNING (keyspaces: discord.messages; schedule: every 1h)
repair/531c5bd3-...: DONE   (keyspaces: discord, !discord.messages; schedule: once)
2026-01-29 18:03:05.693Z Finished   duration=15ms

And kicking off a full cluster workflow:

$ scyllactl job run add_nodes_to_cluster \
      --cluster=scylla-messages-prd \
      --nodes=scylla-messages-prd-us-east1-b-10,\
        scylla-messages-prd-us-east1-c-10,\
        scylla-messages-prd-us-east1-d-10

Adding Nodes Without Losing Sleep

Adding nodes to a running ScyllaDB cluster is the kind of operation that rewards careful orchestration. But when it’s done right, it runs as beautifully as a real orchestra.

We kept Scylla’s historical limitation of only joining a single node at once to avoid overwhelming the cluster. Retaining this limitation requires us to be a bit more particular about what we execute, and against which nodes we execute on. Specifically, we join nodes into the cluster one at a time, grouped by availability zone, waiting for each node to finish bootstrapping and reach a healthy state before continuing to the next node.

The add_nodes_to_cluster workflow encodes this logic:

- task: !node-workflow
    name: Join nodes to cluster
    concurrency_unit: zonal
    concurrency_limit: 1
    node_tasks:
      - task: !send-webhook-message
          channels: [infra-doing]
          message: Node is about to join!

      - task: !salt-highstate

      - task: !wait-for-conditions
          conditions:
            - condition: !is-up-normal
                known_node: +known_node+
              success_window_seconds: 60
              poll_interval_seconds: 5
              timeout_seconds: 86400  # 24 hours

In the above example, concurrency_unit: zonal combined with concurrency_limit: 1 means nodes join strictly one at a time, never across zones simultaneously. The is-up-normal check waits up to 24 hours for the node to stabilize, with a 60-second success window ensuring it's continuously healthy, not just healthy for one poll. And since joining nodes can temporarily impact cluster availability, the webhook notifies whoever is on-call before each join..

This is exactly the kind of operation that exposes whether a workflow framework is actually useful. The orchestration logic is non-trivial —zone-aware batching, per-step precondition checks, webhook notifications, retries upon failures — but in SCP, that logic lives in the workflow YAML and uses the individual tasks as composable primitives to execute operations. The engineer running the expansion isn't making decisions at each step; they're watching SCP execute a well-tested workflow and trusting that the system will yell if it hits anything unexpected.

From Dread to Y(E)A(H!)ML

Since shipping SCP, we've automated many of the operations that used to require the most careful hand-holding, such as:

Standing up new clusters end-to-end
Expanding clusters
Rolling Ubuntu upgrades across nodes
Rolling restarts after config changes
Other common remediations, such as cycling binaries, applying scylla.yaml changes, sending SIGHUP, and running cleanups

The next major investments are in making complex multi-phase operations fully automated. Today, spinning up a shadow cluster still requires several manual steps. We're building toward a single workflow that handles the full lifecycle, including provision, configure, validate, and tear down. Similarly, cluster expansion currently joins nodes in a single group; in the future, we want to join in smaller batches and run repairs between them, so we're never far behind on repair coverage as capacity grows.

Long Live SCP 👽

Today, that daydream of running a 36-hour operation in less than two hours is now a reality. And even better, most of those two hours are waiting for nodes to bootstrap while the engineer does something else.

That's the shift: not just faster, but substantially different. "Sustained attention for a full day" became "kick it off and check back later." All the operations we used to dread — standing up shadow clusters, expanding production clusters, rolling OS upgrades across hundreds of nodes — are now workflows we trust to run, surface problems on their own, and wait for our input when something needs a human decision.

SCP isn't done: we're still building a foundation for fully automating shadow cluster lifecycles and smarter expansion strategies, but every new workflow we add makes the next operation a little less manual.

If you like building tools that make hard operations a little less monotonous, we're hiring!

Peter French

Senior Software Engineer, Database Infrastructure

推荐订阅源

Discord Blog