How SCP Works
SCP is built around a few layered concepts: tasks, workflows, and jobs.
Tasks
A task is a single unit of work; it includes things like "drain this node," "check the repair status," or "run a cleanup." Tasks come in two flavors: node tasks operate on a single node, while cluster tasks coordinate across an entire cluster (which includes running individual node tasks across many nodes in the cluster).
Between tasks, we often need to wait for the cluster to reach a desired state before it's OK to proceed. So, we establish some conditions: a special type of task that blocks execution until a criterion is satisfied. It verifies whether or not it’s safe to proceed by polling Scylla's API or Prometheus metrics until either the check passes, or it times out and surfaces an error.
After restarting a Scylla node, you often need to wait for compactions to settle before considering the node as back to a normal state. If you move too quickly, you’ll risk cascading pressure across the entire cluster. Without an explicit condition check, you'd either hardcode a sleep — too short and you cause problems; too long and a rolling restart across 30 nodes takes all day — or accept that your operation might fail unpredictably. Conditions make the wait explicit, observable, and tunable.
In Rust, tasks are defined using a trait that requires three things:
- A
name() method, describing what the task is doing. - A
preconditions() method that lists conditions that must be true before the task runs. - An
execute() method that does all the work.
struct Drain;
impl ExecuteNodeTask for Drain {
fn name(&self) -> String {
"Drain Scylla node".into()
}
fn preconditions(&self) -> Vec<ConditionCheck<NodeCondition>> {
vec![
ConditionCheck::new_with_defaults(QuorumSafe.into()),
ConditionCheck::new_with_defaults(ClusterNormal.into()),
]
}
async fn execute(&mut self, ctx: &NodeExecutionContext) -> TaskResult<()> {
ctx.scylla_api().drain().await?;
info!("Drain completed successfully");
Ok(())
}
}
One property we require of all tasks: idempotency. Running a task twice should produce the same result as running it once. This isn't always easy to achieve, but retrying is a key part of how the Scylla Control Plane handles failures, therefore idempotency is required to make retries safe.
Workflows
Workflows are defined in YAML and describe a sequence of tasks, along with their configuration; it details how many retries each task gets, whether to abort on the first failure, and how to handle parallelism.
name: Drain and restart each node in the cluster
variables:
- name: compactions_nominal_timeout_seconds
type: integer
description: Seconds to wait for compactions to reach nominal levels
default: 90
cluster_tasks:
- task: !node-workflow
name: Drain and restart each node
node_tasks:
- task: !scylla-drain
- task: !systemd-stop-scylla-server
- task: !systemd-start-service
service: scylla-server
- task: !wait-for-conditions
conditions:
- condition: !compactions-nominal
success_window_seconds: 20
poll_interval_seconds: 5
timeout_seconds: +compactions_nominal_timeout_seconds+
YAML was a deliberate choice. We didn't want every workflow change to require a Rust recompile, and we wanted operators to be able to tune parameters (such as retry counts and concurrency limits) without requiring a full binary deploy.
Template variables let workflows be parameterized at runtime, so you can scope a workflow to specific nodes or availability zones at invocation time without modifying the workflow definition.
Jobs and Orchestration
A job is a single execution of a workflow, bound to a specific cluster. Jobs are the thing you monitor, resume, and refer back to.
Jobs also support targeting, or running a workflow on a subset of a cluster's nodes rather than all of them. You can target an explicit list of nodes, a specific availability zone, or omit targeting to run against all nodes in a cluster.
Two parameters in the workflow YAML control how jobs run across the nodes in the cluster:
concurrency_unit controls how nodes are grouped for parallel execution. Setting it to zone means nodes are batched by availability zone, and a task won't run on nodes in multiple zones simultaneously. For a cluster with replication across three zones, this prevents a scenario where simultaneous node failures in multiple zones cause quorum loss.concurrency_limit caps how many nodes can be running a task at once, regardless of grouping. A limit of 1 means strictly serial execution within each batch; a limit of 3 allows up to three nodes to proceed in parallel.
Together, these two parameters let you express things like "restart nodes one zone at a time, with at most two nodes restarting concurrently within a zone" without any custom orchestration logic.
Resumability
Any long-running operation across a large cluster will eventually be interrupted (e.g. a node becomes unreachable, an SSH connection times out, the engineer running the job closes their terminal). Before SCP, this interruption would mean starting over, or worse, manually reconstructing which nodes had already been touched and writing a one-off script to handle the remainder.
SCP tracks the state of every job in its own SQLite database, including which tasks have completed on which nodes, which are in progress, and which have failed. When a job is interrupted and resumed, it’s able to pick up from exactly where it left off. Completed tasks are not re-run, and tasks that were mid-execution when the interruption occurred are attempted again.
While we considered more complex state backends, the operational simplicity of a file-based database that lives alongside the binary won out. There's no external dependency to manage, the job state survives the process and restarts on its own, and the files themselves are small enough to inspect by hand when something goes wrong. Plus, we can always move to a distributed system down the road if we need it.
Error Classification
Not all errors are equal. Ideally, a task that fails due to a transient network timeout should be reattempted, while a task that detects data corruption or an unsafe cluster state should stop immediately and notify a human.
SCP distinguishes between recoverable and unrecoverable errors. Recoverable errors trigger the retry logic configured for that task in the workflow YAML. Unrecoverable errors halt the job immediately and fire a webhook notification to a designated ops channel in Discord, tagging the operator who invoked the job.
Getting this classification right is one of the trickier parts of writing a new task. Your natural instinct might be to mark everything as recoverable and let the auto-retries handle it, but a retry loop on a genuinely broken state can cause real harm. Task authors need to understand exactly what different failure modes mean for their specific operation.
Webhook notifications turned out to matter more than we initially expected. It turns out that running a two-hour rolling restart across a 30-node cluster while trusting the system to ping you if something goes wrong, is a wildly different experience than babysitting a terminal for two hours.