Spot instances as GitHub Actions runners

Part 1 was the foundation - Jenkins as a Code, ephemeral workers, the whole thing. Part 2 was the painful platform - macOS. This one is sideways: we moved a chunk of the CI workload off Jenkins entirely, onto GitHub Actions, using EC2 spot instances as the runner fleet.

This isn't me saying "Jenkins is dead, use GitHub Actions". The Jenkins setup is still the workhorse for the big builds - macOS, Windows, anything that needs custom orchestration or runs for hours. GitHub Actions runs alongside it for a specific class of workload where it fits better.

This post is about the self-hosted spot runner pattern - how to point GitHub Actions at your own ephemeral EC2 fleet instead of GitHub's managed runners, and the things that bite once you do.

Why bother in the first place

The managed runners GitHub gives you out of the box are fine for small teams. A few reasons push you toward self-hosted:

1. Cost at volume. GitHub bills its managed Linux runners at $0.008/minute, which is about $0.48 per hour. No problem if you run a few builds a day. At our volume (roughly 80,000 runner-hours per month across 29,000 jobs), the same workload on managed runners would land around $38,000 per month. Our actual EC2 spot bill for the runner fleet last month was about $120 (plus a few dollars of Lambda, SQS and EBS overhead from the orchestration). The per-hour gap is what does the heavy lifting - spot sits around $0.001-0.002 per runner-hour against $0.48 for managed.

2. Instance shape. GitHub's managed runners come in fixed sizes. If your build wants 16 vCPUs and 64 GB of RAM, or a specific GPU, or arm64, you're either paying for the largest tier or you're stuck. Self-hosted lets you pick whatever EC2 instance type the build actually needs.

3. Network and dependency access. Builds that need to talk to private resources - internal artifact registries, RDS instances, things behind a VPC - are awkward on managed runners. Self-hosted runners live inside your VPC, so they have direct access to whatever they need without proxies or tunnels.

All three applied for us. The bill is what got us to try it, but instance-shape flexibility and VPC access were nice surprises on top.

What "self-hosted spot runner" actually means

A self-hosted GitHub Actions runner is a small agent that registers itself with a GitHub repo or org, polls for jobs matching its labels, runs them, and reports results back. The agent doesn't care where it lives - bare-metal, VM, container, anything that can run the runner binary.

The runner can be:

Persistent: registered once, sits there forever, picks up jobs as they come in.
Ephemeral: registered with a single-use token, picks up exactly one job, then de-registers and shuts down.

We picked ephemeral, for the same reasons as in Part 1. A long-lived self-hosted runner means you're babysitting a host, eating build pollution from a shared agent, and carrying a security blast radius that never closes.

So every GitHub Actions job gets its own EC2 spot instance, freshly launched from a Packer-baked AMI. The job runs, the instance is terminated. It's the same one-build-per-worker idea as the Jenkins fleet, just driven from a different control plane.

The architecture, end to end

There's no single "self-hosted spot runner" plugin or service. You stitch the flow yourself, or you grab one of a few open-source modules that stitch it for you. I went with terraform-aws-github-runner - the most beat-on option of the bunch, and it drops into a Terraform-managed AWS account without much ceremony. (Heads-up if you knew this project by its old name: it was originally maintained by Philips Labs at philips-labs/terraform-aws-github-runner and later moved to the dedicated github-aws-runners org. Same project, new home.)

When somebody opens a PR that triggers a workflow, the flow looks like this:

GitHub fires a webhook. As soon as a workflow job gets queued, GitHub posts a workflow_job event to a webhook URL.
API Gateway + Lambda receives it. The webhook Lambda validates the payload (HMAC signature check), filters for the runner labels you care about, and drops a message onto an SQS queue.
A "scale-up" Lambda drains the queue. For each queued job it launches an EC2 spot instance from a specific AMI, with user-data carrying a single-use registration token.
Instance comes up. Cloud-init runs, the runner binary registers itself with GitHub using the token, then starts polling for jobs.
GitHub assigns the job to it. The runner came up with matching labels, so GitHub schedules that specific job onto it.
Job runs. Whatever your workflow says - checkout, build, test, push artifacts.
Runner shuts down. We registered it as --ephemeral, so the runner agent exits after one job. A "scale-down" Lambda swings by on a schedule, finds instances whose runners have de-registered, and kills them.

The Lambda code, SQS queues and IAM glue all live inside the runner module - you don't write any of it yourself. What you do write is the Terraform configuration declaring which runners exist, what AMI they use, and which instance types are eligible.

📌 IMAGE TODO (architecture): linear flow left-to-right - GitHub webhook → API Gateway + Lambda → SQS → Scale-up Lambda → EC2 spot instance (Packer AMI) → Job runs → Instance terminates. One dashed loop labeled "scale-down Lambda watches and cleans up". Hand-drawn excalidraw style to match Parts 1 and 2.

Multi-tier runners - one workload size doesn't fit all

The setup earns its keep once you split runners into tiers distinguished by labels. Workflows pick the tier they want by setting runs-on: in the workflow YAML.

I ended up with roughly three tiers:

Default (small) tier: t3.medium / m5.large-class instances. For lightweight jobs - linters, formatters, doc builds, anything that doesn't really stress a CPU. Cheap, spot capacity is rarely a problem at this size, and they spawn quickly.
Large tier: m5.xlarge / c5.xlarge-class instances. For the typical build/test workflow that wants a bit of CPU but isn't hammering it.
Compute-intensive tier: c7a.4xlarge / c8a.8xlarge-class instances. For workflows that benefit from a lot of cores - compile-heavy builds, big test suites, anything that scales with parallelism.

Each tier is its own runner configuration in Terraform - same module, called multiple times, with different labels and instance-type lists. The Terraform looks roughly like this (sanitized):

module "github-runners" {
  source  = "github-aws-runners/github-runner/aws//modules/multi-runner"
  version = "~> 6.0"

  multi_runner_config = {
    "linux-x64-small" = {
      runner_config = {
        runner_extra_labels = "linux,x64,small"
        instance_types      = local.default_instances
        ami_filter          = { name = ["*ci-runner-x64*"] }
        enable_ephemeral_runners = true
        enable_spot_instances    = true
      }
    }

    "linux-x64-compute-intensive" = {
      runner_config = {
        runner_extra_labels = "linux,x64,compute-intensive"
        instance_types      = local.compute_intensive
        ami_filter          = { name = ["*ci-runner-x64*"] }
        enable_ephemeral_runners = true
        enable_spot_instances    = true
      }
    }

    # ...and so on for the other tiers
  }

  # Common stuff: webhook secret, GitHub app credentials, VPC config, etc.
  github_app = { ... }
  webhook_secret = random_id.webhook_secret.hex
  vpc_id  = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets
}

And in a workflow YAML, picking the tier looks like this:

jobs:
  build:
    runs-on: [self-hosted, linux, x64, compute-intensive]
    steps:
      - uses: actions/checkout@v4
      - run: make build

The runs-on array gets matched against runner labels - GitHub picks any registered runner that has all of them. And because the Lambda only spawns instances on demand, you don't pay for idle capacity in a tier nobody's using.

Scheduling - warm pool during hours, lights off at night

Pure on-demand scaling reads well on paper - zero idle runners, you only pay when a job is running, the Lambda spawns instances exactly when GitHub queues something. Two patterns spoil it in practice, and you'll want to layer scheduling on top.

The morning-rush problem: the first PRs of the day all queue around the time people log in. With pure on-demand, every one of those jobs eats the full cold-start latency (~60-120 seconds from queue to running). Multiply by a dozen developers hitting "push" at 9am and you've got a backlog people notice.

The 3am problem: even on spot, runners that exist but aren't being used cost something. Idle time on launched instances waiting for the next job, EBS-backed storage on warm AMIs, plus the always-on cost of the orchestration Lambdas. Outside business hours the queue is mostly empty - no point in keeping capacity hot.

The runner module addresses both with idle pools and scheduled scaling.

What works for us:

During business hours (weekdays 08:00-20:00 in our main timezone): keep a warm pool of N runners per tier. They sit idle, registered with GitHub, ready to grab the first job whose labels match. As soon as one of them claims a job, the scale-up Lambda spawns a replacement, so the pool size stays at N. Cold start from the user's perspective drops from ~90s to ~0s.
Outside business hours: pool size goes to zero. Late-night and weekend jobs still run, they just pay the cold-start tax. Most jobs in those windows are scheduled batch work that doesn't care about an extra minute.

In Terraform, the relevant pieces look roughly like this (sanitized, per-tier):

runner_config = {
  # ... labels, instance_types, ami_filter as before ...

  enable_ephemeral_runners = true
  enable_spot_instances    = true

  # Warm pool - kept at this size during the cron windows below.
  idle_config = [
    {
      cron      = "0 8 * * MON-FRI"   # ramp up at 08:00 weekdays
      timeZone  = "Europe/Berlin"
      idleCount = 3
    },
    {
      cron      = "0 20 * * MON-FRI"  # ramp down at 20:00 weekdays
      timeZone  = "Europe/Berlin"
      idleCount = 0
    },
  ]
}

A few notes about this pattern:

"Warm pool" runners are still ephemeral, just pre-launched. Each one picks up a single job and dies, same as any other runner. The pool replenishes by launching new instances, never by re-using old ones, so the byte-identical state property from Part 1 still holds.
Pool size is a judgment call. Too small and you hit cold-starts during the rush anyway. Too big and you're burning money on instances that sit unused. We tune it per tier based on the morning queue depth we observe. The compute-intensive tier has a smaller pool than the default tier because those jobs are rarer.
Spot eviction during pool-idle time is a non-event. If AWS reclaims a pool runner before it ever picks up a job, the scale-up Lambda launches a replacement. Pool size is a target, not a list of specific instances.
Holidays we haven't solved properly. The cron-based schedule doesn't know about public holidays, so on a Monday holiday the pool ramps up at 08:00 to serve nobody. The cost is small enough that we've never been motivated to build a calendar-aware scheduler.

During working hours the dev experience is roughly as snappy as the managed runners. Outside those hours the bill is close to zero.

📌 IMAGE TODO (warm-pool schedule): a 24-hour timeline showing pool size. Flat at 0 from 00:00-08:00, ramps up to N runners at 08:00, stays at N until 20:00, drops to 0 for the night. Weekend bar shows pool=0 all day. Annotate "cold-start hides here" inside the 08:00-20:00 band.

The AMI is still a Packer image

As in Part 1: whatever the runner needs to do its job (language runtimes, build tools, Docker, cached dependencies) gets baked into a Packer AMI ahead of time. The AMI is versioned, lives in your AWS account, and is referenced by the runner's ami_filter.

For the GitHub Actions case the image is usually lighter than the Jenkins worker images, because GitHub Actions workflows tend to install their own action-specific tooling at runtime via setup-node, setup-python, setup-java, etc. So the base image just needs:

The OS (Ubuntu, usually).
The GitHub Actions runner binary, pre-downloaded.
Docker (for docker build / docker run steps).
AWS CLI (most workflows talk to S3 or ECR).
Basic build dependencies (git, curl, jq, unzip).
Any runtime that's stable enough to bake (Node LTS, Python 3.x).

The image lands in the 5-10 GB range. Small enough that a fresh spot instance pulls it and starts running in well under a minute.

Spot interruptions - and why ephemeral makes them OK

Everyone asks the same question the first time they look at spot for CI: what happens when AWS yanks the instance mid-build?

The build fails. GitHub re-queues it. A fresh instance picks it up. There's no recovery dance to write, because the runner is ephemeral and the workflow should be idempotent anyway. An interruption costs you one wasted partial build, and the retry runs from scratch on a brand-new runner.

Spot interruption notices give you 2 minutes of warning before AWS pulls the plug. The runner can listen for that signal and de-register from GitHub cleanly before shutdown - the module wires this up for you. Without it, GitHub briefly shows a "this runner went offline mid-job" error before the retry. Ugly but not fatal.

In practice I see an interruption rate of roughly 1-3% of jobs on the default and large tiers, and a bit higher on the compute-intensive tier (the larger instance types just have less spot capacity available per AZ). For most workloads that's a fine trade for the cost savings. For workflows that can't tolerate retry - final release builds, deploys with side effects - I either run those on on-demand instances (same module, just flip enable_spot_instances = false for that tier) or send them over to the Jenkins fleet where the lifecycle is more controlled.

Trade-offs vs. Jenkins workers

"Should this workflow run on Jenkins or GitHub Actions?" comes up a lot. How I think about it:

Workload shape	Where it goes
PR-triggered, short, idempotent	GitHub Actions on spot - fast, cheap, no Jenkins overhead.
Long-running build (1h+)	Jenkins. Spot interruption risk is too high for long jobs.
macOS / Windows builds	Jenkins. The macOS and Windows worker setup from Parts 1 and 2 lives there.
Custom orchestration (matrix sharding, dynamic parallelism, gated promotion)	Jenkins. Groovy DSL handles this more flexibly than GitHub Actions matrix.
Deploys / releases with side effects	Jenkins on dedicated workers, or GitHub Actions on on-demand. No spot.
Open-source / contributor-facing repos	GitHub Actions. Don't expose Jenkins to contributors.
Builds that need ephemeral access to a specific cloud service	Whichever is in the right VPC. Usually GitHub Actions for the small stuff.

The two systems don't overlap, they cover different shapes of work. Moving everything to GitHub Actions would have been a mistake. Moving the small, fast, PR-scoped stuff off Jenkins freed up real capacity for the big jobs.

What I'm still figuring out

A few things in this part of the system that aren't fully solved:

Cold-start latency outside the warm pool. Inside business hours the warm pool hides it; outside business hours every job pays the full 60-120 seconds. We're OK with this trade-off but it occasionally bites the people working evenings.
Spot capacity in specific AZs. Occasionally the compute-intensive tier can't get spot capacity in our preferred AZ and the queue backs up. The module supports multi-AZ fallback, which helps, but doesn't eliminate it. For genuinely bursty days we sometimes fall back to on-demand temporarily.
Holiday-aware pool scheduling. The cron schedule doesn't know about public holidays. Low impact, but it's annoying every time you remember it.
AMI sprawl. Every architecture (x64, arm64) and every base-image variant is its own AMI lineage. Keeping them all current is a chore. We build them on a schedule via Packer the same way we do the Jenkins worker images, but the operational overhead is real.
Cost attribution. Spot instances launched by the scale-up Lambda inherit tags from the launch template, but not every downstream resource (EBS volumes, ENIs) picks up the right cost-attribution tags automatically. Whole separate rabbit hole, not opening it here.

Closing thought

The same pattern keeps showing up across all three parts: ephemeral workers, baked images, orchestrated from git, secrets pulled from a vault, identical state every build. Jenkins is one way to wire it up, GitHub Actions on self-hosted spot runners is another, and there's no rule that says you have to pick just one. Pick whichever control plane fits the workload in front of you.

The worker lifecycle is the bit that has to be right. Don't keep workers around between builds. Once that piece is in place, everything else (Jenkins or GitHub Actions, spot or on-demand, Tart or vSphere) is tactical, and you can swap it later without rebuilding the platform.

That's where My CI/CD Odyssey wraps for now. The three posts more or less reduce to the same advice - stop clicking, bake your images, don't reuse workers between builds. If any of it saves you a week of figuring it out yourself, this was worth writing.

Appendix - tools mentioned in this post

terraform-aws-github-runner - the Terraform module that wires up the whole self-hosted spot runner architecture. (Formerly hosted at philips-labs/terraform-aws-github-runner; moved to its own org.)
GitHub Actions self-hosted runners - official docs for the runner agent itself.
HashiCorp Packer - bakes the runner AMIs.
Terraform - the IaC layer that calls the above module.
AWS EC2 spot instances - the cheap, interruptible compute the runners live on.
AWS Lambda and SQS - the queueing and orchestration glue (managed for you by the module above).

This is Part 3 of My CI/CD Odyssey. Thanks for sticking with the whole series. If you're doing self-hosted CI differently - Jenkins, GitHub Actions, or something else - I'd love to hear about it in the comments.

推荐订阅源

DEV Community