惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

WordPress大学
WordPress大学
T
Threat Research - Cisco Blogs
D
DataBreaches.Net
Microsoft Azure Blog
Microsoft Azure Blog
D
Docker
P
Proofpoint News Feed
小众软件
小众软件
博客园 - 聂微东
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
人人都是产品经理
人人都是产品经理
J
Java Code Geeks
Martin Fowler
Martin Fowler
L
LangChain Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
李成银的技术随笔
MongoDB | Blog
MongoDB | Blog
M
MIT News - Artificial intelligence
阮一峰的网络日志
阮一峰的网络日志
Hacker News: Ask HN
Hacker News: Ask HN
C
CERT Recently Published Vulnerability Notes
H
Help Net Security
The GitHub Blog
The GitHub Blog
S
Security Archives - TechRepublic
AWS News Blog
AWS News Blog
Project Zero
Project Zero
Security Latest
Security Latest
P
Privacy International News Feed
T
Troy Hunt's Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
CXSECURITY Database RSS Feed - CXSecurity.com
I
Intezer
酷 壳 – CoolShell
酷 壳 – CoolShell
The Hacker News
The Hacker News
I
InfoQ
P
Proofpoint News Feed
C
Cisco Blogs
aimingoo的专栏
aimingoo的专栏
T
ThreatConnect
Recorded Future
Recorded Future
P
Palo Alto Networks Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
V
V2EX
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
G
GRAHAM CLULEY
F
Future of Privacy Forum
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
N
News and Events Feed by Topic
Engineering at Meta
Engineering at Meta

DEV Community

Agents Need Receipts, Not Just Better Prompts When AI Reads Blueprints: The Hidden Attack Surface of Multimodal Engineering Intelligence Simplicity scales — complexity kills side projects AI does exactly what you ask — that's the problem How a model upgrade silently broke our extraction prompt (and how we caught it) The Best Form Backend for Static Sites in 2026 # ⛽ I Built a Cross-Platform Fuel Finder with React & Supabase: The Indie Dev Journey The 11 Major Cloud Service Providers in 2025 Membangun Karya Visual: Mengintip Fasilitas Multimedia dan Studio Kreatif Amikom What Is IOPS? Visualizing Database Design: From Interactive Canvas to Drizzle, Prisma, and SQL in Real-time A tool to make your GitHub README impossible to ignore 🚀 Zero-Downtime Blue-Green and IP-Based Canary Deployments on ECS Fargate I reproduced a Claude Code RCE. The bug pattern is everywhere. We Replaced Our RAG Pipeline With Persistent KV Cache. Here's What We Found. Jenkins CI/CD Pipeline for a Dockerized Node.js Application: Manual Trigger vs Automatic Trigger Using GitHub Webhooks How to Stream Live Forex Rates to Google Sheets API: A Complete Guide Small Models Will Beat Giant Models (And Most People Haven’t Realized Why Yet) How I Built 5 Linux Automation Scripts on AWS EC2 I built TokenPatch to measure AI coding cost per applied patch I built a Chrome extension to stop squinting at the web Producer audit clean, six tests red Conversa — A Multi-Agent AI Platform Powered by Gemma 4 Build a Real Agent in 15 Minutes with Gemini's New Managed Agents API What I Actually Build: AI Systems That Ship, Not Demos That Impress The Box Ticked While You Read This: LinkedIn, AI Training, and the Switch You Did Not Flip Investasi Masa Depan: Mengintip Fasilitas Laboratorium Komputer Kelas Dunia di Yogyakarta I Cancelled My $20 Claude Cowork Plan After a Week With OpenWork Stop Reviewing Every Line of AI Code - Build the Trust Stack Instead How To Build an Image Cropper in Browser (Simple Steps) I built a macOS disk cleaner for developers and just launched it would love feedback Membangun Kompetensi dan Relasi: Mengapa Ekosistem Kampus Itu Penting I Built an AI That Decides Which AI to Talk To — Running 24/7 From My Living Room Codex Team Usage SOP How to Actually Become a Programmer: The Hard Part Nobody Wants to Explain Building a Production-Style Multi-Tool AI Agent with Python, Flask, React & Gemini AI The Caretaker Sandbox: An Offline-First Visual Playground & Template Engine powered by Gemma 4 # Building Instagram OSINT Projects with HikerAPI Your AI can read. Gemma 4 can see The Battle of the Senior Dev: Why AI Gives You Wings But Only If You're Ready to Pilot HiDream Raw Output Failed Tried Dev-2604 VRAM Math Killed It Won with a Prompt Enhancer Instead I Finally Finished a Project I Abandoned — And GitHub Copilot Helped Me Ship It SafeSMS: On-Device Threat Detection with Gemma 4 E4B, no internet required I Built OpenKap — A Loom Alternative for Small Teams Who Just Want to Ship Gemma 4 is Here: The Dawn of Local Multimodal Reasoning Offline-First Flutter: How We Built a CRM That Manages 100K+ Leads With No Internet Memory for Agents: When Vectors Meet Graphs, Bugs Drop 4 The Rise of Production-Grade AI Infrastructure I ran my idea-validation product through its own validator. The verdict was PIVOT. We Built an Agent Commerce API. Google I/O 2026 Changed Our 3-Month Roadmap in 24 Hours. "My Partner's Memory Was Full. I Didn't Know — Until We Tried to Talk." I’m a Front End Web Developer Learning Machine Learning From Scratch Laravel Waiting Request I Built a Chrome Extension to Track How Long You Actually Spend on Each Tab Why Google Can't See Your React Breadcrumbs (And the 4-Line Fix) AI Travel Assistant Powered by Gemma 4; With Streaming, Image Input, and Visual Recommendation Cards Microsoft tried to kill the printer driver. Healthcare said no. The Blueprint Beneath the Blueprint: Designing Data Model and Choosing Its Database REST APIs vs Webhooks in Telecom Billing - Which One Actually Makes Sense? Accounting Made Simple: AI-Powered Financial Insights of Japanese Companies with Gemma 4 The append-only AST trick that makes Flutter AI chat actually smooth Designing the Future of Payments — Why XML Still Matters in the Age of APIs From Legacy to Live — Reviving XMLPayments with GitHub Copilot Two Weeks Into Learning Solana XMLPayments — The Hidden Backbone of Modern Financial Orchestration AI Agents in Practice — Read from the beginning Reviving My Gemma Agentic Framework: From Prototype to Polished Repo Smart Contracts Demand Better Infrastructure: Building on contract.dev Self-Hosted LLM Tool Calling: Forge and the Build-vs-Buy Decision ORA-00072 오류 원인과 해결 방법 완벽 가이드 OpenWA for CTOs: Self-Hosted WhatsApp Gateway Trade-Offs NotebookLM Automation With notebooklm-py: Useful, But Classify Data First Docker v29.5.x Operator Upgrade Checklist Coding-Agent Instruction Design: The CLAUDE.md File That Prevents Rework When I Finally Realized My Runtime Was Holding Me Back GnokeOps: Host Your Own AI House Party The Death of Static Rate Limiters: Why Your Java Virtual Threads Need BBR-Style Adaptive Concurrency AI Agents in Practice — Part 2: What Makes Something an Agent Stop scattering LLM SDK/API calls across your codebase. Here is the 2-file rule that fixed mine Beyond Prompts: Structuring AI Workflows for Real Frontend Engineering From an Abandoned Hackathon Project to an AI Study Workspace 🚀 Terraform with AI: Build AWS Infra (Cursor + MCP) What If AI Didn’t Need the Internet? 750,000 Chips, 140 Trillion Tokens: The Math Behind DeepSeek's Permanent Price Cut You're Renting Someone Else's Compute — And It's Costing You More Than You Think CSS :has() Selector: The Layout Trick I Wish I Knew 5 Years Ago Five Clusters. Five Lessons. One Production System. Synaptic: A Local-First AI Dev Companion That Remembers How You Think Revolutionizing Edge MedTech: Building a Sovereign Sleep Apnea Companion ("XiHan Snore Coach") with Gemma 4 HDD Eksternal Tiba-Tiba Tidak Bisa Diakses di Windows? Ini Tiga Lapis Fix-nya DMARC p=none vs p=quarantine vs p=reject: what to use and when DSA Application in Real Life: How Git Diff Works: LCS Intuition, Myers Algorithm, and Real Code Changes I solo-built a reputation layer for AI agents on NEAR — and here's what I learned I built an AI faceless video generator in 2 months — here's the stack Diffusion Language Models: How NVIDIA Nemotron-Labs Diffusion Shatters the Autoregressive Speed Ceiling llm-nano-vm v0.8.0 — deterministic FSM runtime for LLM pipelines, now with output validation and per-step timeouts From the Renaissance to the Quantum Dawn: AI, Computation, and the Next Paradigm Shift How I Built a Review Site with 800+ Articles Using AI I Built a Smart Kitchen AI with Gemma 4 That Turns Fridge Photos Into Recipes Why your vulnerability dashboard is lying to you (and how to fix it)
Spot instances as GitHub Actions runners
Khachatur As · 2026-05-23 · via DEV Community

Part 1 was the foundation - Jenkins as a Code, ephemeral workers, the whole thing. Part 2 was the painful platform - macOS. This one is sideways: we moved a chunk of the CI workload off Jenkins entirely, onto GitHub Actions, using EC2 spot instances as the runner fleet.

This isn't me saying "Jenkins is dead, use GitHub Actions". The Jenkins setup is still the workhorse for the big builds - macOS, Windows, anything that needs custom orchestration or runs for hours. GitHub Actions runs alongside it for a specific class of workload where it fits better.

This post is about the self-hosted spot runner pattern - how to point GitHub Actions at your own ephemeral EC2 fleet instead of GitHub's managed runners, and the things that bite once you do.

Same workflow, different runner: github managed runners vs spot runners

Why bother in the first place

The managed runners GitHub gives you out of the box are fine for small teams. A few reasons push you toward self-hosted:

1. Cost at volume. GitHub bills its managed Linux runners at $0.008/minute, which is about $0.48 per hour. No problem if you run a few builds a day. At our volume (roughly 80,000 runner-hours per month across 29,000 jobs), the same workload on managed runners would land around $38,000 per month. Our actual EC2 spot bill for the runner fleet last month was about $120 (plus a few dollars of Lambda, SQS and EBS overhead from the orchestration). The per-hour gap is what does the heavy lifting - spot sits around $0.001-0.002 per runner-hour against $0.48 for managed.

2. Instance shape. GitHub's managed runners come in fixed sizes. If your build wants 16 vCPUs and 64 GB of RAM, or a specific GPU, or arm64, you're either paying for the largest tier or you're stuck. Self-hosted lets you pick whatever EC2 instance type the build actually needs.

3. Network and dependency access. Builds that need to talk to private resources - internal artifact registries, RDS instances, things behind a VPC - are awkward on managed runners. Self-hosted runners live inside your VPC, so they have direct access to whatever they need without proxies or tunnels.

All three applied for us. The bill is what got us to try it, but instance-shape flexibility and VPC access were nice surprises on top.

github actions and cost explorer charts

What "self-hosted spot runner" actually means

A self-hosted GitHub Actions runner is a small agent that registers itself with a GitHub repo or org, polls for jobs matching its labels, runs them, and reports results back. The agent doesn't care where it lives - bare-metal, VM, container, anything that can run the runner binary.

The runner can be:

  • Persistent: registered once, sits there forever, picks up jobs as they come in.
  • Ephemeral: registered with a single-use token, picks up exactly one job, then de-registers and shuts down.

We picked ephemeral, for the same reasons as in Part 1. A long-lived self-hosted runner means you're babysitting a host, eating build pollution from a shared agent, and carrying a security blast radius that never closes.

So every GitHub Actions job gets its own EC2 spot instance, freshly launched from a Packer-baked AMI. The job runs, the instance is terminated. It's the same one-build-per-worker idea as the Jenkins fleet, just driven from a different control plane.

The architecture, end to end

There's no single "self-hosted spot runner" plugin or service. You stitch the flow yourself, or you grab one of a few open-source modules that stitch it for you. I went with terraform-aws-github-runner - the most beat-on option of the bunch, and it drops into a Terraform-managed AWS account without much ceremony. (Heads-up if you knew this project by its old name: it was originally maintained by Philips Labs at philips-labs/terraform-aws-github-runner and later moved to the dedicated github-aws-runners org. Same project, new home.)

When somebody opens a PR that triggers a workflow, the flow looks like this:

  1. GitHub fires a webhook. As soon as a workflow job gets queued, GitHub posts a workflow_job event to a webhook URL.
  2. API Gateway + Lambda receives it. The webhook Lambda validates the payload (HMAC signature check), filters for the runner labels you care about, and drops a message onto an SQS queue.
  3. A "scale-up" Lambda drains the queue. For each queued job it launches an EC2 spot instance from a specific AMI, with user-data carrying a single-use registration token.
  4. Instance comes up. Cloud-init runs, the runner binary registers itself with GitHub using the token, then starts polling for jobs.
  5. GitHub assigns the job to it. The runner came up with matching labels, so GitHub schedules that specific job onto it.
  6. Job runs. Whatever your workflow says - checkout, build, test, push artifacts.
  7. Runner shuts down. We registered it as --ephemeral, so the runner agent exits after one job. A "scale-down" Lambda swings by on a schedule, finds instances whose runners have de-registered, and kills them.

The Lambda code, SQS queues and IAM glue all live inside the runner module - you don't write any of it yourself. What you do write is the Terraform configuration declaring which runners exist, what AMI they use, and which instance types are eligible.

📌 IMAGE TODO (architecture): linear flow left-to-right - GitHub webhook → API Gateway + Lambda → SQS → Scale-up Lambda → EC2 spot instance (Packer AMI) → Job runs → Instance terminates. One dashed loop labeled "scale-down Lambda watches and cleans up". Hand-drawn excalidraw style to match Parts 1 and 2.

Multi-tier runners - one workload size doesn't fit all

The setup earns its keep once you split runners into tiers distinguished by labels. Workflows pick the tier they want by setting runs-on: in the workflow YAML.

I ended up with roughly three tiers:

  • Default (small) tier: t3.medium / m5.large-class instances. For lightweight jobs - linters, formatters, doc builds, anything that doesn't really stress a CPU. Cheap, spot capacity is rarely a problem at this size, and they spawn quickly.
  • Large tier: m5.xlarge / c5.xlarge-class instances. For the typical build/test workflow that wants a bit of CPU but isn't hammering it.
  • Compute-intensive tier: c7a.4xlarge / c8a.8xlarge-class instances. For workflows that benefit from a lot of cores - compile-heavy builds, big test suites, anything that scales with parallelism.

Each tier is its own runner configuration in Terraform - same module, called multiple times, with different labels and instance-type lists. The Terraform looks roughly like this (sanitized):

module "github-runners" {
  source  = "github-aws-runners/github-runner/aws//modules/multi-runner"
  version = "~> 6.0"

  multi_runner_config = {
    "linux-x64-small" = {
      runner_config = {
        runner_extra_labels = "linux,x64,small"
        instance_types      = local.default_instances
        ami_filter          = { name = ["*ci-runner-x64*"] }
        enable_ephemeral_runners = true
        enable_spot_instances    = true
      }
    }

    "linux-x64-compute-intensive" = {
      runner_config = {
        runner_extra_labels = "linux,x64,compute-intensive"
        instance_types      = local.compute_intensive
        ami_filter          = { name = ["*ci-runner-x64*"] }
        enable_ephemeral_runners = true
        enable_spot_instances    = true
      }
    }

    # ...and so on for the other tiers
  }

  # Common stuff: webhook secret, GitHub app credentials, VPC config, etc.
  github_app = { ... }
  webhook_secret = random_id.webhook_secret.hex
  vpc_id  = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets
}

Enter fullscreen mode Exit fullscreen mode

And in a workflow YAML, picking the tier looks like this:

jobs:
  build:
    runs-on: [self-hosted, linux, x64, compute-intensive]
    steps:
      - uses: actions/checkout@v4
      - run: make build

Enter fullscreen mode Exit fullscreen mode

The runs-on array gets matched against runner labels - GitHub picks any registered runner that has all of them. And because the Lambda only spawns instances on demand, you don't pay for idle capacity in a tier nobody's using.

Scheduling - warm pool during hours, lights off at night

Pure on-demand scaling reads well on paper - zero idle runners, you only pay when a job is running, the Lambda spawns instances exactly when GitHub queues something. Two patterns spoil it in practice, and you'll want to layer scheduling on top.

The morning-rush problem: the first PRs of the day all queue around the time people log in. With pure on-demand, every one of those jobs eats the full cold-start latency (~60-120 seconds from queue to running). Multiply by a dozen developers hitting "push" at 9am and you've got a backlog people notice.

The 3am problem: even on spot, runners that exist but aren't being used cost something. Idle time on launched instances waiting for the next job, EBS-backed storage on warm AMIs, plus the always-on cost of the orchestration Lambdas. Outside business hours the queue is mostly empty - no point in keeping capacity hot.

The runner module addresses both with idle pools and scheduled scaling.

What works for us:

  • During business hours (weekdays 08:00-20:00 in our main timezone): keep a warm pool of N runners per tier. They sit idle, registered with GitHub, ready to grab the first job whose labels match. As soon as one of them claims a job, the scale-up Lambda spawns a replacement, so the pool size stays at N. Cold start from the user's perspective drops from ~90s to ~0s.
  • Outside business hours: pool size goes to zero. Late-night and weekend jobs still run, they just pay the cold-start tax. Most jobs in those windows are scheduled batch work that doesn't care about an extra minute.

In Terraform, the relevant pieces look roughly like this (sanitized, per-tier):

runner_config = {
  # ... labels, instance_types, ami_filter as before ...

  enable_ephemeral_runners = true
  enable_spot_instances    = true

  # Warm pool - kept at this size during the cron windows below.
  idle_config = [
    {
      cron      = "0 8 * * MON-FRI"   # ramp up at 08:00 weekdays
      timeZone  = "Europe/Berlin"
      idleCount = 3
    },
    {
      cron      = "0 20 * * MON-FRI"  # ramp down at 20:00 weekdays
      timeZone  = "Europe/Berlin"
      idleCount = 0
    },
  ]
}

Enter fullscreen mode Exit fullscreen mode

A few notes about this pattern:

  • "Warm pool" runners are still ephemeral, just pre-launched. Each one picks up a single job and dies, same as any other runner. The pool replenishes by launching new instances, never by re-using old ones, so the byte-identical state property from Part 1 still holds.
  • Pool size is a judgment call. Too small and you hit cold-starts during the rush anyway. Too big and you're burning money on instances that sit unused. We tune it per tier based on the morning queue depth we observe. The compute-intensive tier has a smaller pool than the default tier because those jobs are rarer.
  • Spot eviction during pool-idle time is a non-event. If AWS reclaims a pool runner before it ever picks up a job, the scale-up Lambda launches a replacement. Pool size is a target, not a list of specific instances.
  • Holidays we haven't solved properly. The cron-based schedule doesn't know about public holidays, so on a Monday holiday the pool ramps up at 08:00 to serve nobody. The cost is small enough that we've never been motivated to build a calendar-aware scheduler.

During working hours the dev experience is roughly as snappy as the managed runners. Outside those hours the bill is close to zero.

📌 IMAGE TODO (warm-pool schedule): a 24-hour timeline showing pool size. Flat at 0 from 00:00-08:00, ramps up to N runners at 08:00, stays at N until 20:00, drops to 0 for the night. Weekend bar shows pool=0 all day. Annotate "cold-start hides here" inside the 08:00-20:00 band.

The AMI is still a Packer image

As in Part 1: whatever the runner needs to do its job (language runtimes, build tools, Docker, cached dependencies) gets baked into a Packer AMI ahead of time. The AMI is versioned, lives in your AWS account, and is referenced by the runner's ami_filter.

For the GitHub Actions case the image is usually lighter than the Jenkins worker images, because GitHub Actions workflows tend to install their own action-specific tooling at runtime via setup-node, setup-python, setup-java, etc. So the base image just needs:

  • The OS (Ubuntu, usually).
  • The GitHub Actions runner binary, pre-downloaded.
  • Docker (for docker build / docker run steps).
  • AWS CLI (most workflows talk to S3 or ECR).
  • Basic build dependencies (git, curl, jq, unzip).
  • Any runtime that's stable enough to bake (Node LTS, Python 3.x).

The image lands in the 5-10 GB range. Small enough that a fresh spot instance pulls it and starts running in well under a minute.

Spot interruptions - and why ephemeral makes them OK

Everyone asks the same question the first time they look at spot for CI: what happens when AWS yanks the instance mid-build?

The build fails. GitHub re-queues it. A fresh instance picks it up. There's no recovery dance to write, because the runner is ephemeral and the workflow should be idempotent anyway. An interruption costs you one wasted partial build, and the retry runs from scratch on a brand-new runner.

Spot interruption notices give you 2 minutes of warning before AWS pulls the plug. The runner can listen for that signal and de-register from GitHub cleanly before shutdown - the module wires this up for you. Without it, GitHub briefly shows a "this runner went offline mid-job" error before the retry. Ugly but not fatal.

In practice I see an interruption rate of roughly 1-3% of jobs on the default and large tiers, and a bit higher on the compute-intensive tier (the larger instance types just have less spot capacity available per AZ). For most workloads that's a fine trade for the cost savings. For workflows that can't tolerate retry - final release builds, deploys with side effects - I either run those on on-demand instances (same module, just flip enable_spot_instances = false for that tier) or send them over to the Jenkins fleet where the lifecycle is more controlled.

Trade-offs vs. Jenkins workers

"Should this workflow run on Jenkins or GitHub Actions?" comes up a lot. How I think about it:

Workload shape Where it goes
PR-triggered, short, idempotent GitHub Actions on spot - fast, cheap, no Jenkins overhead.
Long-running build (1h+) Jenkins. Spot interruption risk is too high for long jobs.
macOS / Windows builds Jenkins. The macOS and Windows worker setup from Parts 1 and 2 lives there.
Custom orchestration (matrix sharding, dynamic parallelism, gated promotion) Jenkins. Groovy DSL handles this more flexibly than GitHub Actions matrix.
Deploys / releases with side effects Jenkins on dedicated workers, or GitHub Actions on on-demand. No spot.
Open-source / contributor-facing repos GitHub Actions. Don't expose Jenkins to contributors.
Builds that need ephemeral access to a specific cloud service Whichever is in the right VPC. Usually GitHub Actions for the small stuff.

The two systems don't overlap, they cover different shapes of work. Moving everything to GitHub Actions would have been a mistake. Moving the small, fast, PR-scoped stuff off Jenkins freed up real capacity for the big jobs.

What I'm still figuring out

A few things in this part of the system that aren't fully solved:

  • Cold-start latency outside the warm pool. Inside business hours the warm pool hides it; outside business hours every job pays the full 60-120 seconds. We're OK with this trade-off but it occasionally bites the people working evenings.
  • Spot capacity in specific AZs. Occasionally the compute-intensive tier can't get spot capacity in our preferred AZ and the queue backs up. The module supports multi-AZ fallback, which helps, but doesn't eliminate it. For genuinely bursty days we sometimes fall back to on-demand temporarily.
  • Holiday-aware pool scheduling. The cron schedule doesn't know about public holidays. Low impact, but it's annoying every time you remember it.
  • AMI sprawl. Every architecture (x64, arm64) and every base-image variant is its own AMI lineage. Keeping them all current is a chore. We build them on a schedule via Packer the same way we do the Jenkins worker images, but the operational overhead is real.
  • Cost attribution. Spot instances launched by the scale-up Lambda inherit tags from the launch template, but not every downstream resource (EBS volumes, ENIs) picks up the right cost-attribution tags automatically. Whole separate rabbit hole, not opening it here.

Closing thought

The same pattern keeps showing up across all three parts: ephemeral workers, baked images, orchestrated from git, secrets pulled from a vault, identical state every build. Jenkins is one way to wire it up, GitHub Actions on self-hosted spot runners is another, and there's no rule that says you have to pick just one. Pick whichever control plane fits the workload in front of you.

The worker lifecycle is the bit that has to be right. Don't keep workers around between builds. Once that piece is in place, everything else (Jenkins or GitHub Actions, spot or on-demand, Tart or vSphere) is tactical, and you can swap it later without rebuilding the platform.

That's where My CI/CD Odyssey wraps for now. The three posts more or less reduce to the same advice - stop clicking, bake your images, don't reuse workers between builds. If any of it saves you a week of figuring it out yourself, this was worth writing.


Appendix - tools mentioned in this post


This is Part 3 of My CI/CD Odyssey. Thanks for sticking with the whole series. If you're doing self-hosted CI differently - Jenkins, GitHub Actions, or something else - I'd love to hear about it in the comments.