惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
人人都是产品经理
人人都是产品经理
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
V2EX
博客园 - 三生石上(FineUI控件)
Martin Fowler
Martin Fowler
WordPress大学
WordPress大学
D
Docker
S
SegmentFault 最新的问题
博客园 - 聂微东
美团技术团队
Apple Machine Learning Research
Apple Machine Learning Research
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Last Week in AI
Last Week in AI
M
MIT News - Artificial intelligence
F
Fortinet All Blogs
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
GbyAI
GbyAI
L
LangChain Blog
Vercel News
Vercel News
博客园 - 叶小钗
MongoDB | Blog
MongoDB | Blog
Stack Overflow Blog
Stack Overflow Blog
H
Help Net Security
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
The Cloudflare Blog
Engineering at Meta
Engineering at Meta
T
Threat Research - Cisco Blogs
T
Threatpost
Scott Helme
Scott Helme
T
Tailwind CSS Blog
Latest news
Latest news
Stack Overflow Blog
Stack Overflow Blog
Blog — PlanetScale
Blog — PlanetScale
The Register - Security
The Register - Security
罗磊的独立博客
P
Proofpoint News Feed
腾讯CDC
S
Schneier on Security
雷峰网
雷峰网
A
About on SuperTechFans
T
Tenable Blog
F
Full Disclosure
Cyberwarzone
Cyberwarzone
博客园_首页
有赞技术团队
有赞技术团队
K
Kaspersky official blog

DEV Community

Session Management, Rate Limiting & Caching using Redis Why Accuracy Is Not Enough: Evaluation Metrics Every AI Engineer Should Understand How I Built One Building Instagram Data Workflows with HikerAPI (Without Maintaining Scrapers) Claude Code can't open my browser. Cowork can't run my tests. So I wired them together. AGTP: A Transport Protocol Built for Agents I built Snipworth a Chrome extension to turn code into shareable images — and keep them for later My Friend's Two Android Apps, Three Months Lost, and Why We Built onTest Need your attention on my current project Why a deleted backup Lambda kept billing 9,400 EBS snapshots Deterministic Telemetry Ingestion Pipeline for GridLoqer Your Deployments Are Causing Downtime. Mine Do Not. Here Is Why How I Built a 7-Layer NL2SQL Guardrail Stack for a Fortune 500 Enterprise Identity in Web3 The Trap of "Perfect" Architecture: What Building a Shopping Cart Taught Me The Browser Boundary Model: APIs, CORS, Cookies, JSON, Files, and SEO ModelChain: Measurable LLM Router with Adaptive Model Selection, Real-Time Scoring, Budget Guards and Failover for Node.js, Edge and Browser I Built a 25-Agent Polish Parliament That Drafts Bills With Real Legal Citations KeyMesh: Zero-Runtime-Dependency API Key Rotation, Circuit Breaker and Failover for Production LLM Applications in Node.js Claude Code's workflow docs are a menu. Building a home server with a mini PC Stop Shipping AI Slop: Build an Anti-Slop Harness Around Your LLM I built an open source SDK to catch AI agent regressions before they ship. Great Stack to Doesn't Work #3 — Redis: "99% Cache Hit Ratio, System Down" The Bug That Passes Every Toolchain Check: Circular Dependencies in JavaScript Great Stack to Doesn't Work Bonus: SQL vs NoSQL: Which One in 2026? Great Stack to Doesn't Work #2 — Kafka: "Where Did My Messages Go?" I built a detention-pay calculator for truckers in a day — unglamourous niches beat another AI wrapper The Same AI Model Can Perform 6x Better: Here's Why SQL-like Queries in FSRS Plugin for Obsidian [Imposter syndrome] Back to the beginning (DevSecOps path) How to Build a Kundali App with Free Vedic Astrology API — Step by Step Ideias Valem Muito Menos do Que Você Imagina [PT-BR] cgroups and Namespaces — The Linux Kernel's Building Blocks Behind Containers Hermes Blueprint: A Multi-Agent Hedge Fund Morning Briefing System Why We Abandoned Java for Our Treasure Hunt Engine and Embraced the Complexity of Rust Building a RAG System in Rust with Qdrant, Rig, and gRPC 🦀 Ecommerce Search API: Add Visual and Semantic Search Bots read fast pages too: what we reprioritised after an AI-crawler audit Tu navegador te conoce mejor de lo que crees: privacidad en 2026 From Zero to DevOps in Pakistan: My Real Journey With No CS Degree Astro 6.4 + Cosmic: The Fastest Content Stack in 2026 Inferred context is not a dependency graph A Simpler ButtonComponent: Just Render a Div Small Go Detail That Changes How Your Project Looks I Built a SaaS. Nobody Came. Here's What I Learned the Hard Way. From Vitals to Variables: How AutoAI Automates the Heavy Lifting of Machine Learning Home-Bottom Row Modifier Clusters We Trusted Auto-Ack. The Queue Agreed. Our Costs Didn't. DevOps for Developers: Reducing Cognitive Load and Boosting Transparency Python pytest: Write Tests That Actually Help You How I bypassed Vercel Serverless timeouts to build a decoupled document ingestion pipeline The Case for a Dedicated Reliability Engineer Next.js SaaS Boilerplate with BetterAuth, RBAC, i18n & Production-Ready Setup Reverse Engineer Any Database into dbdiagram.io, PlantUML, Mermaid, or QuickDBD - Then Keep Designing Your AI coding agent doesn't need a smarter model. It needs your backlog. I built a free streaming site from scratch — no ads, no framework, no BS I Can't Believe This AI Agent Runs on a $5 VPS — And It Puts $99/Month Frameworks to Shame Beyond Static Prompts: How to Build Self-Improving AI Agents with Closed-Loop Skill Playbooks How I Taught My Incident Alerts to Say "This Broke 3 Minutes After Your Last Deploy" Why I Stopped Treating Job Applications as My Only Career Strategy Stop Watching Tutorials, Start Coding: How I Built CodeQuizz, an AI-Powered Active Learning Engine How We Generate 300+ AI Business Ideas a Month With GPT-5 (and Filter the Junk Out) The Intent Layer Your AI Coding Agent Does Not Need a Bigger Prompt How I solved a problem in my house using with an AI-powered application! Structure: A Local-First Interview IDE Powered by Gemma 4 Build in public, month 2: 615 of 616 visitors never clicked anything Someone wrote a fake EULA into Bitcoin. Two hours later they revoked it. Insights of Git ( part : 1 ) Someone wrote a fake EULA into Bitcoin. Two hours later they revoked it. Payload CMS Has 508 Circular Dependencies. Next.js Has 17. Here's Why They Form in Every Large JS Codebase. Prompt Packs Are Dead. Long Live Skills Why I Started Building a Portfolio Tracker Senior developer" after 3 years is title laundering Stripe Webhook Idempotency in FastAPI: Handling Duplicate Events Without Double-Charging SaaS Customers What Happens Before Your C Program Reaches the CPU? FinOps for Startups: How to Keep Your AWS Bill Under $100/Month Configuring CORS in Azure API Management How RBI Quietly Created a New Billion Dollar Industry in International Payments Time Need To Rearrange Binary String I Updated My GitHub Auto-Commit Desktop App I Have Reviewed Over 400 Resumes for Tech roles. Here Is What Actually Gets You the Phone Screen [Boost] Awesomeness! We built a lightweight, 100% local File Integrity Monitor (FIM) with zero telemetry Building chart() for Tala: From Raw Indicator Data to Something You Can Actually Inspect A client-side secret scanner that physically can't exfiltrate your code (and why you shouldn't trust mine either) Your AI Agent Should Text You First Built free app for game design and worldbuilding You Have a Free AI Model Sitting in Chrome Right Now I created a fork of GunDB and rewrote it in TypeScript using Vibe Code 6 Advanced JavaScript Questions That Separate Seniors from Mid-Levels Claude Does Not Need More Prompts. It Needs Reasoning Discipline. An Introduction to AI Hub, Part 2: Custom MCP Servers I built a RAG pipeline from scratch — no LangChain, just FastAPI + FAISS How I built a dependency risk scanner with Coral in 7 days Local-first: a Model on Your Own Machine, Zero Cloud 2487. Remove Nodes From Linked List C_STD : A Leak-Free, Cross-Platform Standard Library for Modern C
Blue-Green Deployments Are Invisible. I Made Mine Visible. Here Is How.
Oluwagbade Odimayo · 2026-05-31 · via DEV Community

This is Part 2 of a series. Part 1 covers building the core blue-green deployment pipeline on AWS EKS from scratch. You can read it here: Your Deployments Are Causing Downtime. Mine Do Not. Here Is Why. This post picks up exactly where that one left off.


After Part 1, I had a working blue-green deployment pipeline. Every push to main triggered an automated sequence that built a Docker image, deployed it to the idle environment, verified it was healthy, and switched traffic in under one second. Zero downtime. Proven with a curl loop.

But there was a problem I could not shake.

The only way to see the traffic switch happen was a terminal showing JSON responses. A curl loop is fine for a demo. It is not fine for production. In a real system you need to answer questions like: how many requests per second were hitting blue before the switch? How quickly did green ramp up? Was there any period where request rates dropped? What exactly happened at 21:49:00?

A curl loop cannot answer those questions. A proper observability stack can.

So I went back and added three things: Prometheus metrics exported directly from the application, a Grafana dashboard that tracks request rates per environment in real time, and Terraform to manage all the AWS infrastructure as code. This post covers all three and shows you the graph that proves the traffic switch happened cleanly.


What I Added on Top of Part 1

If you read Part 1, you know the core system: two Kubernetes deployments (blue and green), a Service that routes traffic based on a label selector, NGINX Ingress for the public URL, and a GitHub Actions pipeline that automates the switch.

What I added in Part 2:

  • A /metrics endpoint in the Node.js app using prom-client
  • A Kubernetes ServiceMonitor that tells Prometheus to scrape the app every 15 seconds
  • A Grafana dashboard showing HTTP request rates per environment
  • Terraform configuration that provisions the entire AWS infrastructure in one command

None of these changes how the blue-green strategy works. They make it observable and reproducible.


Adding Metrics to the Application

The first step was making the app expose data that Prometheus could scrape. I added prom-client to the Node.js application and created two metrics: a counter tracking HTTP requests labelled by route, status code, color, and version, and a gauge identifying which environment the pod belongs to.

const client = require("prom-client");
const register = new client.Registry();
client.collectDefaultMetrics({ register });

const httpRequests = new client.Counter({
  name: "http_requests_total",
  help: "Total HTTP requests by route, status, color and version",
  labelNames: ["method", "route", "status", "color", "version"],
  registers: [register],
});

const envGauge = new client.Gauge({
  name: "bluegreen_environment_info",
  help: "Current environment color and version",
  labelNames: ["color", "version"],
  registers: [register],
});
envGauge.labels(APP_COLOR, APP_VERSION).set(1);

Enter fullscreen mode Exit fullscreen mode

Every request to /health or / increments the counter with the correct color label. Blue pods increment color="blue". Green pods increment color="green". When traffic switches, the counter accumulation shifts from one color to the other, and Prometheus captures that shift in its time-series data.

The /metrics endpoint serves the data:

app.get("/metrics", async (req, res) => {
  res.set("Content-Type", register.contentType);
  res.end(await register.metrics());
});

Enter fullscreen mode Exit fullscreen mode

Testing the Metrics Endpoint Before Deploying Anything

This is where I made a mistake the first time around, and it cost me a full rebuild cycle. I updated the code, rebuilt the Docker image, deployed to the cluster, and then tested the endpoint. It returned 404. The reason was that Kubernetes had cached the old image on the nodes. Even though I pushed a new image to ECR, the pods kept running the cached version because imagePullPolicy was not set to Always.

The lesson: always test the metrics endpoint locally before touching the cluster.

node server.js &
sleep 2
curl http://localhost:3000/metrics | head -10
# Must show: # HELP process_cpu_user_seconds_total ...
kill %1

Enter fullscreen mode Exit fullscreen mode

If that does not work locally, nothing you do on the cluster will fix it. Fix it at the source.

The second lesson: add imagePullPolicy: Always to every Deployment manifest before you ever push an image with a reused tag.

containers:
  - name: bluegreen-app
    image: 677276115158.dkr.ecr.us-east-1.amazonaws.com/bluegreen-app:blue
    imagePullPolicy: Always

Enter fullscreen mode Exit fullscreen mode

Without that line, you will spend time chasing a ghost. The code is right. The image is right. The cluster is lying to you.


Connecting Prometheus to the Application

Prometheus does not automatically discover application pods. You have to tell it what to scrape. On a cluster running the kube-prometheus-stack Helm chart, the way to do that is a ServiceMonitor resource.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: bluegreen-monitor
  namespace: monitoring
  labels:
    release: monitoring
spec:
  namespaceSelector:
    matchNames:
      - default
  selector:
    matchLabels:
      app: bluegreen-app
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

Enter fullscreen mode Exit fullscreen mode

This tells Prometheus: find all Services in the default namespace with the label app: bluegreen-app, then scrape their /metrics endpoint every 15 seconds on the port named http.

The Service also needed a named port to match:

ports:
  - name: http
    port: 80
    targetPort: 3000

Enter fullscreen mode Exit fullscreen mode

Once both are applied, Prometheus starts collecting http_requests_total data from every blue and green pod automatically. No manual configuration needed when pods restart or scale.


The Grafana Dashboard

With metrics flowing into Prometheus, the Grafana dashboard becomes straightforward. The key PromQL queries are:

rate(http_requests_total{color="blue"}[1m])
rate(http_requests_total{color="green"}[1m])

Enter fullscreen mode Exit fullscreen mode

rate() calculates the per-second request rate over a one-minute window. Labelling by color means blue and green show as separate lines on the same graph. During normal operation, only blue has traffic. At the moment of the switch, blue drops and green rises.

Here is what the graph looked like during a live traffic switch:

Grafana showing blue dropping and green rising during the traffic switch

At 21:49:00 the blue /health line dropped from 0.55 requests per second down to 0.25. At the same moment, the green /health line appeared at 0.19 and climbed back up to match the previous blue rate. The crossover is visible as a V-shape on the graph. Both environments tracked simultaneously. The switch captured in real-time data.

This is what a curl loop cannot show you. The curl loop proves no requests failed. The Grafana graph proves how fast the transition happened, what the request rate was before and after, and that both environments were being scraped correctly throughout.


Terraform: Making the Infrastructure Reproducible

The first time I built this project, I provisioned the AWS infrastructure using eksctl create cluster and a series of aws CLI commands. That works. But it leaves no record of what was created, cannot be version-controlled, and requires you to remember every command in the correct order if you ever need to rebuild.

Terraform solves all three problems by describing infrastructure as code.

resource "aws_eks_cluster" "main" {
  name     = var.cluster_name
  role_arn = aws_iam_role.eks_cluster.arn
  version  = var.kubernetes_version

  vpc_config {
    subnet_ids             = aws_subnet.public[*].id
    endpoint_public_access = true
  }

  depends_on = [
    aws_iam_role_policy_attachment.eks_cluster_policy,
  ]
}

Enter fullscreen mode Exit fullscreen mode

The entire infrastructure, VPC, subnets, internet gateway, route table, EKS cluster, node group, ECR repository, and all IAM roles and policy attachments, is defined across three files: main.tf, variables.tf, and outputs.tf. Rebuilding everything takes one command:

terraform apply

Enter fullscreen mode Exit fullscreen mode

It takes 10 to 15 minutes and prints every output you need at the end, including the kubectl connect command and the ECR registry URL.

cluster_endpoint       = "https://829B30456E2DD983A919AC57A97A18A8.gr7.us-east-1.eks.amazonaws.com"
cluster_name           = "bluegreen-cluster"
cluster_version        = "1.31"
ecr_repository_url     = "677276115158.dkr.ecr.us-east-1.amazonaws.com/bluegreen-app"
kubectl_config_command = "aws eks update-kubeconfig --region us-east-1 --name bluegreen-cluster"

Enter fullscreen mode Exit fullscreen mode

The Tear-Down Problem Terraform Exposed

When I ran terraform destroy the first time, it failed. The VPC could not be deleted because NGINX Ingress had created an AWS Elastic Load Balancer outside of Terraform's knowledge. That load balancer was still attached to the subnets, which blocked subnet deletion, which blocked internet gateway deletion, which blocked VPC deletion.

Terraform had no record of the load balancer because Helm created it, not Terraform. This is a real-world infrastructure management problem: resources created by one tool can block resources managed by another.

The fix is to clean up the orphaned load balancer manually before running destroy:

LB_NAME=$(aws elb describe-load-balancers --region us-east-1 \
  --query "LoadBalancerDescriptions[*].LoadBalancerName" --output text)

aws elb delete-load-balancer --region us-east-1 --load-balancer-name $LB_NAME

# Wait for the load balancer to fully detach
sleep 30

terraform destroy -auto-approve

Enter fullscreen mode Exit fullscreen mode

After that, destroy completes cleanly. This is now documented in the README so anyone who forks the project does not hit the same wall.

What Terraform Tracks

Running terraform state list after a successful apply shows every resource Terraform manages:

aws_ecr_lifecycle_policy.app
aws_ecr_repository.app
aws_eks_cluster.main
aws_eks_node_group.main
aws_iam_role.eks_cluster
aws_iam_role.eks_nodes
aws_iam_role_policy_attachment.eks_cluster_policy
aws_iam_role_policy_attachment.eks_cni_policy
aws_iam_role_policy_attachment.eks_ecr_readonly
aws_iam_role_policy_attachment.eks_worker_node_policy
aws_internet_gateway.main
aws_route_table.public
aws_route_table_association.public[0]
aws_route_table_association.public[1]
aws_subnet.public[0]
aws_subnet.public[1]
aws_vpc.main

Enter fullscreen mode Exit fullscreen mode

17 resources. All defined in code. All version-controlled. All reproducible.


The Challenges That Only Appeared in Part 2

Part 1 had its own challenges: the AWS ELB hostname versus IP address difference, the ECR IAM policy that most tutorials skip, the workflow files that were never actually in the repository. Part 2 introduced new ones.

The Metrics Endpoint Returned 404

After updating the app code and pushing a new image to ECR, the pods were still serving the old image without the metrics route. The cache on the Kubernetes nodes was serving stale containers because the image tag (blue) had not changed, and imagePullPolicy defaulted to IfNotPresent.

The fix was imagePullPolicy: Always combined with docker build --no-cache. One ensures the cluster always pulls the latest image. The other ensures Docker does not reuse cached layers that might bake in old code.

Grafana Showed No Green Lines After the Switch

After switching traffic to green, the Grafana dashboard showed only blue lines continuing. Green never appeared. The reason was that the ServiceMonitor was selecting pods by the app: bluegreen-app label but the Service itself did not have a named port, so Prometheus could not match the scrape endpoint configuration.

Adding name: http to the Service port definition and restarting the Prometheus operator resolved it. The ServiceMonitor's port: http reference only works if the Service has a port with that exact name.


What the Complete System Looks Like Now

Developer
    |
    | git push to main
    v
GitHub Actions (29 seconds average)
    |
    +-- Configure AWS credentials
    +-- Log in to Amazon ECR
    +-- Connect kubectl to EKS
    +-- Detect idle environment
    +-- Build and push image (--no-cache)
    +-- Deploy to idle environment
    +-- Health check idle pods
    +-- Switch traffic (patch Service selector)
    |
    v
Amazon EKS Cluster (Terraform-provisioned, 17 resources)
    |
    +-- NGINX Ingress (AWS ELB public URL)
    +-- Prometheus (scrapes /metrics every 15s from all pods)
    +-- Grafana (shows request rates per environment in real time)
    |
    +-- Kubernetes Service (selector: blue OR green)
            |
            +-- Blue Deployment  (2 pods, imagePullPolicy: Always)
            +-- Green Deployment (2 pods, imagePullPolicy: Always)

Enter fullscreen mode Exit fullscreen mode

Every component serves a specific purpose. Terraform makes the infrastructure reproducible. Prometheus makes the application observable. Grafana makes the switch moment visible. GitHub Actions makes the whole process automatic.


Key Takeaways from Part 2

Test locally before deploying. If the metrics endpoint returns 404 locally, it will return 404 in the cluster. The cluster is not the place to debug application code.

imagePullPolicy: Always is not optional when you reuse image tags. If you tag your image blue and push a new version, Kubernetes will happily keep serving the old one unless you tell it not to.

Prometheus needs a named port to match a ServiceMonitor. The port: http in the ServiceMonitor references the port name in the Service, not the port number. If the Service port has no name, the scrape silently fails.

Terraform and Helm create resources in different state systems. Terraform cannot destroy resources Helm created. Clean up Helm-managed load balancers before running terraform destroy or the network deletion will fail.

The Grafana graph is the real proof. The curl loop shows zero failed requests. The Grafana graph shows what the request rate was, how fast the transition happened, and that both environments were healthy throughout. One proves correctness. The other proves performance.


The Repository

Everything in both parts of this series, the application code, Kubernetes manifests, Terraform configuration, GitHub Actions workflows, ServiceMonitor, and all 23 screenshots from the live deployment, is in the repository:

github.com/gbadedata/zero-downtime-bluegreen-eks


What Comes Next

Two improvements remain on the roadmap.

Canary releases would add a middle step between binary blue-green switching and full automated rollout. Rather than moving 100% of traffic instantly, you shift 5% to green first, monitor error rates for ten minutes, then ramp to 100% if everything looks healthy. Achievable with NGINX Ingress weight annotations.

Automated rollback would close the last manual loop. Right now, if the new version has a bug after the switch, a human has to notice the Grafana graph, decide to roll back, and run the patch command. Automated rollback would watch the error rate in Prometheus for two minutes after every switch and fire the rollback automatically if the rate exceeds a defined threshold. No human required.

Both build directly on what already exists. The observability stack from Part 2 is what makes automated rollback possible. You cannot automate a response to something you cannot measure.


Part 1: Your Deployments Are Causing Downtime. Mine Do Not. Here Is Why.

Full source code: github.com/gbadedata/zero-downtime-bluegreen-eks