This is Part 2 of a series. Part 1 covers building the core blue-green deployment pipeline on AWS EKS from scratch. You can read it here: Your Deployments Are Causing Downtime. Mine Do Not. Here Is Why. This post picks up exactly where that one left off.
After Part 1, I had a working blue-green deployment pipeline. Every push to main triggered an automated sequence that built a Docker image, deployed it to the idle environment, verified it was healthy, and switched traffic in under one second. Zero downtime. Proven with a curl loop.
But there was a problem I could not shake.
The only way to see the traffic switch happen was a terminal showing JSON responses. A curl loop is fine for a demo. It is not fine for production. In a real system you need to answer questions like: how many requests per second were hitting blue before the switch? How quickly did green ramp up? Was there any period where request rates dropped? What exactly happened at 21:49:00?
A curl loop cannot answer those questions. A proper observability stack can.
So I went back and added three things: Prometheus metrics exported directly from the application, a Grafana dashboard that tracks request rates per environment in real time, and Terraform to manage all the AWS infrastructure as code. This post covers all three and shows you the graph that proves the traffic switch happened cleanly.
What I Added on Top of Part 1
If you read Part 1, you know the core system: two Kubernetes deployments (blue and green), a Service that routes traffic based on a label selector, NGINX Ingress for the public URL, and a GitHub Actions pipeline that automates the switch.
What I added in Part 2:
- A
/metricsendpoint in the Node.js app usingprom-client - A Kubernetes ServiceMonitor that tells Prometheus to scrape the app every 15 seconds
- A Grafana dashboard showing HTTP request rates per environment
- Terraform configuration that provisions the entire AWS infrastructure in one command
None of these changes how the blue-green strategy works. They make it observable and reproducible.
Adding Metrics to the Application
The first step was making the app expose data that Prometheus could scrape. I added prom-client to the Node.js application and created two metrics: a counter tracking HTTP requests labelled by route, status code, color, and version, and a gauge identifying which environment the pod belongs to.
const client = require("prom-client");
const register = new client.Registry();
client.collectDefaultMetrics({ register });
const httpRequests = new client.Counter({
name: "http_requests_total",
help: "Total HTTP requests by route, status, color and version",
labelNames: ["method", "route", "status", "color", "version"],
registers: [register],
});
const envGauge = new client.Gauge({
name: "bluegreen_environment_info",
help: "Current environment color and version",
labelNames: ["color", "version"],
registers: [register],
});
envGauge.labels(APP_COLOR, APP_VERSION).set(1);
Every request to /health or / increments the counter with the correct color label. Blue pods increment color="blue". Green pods increment color="green". When traffic switches, the counter accumulation shifts from one color to the other, and Prometheus captures that shift in its time-series data.
The /metrics endpoint serves the data:
app.get("/metrics", async (req, res) => {
res.set("Content-Type", register.contentType);
res.end(await register.metrics());
});
Testing the Metrics Endpoint Before Deploying Anything
This is where I made a mistake the first time around, and it cost me a full rebuild cycle. I updated the code, rebuilt the Docker image, deployed to the cluster, and then tested the endpoint. It returned 404. The reason was that Kubernetes had cached the old image on the nodes. Even though I pushed a new image to ECR, the pods kept running the cached version because imagePullPolicy was not set to Always.
The lesson: always test the metrics endpoint locally before touching the cluster.
node server.js &
sleep 2
curl http://localhost:3000/metrics | head -10
# Must show: # HELP process_cpu_user_seconds_total ...
kill %1
If that does not work locally, nothing you do on the cluster will fix it. Fix it at the source.
The second lesson: add imagePullPolicy: Always to every Deployment manifest before you ever push an image with a reused tag.
containers:
- name: bluegreen-app
image: 677276115158.dkr.ecr.us-east-1.amazonaws.com/bluegreen-app:blue
imagePullPolicy: Always
Without that line, you will spend time chasing a ghost. The code is right. The image is right. The cluster is lying to you.
Connecting Prometheus to the Application
Prometheus does not automatically discover application pods. You have to tell it what to scrape. On a cluster running the kube-prometheus-stack Helm chart, the way to do that is a ServiceMonitor resource.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: bluegreen-monitor
namespace: monitoring
labels:
release: monitoring
spec:
namespaceSelector:
matchNames:
- default
selector:
matchLabels:
app: bluegreen-app
endpoints:
- port: http
path: /metrics
interval: 15s
This tells Prometheus: find all Services in the default namespace with the label app: bluegreen-app, then scrape their /metrics endpoint every 15 seconds on the port named http.
The Service also needed a named port to match:
ports:
- name: http
port: 80
targetPort: 3000
Once both are applied, Prometheus starts collecting http_requests_total data from every blue and green pod automatically. No manual configuration needed when pods restart or scale.
The Grafana Dashboard
With metrics flowing into Prometheus, the Grafana dashboard becomes straightforward. The key PromQL queries are:
rate(http_requests_total{color="blue"}[1m])
rate(http_requests_total{color="green"}[1m])
rate() calculates the per-second request rate over a one-minute window. Labelling by color means blue and green show as separate lines on the same graph. During normal operation, only blue has traffic. At the moment of the switch, blue drops and green rises.
Here is what the graph looked like during a live traffic switch:
At 21:49:00 the blue /health line dropped from 0.55 requests per second down to 0.25. At the same moment, the green /health line appeared at 0.19 and climbed back up to match the previous blue rate. The crossover is visible as a V-shape on the graph. Both environments tracked simultaneously. The switch captured in real-time data.
This is what a curl loop cannot show you. The curl loop proves no requests failed. The Grafana graph proves how fast the transition happened, what the request rate was before and after, and that both environments were being scraped correctly throughout.
Terraform: Making the Infrastructure Reproducible
The first time I built this project, I provisioned the AWS infrastructure using eksctl create cluster and a series of aws CLI commands. That works. But it leaves no record of what was created, cannot be version-controlled, and requires you to remember every command in the correct order if you ever need to rebuild.
Terraform solves all three problems by describing infrastructure as code.
resource "aws_eks_cluster" "main" {
name = var.cluster_name
role_arn = aws_iam_role.eks_cluster.arn
version = var.kubernetes_version
vpc_config {
subnet_ids = aws_subnet.public[*].id
endpoint_public_access = true
}
depends_on = [
aws_iam_role_policy_attachment.eks_cluster_policy,
]
}
The entire infrastructure, VPC, subnets, internet gateway, route table, EKS cluster, node group, ECR repository, and all IAM roles and policy attachments, is defined across three files: main.tf, variables.tf, and outputs.tf. Rebuilding everything takes one command:
terraform apply
It takes 10 to 15 minutes and prints every output you need at the end, including the kubectl connect command and the ECR registry URL.
cluster_endpoint = "https://829B30456E2DD983A919AC57A97A18A8.gr7.us-east-1.eks.amazonaws.com"
cluster_name = "bluegreen-cluster"
cluster_version = "1.31"
ecr_repository_url = "677276115158.dkr.ecr.us-east-1.amazonaws.com/bluegreen-app"
kubectl_config_command = "aws eks update-kubeconfig --region us-east-1 --name bluegreen-cluster"
The Tear-Down Problem Terraform Exposed
When I ran terraform destroy the first time, it failed. The VPC could not be deleted because NGINX Ingress had created an AWS Elastic Load Balancer outside of Terraform's knowledge. That load balancer was still attached to the subnets, which blocked subnet deletion, which blocked internet gateway deletion, which blocked VPC deletion.
Terraform had no record of the load balancer because Helm created it, not Terraform. This is a real-world infrastructure management problem: resources created by one tool can block resources managed by another.
The fix is to clean up the orphaned load balancer manually before running destroy:
LB_NAME=$(aws elb describe-load-balancers --region us-east-1 \
--query "LoadBalancerDescriptions[*].LoadBalancerName" --output text)
aws elb delete-load-balancer --region us-east-1 --load-balancer-name $LB_NAME
# Wait for the load balancer to fully detach
sleep 30
terraform destroy -auto-approve
After that, destroy completes cleanly. This is now documented in the README so anyone who forks the project does not hit the same wall.
What Terraform Tracks
Running terraform state list after a successful apply shows every resource Terraform manages:
aws_ecr_lifecycle_policy.app
aws_ecr_repository.app
aws_eks_cluster.main
aws_eks_node_group.main
aws_iam_role.eks_cluster
aws_iam_role.eks_nodes
aws_iam_role_policy_attachment.eks_cluster_policy
aws_iam_role_policy_attachment.eks_cni_policy
aws_iam_role_policy_attachment.eks_ecr_readonly
aws_iam_role_policy_attachment.eks_worker_node_policy
aws_internet_gateway.main
aws_route_table.public
aws_route_table_association.public[0]
aws_route_table_association.public[1]
aws_subnet.public[0]
aws_subnet.public[1]
aws_vpc.main
17 resources. All defined in code. All version-controlled. All reproducible.
The Challenges That Only Appeared in Part 2
Part 1 had its own challenges: the AWS ELB hostname versus IP address difference, the ECR IAM policy that most tutorials skip, the workflow files that were never actually in the repository. Part 2 introduced new ones.
The Metrics Endpoint Returned 404
After updating the app code and pushing a new image to ECR, the pods were still serving the old image without the metrics route. The cache on the Kubernetes nodes was serving stale containers because the image tag (blue) had not changed, and imagePullPolicy defaulted to IfNotPresent.
The fix was imagePullPolicy: Always combined with docker build --no-cache. One ensures the cluster always pulls the latest image. The other ensures Docker does not reuse cached layers that might bake in old code.
Grafana Showed No Green Lines After the Switch
After switching traffic to green, the Grafana dashboard showed only blue lines continuing. Green never appeared. The reason was that the ServiceMonitor was selecting pods by the app: bluegreen-app label but the Service itself did not have a named port, so Prometheus could not match the scrape endpoint configuration.
Adding name: http to the Service port definition and restarting the Prometheus operator resolved it. The ServiceMonitor's port: http reference only works if the Service has a port with that exact name.
What the Complete System Looks Like Now
Developer
|
| git push to main
v
GitHub Actions (29 seconds average)
|
+-- Configure AWS credentials
+-- Log in to Amazon ECR
+-- Connect kubectl to EKS
+-- Detect idle environment
+-- Build and push image (--no-cache)
+-- Deploy to idle environment
+-- Health check idle pods
+-- Switch traffic (patch Service selector)
|
v
Amazon EKS Cluster (Terraform-provisioned, 17 resources)
|
+-- NGINX Ingress (AWS ELB public URL)
+-- Prometheus (scrapes /metrics every 15s from all pods)
+-- Grafana (shows request rates per environment in real time)
|
+-- Kubernetes Service (selector: blue OR green)
|
+-- Blue Deployment (2 pods, imagePullPolicy: Always)
+-- Green Deployment (2 pods, imagePullPolicy: Always)
Every component serves a specific purpose. Terraform makes the infrastructure reproducible. Prometheus makes the application observable. Grafana makes the switch moment visible. GitHub Actions makes the whole process automatic.
Key Takeaways from Part 2
Test locally before deploying. If the metrics endpoint returns 404 locally, it will return 404 in the cluster. The cluster is not the place to debug application code.
imagePullPolicy: Always is not optional when you reuse image tags. If you tag your image blue and push a new version, Kubernetes will happily keep serving the old one unless you tell it not to.
Prometheus needs a named port to match a ServiceMonitor. The port: http in the ServiceMonitor references the port name in the Service, not the port number. If the Service port has no name, the scrape silently fails.
Terraform and Helm create resources in different state systems. Terraform cannot destroy resources Helm created. Clean up Helm-managed load balancers before running terraform destroy or the network deletion will fail.
The Grafana graph is the real proof. The curl loop shows zero failed requests. The Grafana graph shows what the request rate was, how fast the transition happened, and that both environments were healthy throughout. One proves correctness. The other proves performance.
The Repository
Everything in both parts of this series, the application code, Kubernetes manifests, Terraform configuration, GitHub Actions workflows, ServiceMonitor, and all 23 screenshots from the live deployment, is in the repository:
github.com/gbadedata/zero-downtime-bluegreen-eks
What Comes Next
Two improvements remain on the roadmap.
Canary releases would add a middle step between binary blue-green switching and full automated rollout. Rather than moving 100% of traffic instantly, you shift 5% to green first, monitor error rates for ten minutes, then ramp to 100% if everything looks healthy. Achievable with NGINX Ingress weight annotations.
Automated rollback would close the last manual loop. Right now, if the new version has a bug after the switch, a human has to notice the Grafana graph, decide to roll back, and run the patch command. Automated rollback would watch the error rate in Prometheus for two minutes after every switch and fire the rollback automatically if the rate exceeds a defined threshold. No human required.
Both build directly on what already exists. The observability stack from Part 2 is what makes automated rollback possible. You cannot automate a response to something you cannot measure.
Part 1: Your Deployments Are Causing Downtime. Mine Do Not. Here Is Why.
Full source code: github.com/gbadedata/zero-downtime-bluegreen-eks

























