Zero-Downtime Blue-Green and IP-Based Canary Deployments on ECS Fargate

Most ECS blue-green deployment tutorials eventually lead to the same stack:

AWS CodeDeploy
Deployment groups
AppSpec files
Lifecycle hooks
Weighted traffic shifting
Complex rollback orchestration

And while CodeDeploy works, I kept running into one practical limitation during real deployments:

I couldn’t let my internal team validate a new release on the actual production URL before exposing it to customers.

That became the entire motivation behind this setup.

I didn’t want:

separate staging domains
duplicate ALBs
temporary preview environments
“almost production” testing

I wanted something much simpler:

Internal users should see the new version first
Customers should continue seeing the stable version
Both should use the same production domain
Rollback should be immediate
Deployments should remain fully zero downtime

So I built a Terraform-driven deployment workflow using:

ECS Fargate
Application Load Balancer (ALB)
ALB listener priorities
Source IP routing
Terraform

without using CodeDeploy.

After running this setup in practice, I ended up preferring it for many ECS workloads.

The Core Idea

Both BLUE and GREEN environments run behind the same ALB.

Internal office/VPN IPs get routed to GREEN first.

Everyone else continues hitting BLUE.

That means QA and internal teams can validate the new release directly on the real production infrastructure before public rollout begins.

Same:

domain
SSL certificate
ALB
authentication flow
redirects
networking path

No “staging surprises” later.

A lot of deployment issues only appear on the real production routing path.

Real Example

Internal users open:

https://nginx.jayakrishnayadav.cloud

…and immediately see the GREEN version.

Meanwhile, public users continue seeing BLUE.

No DNS switching.

No duplicate infrastructure.

Just ALB listener routing.

Architecture Overview

The deployment flow looks like this:

                ┌────────────────────┐
                │   Application LB   │
                └─────────┬──────────┘
                          │
         ┌────────────────┴────────────────┐
         │                                 │
 Internal Office/VPN IPs             Public Users
         │                                 │
         ▼                                 ▼
   GREEN Target Group               BLUE Target Group
         │                                 │
    ECS GREEN Tasks                  ECS BLUE Tasks

The canary routing rule gets evaluated first.

If the request source IP matches internal CIDRs, traffic goes to GREEN.

Everything else falls back to BLUE.

Terraform Structure

I kept the Terraform layout modular so it could be reused across multiple services.

.
├── main.tf
├── variables.tf
├── outputs.tf
├── env/
│   ├── backend.hcl
│   └── terraform.tfvars
├── modules/
│   ├── vpc/
│   ├── iam/
│   ├── alb/
│   ├── ecs-cluster/
│   └── ecs-blue-green-service/
└── scripts/
    └── zero-downtime-test.sh

Each ECS service gets:

BLUE ECS service
GREEN ECS service
BLUE target group
GREEN target group
production listener rule
optional canary listener rule

ALB Listener Rule Logic

The entire deployment behavior depends on ALB listener priorities.

The canary listener rule gets evaluated first.

If the request source IP matches internal CIDRs, traffic gets forwarded to GREEN.

resource "aws_lb_listener_rule" "canary" {
  count    = var.activate_canary ? 1 : 0
  priority = 99

  condition {
    source_ip {
      values = var.canary_source_ips
    }
  }

  condition {
    host_header {
      values = ["nginx.jayakrishnayadav.cloud"]
    }
  }

  action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.green.arn
  }
}

The production rule remains below it:

resource "aws_lb_listener_rule" "production" {
  priority = 100

  condition {
    host_header {
      values = ["nginx.jayakrishnayadav.cloud"]
    }
  }

  action {
    type             = "forward"
    target_group_arn = local.active_target_group
  }
}

That’s it.

No weighted routing.

No lifecycle hooks.

Just listener priorities.

Real Deployment Workflow

This wasn’t built as a theoretical architecture exercise.

I tested the rollout flow directly from Terraform while continuously validating traffic behavior against live ECS Fargate services.

Terraform initialization:

terraform init -backend-config=env/backend.hcl

Deployment apply:

terraform apply \
  -var-file=env/terraform.tfvars \
  -lock=false \
  -auto-approve

During canary validation, I continuously verified my public IP:

curl ifconfig.me

That mattered because the ALB source-IP rule decides whether traffic reaches:

BLUE
GREEN

Once my IP matched the configured canary CIDRs, traffic immediately started routing to GREEN.

Deployment Flow

The nice part about this setup is that everything becomes variable-driven.

Step 1 — Normal Production State

BLUE handles all production traffic.

GREEN remains scaled down.

enable_canary   = false
activate_canary = false
promote_to_all  = false

Apply:

terraform apply \
  -var-file=env/terraform.tfvars \
  -lock=false \
  -auto-approve

Result:

BLUE active
GREEN inactive
minimal Fargate cost

Step 2 — Start GREEN Tasks

Now we start the GREEN environment.

enable_canary   = true
activate_canary = false
promote_to_all  = false

Apply again:

terraform apply \
  -var-file=env/terraform.tfvars \
  -lock=false \
  -auto-approve

At this stage:

GREEN tasks start
ECS health checks complete
ALB target registration completes
no production traffic reaches GREEN yet

Users never hit partially starting containers.

Step 3 — Internal Canary Validation

Now we enable canary routing.

enable_canary   = true
activate_canary = true
promote_to_all  = false

Apply again:

terraform apply \
  -var-file=env/terraform.tfvars \
  -lock=false \
  -auto-approve

Now:

internal office/VPN users hit GREEN
public users continue hitting BLUE

This became the most valuable phase of the deployment workflow.

Because now:

QA validates production behavior
developers inspect logs
authentication flows get tested
sessions and redirects get verified

while customers remain completely unaffected.

Internal Canary Routing

This is the ALB listener rules view while canary routing is enabled.

The priority 99 rule matches internal source IPs and forwards them to GREEN, while everyone else continues hitting BLUE.

Step 4 — Promote GREEN to Production

Once validation looks good:

enable_canary   = true
activate_canary = false
promote_to_all  = true

Apply again:

terraform apply \
  -var-file=env/terraform.tfvars \
  -lock=false \
  -auto-approve

Now:

production listener switches to GREEN
BLUE scales down
all users see the new version

No downtime occurs.

Traffic simply moves from one target group to another.

Verifying Zero Downtime

I didn’t want to assume the deployment was safe.

I wanted to verify it continuously during rollout.

So I used a simple curl-based validation script that continuously hit both applications while traffic shifted between BLUE and GREEN.

for i in {1..100}
do
  for url in \
    "https://nginx.jayakrishnayadav.cloud/" \
    "https://apache.jayakrishnayadav.cloud/"
  do
    response=$(curl -k -s -w " HTTPSTATUS:%{http_code}" "$url")

    body=${response% HTTPSTATUS:*}
    status=${response##*HTTPSTATUS:}

    if [[ $body == *"BLUE - v"* ]]; then
      color="BLUE"
    elif [[ $body == *"GREEN - v"* ]]; then
      color="GREEN"
    else
      color="UNKNOWN"
    fi

    echo "Run: $i | URL: $url | Status: $status | Version: $color"
  done
done

Output during deployment:

You can clearly see:

HTTP 200 responses throughout deployment
no failed requests
no 503s
clean traffic movement from BLUE to GREEN

That confirmed the deployment was genuinely zero downtime.

Production Promotion View

After promotion:

the canary rule disappears
the production listener points directly to GREEN
all traffic reaches the new version
BLUE scales down to zero

Clean and simple.

Rollback

Rollback became extremely simple.

I just reverted the Terraform variables:

enable_canary   = false
activate_canary = false
promote_to_all  = false

Apply Terraform again:

terraform apply \
  -var-file=env/terraform.tfvars \
  -lock=false \
  -auto-approve

ALB immediately routes traffic back to BLUE.

The rollback process stays predictable because traffic switching is entirely controlled through ALB listener rules.

HTTPS Configuration

The ALB uses ACM certificates for HTTPS.

Listeners:

Port 80 → redirect to HTTPS
Port 443 → production traffic
optional internal listener → restricted to internal CIDRs

Example:

test_listener_allowed_cidrs = [
  "160.30.39.198/32"
]

That keeps internal preview traffic private while still using the same production infrastructure.

Cost Optimization

One thing I specifically wanted to avoid was permanently doubling infrastructure cost.

Normal state:

only BLUE tasks run

Deployment window:

BLUE + GREEN both run temporarily

After promotion:

BLUE scales down again

So infrastructure cost only increases briefly during deployments.

Final Thoughts

This project started because I wanted a very practical deployment workflow:

Internal users should validate the new version on the actual production URL before customers ever see it.

Once I implemented that using ALB listener priorities and source IP routing, I realized I no longer really needed CodeDeploy for this workflow.

The end result became:

simpler
easier to operate
easier to rollback
easier to debug
easier to reason about
fully zero downtime

And because everything is Terraform-driven, the deployment process stays reproducible and predictable.

GitHub Repository

Full Terraform implementation:

https://github.com/jayakrishnayadav24/ecs-blue-green-deployment/tree/canary

推荐订阅源

DEV Community

The Core Idea

Real Example

Architecture Overview

Terraform Structure

ALB Listener Rule Logic

Real Deployment Workflow

Deployment Flow

Step 1 — Normal Production State

Step 2 — Start GREEN Tasks

Step 3 — Internal Canary Validation

Internal Canary Routing

Step 4 — Promote GREEN to Production

Verifying Zero Downtime

Production Promotion View

Rollback

HTTPS Configuration

Cost Optimization

Final Thoughts

GitHub Repository