ECS Fargate Best Practices: Running a Fleet of 10+ Environments Without the Pain

ECS Fargate Best Practices: 10+ Environments Without the Pain

Originally published at https://fortem.dev/blog/ecs-fargate-best-practices
Seven ECS Fargate best practices for teams running 10+ environments. Fix hidden costs, Terraform state sprawl, Fargate quota sharing, and scheduling before they break your fleet.

Guide

Most ECS Fargate best practices guides tell you what to do. This one tells you what breaks between environment 5 and environment 20 — and gives you the exact fix for each. The numbers come from AWS published pricing, service quotas, and patterns we've seen managing fleets at scale. If you're running fewer than 5 environments, most of this won't matter yet. Bookmark it.

TL;DR

Name everything consistently from day one; retrofitting naming across 10+ environments takes weeks.
Fixed overhead is $85–100/mo per environment before a single container runs — at 50 envs that's $4,250–5,000/mo invisible spend.
Schedule dev/staging off-hours first. It cuts compute cost 60–70% and requires zero infrastructure changes.
Set CloudWatch log retention before ingestion hits 15 TB/mo and you get a $7,500 bill.
Isolate Terraform state per environment before the 25 MB threshold makes plans take 30+ minutes.

Start with naming and account structure

At 3 environments you can get away with ad-hoc names. At 10 you can't — because every AWS resource name is simultaneously a billing dimension, an IAM scope, and a CloudWatch filter. Inconsistent names mean you can't attribute cost, can't write scoped IAM policies, and can't build dashboards without a lookup table.

The convention that scales: {region_short}-{account}-{envname}. Applied to every resource from day one. One Terraform local generates every downstream resource name — ECS cluster, task definition, SSM parameter path, IAM role, CloudWatch log group — all from one source.

Ready to use — copy this today

locals {
  env_prefix = "${var.region_short}-${var.account}-${var.envname}"
}

resource "aws_ecs_cluster" "main" {
  name = local.env_prefix  # → "use1-prod-main"
}

resource "aws_ecs_task_definition" "api" {
  family = "${local.env_prefix}-api-td"
  # → "use1-prod-main-api-td"
}

resource "aws_ssm_parameter" "db_host" {
  name = "/${local.env_prefix}/api/DB_HOST"
  # → "/use1-prod-main/api/DB_HOST"
}

resource "aws_iam_role" "task_role" {
  name = "${local.env_prefix}-api-task-role"
  # → "use1-prod-main-api-task-role"
}

resource "aws_cloudwatch_log_group" "api" {
  name = "/ecs/${local.env_prefix}-api"
  retention_in_days = var.log_retention_days
}

Map naming to account structure. The most common pattern that works at 10+ environments: one AWS account for production, one for all non-prod. This separates Fargate vCPU quota pools, hardens IAM boundaries, and makes Cost Explorer attribution clean.

One constraint your naming convention must handle: ALB target group names are capped at 32 characters, and each ALB has a hard limit of 100 target groups. At 20 environments with 6 services each, you're at 120 target groups — past the limit. This forces per-environment ALBs sooner than you think, which increases your fixed overhead. A short naming prefix (use1-prod-api — 12 chars) leaves room for the target group suffix.

For the full naming pattern table, including the 32-character target group constraint and per-resource examples, see the dedicated section on consistent naming conventions for ECS environments.

Know your fixed overhead per environment

When engineers estimate ECS costs, they calculate compute: vCPU hours, memory hours, maybe RDS. What they miss is the fixed overhead that exists before a single container runs.

Every environment needs its own ALB and NAT Gateway. These costs are flat — they don't scale with usage, they don't go away when you stop tasks at night, and they don't appear on the compute line in Cost Explorer.

ResourceMonthly costNotes — Application Load Balancer: $22/mo$0.0225/hr base + $0.008/LCU-hr

NAT Gateway (2 AZs)~$66/mo$0.045/hr × 2 + $0.045/GB data

CloudWatch log basics$3–15/moDepends on log volume + retention

SSM, ECR, other$1–5/moSmall but additive at scale

Total fixed overhead$85–100/moBefore first task runs

At 10 environments, that's $850–1,000/mo invisible spend. At 50 environments, it's $4,250–5,000/mo before a single task runs.

KEY INSIGHT: NAT Gateway is the single most expensive fixed line item in any ECS environment — and the easiest to eliminate for non-prod. Teams that care about NAT cost switch non-prod environments to public subnet placement with strict security group rules and Network ACLs instead of private subnets with a NAT. This is meaningfully cheaper but does reduce your network boundary — regulated environments (PCI, HIPAA) and prod should keep the NAT. Evaluate your compliance posture before cutting this corner.

One more lever: VPC Endpoints. If your containers only need to reach AWS services (S3, ECR, CloudWatch, SSM), a VPC Endpoint costs ~$7.20/mo per endpoint — roughly 1/5th of one NAT Gateway. For ECR pulls and CloudWatch pushes, Gateway Endpoints (S3, DynamoDB) are free. Combined with the public-subnet approach above, this is the cheapest path to eliminating NAT entirely for non-prod. Strategy: use VPC Endpoints for AWS dependencies and public subnets for outbound internet, and you drop NAT from non-prod without sacrificing functionality.

We broke down the full per-environment cost — including ALB, NAT Gateway, CloudWatch, and data transfer — in our guide to how much an ECS environment actually costs.

Schedule dev/staging before the bill bleeds

Your environments run 168 hours a week. Your team works 40–55. Scheduling alone cuts compute cost by 60–70%— for most teams it's the single largest ECS cost lever available, and it requires zero code changes. The spread: 70% savings on a strict 40-hour Mon–Fri schedule, 60–65% on a 55-hour week. The exact number depends on your team's working hours, but either way it's the fastest path to a lower AWS bill.

The problem: AWS-native scheduling operates at the service level. To schedule one environment with 8 services, you need 16 Auto Scaling actions (stop + start per service). At 10 environments that's 160 actions to create, maintain, and update when schedules change.

EnvironmentsServices eachAuto Scaling actionsSchedule change cost

38488 updates

1081608–16 updates

201040010–20 updates

$1,730/mo

$515/mo

12 envs, 24/7

$1,730/mo

12 envs, business hours schedule (55 hrs/week)

$515/mo

Monthly AWS Fargate cost−70% savings

What teams actually do

Teams start with EventBridge + Lambda at 3–5 environments and it works beautifully. By 10 environments they're maintaining a scheduling codebase with a full test suite. By 15–20 environments, the maintenance burden outweighs the savings — and environments quietly drift back to 24/7. The economics of scheduling are sound; the tooling to maintain it at scale is the bottleneck.

For a deep dive on the AWS-native approach and a comparison with environment-level scheduling, read the complete guide to ECS environment scheduling.

Isolate Terraform state before it isolates you

A single Terraform state file containing all environments starts fast. At 25–50 MB, plans take 30+ minutes. At the HCP Terraform hard limit of ~100 MB (from base64 encoding), Terraform stops working entirely.

The blast radius is worse than the speed problem: one module bug in a shared state file can take down every environment in a single apply. A typo in a variable that propagates to 10 environments creates 10 simultaneous incidents.

The fix is per-environment state, applied independently. One folder per environment, each with its own S3 backend. No shared state files, no workspaces, no extra tooling — just directories you can see and reason about:

Folder-per-environment pattern

# Directory structure — one folder per environment, independent state
# terraform/environments/
#   prod/
#     backend.tf        → prod's own S3 backend (separate state file)
#     main.tf           → calls the shared module
#     terraform.tfvars
#   staging/
#     backend.tf        → staging's own S3 backend
#     main.tf
#     terraform.tfvars
#   dev-01/
#     ...

# environments/prod/backend.tf — each environment has its own state
terraform {
  backend "s3" {
    bucket         = "tfstate-org"
    key            = "envs/prod/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

# environments/prod/main.tf — thin, calls the shared module
module "environment" {
  source = "../../modules/ecs-environment"

  env_name   = "prod"
  account_id = "111111111111"
  # Plans run independently, blast radius is one environment
}

Each environments/<name>/ folder is self-contained: its own backend, its own tfvars, its own plan/apply lifecycle. You can see the entire fleet structure by looking at the directory tree — no jumping between files to trace configuration inheritance. Adding an environment means copying one folder and changing three lines. This is the pattern teams converge on after workspaces stop scaling, and it works with vanilla Terraform — no extra tooling required.

How to know when to split — check your state file size:

terraform state pull | wc -c

Under 5 MB — fine. 10–25 MB — start planning the migration. Over 25 MB — plans take 30+ minutes and locking contention becomes noticeable. The 3-minute plan threshold is also a strong signal: if a plan against one environment takes longer than 3 minutes, your state file is too large regardless of its byte count.

Practical guidance: teams managing 10+ environments should move to per-environment state before hitting 25 MB, not after. The migration is mechanical — extract each environment into its own directory, run one init per directory, and verify with a plan. It takes an afternoon and prevents a week of incidents. For the full implementation guide, see managing ECS Fargate with Terraform: what works and what doesn't.

Set CloudWatch retention on day one

The default CloudWatch log group setting is “never expire.” Teams routinely forget to change this. At $0.50/GB ingested, a fleet of 50 containers writing 5 GB/day generates $75/mo in ingestion costs alone — before storage, before metrics.

CloudWatch Logs at scale

50 containers × 5 GB/day: 7,500 GB/mo × $0.50/GB = $3,750/mo

Double the fleet to 100 containers: 15 TB/mo = $7,500/mo. We've seen this.

Add Container Insights: $0.21/hr per cluster

The fix: set retention_in_daysin Terraform. 30 days for dev/staging, 90 for prod. Never “never expire.”

resource "aws_cloudwatch_log_group" "api" {
  name              = "/ecs/${local.env_prefix}-${var.service_name}"
  retention_in_days = var.env_type == "prod" ? 90 : 30

  # Optional: switch non-prod to Infrequent Access — 50% cheaper storage
  # for logs read less than once a week
  log_group_class = var.env_type == "prod" ? "STANDARD" : "INFREQUENT_ACCESS"
}

Also: SSM parameters at $0.05/parameter/month creep unnoticed. At 10 environments × 8 services × 5 parameters each = 400 parameters = $20/mo. Small, but nobody accounts for it.

KEY INSIGHT: We've seen teams discover a $7,500/mo CloudWatch bill six months after launching their 15th environment. The Terraform was deployed with default retention, and nobody looked at the CloudWatch line in Cost Explorer until the CFO asked. Set retention in your module defaults. It costs nothing to set and thousands to miss.

CloudWatch is one piece of the ECS cost puzzle. For the full picture — Fargate compute, data transfer, load balancing, and the 65% savings playbook — see how to cut AWS ECS Fargate costs by 65%.

Use Fargate Spot where it belongs

Fargate Spot offers a 68% discount over on-demand: $0.01291/vCPU-hr vs $0.04048. The trade-off is a 2-minute interruption notice when AWS reclaims capacity, per the AWS Fargate pricing page (verified May 2026).

“Fargate Spot runs tasks on spare AWS EC2 capacity at up to a 70% discount compared to Fargate On-Demand. If AWS needs the capacity back, your running tasks will be given a two-minute warning and then stopped.”

— AWS Fargate Pricing, verified May 2026

Real interruption rates: large instance families see under 5% interruption; common instance types see 5–15%.

Best practice: use a capacity provider strategy with a 70/30 or 80/20 Spot/On-Demand split. Spot for CI/CD runners, staging, automated tests, and non-interactive batch jobs. On-Demand for production, customer-facing staging, and demo environments.

To enable: create a capacity provider strategy that includes both FARGATE_SPOT and FARGATE with a weighted base. AWS distributes tasks proportionally. The base weight (first number) is the minimum On-Demand count; the weight determines the split for additional tasks.

Capacity provider strategy with weighted split

# Define capacity providers for the ECS cluster
resource "aws_ecs_cluster_capacity_providers" "main" {
  cluster_name = aws_ecs_cluster.main.name

  capacity_providers = ["FARGATE", "FARGATE_SPOT"]

  default_capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = 1
    base              = 0  # 0 On-Demand tasks minimum for non-prod
  }

  default_capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 0  # Use On-Demand only when Spot unavailable
  }
}

# Per-service: adjust weights based on workload criticality
# Prod services use base=2 + more FARGATE weight
# Non-prod services use base=0 + FARGATE_SPOT only

One operational note: Fargate Spot provides a 2-minute SIGTERM window before SIGKILL. Your containers must handle graceful shutdown within this window — drain connections, flush buffers, checkpoint state. If your app takes 3+ minutes to shut down, Spot tasks will be force-killed mid-flight. For CI/CD runners and stateless workers this is fine; for anything with in-flight state, On-Demand is the safer choice. For more on Spot savings strategy, see how to cut ECS Fargate costs by 65%.

Split your Fargate quota before dev takes down prod

Fargate vCPU quota is per-region, per-account. If dev and prod share an account, they share the same quota pool. A developer running load tests against a dev environment can exhaust the regional Fargate quota — and production can't scale up during a traffic spike.

AWS has no native mechanism to reserve quota for production. The default Fargate On-Demand vCPU quota is 6 vCPUs per region (soft limit, increaseable to 10,000+ via support ticket). Dev and prod compete for the same pool.

KEY INSIGHT: Fargate quota sharing is invisible until it bites you. You won't know it happened until prod fails to scale during an incident. At that point, the fix takes hours — filing a support ticket and waiting for the quota increase to propagate. Account-level separation (prod in one account, non-prod in another) eliminates this class of incident.

The fix: separate accounts for prod vs non-prod. If that's not immediately feasible, monitor quota utilization proactively. Go to Service Quotas → AWS Fargate → Running On-Demand Fargate vCPUs in the AWS Console. Set a CloudWatch alarm at 70% utilization so you have time to react before hitting the limit. Quota increase requests can take 24–72 hours — at 70% you have days of runway; at 95% you have hours.

Two more constraints that hit at fleet scale: (1) Fargate launch rate — 20 tasks/second sustained in older regions, 5/second in newer ones. If your scheduler tries to start 100 tasks across 10 environments simultaneously, you'll hit the throttle. Add jitter to scheduled starts. (2) ECS API throttle — 10 burst requests/second, 1 sustained. Scripts that poll DescribeServices across 50 services will get rate-limited. Add exponential backoff and batch calls.

The ECS multi-environment strategy guide covers account structure patterns in detail, including when to split further and how to set up cross-account IAM for Fortem-style tooling.

Common questions

How do I track per-environment costs in AWS?

Should I use one ECS cluster or one per environment?

What's the fastest way to start saving on ECS Fargate costs?

How do I prevent developers from leaving environments running 24/7?

Does Fortem replace Terraform?

### See what your fleet would save Run the calculator in 30 seconds, then book 2

推荐订阅源

DEV Community