惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

U
Unit 42
S
Securelist
小众软件
小众软件
WordPress大学
WordPress大学
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
B
Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
The GitHub Blog
The GitHub Blog
Apple Machine Learning Research
Apple Machine Learning Research
博客园 - 司徒正美
博客园 - Franky
Hugging Face - Blog
Hugging Face - Blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
酷 壳 – CoolShell
酷 壳 – CoolShell
O
OpenAI News
Cloudbric
Cloudbric
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
TaoSecurity Blog
TaoSecurity Blog
MongoDB | Blog
MongoDB | Blog
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
V
V2EX
PCI Perspectives
PCI Perspectives
T
Troy Hunt's Blog
Schneier on Security
Schneier on Security
P
Palo Alto Networks Blog
M
MIT News - Artificial intelligence
V2EX - 技术
V2EX - 技术
阮一峰的网络日志
阮一峰的网络日志
Hacker News - Newest:
Hacker News - Newest: "LLM"
G
Google Developers Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
The Last Watchdog
The Last Watchdog
The Register - Security
The Register - Security
腾讯CDC
N
News and Events Feed by Topic
C
Check Point Blog
爱范儿
爱范儿
T
Tailwind CSS Blog
Webroot Blog
Webroot Blog
P
Proofpoint News Feed
S
Schneier on Security
MyScale Blog
MyScale Blog
N
News | PayPal Newsroom
Recorded Future
Recorded Future
T
Tenable Blog
I
InfoQ
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Microsoft Security Blog
Microsoft Security Blog
Simon Willison's Weblog
Simon Willison's Weblog
Engineering at Meta
Engineering at Meta

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
ECS Fargate Best Practices: Running a Fleet of 10+ Environments Without the Pain
Matt · 2026-06-04 · via DEV Community

ECS Fargate Best Practices: 10+ Environments Without the Pain

Originally published at https://fortem.dev/blog/ecs-fargate-best-practices
Seven ECS Fargate best practices for teams running 10+ environments. Fix hidden costs, Terraform state sprawl, Fargate quota sharing, and scheduling before they break your fleet.


Guide

Most ECS Fargate best practices guides tell you what to do. This one tells you what breaks between environment 5 and environment 20 — and gives you the exact fix for each. The numbers come from AWS published pricing, service quotas, and patterns we've seen managing fleets at scale. If you're running fewer than 5 environments, most of this won't matter yet. Bookmark it.

TL;DR

  • Name everything consistently from day one; retrofitting naming across 10+ environments takes weeks.
  • Fixed overhead is $85–100/mo per environment before a single container runs — at 50 envs that's $4,250–5,000/mo invisible spend.
  • Schedule dev/staging off-hours first. It cuts compute cost 60–70% and requires zero infrastructure changes.
  • Set CloudWatch log retention before ingestion hits 15 TB/mo and you get a $7,500 bill.
  • Isolate Terraform state per environment before the 25 MB threshold makes plans take 30+ minutes.

Start with naming and account structure

At 3 environments you can get away with ad-hoc names. At 10 you can't — because every AWS resource name is simultaneously a billing dimension, an IAM scope, and a CloudWatch filter. Inconsistent names mean you can't attribute cost, can't write scoped IAM policies, and can't build dashboards without a lookup table.

The convention that scales: {region_short}-{account}-{envname}. Applied to every resource from day one. One Terraform local generates every downstream resource name — ECS cluster, task definition, SSM parameter path, IAM role, CloudWatch log group — all from one source.

Ready to use — copy this today

locals {
  env_prefix = "${var.region_short}-${var.account}-${var.envname}"
}

resource "aws_ecs_cluster" "main" {
  name = local.env_prefix  # → "use1-prod-main"
}

resource "aws_ecs_task_definition" "api" {
  family = "${local.env_prefix}-api-td"
  # → "use1-prod-main-api-td"
}

resource "aws_ssm_parameter" "db_host" {
  name = "/${local.env_prefix}/api/DB_HOST"
  # → "/use1-prod-main/api/DB_HOST"
}

resource "aws_iam_role" "task_role" {
  name = "${local.env_prefix}-api-task-role"
  # → "use1-prod-main-api-task-role"
}

resource "aws_cloudwatch_log_group" "api" {
  name = "/ecs/${local.env_prefix}-api"
  retention_in_days = var.log_retention_days
}

Enter fullscreen mode Exit fullscreen mode

Map naming to account structure. The most common pattern that works at 10+ environments: one AWS account for production, one for all non-prod. This separates Fargate vCPU quota pools, hardens IAM boundaries, and makes Cost Explorer attribution clean.

One constraint your naming convention must handle: ALB target group names are capped at 32 characters, and each ALB has a hard limit of 100 target groups. At 20 environments with 6 services each, you're at 120 target groups — past the limit. This forces per-environment ALBs sooner than you think, which increases your fixed overhead. A short naming prefix (use1-prod-api — 12 chars) leaves room for the target group suffix.

For the full naming pattern table, including the 32-character target group constraint and per-resource examples, see the dedicated section on consistent naming conventions for ECS environments.

Know your fixed overhead per environment

When engineers estimate ECS costs, they calculate compute: vCPU hours, memory hours, maybe RDS. What they miss is the fixed overhead that exists before a single container runs.

Every environment needs its own ALB and NAT Gateway. These costs are flat — they don't scale with usage, they don't go away when you stop tasks at night, and they don't appear on the compute line in Cost Explorer.

ResourceMonthly costNotes — Application Load Balancer: $22/mo$0.0225/hr base + $0.008/LCU-hr

NAT Gateway (2 AZs)~$66/mo$0.045/hr × 2 + $0.045/GB data

CloudWatch log basics$3–15/moDepends on log volume + retention

SSM, ECR, other$1–5/moSmall but additive at scale

Total fixed overhead$85–100/moBefore first task runs

At 10 environments, that's $850–1,000/mo invisible spend. At 50 environments, it's $4,250–5,000/mo before a single task runs.

KEY INSIGHT: NAT Gateway is the single most expensive fixed line item in any ECS environment — and the easiest to eliminate for non-prod. Teams that care about NAT cost switch non-prod environments to public subnet placement with strict security group rules and Network ACLs instead of private subnets with a NAT. This is meaningfully cheaper but does reduce your network boundary — regulated environments (PCI, HIPAA) and prod should keep the NAT. Evaluate your compliance posture before cutting this corner.

One more lever: VPC Endpoints. If your containers only need to reach AWS services (S3, ECR, CloudWatch, SSM), a VPC Endpoint costs ~$7.20/mo per endpoint — roughly 1/5th of one NAT Gateway. For ECR pulls and CloudWatch pushes, Gateway Endpoints (S3, DynamoDB) are free. Combined with the public-subnet approach above, this is the cheapest path to eliminating NAT entirely for non-prod. Strategy: use VPC Endpoints for AWS dependencies and public subnets for outbound internet, and you drop NAT from non-prod without sacrificing functionality.

We broke down the full per-environment cost — including ALB, NAT Gateway, CloudWatch, and data transfer — in our guide to how much an ECS environment actually costs.

Schedule dev/staging before the bill bleeds

Your environments run 168 hours a week. Your team works 40–55. Scheduling alone cuts compute cost by 60–70%— for most teams it's the single largest ECS cost lever available, and it requires zero code changes. The spread: 70% savings on a strict 40-hour Mon–Fri schedule, 60–65% on a 55-hour week. The exact number depends on your team's working hours, but either way it's the fastest path to a lower AWS bill.

The problem: AWS-native scheduling operates at the service level. To schedule one environment with 8 services, you need 16 Auto Scaling actions (stop + start per service). At 10 environments that's 160 actions to create, maintain, and update when schedules change.

EnvironmentsServices eachAuto Scaling actionsSchedule change cost

38488 updates

1081608–16 updates

201040010–20 updates

$1,730/mo

$515/mo

12 envs, 24/7

$1,730/mo

12 envs, business hours schedule (55 hrs/week)

$515/mo

Monthly AWS Fargate cost−70% savings

What teams actually do

Teams start with EventBridge + Lambda at 3–5 environments and it works beautifully. By 10 environments they're maintaining a scheduling codebase with a full test suite. By 15–20 environments, the maintenance burden outweighs the savings — and environments quietly drift back to 24/7. The economics of scheduling are sound; the tooling to maintain it at scale is the bottleneck.

For a deep dive on the AWS-native approach and a comparison with environment-level scheduling, read the complete guide to ECS environment scheduling.

Isolate Terraform state before it isolates you

A single Terraform state file containing all environments starts fast. At 25–50 MB, plans take 30+ minutes. At the HCP Terraform hard limit of ~100 MB (from base64 encoding), Terraform stops working entirely.

The blast radius is worse than the speed problem: one module bug in a shared state file can take down every environment in a single apply. A typo in a variable that propagates to 10 environments creates 10 simultaneous incidents.

The fix is per-environment state, applied independently. One folder per environment, each with its own S3 backend. No shared state files, no workspaces, no extra tooling — just directories you can see and reason about:

Folder-per-environment pattern

# Directory structure — one folder per environment, independent state
# terraform/environments/
#   prod/
#     backend.tf        → prod's own S3 backend (separate state file)
#     main.tf           → calls the shared module
#     terraform.tfvars
#   staging/
#     backend.tf        → staging's own S3 backend
#     main.tf
#     terraform.tfvars
#   dev-01/
#     ...

# environments/prod/backend.tf — each environment has its own state
terraform {
  backend "s3" {
    bucket         = "tfstate-org"
    key            = "envs/prod/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

# environments/prod/main.tf — thin, calls the shared module
module "environment" {
  source = "../../modules/ecs-environment"

  env_name   = "prod"
  account_id = "111111111111"
  # Plans run independently, blast radius is one environment
}

Enter fullscreen mode Exit fullscreen mode

Each environments/<name>/ folder is self-contained: its own backend, its own tfvars, its own plan/apply lifecycle. You can see the entire fleet structure by looking at the directory tree — no jumping between files to trace configuration inheritance. Adding an environment means copying one folder and changing three lines. This is the pattern teams converge on after workspaces stop scaling, and it works with vanilla Terraform — no extra tooling required.

How to know when to split — check your state file size:

terraform state pull | wc -c

Enter fullscreen mode Exit fullscreen mode

Under 5 MB — fine. 10–25 MB — start planning the migration. Over 25 MB — plans take 30+ minutes and locking contention becomes noticeable. The 3-minute plan threshold is also a strong signal: if a plan against one environment takes longer than 3 minutes, your state file is too large regardless of its byte count.

Practical guidance: teams managing 10+ environments should move to per-environment state before hitting 25 MB, not after. The migration is mechanical — extract each environment into its own directory, run one init per directory, and verify with a plan. It takes an afternoon and prevents a week of incidents. For the full implementation guide, see managing ECS Fargate with Terraform: what works and what doesn't.

Set CloudWatch retention on day one

The default CloudWatch log group setting is “never expire.” Teams routinely forget to change this. At $0.50/GB ingested, a fleet of 50 containers writing 5 GB/day generates $75/mo in ingestion costs alone — before storage, before metrics.

CloudWatch Logs at scale

50 containers × 5 GB/day: 7,500 GB/mo × $0.50/GB = $3,750/mo

Double the fleet to 100 containers: 15 TB/mo = $7,500/mo. We've seen this.

Add Container Insights: $0.21/hr per cluster

The fix: set retention_in_daysin Terraform. 30 days for dev/staging, 90 for prod. Never “never expire.”

resource "aws_cloudwatch_log_group" "api" {
  name              = "/ecs/${local.env_prefix}-${var.service_name}"
  retention_in_days = var.env_type == "prod" ? 90 : 30

  # Optional: switch non-prod to Infrequent Access — 50% cheaper storage
  # for logs read less than once a week
  log_group_class = var.env_type == "prod" ? "STANDARD" : "INFREQUENT_ACCESS"
}

Enter fullscreen mode Exit fullscreen mode

Also: SSM parameters at $0.05/parameter/month creep unnoticed. At 10 environments × 8 services × 5 parameters each = 400 parameters = $20/mo. Small, but nobody accounts for it.

KEY INSIGHT: We've seen teams discover a $7,500/mo CloudWatch bill six months after launching their 15th environment. The Terraform was deployed with default retention, and nobody looked at the CloudWatch line in Cost Explorer until the CFO asked. Set retention in your module defaults. It costs nothing to set and thousands to miss.

CloudWatch is one piece of the ECS cost puzzle. For the full picture — Fargate compute, data transfer, load balancing, and the 65% savings playbook — see how to cut AWS ECS Fargate costs by 65%.

Use Fargate Spot where it belongs

Fargate Spot offers a 68% discount over on-demand: $0.01291/vCPU-hr vs $0.04048. The trade-off is a 2-minute interruption notice when AWS reclaims capacity, per the AWS Fargate pricing page (verified May 2026).

“Fargate Spot runs tasks on spare AWS EC2 capacity at up to a 70% discount compared to Fargate On-Demand. If AWS needs the capacity back, your running tasks will be given a two-minute warning and then stopped.”

AWS Fargate Pricing, verified May 2026

Real interruption rates: large instance families see under 5% interruption; common instance types see 5–15%.

Best practice: use a capacity provider strategy with a 70/30 or 80/20 Spot/On-Demand split. Spot for CI/CD runners, staging, automated tests, and non-interactive batch jobs. On-Demand for production, customer-facing staging, and demo environments.

To enable: create a capacity provider strategy that includes both FARGATE_SPOT and FARGATE with a weighted base. AWS distributes tasks proportionally. The base weight (first number) is the minimum On-Demand count; the weight determines the split for additional tasks.

Capacity provider strategy with weighted split

# Define capacity providers for the ECS cluster
resource "aws_ecs_cluster_capacity_providers" "main" {
  cluster_name = aws_ecs_cluster.main.name

  capacity_providers = ["FARGATE", "FARGATE_SPOT"]

  default_capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = 1
    base              = 0  # 0 On-Demand tasks minimum for non-prod
  }

  default_capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 0  # Use On-Demand only when Spot unavailable
  }
}

# Per-service: adjust weights based on workload criticality
# Prod services use base=2 + more FARGATE weight
# Non-prod services use base=0 + FARGATE_SPOT only

Enter fullscreen mode Exit fullscreen mode

One operational note: Fargate Spot provides a 2-minute SIGTERM window before SIGKILL. Your containers must handle graceful shutdown within this window — drain connections, flush buffers, checkpoint state. If your app takes 3+ minutes to shut down, Spot tasks will be force-killed mid-flight. For CI/CD runners and stateless workers this is fine; for anything with in-flight state, On-Demand is the safer choice. For more on Spot savings strategy, see how to cut ECS Fargate costs by 65%.

Split your Fargate quota before dev takes down prod

Fargate vCPU quota is per-region, per-account. If dev and prod share an account, they share the same quota pool. A developer running load tests against a dev environment can exhaust the regional Fargate quota — and production can't scale up during a traffic spike.

AWS has no native mechanism to reserve quota for production. The default Fargate On-Demand vCPU quota is 6 vCPUs per region (soft limit, increaseable to 10,000+ via support ticket). Dev and prod compete for the same pool.

KEY INSIGHT: Fargate quota sharing is invisible until it bites you. You won't know it happened until prod fails to scale during an incident. At that point, the fix takes hours — filing a support ticket and waiting for the quota increase to propagate. Account-level separation (prod in one account, non-prod in another) eliminates this class of incident.

The fix: separate accounts for prod vs non-prod. If that's not immediately feasible, monitor quota utilization proactively. Go to Service Quotas → AWS Fargate → Running On-Demand Fargate vCPUs in the AWS Console. Set a CloudWatch alarm at 70% utilization so you have time to react before hitting the limit. Quota increase requests can take 24–72 hours — at 70% you have days of runway; at 95% you have hours.

Two more constraints that hit at fleet scale: (1) Fargate launch rate — 20 tasks/second sustained in older regions, 5/second in newer ones. If your scheduler tries to start 100 tasks across 10 environments simultaneously, you'll hit the throttle. Add jitter to scheduled starts. (2) ECS API throttle — 10 burst requests/second, 1 sustained. Scripts that poll DescribeServices across 50 services will get rate-limited. Add exponential backoff and batch calls.

The ECS multi-environment strategy guide covers account structure patterns in detail, including when to split further and how to set up cross-account IAM for Fortem-style tooling.

Common questions

How do I track per-environment costs in AWS?

Should I use one ECS cluster or one per environment?

What's the fastest way to start saving on ECS Fargate costs?

How do I prevent developers from leaving environments running 24/7?

Does Fortem replace Terraform?

### See what your fleet would save Run the calculator in 30 seconds, then book 2