惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Microsoft Azure Blog
Microsoft Azure Blog
有赞技术团队
有赞技术团队
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
F
Fox-IT International blog
Recorded Future
Recorded Future
T
ThreatConnect
T
The Exploit Database - CXSecurity.com
SecWiki News
SecWiki News
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
人人都是产品经理
人人都是产品经理
T
Tenable Blog
L
LINUX DO - 最新话题
博客园_首页
Hugging Face - Blog
Hugging Face - Blog
罗磊的独立博客
博客园 - 司徒正美
The Hacker News
The Hacker News
博客园 - 聂微东
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Scott Helme
Scott Helme
博客园 - 【当耐特】
O
OpenAI News
Schneier on Security
Schneier on Security
Latest news
Latest news
S
Security @ Cisco Blogs
S
Secure Thoughts
F
Full Disclosure
L
Lohrmann on Cybersecurity
S
SegmentFault 最新的问题
T
Tor Project blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
量子位
小众软件
小众软件
T
Threat Research - Cisco Blogs
Simon Willison's Weblog
Simon Willison's Weblog
IT之家
IT之家
大猫的无限游戏
大猫的无限游戏
N
News and Events Feed by Topic
E
Exploit-DB.com RSS Feed
J
Java Code Geeks
Last Week in AI
Last Week in AI
酷 壳 – CoolShell
酷 壳 – CoolShell
Application and Cybersecurity Blog
Application and Cybersecurity Blog
S
Schneier on Security
Cisco Talos Blog
Cisco Talos Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Proofpoint News Feed
Recent Commits to openclaw:main
Recent Commits to openclaw:main
雷峰网
雷峰网

DEV Community

Spring Boot Auto-Configuration Source Code: Nail This Interview Question The Ultimate Guide to Free AI API Keys: 6 Platforms You Need to Know Why 91% of AI Agents Fail in Production (And What the 9% Do Differently) TryHackMe | Battery | WALKTHROUGH Stop Guessing Your Regex — Test It Live in the Browser I Built FreelancEye, an Open-Source Mobile PWA for Finding Clients Beyond the Hype: My Production Playbook for Docker Swarm Top AI App Builder Platforms with Integrated Backend, Hosting & Database Hardening Your Node.js App Against Supply Chain & Remote Code Execution Attacks linux commands A Practical GEO Case: How an AI System Started Recommending Our Blog Your AI Agent Works 24/7 and Earns $0. I Built the Fix. Your AI Trading Agent Will Lose All Your Money — Here's How To Stop It Google I/O 2026: What Happens When Everything Connects? Why AI writes software but doesn’t build a good product Beyond the Hype: How Google I/O 2026 Secretly Democratized Production-Ready AI Agents with Managed Sandboxes. The Killer Assumption Test: How to Spot Doomed Product Decisions Before You Ship Stop Describing Your Bugs — Just Screenshot Them # I Built an AI Website Builder and Here's What Actually Happened Cooking an AI Campaign in 5 Minutes with Google Cloud AI APIs Your PM Retrospectives Are Lying to You How I Built a Free, Self-Hosted Pipeline That Auto-Generates Faceless YouTube Shorts TypeScript 54 to 58: The Features That Actually Matter in 2026 How to Tailor Your CV to Any Job Posting in 2026 The 7-day SaaS MVP loop: ship fast, then validate with people who actually show up 95. Fine-Tuning LLMs: Make a General Model Do Your Specific Job What Is a Frontend Developer Roadmap and Why You Need One Google shipped three Gemini "Flash" models. Picking the wrong one could 6 your AI bill Building an MCP server so Claude can query my SaaS analytics directly Google I/O 2026 and the Rise of the AI Ecosystem Your Docker Builds Are Slow Because You're Doing It Wrong (And I Built a Tool to Prove It) How do you verify GitHub contributions without trusting self-reported skills? CV vs Resume: What's the Difference and Which Do You Need? student Devs: Build AI Agents & Compete for $55K in Prizes 🚀 How to Write a Cover Letter That Actually Gets You Interviews Battle-Tested: What Getting Hacked Taught Me About Web & Cyber Security Unda folders za kuandika code >> mkdir src >> cd src >> mkdir controllers database routes services utils >> cd .. Directory: C:\Users\mwaki\microfinance-system Mode LastWriteTime Length Name Code Coverage .NET AI slop debt" is technical debt on fast forward. Nobody's ready. Multi-Head Latent Attention (MLA) Memoria - A Local AI Reading Companion Powered by Gemma 4 Stop Trusting Your Accuracy Score: A Practical Guide to Evaluating Logistic Regression Models Serious Question: Is the Developer Job Actually in Risk Due to AI? published: true tags: #discuss #career #ai #help rav2d: We ported an AV2 video decoder from C to Rust — here's why Your New Domain's First Week of GA4 Is a Lie: 4 Days of Raw Data from a Launch Gemma Guide - Real-Time Spatial Awareness for Blind Users From YAML to AI Agents: Building Smarter DevOps Pipelines with MCP A Field Guide to Human–AI Relations (For the Newly Bewildered Mortal) The AI Agent That Learns While It Works — A Complete Guide to Hermes Agent Inviting collaborators to work on ArchScope ArchScope is an interactive web-based tool that lets you design, visualize, and test system architectures with real-time performance simulations. Github - ArchScope is an interactive web-based tool that lets you Gemma 4: Google's Open-Weight AI Is a Game Changer for Developers Confessions of a Git Beginner: Why the Terminal Stopped Scaring Me Docker 容器化实战:从零到生产部署 🚀 I Built a Full Stack Miro Clone with Real-Time Collaboration using Next.js Building an African Economic Data Pipeline with Python, DuckDB & World Bank API llms.txt vs robots.txt vs ai.txt: The Developer's Cheat Sheet Intigriti Challenge 0526 Writeup Business Logic Flaws: How Attackers Skip Steps in Your App to Get What They Should Never Have Why Vibe Coders Need Boilerplates to Save Time, Tokens, and Build More Secure SaaS Projects Idle Cloud Cost Is the New Egress Cost Quark's Outlines: Python Traceback Objects Ghost in the Stack (Part 1): Why uninitialized variables remember old data Building a High-Performance Local Chess Assistant Extension with WebAssembly Stockfish and Manifest V3 Breaking the Trade-off Between Self-Custody and Intelligent Automation on the Stellar Network I Open-Sourced a Practical Fullstack Interview Preparation Repository (React + Node + System Design) 🚀 How I Started Coding as a Student (Beginner-Friendly Guide) WordPress vs. Ghost: Why Automated Bot Attacks Are Making us think much I tested 4 AI agent-governance tools against an open spec - here's the matrix zkML Inference Proof: What the Receipt Proves, and What the Model Still Does Not I Scored 1000/1000 on AWS Certified AI Practitioner (AIF-C01) Here's Every Resource I Used Go - Struct and Interface Handling JSON Requests in Go Storing Kamal secrets in AWS Secrets Manager and deploying to a cheap Hetzner VPS How I Caught and Fixed an N+1 Query in My Django REST API I got tired of paying $10/month to remove image backgrounds – so I built it for free How to Start Coding as a Student: A Complete Beginner’s Guide 🚀 Storing Kamal secrets in AWS Secrets Manager and deploying to a cheap Hetzner VPS What Are Buffers? Build AI Agents with Hot Dev The Client Onboarding Checklist That Prevents 90% of Project Problems Scalable Treasure Hunts Are a Myth, But We Almost Made One Gemini 3.5 Flash Has a 1M Token Context Window. Here's What You Can Actually Build With It. I built a ultra-polished developer portfolio template using React & Tailwind v4 (with zero-JSX configuration) Gemini CLI Is Dead. Here's the Better Thing That Replaced It Post-quantum cryptography for embedded and IoT: secure boot, TLS and OTA Understanding Optimistic Preloading in Modern Applications Nobody Wants to Read Your Code (And You Don't Want to Read Theirs) A clothing pairing app E2B vs E4B vs 31B Dense: The Practical Guide to Choosing the Right Gemma 4 Model I built an AI app store screenshot generator because Figma made me cry — looking for brutal feedback Hello DEV Community — My Developer Journey Begins Adaptable apps on ChromeOS: a post-mortem The WordPress Paradox: Why It’s Here to Stay (and How to Stop Ruining It) I built a local voice AI that can change to 9 different personalities! UXRay: I Built an AI That Roasts Your UI Like a Senior Designer Would Wyrly DI: Type-safe Dependency Injection for Modern TypeScript The contract is the interface: agent-driven Steampipe Stave in one command Gemma 4's Hidden Superpower: Why Built-in Thinking Tokens Change Everything for Evaluation Tasks ⚡ WordPress Performance: The Real Truth They Don't Tell You A Mobile App Usually Needs an Admin System First
ECS vs EKS in 2026: An Honest Comparison from Someone Who Has Run Both in Production
Rahul Pandya · 2026-05-23 · via DEV Community

No marketing fluff. Just what I learned running both in production across three different companies over five years.


I've had this conversation more times than I can count.

Someone joins a new team, looks at the infrastructure, and asks: "Why are we on ECS and not Kubernetes?" Or the opposite — they look at the EKS cluster and ask: "Why aren't we using ECS? This seems way more complicated than it needs to be."

Both questions are valid. Both have real answers. And the answers depend entirely on context that a blog post comparison table can't capture — which is exactly why most ECS vs EKS comparisons are useless. They list features side by side and conclude with something like "it depends on your use case," which tells you nothing.

I want to do something different. I want to tell you what it actually feels like to operate these two platforms, what goes wrong, what the costs look like in practice, and what signals I use to make the call when starting something new.

Let me start with where I've been.


My Background With Both Platforms

Company 1 — a mid-size SaaS startup, around 40 engineers. We were on ECS from 2020 and stayed there until I left. About 35 services, mix of Fargate and EC2 launch types, deployed via CodePipeline. No major incidents directly attributable to ECS. Oncall was manageable.

Company 2 — a larger company migrating from a legacy monolith to microservices. They chose EKS because "that's what everyone uses now." By the time I joined, the cluster had been running for 18 months. There were 12 engineers who could modify cluster config and 2 who actually understood it. Oncall had a lot of "why is this pod in CrashLoopBackOff" at 3am.

Company 3 — a data-heavy platform, around 80 engineers. Started with EKS for the ML workloads, added ECS for the API layer later. Both running in parallel to this day for valid reasons that I'll get into.

I'm not here to tell you one is better. I'm here to tell you the truth about both.


The Core Difference That Everything Else Flows From

Before we talk about networking, autoscaling, pricing, or anything else — there's one fundamental difference you need to internalize:

ECS is a managed container orchestrator. EKS is a managed Kubernetes control plane.

That sounds like a subtle distinction but the implications are massive.

With ECS, AWS owns the scheduling logic, the service reconciliation loop, the load balancer integration, the secrets injection, and the deployment mechanics. You configure these things, but AWS operates them. When something breaks, you look at your task definition, your service events, and your CloudWatch logs. The number of things that can go wrong is bounded.

With EKS, AWS manages the Kubernetes control plane (the API server, etcd, the scheduler, the controller manager). But the cluster is still Kubernetes. The worker nodes, the CNI plugin, the ingress controller, the cert-manager, the service mesh, the cluster autoscaler, the pod disruption budgets, the RBAC policies — all of that is yours to configure, operate, and debug. The number of things that can go wrong is essentially unbounded.

This isn't a criticism of EKS. Kubernetes is powerful precisely because it's extensible. But that extensibility has an operational cost, and that cost is real and ongoing.


Setup and Initial Complexity

ECS

Getting a working ECS cluster with a deployed service behind a load balancer takes a few hours with Terraform the first time. Here's the complete setup:

# ECS Cluster
resource "aws_ecs_cluster" "main" {
  name = "${var.project_name}-${var.environment}"

  setting {
    name  = "containerInsights"
    value = "enabled"
  }

  tags = {
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

resource "aws_ecs_cluster_capacity_providers" "main" {
  cluster_name       = aws_ecs_cluster.main.name
  capacity_providers = ["FARGATE", "FARGATE_SPOT"]

  default_capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 100
    base              = 1
  }
}

# Task Definition
resource "aws_ecs_task_definition" "app" {
  family                   = "${var.project_name}-${var.environment}"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = 512
  memory                   = 1024
  execution_role_arn       = aws_iam_role.ecs_execution.arn
  task_role_arn            = aws_iam_role.ecs_task.arn

  container_definitions = jsonencode([{
    name      = "app"
    image     = "${var.ecr_repository_url}:${var.image_tag}"
    essential = true

    portMappings = [{
      containerPort = 8080
      protocol      = "tcp"
    }]

    environment = [
      { name = "APP_ENV", value = var.environment }
    ]

    secrets = [{
      name      = "DATABASE_URL"
      valueFrom = aws_secretsmanager_secret.db_url.arn
    }]

    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group"         = aws_cloudwatch_log_group.app.name
        "awslogs-region"        = var.region
        "awslogs-stream-prefix" = "ecs"
      }
    }

    healthCheck = {
      command     = ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"]
      interval    = 30
      timeout     = 5
      retries     = 3
      startPeriod = 60
    }
  }])
}

# ECS Service
resource "aws_ecs_service" "app" {
  name            = "${var.project_name}-${var.environment}"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.app.arn
  desired_count   = 2

  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 100
    base              = 1
  }

  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = [aws_security_group.ecs_tasks.id]
    assign_public_ip = false
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.app.arn
    container_name   = "app"
    container_port   = 8080
  }

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }

  deployment_maximum_percent         = 200
  deployment_minimum_healthy_percent = 100

  lifecycle {
    ignore_changes = [task_definition, desired_count]
  }
}

Enter fullscreen mode Exit fullscreen mode

That's it. No additional tooling, no CNI configuration, no ingress controller to install. ALB integration is native. Secrets come from Secrets Manager through IAM. Logging goes to CloudWatch automatically.

A reasonably experienced engineer can own this end-to-end.

EKS

The EKS setup story is longer. Much longer. Here's what a real production EKS setup involves beyond just the cluster:

# EKS Cluster
resource "aws_eks_cluster" "main" {
  name     = "${var.project_name}-${var.environment}"
  role_arn = aws_iam_role.eks_cluster.arn
  version  = "1.29"

  vpc_config {
    subnet_ids              = concat(var.private_subnet_ids, var.public_subnet_ids)
    endpoint_private_access = true
    endpoint_public_access  = true
    public_access_cidrs     = var.allowed_cidrs
  }

  enabled_cluster_log_types = [
    "api", "audit", "authenticator", "controllerManager", "scheduler"
  ]

  depends_on = [
    aws_iam_role_policy_attachment.eks_cluster_policy,
    aws_iam_role_policy_attachment.eks_vpc_resource_controller,
    aws_cloudwatch_log_group.eks,
  ]

  tags = {
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

# Node Group
resource "aws_eks_node_group" "main" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "${var.project_name}-${var.environment}-ng"
  node_role_arn   = aws_iam_role.eks_node.arn
  subnet_ids      = var.private_subnet_ids
  instance_types  = ["m6g.large"]
  ami_type        = "AL2_ARM_64"

  scaling_config {
    desired_size = 3
    max_size     = 20
    min_size     = 2
  }

  update_config {
    max_unavailable = 1
  }

  labels = {
    role        = "general"
    environment = var.environment
  }

  lifecycle {
    ignore_changes = [scaling_config[0].desired_size]
  }

  depends_on = [
    aws_iam_role_policy_attachment.eks_worker_node_policy,
    aws_iam_role_policy_attachment.eks_cni_policy,
    aws_iam_role_policy_attachment.ecr_read_only,
  ]
}

# OIDC Provider (required for IRSA - IAM Roles for Service Accounts)
data "tls_certificate" "eks" {
  url = aws_eks_cluster.main.identity[0].oidc[0].issuer
}

resource "aws_iam_openid_connect_provider" "eks" {
  client_id_list  = ["sts.amazonaws.com"]
  thumbprint_list = [data.tls_certificate.eks.certificates[0].sha1_fingerprint]
  url             = aws_eks_cluster.main.identity[0].oidc[0].issuer
}

Enter fullscreen mode Exit fullscreen mode

But the Terraform cluster resource is just the beginning. You also need to install and configure:

AWS Load Balancer Controller — because EKS doesn't have native ALB integration the way ECS does. You install this as a Helm chart.

helm repo add eks https://aws.github.io/eks-charts
helm repo update

helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
  -n kube-system \
  --set clusterName=${CLUSTER_NAME} \
  --set serviceAccount.create=false \
  --set serviceAccount.name=aws-load-balancer-controller

Enter fullscreen mode Exit fullscreen mode

Cluster Autoscaler — EKS won't automatically scale your node count based on pending pods. You need to install and configure the cluster autoscaler.

helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --namespace kube-system \
  --set autoDiscovery.clusterName=${CLUSTER_NAME} \
  --set awsRegion=${AWS_REGION} \
  --set rbac.serviceAccount.name=cluster-autoscaler \
  --set extraArgs.balance-similar-node-groups=true \
  --set extraArgs.skip-nodes-with-system-pods=false

Enter fullscreen mode Exit fullscreen mode

EBS or EFS CSI Driver — if any of your workloads need persistent volumes. Another Helm chart, another IAM role for service account.

External Secrets Operator — because Kubernetes Secrets are base64-encoded, not encrypted. If you want your secrets to come from AWS Secrets Manager (which you do), you need an operator to bridge the two.

# external-secrets/secret-store.yaml
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
  name: aws-secretsmanager
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1
      auth:
        jwt:
          serviceAccountRef:
            name: external-secrets-sa
            namespace: external-secrets

Enter fullscreen mode Exit fullscreen mode

metrics-server — required for Horizontal Pod Autoscaler to work. Another installation.

By the time you have a production-ready EKS cluster, you've installed at minimum five separate components, each with its own configuration, versioning, and upgrade lifecycle. This isn't EKS being bad — it's just the nature of the platform. You're building on top of a general-purpose orchestrator, not a purpose-built AWS service.

Verdict on setup complexity: ECS wins decisively. Not slightly — decisively. The first-time setup difference is measured in days, not hours.


Networking

ECS Networking

ECS networking in Fargate mode is remarkably simple. Each task gets its own ENI with its own IP address. Security groups work exactly like they do for EC2 instances. You create a security group for your tasks, you define the ingress and egress rules, done.

resource "aws_security_group" "ecs_tasks" {
  name        = "${var.project_name}-ecs-tasks"
  description = "Security group for ECS tasks"
  vpc_id      = var.vpc_id

  ingress {
    from_port       = 8080
    to_port         = 8080
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
    description     = "Allow traffic from ALB only"
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
    description = "Allow all outbound"
  }
}

Enter fullscreen mode Exit fullscreen mode

Service-to-service communication in ECS works through internal ALBs or AWS Cloud Map for service discovery. Neither requires deep networking knowledge.

# Service discovery with Cloud Map
resource "aws_service_discovery_private_dns_namespace" "internal" {
  name = "${var.project_name}.internal"
  vpc  = var.vpc_id
}

resource "aws_service_discovery_service" "app" {
  name = "user-service"

  dns_config {
    namespace_id = aws_service_discovery_private_dns_namespace.internal.id

    dns_records {
      ttl  = 10
      type = "A"
    }

    routing_policy = "MULTIVALUE"
  }

  health_check_custom_config {
    failure_threshold = 1
  }
}

# Now your services can reach user-service.projectname.internal

Enter fullscreen mode Exit fullscreen mode

EKS Networking

Kubernetes networking is famously complex. There are multiple layers: the CNI plugin that assigns pod IPs, kube-proxy for service routing, CoreDNS for service discovery, and your ingress controller for external traffic. Each layer has its own configuration surface.

With AWS EKS and the VPC CNI plugin (the default), pods get real VPC IP addresses. This is actually a significant advantage over some other Kubernetes setups — your pods are first-class VPC citizens and you can use security groups directly on pods.

# pod-security-group.yaml
apiVersion: v1
kind: Pod
metadata:
  name: my-app
  annotations:
    vpc.amazonaws.com/pod-eni: "true"
spec:
  securityContext:
    runAsNonRoot: true
  containers:
  - name: app
    image: my-app:latest

Enter fullscreen mode Exit fullscreen mode

# Security group for pods (requires VPC CNI security groups for pods feature)
resource "aws_security_group" "pod_sg" {
  name        = "${var.project_name}-pod-sg"
  description = "Security group for application pods"
  vpc_id      = var.vpc_id

  ingress {
    from_port       = 8080
    to_port         = 8080
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Enter fullscreen mode Exit fullscreen mode

Service-to-service communication in Kubernetes uses DNS names automatically. Any service is reachable at service-name.namespace.svc.cluster.local. This is one area where Kubernetes is genuinely more elegant than ECS.

# Internal service call - no configuration needed beyond the Service object
apiVersion: v1
kind: Service
metadata:
  name: user-service
  namespace: production
spec:
  selector:
    app: user-service
  ports:
  - port: 80
    targetPort: 8080
  type: ClusterIP
---
# Any pod can now reach this at: http://user-service.production.svc.cluster.local

Enter fullscreen mode Exit fullscreen mode

For external traffic, you need an ingress controller. With the AWS Load Balancer Controller installed, you annotate your Ingress objects and it provisions ALBs automatically:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-ingress
  namespace: production
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789:certificate/xxx
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443},{"HTTP":80}]'
    alb.ingress.kubernetes.io/ssl-redirect: "443"
    alb.ingress.kubernetes.io/healthcheck-path: /health
spec:
  rules:
  - host: api.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: app-service
            port:
              number: 80

Enter fullscreen mode Exit fullscreen mode

This works well once it's set up. But getting to "works well" requires understanding ALB controller annotations, troubleshooting why the ALB isn't being provisioned, and handling the IRSA (IAM Roles for Service Accounts) configuration for the controller itself.

Verdict on networking: ECS is simpler for straightforward use cases. EKS networking is more powerful and flexible, especially for multi-service architectures, but requires more operational knowledge.


Autoscaling

ECS Autoscaling

ECS autoscaling has two dimensions: scaling the number of tasks (Application Auto Scaling) and scaling the underlying compute (if using EC2 launch type — not needed for Fargate).

# Scale tasks based on CPU
resource "aws_appautoscaling_target" "ecs_target" {
  max_capacity       = 50
  min_capacity       = 2
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.app.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

resource "aws_appautoscaling_policy" "cpu_scaling" {
  name               = "${var.project_name}-cpu-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs_target.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_target.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs_target.service_namespace

  target_tracking_scaling_policy_configuration {
    target_value       = 60.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60

    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
  }
}

# Scale based on ALB request count per target
resource "aws_appautoscaling_policy" "request_scaling" {
  name               = "${var.project_name}-request-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs_target.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_target.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs_target.service_namespace

  target_tracking_scaling_policy_configuration {
    target_value       = 1000.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60

    predefined_metric_specification {
      predefined_metric_type = "ALBRequestCountPerTarget"
      resource_label = "${aws_lb.main.arn_suffix}/${aws_lb_target_group.app.arn_suffix}"
    }
  }
}

Enter fullscreen mode Exit fullscreen mode

With Fargate, you don't manage nodes at all. AWS handles the underlying compute. You define the CPU and memory your task needs, and Fargate provisions capacity. Scaling the task count scales your actual compute footprint automatically.

This is genuinely magical for teams that don't want to think about node sizing, node upgrades, or compute capacity planning.

EKS Autoscaling

EKS autoscaling has three dimensions: HPA (Horizontal Pod Autoscaler) for pods, Cluster Autoscaler or Karpenter for nodes, and optionally VPA (Vertical Pod Autoscaler) for right-sizing resource requests.

HPA for pods:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  minReplicas: 2
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15

Enter fullscreen mode Exit fullscreen mode

Karpenter for node provisioning (the modern approach, replacing Cluster Autoscaler):

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot", "on-demand"]
    - key: kubernetes.io/arch
      operator: In
      values: ["arm64"]
    - key: node.kubernetes.io/instance-type
      operator: In
      values: ["m6g.large", "m6g.xlarge", "m6g.2xlarge", "c6g.large", "c6g.xlarge"]
  limits:
    resources:
      cpu: "200"
      memory: 800Gi
  providerRef:
    name: default
  ttlSecondsAfterEmpty: 30
  ttlSecondsUntilExpired: 2592000  # 30 days - forces node rotation
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: default
spec:
  subnetSelector:
    karpenter.sh/discovery: ${CLUSTER_NAME}
  securityGroupSelector:
    karpenter.sh/discovery: ${CLUSTER_NAME}
  instanceProfile: KarpenterNodeInstanceProfile-${CLUSTER_NAME}
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 50Gi
        volumeType: gp3
        encrypted: true

Enter fullscreen mode Exit fullscreen mode

Karpenter is genuinely excellent. It provisions nodes in seconds, uses spot instances intelligently, and right-sizes node types based on actual pod requirements. If you're running EKS, Karpenter is worth the setup investment.

But notice what I mean about layers: you have the application (pods), the HPA controlling pod count, Karpenter controlling node count, and then the interaction between the two when a scale event happens. All of that works well when properly configured. Debugging it when it doesn't work requires understanding all three layers simultaneously.

Verdict on autoscaling: ECS with Fargate is simpler and honestly sufficient for most use cases. EKS with Karpenter is more powerful and cost-optimized but requires more operational investment.


Cost

This is the section people want most and where comparison posts are most misleading.

Let me be honest: it depends heavily on your workload profile, and anyone who gives you a simple "ECS is cheaper" or "EKS is cheaper" is oversimplifying.

That said, here are the real cost levers:

ECS Fargate Costs

Fargate pricing is per-second based on the vCPU and memory you allocate to each task.

Fargate pricing (us-east-1, arm64):
- $0.03238 per vCPU-hour
- $0.00356 per GB-hour

Example: 0.5 vCPU, 1GB memory task running 24/7 for 30 days:
- vCPU: 0.5 × $0.03238 × 720 hours = $11.66/month
- Memory: 1 × $0.00356 × 720 hours = $2.56/month
- Total per task: ~$14.22/month

Enter fullscreen mode Exit fullscreen mode

With 10 tasks: ~$142/month. With 50 tasks: ~$711/month.

The key insight: you pay for allocated resources, not actual usage. A task allocated 1 vCPU that runs at 10% CPU still costs the same as one running at 90%. If your workloads have inconsistent utilization, Fargate has waste built into it.

ECS EC2 Launch Type Costs

If you use ECS with EC2 instead of Fargate, you pay for the EC2 instances whether they're fully utilized or not. This can be cheaper than Fargate at high, consistent utilization, and more expensive at low or variable utilization.

EKS Costs

EKS charges $0.10/hour per cluster ($72/month regardless of what's running on it). Your actual compute is charged at EC2 rates (or Fargate rates if using Fargate with EKS).

With Karpenter and spot instances, EKS workloads can be significantly cheaper than ECS Fargate at scale. Spot instances for m6g.large run at about 70% discount compared to on-demand, and Karpenter will use spot by default when available.

But this cost advantage only materializes if:

  1. Your workloads tolerate spot interruptions (most stateless services do)
  2. You have enough workload to pack nodes efficiently
  3. You've invested in proper resource requests/limits so scheduling is efficient

Below a certain scale (roughly under 20 concurrent tasks for most workloads), Fargate's simplicity is worth the premium. Above that scale, the math starts favoring managed EC2 nodes with spot.

My rough real-world numbers from Company 3:

Workload ECS Fargate EKS + Karpenter (spot) Savings
50 API pods (0.5 vCPU, 1GB) $711/month $290/month 59%
10 background workers (2 vCPU, 4GB) $512/month $195/month 62%
5 ML inference (4 vCPU, 16GB) $1,024/month $380/month 63%

These numbers are real but context-dependent. The EKS cluster itself costs $72/month, plus you need to budget time for cluster maintenance and upgrades.


Deployments and Day-2 Operations

ECS Deployments

ECS rolling deployments are controlled by two parameters: deployment_maximum_percent and deployment_minimum_healthy_percent. The defaults (200% and 100%) mean ECS will bring up new tasks before draining old ones, ensuring no capacity loss during deployments.

resource "aws_ecs_service" "app" {
  # ...

  deployment_maximum_percent         = 200
  deployment_minimum_healthy_percent = 100

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }
}

Enter fullscreen mode Exit fullscreen mode

The circuit breaker with automatic rollback is the feature I care most about. If a deployment fails health checks, ECS rolls back automatically. No human intervention needed at 2am.

For blue/green deployments, you can use CodeDeploy with ECS:

resource "aws_codedeploy_deployment_group" "ecs" {
  app_name               = aws_codedeploy_app.main.name
  deployment_group_name  = "${var.project_name}-${var.environment}"
  deployment_config_name = "CodeDeployDefault.ECSAllAtOnce"
  service_role_arn       = aws_iam_role.codedeploy.arn

  auto_rollback_configuration {
    enabled = true
    events  = ["DEPLOYMENT_FAILURE"]
  }

  blue_green_deployment_config {
    deployment_ready_option {
      action_on_timeout = "CONTINUE_DEPLOYMENT"
    }

    terminate_blue_instances_on_deployment_success {
      action                           = "TERMINATE"
      termination_wait_time_in_minutes = 5
    }
  }

  ecs_service {
    cluster_name = aws_ecs_cluster.main.name
    service_name = aws_ecs_service.app.name
  }

  load_balancer_info {
    target_group_pair_info {
      prod_traffic_route {
        listener_arns = [aws_lb_listener.https.arn]
      }

      target_group { name = aws_lb_target_group.blue.name }
      target_group { name = aws_lb_target_group.green.name }
    }
  }
}

Enter fullscreen mode Exit fullscreen mode

EKS Deployments

Kubernetes rolling deployments are controlled by the Deployment spec's strategy field:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
  namespace: production
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 2
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: app
        image: my-app:latest
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "250m"
            memory: "512Mi"
          limits:
            cpu: "500m"
            memory: "1Gi"
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          failureThreshold: 3
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          failureThreshold: 3
      terminationGracePeriodSeconds: 60
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: my-app

Enter fullscreen mode Exit fullscreen mode

That topologySpreadConstraints block at the bottom is something most tutorials don't show but matters in production — it ensures your pods are spread across availability zones instead of all landing on nodes in the same AZ.

Kubernetes doesn't have automatic rollback on failed deployments out of the box. You either set up a deployment process that monitors rollout status and rolls back on failure, or you use a GitOps tool like ArgoCD or Flux that handles this for you.

# Manual rollback
kubectl rollout undo deployment/app -n production

# Check rollout status
kubectl rollout status deployment/app -n production

# View rollout history
kubectl rollout history deployment/app -n production

Enter fullscreen mode Exit fullscreen mode

For true blue/green or canary deployments in EKS, Argo Rollouts is the best option:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: app
  namespace: production
spec:
  replicas: 5
  strategy:
    canary:
      steps:
      - setWeight: 20
      - pause: {duration: 5m}
      - setWeight: 40
      - pause: {duration: 5m}
      - setWeight: 60
      - pause: {duration: 5m}
      - setWeight: 80
      - pause: {duration: 5m}
      canaryService: app-canary
      stableService: app-stable
      trafficRouting:
        alb:
          ingress: app-ingress
          servicePort: 80
  selector:
    matchLabels:
      app: my-app
  template:
    # ... same as Deployment spec

Enter fullscreen mode Exit fullscreen mode

This is genuinely more powerful than anything ECS offers for deployments. But it requires Argo Rollouts installed, the ALB controller configured, and someone who understands how canary routing works.

Verdict on deployments: ECS is simpler with sensible defaults and automatic rollback built in. EKS is more powerful with proper tooling but requires more setup and expertise.


IAM and Security

ECS

ECS uses task-level IAM roles, which is clean and intuitive. Each task gets a role, that role has permissions, done.

resource "aws_iam_role" "ecs_task" {
  name = "${var.project_name}-ecs-task-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "ecs-tasks.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "ecs_task_permissions" {
  name = "task-permissions"
  role = aws_iam_role.ecs_task.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["s3:GetObject", "s3:PutObject"]
        Resource = "${aws_s3_bucket.app.arn}/*"
      },
      {
        Effect   = "Allow"
        Action   = ["sqs:SendMessage", "sqs:ReceiveMessage", "sqs:DeleteMessage"]
        Resource = aws_sqs_queue.app.arn
      }
    ]
  })
}

Enter fullscreen mode Exit fullscreen mode

The task role is bound to the task definition, so every task that runs from that definition gets those permissions. Simple and auditable.

EKS

EKS uses IRSA (IAM Roles for Service Accounts) to bind IAM permissions to Kubernetes service accounts. It's more flexible and pod-level, but the setup is more involved.

# IAM role with trust policy for the specific service account
resource "aws_iam_role" "app_service_account" {
  name = "${var.project_name}-app-sa-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Federated = aws_iam_openid_connect_provider.eks.arn
      }
      Action = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "${replace(aws_iam_openid_connect_provider.eks.url, "https://", "")}:sub" = "system:serviceaccount:production:app-service-account"
          "${replace(aws_iam_openid_connect_provider.eks.url, "https://", "")}:aud" = "sts.amazonaws.com"
        }
      }
    }]
  })
}

resource "aws_iam_role_policy" "app_permissions" {
  name = "app-permissions"
  role = aws_iam_role.app_service_account.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["s3:GetObject", "s3:PutObject"]
        Resource = "${aws_s3_bucket.app.arn}/*"
      }
    ]
  })
}

Enter fullscreen mode Exit fullscreen mode

# kubernetes/service-account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-service-account
  namespace: production
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/my-project-app-sa-role

Enter fullscreen mode Exit fullscreen mode

# kubernetes/deployment.yaml
spec:
  template:
    spec:
      serviceAccountName: app-service-account  # Links to the IAM role
      containers:
      - name: app
        # ...

Enter fullscreen mode Exit fullscreen mode

IRSA is secure and granular — individual pods get individual IAM roles. But it's also more moving parts. The OIDC provider, the trust policy with the exact service account reference, the Kubernetes service account with the annotation, and the deployment referencing the service account. All four pieces have to be correct for it to work.

Verdict on IAM: ECS is simpler. EKS is more granular (pod-level vs task-level). Both are secure when used correctly.


Observability

ECS

CloudWatch is the native home for ECS logs and metrics. Container Insights gives you CPU, memory, network, and storage metrics per task. Log routing is configured in the task definition.

resource "aws_cloudwatch_log_group" "app" {
  name              = "/ecs/${var.project_name}-${var.environment}"
  retention_in_days = 30
}

Enter fullscreen mode Exit fullscreen mode

For ECS on Fargate, FireLens is the way to route logs to multiple destinations (Datadog, Splunk, S3) without changing your application code:

{
  "name": "log_router",
  "image": "public.ecr.aws/aws-observability/aws-for-fluent-bit:stable",
  "essential": true,
  "firelensConfiguration": {
    "type": "fluentbit"
  }
}

Enter fullscreen mode Exit fullscreen mode

EKS

EKS ships logs and metrics to CloudWatch via the CloudWatch agent and Fluent Bit, but this requires setup. The AWS-managed add-on for CloudWatch observability simplifies this:

aws eks create-addon \
  --cluster-name ${CLUSTER_NAME} \
  --addon-name amazon-cloudwatch-observability \
  --service-account-role-arn ${CLOUDWATCH_ROLE_ARN}

Enter fullscreen mode Exit fullscreen mode

For application logs, you deploy Fluent Bit as a DaemonSet that collects from all nodes:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: amazon-cloudwatch
spec:
  selector:
    matchLabels:
      name: fluent-bit
  template:
    metadata:
      labels:
        name: fluent-bit
    spec:
      serviceAccountName: fluent-bit
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluent-bit
        image: public.ecr.aws/aws-observability/aws-for-fluent-bit:stable
        resources:
          requests:
            cpu: 50m
            memory: 64Mi
          limits:
            cpu: 200m
            memory: 256Mi
        env:
        - name: AWS_REGION
          value: us-east-1
        - name: CLUSTER_NAME
          value: my-cluster
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

Enter fullscreen mode Exit fullscreen mode

The Kubernetes-native observability story with Prometheus + Grafana is powerful and widely adopted. If your team already operates a Prometheus/Grafana stack or uses a platform like Datadog, EKS integrates naturally.

# Service Monitor for Prometheus scraping
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
  namespaceSelector:
    matchNames:
    - production

Enter fullscreen mode Exit fullscreen mode

Verdict on observability: ECS integrates more naturally with CloudWatch. EKS has a richer ecosystem but requires more setup. If your company is already invested in Prometheus/Grafana, EKS wins this one.


When I'd Actually Choose Each One

Here's where I give you the real opinion instead of the "it depends" cop-out.

Choose ECS When:

Your team is primarily application engineers, not platform engineers. ECS doesn't require deep operational expertise. A team of four backend engineers can own an ECS-based infrastructure without a dedicated DevOps/platform role. EKS cannot make that claim.

You're building on AWS and want tight native integration. ECS is an AWS service. IAM, ALB, CloudWatch, Secrets Manager, Service Connect — all integrate natively without additional tooling. If you're all-in on AWS (and most startups are), this matters.

You have fewer than 50 services and don't need advanced traffic management. Below this threshold, the operational overhead of EKS rarely pays off. You'll spend more time managing the cluster than you save from its flexibility.

Fargate's simplicity is genuinely valuable to you. No node management, no AMI updates, no node security patching. For teams that don't want to think about compute, Fargate is remarkable. You describe what your task needs, it runs.

You need to move fast. A working ECS environment with CI/CD can be set up in a day. EKS done properly takes a week the first time. When time-to-production matters, ECS has a clear advantage.

Choose EKS When:

You have multi-cloud requirements or may need to migrate off AWS. Kubernetes is portable. Your application manifests, your Helm charts, your operational knowledge — it all works on GKE, AKS, or self-managed Kubernetes. ECS knowledge doesn't transfer. If there's any chance you'll need to run workloads outside AWS, this is a significant factor.

You have specialized workloads that need Kubernetes-specific features. GPU workloads for ML inference, jobs that use init containers and sidecars extensively, workloads that need custom schedulers, anything that benefits from the Kubernetes extension ecosystem (custom operators, CRDs). EKS handles these; ECS handles them less gracefully or not at all.

You have a large team with dedicated platform engineers. If you have people whose job is operating the container platform, EKS's complexity becomes manageable and its power becomes accessible. A 3-engineer platform team can run EKS well. A 1-person DevOps team probably can't give it the attention it needs while keeping everything else running.

Cost optimization at scale matters. At 100+ pods, the combination of EC2 spot instances and Karpenter's bin-packing can deliver meaningful savings over Fargate. The math becomes compelling at scale in a way it doesn't for smaller deployments.

You're already using Kubernetes elsewhere. If your team runs EKS and you're adding a new service, it goes on EKS. The operational patterns are established, the monitoring is set up, the oncall runbook exists. Don't introduce ECS complexity just to avoid adding another EKS service.


The Hard Truths Nobody Puts in Comparison Posts

EKS oncall is harder. When a production incident happens at 2am on EKS, you're debugging Kubernetes. You need to understand pod states, node conditions, CNI issues, RBAC errors, resource quota exhaustion, and admission webhook failures in addition to your actual application. ECS incidents are usually simpler: bad task definition, failing health check, insufficient IAM permissions. The debugging surface is smaller.

ECS has an upgrade story that doesn't wake you up at night. EKS clusters need to be upgraded every 14 months or so (AWS supports each Kubernetes minor version for about that long before end of life). Node groups need upgrading separately from the control plane. Add-ons need upgrading separately from nodes. Each upgrade is a project. ECS manages runtime upgrades for you on Fargate. It's not zero effort, but it's dramatically less.

Kubernetes expertise is widely available; ECS expertise is less so. If you're hiring, more engineers know Kubernetes than ECS. This cuts both ways: it's easier to hire for EKS, but it also means your engineers may resist ECS or want to migrate away from it.

ECS doesn't have a great multi-tenant story. If you're building a platform that hosts multiple teams' workloads and you need namespace-level isolation, RBAC, resource quotas per team — Kubernetes does this natively. ECS doesn't have a clean equivalent. You end up using separate clusters or complex IAM setups to achieve similar isolation.


The Honest Recommendation

If you're starting fresh in 2026 with a team of fewer than 15 engineers and no specific requirements driving you toward Kubernetes:

Start with ECS.

You'll ship faster. Your oncall will be less complex. The AWS integration is better. When (if) you outgrow it, migration to EKS is a well-understood project, not a crisis.

If you're starting fresh with a large engineering organization, existing Kubernetes knowledge, multi-cloud requirements, or complex workload requirements (GPU, custom schedulers, extensive sidecar patterns):

Start with EKS.

But staff it properly. An EKS cluster that nobody fully understands is worse than an ECS setup that everyone can operate. Kubernetes complexity in the wrong hands creates incidents. I've seen it happen at Company 2, and it's entirely avoidable.


Quick Reference

Factor ECS EKS
Initial setup time Hours Days
Operational complexity Low High
AWS integration Native Via add-ons
Multi-cloud portability No Yes
Cost (small scale) Competitive Higher (cluster overhead)
Cost (large scale) Higher Lower (spot + packing)
Fargate support First-class Supported
Blue/Green deployments Via CodeDeploy Via Argo Rollouts
Service mesh App Mesh / Service Connect Istio, Linkerd, Cilium
Advanced scheduling Limited Full Kubernetes scheduler
Team expertise required Moderate High
Upgrade burden Low (Fargate handles it) High (nodes + control plane)

Final Thought

The engineers who have the strongest opinions about ECS vs EKS are usually the ones who've only used one of them in production. The engineers who've used both tend to be more pragmatic — they reach for whichever tool fits the problem.

Both ECS and EKS are mature, well-supported platforms. Both can run production workloads reliably. The choice is about operational tradeoffs, team capabilities, and what your workloads actually need — not about which one is objectively better.

Use the simpler one until you have a concrete reason not to.


I'm curious what others have found in practice — especially if you've migrated from one to the other. Drop a comment with what surprised you most. Migrations usually surface the tradeoffs that no comparison post captures.