ECS vs EKS in 2026: An Honest Comparison from Someone Who Has Run Both in Production

No marketing fluff. Just what I learned running both in production across three different companies over five years.

I've had this conversation more times than I can count.

Someone joins a new team, looks at the infrastructure, and asks: "Why are we on ECS and not Kubernetes?" Or the opposite — they look at the EKS cluster and ask: "Why aren't we using ECS? This seems way more complicated than it needs to be."

Both questions are valid. Both have real answers. And the answers depend entirely on context that a blog post comparison table can't capture — which is exactly why most ECS vs EKS comparisons are useless. They list features side by side and conclude with something like "it depends on your use case," which tells you nothing.

I want to do something different. I want to tell you what it actually feels like to operate these two platforms, what goes wrong, what the costs look like in practice, and what signals I use to make the call when starting something new.

Let me start with where I've been.

My Background With Both Platforms

Company 1 — a mid-size SaaS startup, around 40 engineers. We were on ECS from 2020 and stayed there until I left. About 35 services, mix of Fargate and EC2 launch types, deployed via CodePipeline. No major incidents directly attributable to ECS. Oncall was manageable.

Company 2 — a larger company migrating from a legacy monolith to microservices. They chose EKS because "that's what everyone uses now." By the time I joined, the cluster had been running for 18 months. There were 12 engineers who could modify cluster config and 2 who actually understood it. Oncall had a lot of "why is this pod in CrashLoopBackOff" at 3am.

Company 3 — a data-heavy platform, around 80 engineers. Started with EKS for the ML workloads, added ECS for the API layer later. Both running in parallel to this day for valid reasons that I'll get into.

I'm not here to tell you one is better. I'm here to tell you the truth about both.

The Core Difference That Everything Else Flows From

Before we talk about networking, autoscaling, pricing, or anything else — there's one fundamental difference you need to internalize:

ECS is a managed container orchestrator. EKS is a managed Kubernetes control plane.

That sounds like a subtle distinction but the implications are massive.

With ECS, AWS owns the scheduling logic, the service reconciliation loop, the load balancer integration, the secrets injection, and the deployment mechanics. You configure these things, but AWS operates them. When something breaks, you look at your task definition, your service events, and your CloudWatch logs. The number of things that can go wrong is bounded.

With EKS, AWS manages the Kubernetes control plane (the API server, etcd, the scheduler, the controller manager). But the cluster is still Kubernetes. The worker nodes, the CNI plugin, the ingress controller, the cert-manager, the service mesh, the cluster autoscaler, the pod disruption budgets, the RBAC policies — all of that is yours to configure, operate, and debug. The number of things that can go wrong is essentially unbounded.

This isn't a criticism of EKS. Kubernetes is powerful precisely because it's extensible. But that extensibility has an operational cost, and that cost is real and ongoing.

Setup and Initial Complexity

ECS

Getting a working ECS cluster with a deployed service behind a load balancer takes a few hours with Terraform the first time. Here's the complete setup:

# ECS Cluster
resource "aws_ecs_cluster" "main" {
  name = "${var.project_name}-${var.environment}"

  setting {
    name  = "containerInsights"
    value = "enabled"
  }

  tags = {
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

resource "aws_ecs_cluster_capacity_providers" "main" {
  cluster_name       = aws_ecs_cluster.main.name
  capacity_providers = ["FARGATE", "FARGATE_SPOT"]

  default_capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 100
    base              = 1
  }
}

# Task Definition
resource "aws_ecs_task_definition" "app" {
  family                   = "${var.project_name}-${var.environment}"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = 512
  memory                   = 1024
  execution_role_arn       = aws_iam_role.ecs_execution.arn
  task_role_arn            = aws_iam_role.ecs_task.arn

  container_definitions = jsonencode([{
    name      = "app"
    image     = "${var.ecr_repository_url}:${var.image_tag}"
    essential = true

    portMappings = [{
      containerPort = 8080
      protocol      = "tcp"
    }]

    environment = [
      { name = "APP_ENV", value = var.environment }
    ]

    secrets = [{
      name      = "DATABASE_URL"
      valueFrom = aws_secretsmanager_secret.db_url.arn
    }]

    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group"         = aws_cloudwatch_log_group.app.name
        "awslogs-region"        = var.region
        "awslogs-stream-prefix" = "ecs"
      }
    }

    healthCheck = {
      command     = ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"]
      interval    = 30
      timeout     = 5
      retries     = 3
      startPeriod = 60
    }
  }])
}

# ECS Service
resource "aws_ecs_service" "app" {
  name            = "${var.project_name}-${var.environment}"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.app.arn
  desired_count   = 2

  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 100
    base              = 1
  }

  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = [aws_security_group.ecs_tasks.id]
    assign_public_ip = false
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.app.arn
    container_name   = "app"
    container_port   = 8080
  }

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }

  deployment_maximum_percent         = 200
  deployment_minimum_healthy_percent = 100

  lifecycle {
    ignore_changes = [task_definition, desired_count]
  }
}

That's it. No additional tooling, no CNI configuration, no ingress controller to install. ALB integration is native. Secrets come from Secrets Manager through IAM. Logging goes to CloudWatch automatically.

A reasonably experienced engineer can own this end-to-end.

EKS

The EKS setup story is longer. Much longer. Here's what a real production EKS setup involves beyond just the cluster:

# EKS Cluster
resource "aws_eks_cluster" "main" {
  name     = "${var.project_name}-${var.environment}"
  role_arn = aws_iam_role.eks_cluster.arn
  version  = "1.29"

  vpc_config {
    subnet_ids              = concat(var.private_subnet_ids, var.public_subnet_ids)
    endpoint_private_access = true
    endpoint_public_access  = true
    public_access_cidrs     = var.allowed_cidrs
  }

  enabled_cluster_log_types = [
    "api", "audit", "authenticator", "controllerManager", "scheduler"
  ]

  depends_on = [
    aws_iam_role_policy_attachment.eks_cluster_policy,
    aws_iam_role_policy_attachment.eks_vpc_resource_controller,
    aws_cloudwatch_log_group.eks,
  ]

  tags = {
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

# Node Group
resource "aws_eks_node_group" "main" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "${var.project_name}-${var.environment}-ng"
  node_role_arn   = aws_iam_role.eks_node.arn
  subnet_ids      = var.private_subnet_ids
  instance_types  = ["m6g.large"]
  ami_type        = "AL2_ARM_64"

  scaling_config {
    desired_size = 3
    max_size     = 20
    min_size     = 2
  }

  update_config {
    max_unavailable = 1
  }

  labels = {
    role        = "general"
    environment = var.environment
  }

  lifecycle {
    ignore_changes = [scaling_config[0].desired_size]
  }

  depends_on = [
    aws_iam_role_policy_attachment.eks_worker_node_policy,
    aws_iam_role_policy_attachment.eks_cni_policy,
    aws_iam_role_policy_attachment.ecr_read_only,
  ]
}

# OIDC Provider (required for IRSA - IAM Roles for Service Accounts)
data "tls_certificate" "eks" {
  url = aws_eks_cluster.main.identity[0].oidc[0].issuer
}

resource "aws_iam_openid_connect_provider" "eks" {
  client_id_list  = ["sts.amazonaws.com"]
  thumbprint_list = [data.tls_certificate.eks.certificates[0].sha1_fingerprint]
  url             = aws_eks_cluster.main.identity[0].oidc[0].issuer
}

But the Terraform cluster resource is just the beginning. You also need to install and configure:

AWS Load Balancer Controller — because EKS doesn't have native ALB integration the way ECS does. You install this as a Helm chart.

helm repo add eks https://aws.github.io/eks-charts
helm repo update

helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
  -n kube-system \
  --set clusterName=${CLUSTER_NAME} \
  --set serviceAccount.create=false \
  --set serviceAccount.name=aws-load-balancer-controller

Cluster Autoscaler — EKS won't automatically scale your node count based on pending pods. You need to install and configure the cluster autoscaler.

helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --namespace kube-system \
  --set autoDiscovery.clusterName=${CLUSTER_NAME} \
  --set awsRegion=${AWS_REGION} \
  --set rbac.serviceAccount.name=cluster-autoscaler \
  --set extraArgs.balance-similar-node-groups=true \
  --set extraArgs.skip-nodes-with-system-pods=false

EBS or EFS CSI Driver — if any of your workloads need persistent volumes. Another Helm chart, another IAM role for service account.

External Secrets Operator — because Kubernetes Secrets are base64-encoded, not encrypted. If you want your secrets to come from AWS Secrets Manager (which you do), you need an operator to bridge the two.

# external-secrets/secret-store.yaml
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
  name: aws-secretsmanager
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1
      auth:
        jwt:
          serviceAccountRef:
            name: external-secrets-sa
            namespace: external-secrets

metrics-server — required for Horizontal Pod Autoscaler to work. Another installation.

By the time you have a production-ready EKS cluster, you've installed at minimum five separate components, each with its own configuration, versioning, and upgrade lifecycle. This isn't EKS being bad — it's just the nature of the platform. You're building on top of a general-purpose orchestrator, not a purpose-built AWS service.

Verdict on setup complexity: ECS wins decisively. Not slightly — decisively. The first-time setup difference is measured in days, not hours.

Networking

ECS Networking

ECS networking in Fargate mode is remarkably simple. Each task gets its own ENI with its own IP address. Security groups work exactly like they do for EC2 instances. You create a security group for your tasks, you define the ingress and egress rules, done.

resource "aws_security_group" "ecs_tasks" {
  name        = "${var.project_name}-ecs-tasks"
  description = "Security group for ECS tasks"
  vpc_id      = var.vpc_id

  ingress {
    from_port       = 8080
    to_port         = 8080
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
    description     = "Allow traffic from ALB only"
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
    description = "Allow all outbound"
  }
}

Service-to-service communication in ECS works through internal ALBs or AWS Cloud Map for service discovery. Neither requires deep networking knowledge.

# Service discovery with Cloud Map
resource "aws_service_discovery_private_dns_namespace" "internal" {
  name = "${var.project_name}.internal"
  vpc  = var.vpc_id
}

resource "aws_service_discovery_service" "app" {
  name = "user-service"

  dns_config {
    namespace_id = aws_service_discovery_private_dns_namespace.internal.id

    dns_records {
      ttl  = 10
      type = "A"
    }

    routing_policy = "MULTIVALUE"
  }

  health_check_custom_config {
    failure_threshold = 1
  }
}

# Now your services can reach user-service.projectname.internal

EKS Networking

Kubernetes networking is famously complex. There are multiple layers: the CNI plugin that assigns pod IPs, kube-proxy for service routing, CoreDNS for service discovery, and your ingress controller for external traffic. Each layer has its own configuration surface.

With AWS EKS and the VPC CNI plugin (the default), pods get real VPC IP addresses. This is actually a significant advantage over some other Kubernetes setups — your pods are first-class VPC citizens and you can use security groups directly on pods.

# pod-security-group.yaml
apiVersion: v1
kind: Pod
metadata:
  name: my-app
  annotations:
    vpc.amazonaws.com/pod-eni: "true"
spec:
  securityContext:
    runAsNonRoot: true
  containers:
  - name: app
    image: my-app:latest

# Security group for pods (requires VPC CNI security groups for pods feature)
resource "aws_security_group" "pod_sg" {
  name        = "${var.project_name}-pod-sg"
  description = "Security group for application pods"
  vpc_id      = var.vpc_id

  ingress {
    from_port       = 8080
    to_port         = 8080
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Service-to-service communication in Kubernetes uses DNS names automatically. Any service is reachable at service-name.namespace.svc.cluster.local. This is one area where Kubernetes is genuinely more elegant than ECS.

# Internal service call - no configuration needed beyond the Service object
apiVersion: v1
kind: Service
metadata:
  name: user-service
  namespace: production
spec:
  selector:
    app: user-service
  ports:
  - port: 80
    targetPort: 8080
  type: ClusterIP
---
# Any pod can now reach this at: http://user-service.production.svc.cluster.local

For external traffic, you need an ingress controller. With the AWS Load Balancer Controller installed, you annotate your Ingress objects and it provisions ALBs automatically:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-ingress
  namespace: production
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789:certificate/xxx
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443},{"HTTP":80}]'
    alb.ingress.kubernetes.io/ssl-redirect: "443"
    alb.ingress.kubernetes.io/healthcheck-path: /health
spec:
  rules:
  - host: api.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: app-service
            port:
              number: 80

This works well once it's set up. But getting to "works well" requires understanding ALB controller annotations, troubleshooting why the ALB isn't being provisioned, and handling the IRSA (IAM Roles for Service Accounts) configuration for the controller itself.

Verdict on networking: ECS is simpler for straightforward use cases. EKS networking is more powerful and flexible, especially for multi-service architectures, but requires more operational knowledge.

Autoscaling

ECS Autoscaling

ECS autoscaling has two dimensions: scaling the number of tasks (Application Auto Scaling) and scaling the underlying compute (if using EC2 launch type — not needed for Fargate).

# Scale tasks based on CPU
resource "aws_appautoscaling_target" "ecs_target" {
  max_capacity       = 50
  min_capacity       = 2
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.app.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

resource "aws_appautoscaling_policy" "cpu_scaling" {
  name               = "${var.project_name}-cpu-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs_target.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_target.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs_target.service_namespace

  target_tracking_scaling_policy_configuration {
    target_value       = 60.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60

    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
  }
}

# Scale based on ALB request count per target
resource "aws_appautoscaling_policy" "request_scaling" {
  name               = "${var.project_name}-request-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs_target.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_target.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs_target.service_namespace

  target_tracking_scaling_policy_configuration {
    target_value       = 1000.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60

    predefined_metric_specification {
      predefined_metric_type = "ALBRequestCountPerTarget"
      resource_label = "${aws_lb.main.arn_suffix}/${aws_lb_target_group.app.arn_suffix}"
    }
  }
}

With Fargate, you don't manage nodes at all. AWS handles the underlying compute. You define the CPU and memory your task needs, and Fargate provisions capacity. Scaling the task count scales your actual compute footprint automatically.

This is genuinely magical for teams that don't want to think about node sizing, node upgrades, or compute capacity planning.

EKS Autoscaling

EKS autoscaling has three dimensions: HPA (Horizontal Pod Autoscaler) for pods, Cluster Autoscaler or Karpenter for nodes, and optionally VPA (Vertical Pod Autoscaler) for right-sizing resource requests.

HPA for pods:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  minReplicas: 2
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15

Karpenter for node provisioning (the modern approach, replacing Cluster Autoscaler):

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot", "on-demand"]
    - key: kubernetes.io/arch
      operator: In
      values: ["arm64"]
    - key: node.kubernetes.io/instance-type
      operator: In
      values: ["m6g.large", "m6g.xlarge", "m6g.2xlarge", "c6g.large", "c6g.xlarge"]
  limits:
    resources:
      cpu: "200"
      memory: 800Gi
  providerRef:
    name: default
  ttlSecondsAfterEmpty: 30
  ttlSecondsUntilExpired: 2592000  # 30 days - forces node rotation
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: default
spec:
  subnetSelector:
    karpenter.sh/discovery: ${CLUSTER_NAME}
  securityGroupSelector:
    karpenter.sh/discovery: ${CLUSTER_NAME}
  instanceProfile: KarpenterNodeInstanceProfile-${CLUSTER_NAME}
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 50Gi
        volumeType: gp3
        encrypted: true

Karpenter is genuinely excellent. It provisions nodes in seconds, uses spot instances intelligently, and right-sizes node types based on actual pod requirements. If you're running EKS, Karpenter is worth the setup investment.

But notice what I mean about layers: you have the application (pods), the HPA controlling pod count, Karpenter controlling node count, and then the interaction between the two when a scale event happens. All of that works well when properly configured. Debugging it when it doesn't work requires understanding all three layers simultaneously.

Verdict on autoscaling: ECS with Fargate is simpler and honestly sufficient for most use cases. EKS with Karpenter is more powerful and cost-optimized but requires more operational investment.

Cost

This is the section people want most and where comparison posts are most misleading.

Let me be honest: it depends heavily on your workload profile, and anyone who gives you a simple "ECS is cheaper" or "EKS is cheaper" is oversimplifying.

That said, here are the real cost levers:

ECS Fargate Costs

Fargate pricing is per-second based on the vCPU and memory you allocate to each task.

Fargate pricing (us-east-1, arm64):
- $0.03238 per vCPU-hour
- $0.00356 per GB-hour

Example: 0.5 vCPU, 1GB memory task running 24/7 for 30 days:
- vCPU: 0.5 × $0.03238 × 720 hours = $11.66/month
- Memory: 1 × $0.00356 × 720 hours = $2.56/month
- Total per task: ~$14.22/month

With 10 tasks: ~$142/month. With 50 tasks: ~$711/month.

The key insight: you pay for allocated resources, not actual usage. A task allocated 1 vCPU that runs at 10% CPU still costs the same as one running at 90%. If your workloads have inconsistent utilization, Fargate has waste built into it.

ECS EC2 Launch Type Costs

If you use ECS with EC2 instead of Fargate, you pay for the EC2 instances whether they're fully utilized or not. This can be cheaper than Fargate at high, consistent utilization, and more expensive at low or variable utilization.

EKS Costs

EKS charges $0.10/hour per cluster ($72/month regardless of what's running on it). Your actual compute is charged at EC2 rates (or Fargate rates if using Fargate with EKS).

With Karpenter and spot instances, EKS workloads can be significantly cheaper than ECS Fargate at scale. Spot instances for m6g.large run at about 70% discount compared to on-demand, and Karpenter will use spot by default when available.

But this cost advantage only materializes if:

Your workloads tolerate spot interruptions (most stateless services do)
You have enough workload to pack nodes efficiently
You've invested in proper resource requests/limits so scheduling is efficient

Below a certain scale (roughly under 20 concurrent tasks for most workloads), Fargate's simplicity is worth the premium. Above that scale, the math starts favoring managed EC2 nodes with spot.

My rough real-world numbers from Company 3:

Workload	ECS Fargate	EKS + Karpenter (spot)	Savings
50 API pods (0.5 vCPU, 1GB)	$711/month	$290/month	59%
10 background workers (2 vCPU, 4GB)	$512/month	$195/month	62%
5 ML inference (4 vCPU, 16GB)	$1,024/month	$380/month	63%

These numbers are real but context-dependent. The EKS cluster itself costs $72/month, plus you need to budget time for cluster maintenance and upgrades.

Deployments and Day-2 Operations

ECS Deployments

ECS rolling deployments are controlled by two parameters: deployment_maximum_percent and deployment_minimum_healthy_percent. The defaults (200% and 100%) mean ECS will bring up new tasks before draining old ones, ensuring no capacity loss during deployments.

resource "aws_ecs_service" "app" {
  # ...

  deployment_maximum_percent         = 200
  deployment_minimum_healthy_percent = 100

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }
}

The circuit breaker with automatic rollback is the feature I care most about. If a deployment fails health checks, ECS rolls back automatically. No human intervention needed at 2am.

For blue/green deployments, you can use CodeDeploy with ECS:

resource "aws_codedeploy_deployment_group" "ecs" {
  app_name               = aws_codedeploy_app.main.name
  deployment_group_name  = "${var.project_name}-${var.environment}"
  deployment_config_name = "CodeDeployDefault.ECSAllAtOnce"
  service_role_arn       = aws_iam_role.codedeploy.arn

  auto_rollback_configuration {
    enabled = true
    events  = ["DEPLOYMENT_FAILURE"]
  }

  blue_green_deployment_config {
    deployment_ready_option {
      action_on_timeout = "CONTINUE_DEPLOYMENT"
    }

    terminate_blue_instances_on_deployment_success {
      action                           = "TERMINATE"
      termination_wait_time_in_minutes = 5
    }
  }

  ecs_service {
    cluster_name = aws_ecs_cluster.main.name
    service_name = aws_ecs_service.app.name
  }

  load_balancer_info {
    target_group_pair_info {
      prod_traffic_route {
        listener_arns = [aws_lb_listener.https.arn]
      }

      target_group { name = aws_lb_target_group.blue.name }
      target_group { name = aws_lb_target_group.green.name }
    }
  }
}

EKS Deployments

Kubernetes rolling deployments are controlled by the Deployment spec's strategy field:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
  namespace: production
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 2
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: app
        image: my-app:latest
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "250m"
            memory: "512Mi"
          limits:
            cpu: "500m"
            memory: "1Gi"
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          failureThreshold: 3
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          failureThreshold: 3
      terminationGracePeriodSeconds: 60
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: my-app

That topologySpreadConstraints block at the bottom is something most tutorials don't show but matters in production — it ensures your pods are spread across availability zones instead of all landing on nodes in the same AZ.

Kubernetes doesn't have automatic rollback on failed deployments out of the box. You either set up a deployment process that monitors rollout status and rolls back on failure, or you use a GitOps tool like ArgoCD or Flux that handles this for you.

# Manual rollback
kubectl rollout undo deployment/app -n production

# Check rollout status
kubectl rollout status deployment/app -n production

# View rollout history
kubectl rollout history deployment/app -n production

For true blue/green or canary deployments in EKS, Argo Rollouts is the best option:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: app
  namespace: production
spec:
  replicas: 5
  strategy:
    canary:
      steps:
      - setWeight: 20
      - pause: {duration: 5m}
      - setWeight: 40
      - pause: {duration: 5m}
      - setWeight: 60
      - pause: {duration: 5m}
      - setWeight: 80
      - pause: {duration: 5m}
      canaryService: app-canary
      stableService: app-stable
      trafficRouting:
        alb:
          ingress: app-ingress
          servicePort: 80
  selector:
    matchLabels:
      app: my-app
  template:
    # ... same as Deployment spec

This is genuinely more powerful than anything ECS offers for deployments. But it requires Argo Rollouts installed, the ALB controller configured, and someone who understands how canary routing works.

Verdict on deployments: ECS is simpler with sensible defaults and automatic rollback built in. EKS is more powerful with proper tooling but requires more setup and expertise.

IAM and Security

ECS

ECS uses task-level IAM roles, which is clean and intuitive. Each task gets a role, that role has permissions, done.

resource "aws_iam_role" "ecs_task" {
  name = "${var.project_name}-ecs-task-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "ecs-tasks.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "ecs_task_permissions" {
  name = "task-permissions"
  role = aws_iam_role.ecs_task.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["s3:GetObject", "s3:PutObject"]
        Resource = "${aws_s3_bucket.app.arn}/*"
      },
      {
        Effect   = "Allow"
        Action   = ["sqs:SendMessage", "sqs:ReceiveMessage", "sqs:DeleteMessage"]
        Resource = aws_sqs_queue.app.arn
      }
    ]
  })
}

The task role is bound to the task definition, so every task that runs from that definition gets those permissions. Simple and auditable.

EKS

EKS uses IRSA (IAM Roles for Service Accounts) to bind IAM permissions to Kubernetes service accounts. It's more flexible and pod-level, but the setup is more involved.

# IAM role with trust policy for the specific service account
resource "aws_iam_role" "app_service_account" {
  name = "${var.project_name}-app-sa-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Federated = aws_iam_openid_connect_provider.eks.arn
      }
      Action = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "${replace(aws_iam_openid_connect_provider.eks.url, "https://", "")}:sub" = "system:serviceaccount:production:app-service-account"
          "${replace(aws_iam_openid_connect_provider.eks.url, "https://", "")}:aud" = "sts.amazonaws.com"
        }
      }
    }]
  })
}

resource "aws_iam_role_policy" "app_permissions" {
  name = "app-permissions"
  role = aws_iam_role.app_service_account.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["s3:GetObject", "s3:PutObject"]
        Resource = "${aws_s3_bucket.app.arn}/*"
      }
    ]
  })
}

# kubernetes/service-account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-service-account
  namespace: production
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/my-project-app-sa-role

# kubernetes/deployment.yaml
spec:
  template:
    spec:
      serviceAccountName: app-service-account  # Links to the IAM role
      containers:
      - name: app
        # ...

IRSA is secure and granular — individual pods get individual IAM roles. But it's also more moving parts. The OIDC provider, the trust policy with the exact service account reference, the Kubernetes service account with the annotation, and the deployment referencing the service account. All four pieces have to be correct for it to work.

Verdict on IAM: ECS is simpler. EKS is more granular (pod-level vs task-level). Both are secure when used correctly.

Observability

ECS

CloudWatch is the native home for ECS logs and metrics. Container Insights gives you CPU, memory, network, and storage metrics per task. Log routing is configured in the task definition.

resource "aws_cloudwatch_log_group" "app" {
  name              = "/ecs/${var.project_name}-${var.environment}"
  retention_in_days = 30
}

For ECS on Fargate, FireLens is the way to route logs to multiple destinations (Datadog, Splunk, S3) without changing your application code:

{
  "name": "log_router",
  "image": "public.ecr.aws/aws-observability/aws-for-fluent-bit:stable",
  "essential": true,
  "firelensConfiguration": {
    "type": "fluentbit"
  }
}

EKS

EKS ships logs and metrics to CloudWatch via the CloudWatch agent and Fluent Bit, but this requires setup. The AWS-managed add-on for CloudWatch observability simplifies this:

aws eks create-addon \
  --cluster-name ${CLUSTER_NAME} \
  --addon-name amazon-cloudwatch-observability \
  --service-account-role-arn ${CLOUDWATCH_ROLE_ARN}

For application logs, you deploy Fluent Bit as a DaemonSet that collects from all nodes:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: amazon-cloudwatch
spec:
  selector:
    matchLabels:
      name: fluent-bit
  template:
    metadata:
      labels:
        name: fluent-bit
    spec:
      serviceAccountName: fluent-bit
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluent-bit
        image: public.ecr.aws/aws-observability/aws-for-fluent-bit:stable
        resources:
          requests:
            cpu: 50m
            memory: 64Mi
          limits:
            cpu: 200m
            memory: 256Mi
        env:
        - name: AWS_REGION
          value: us-east-1
        - name: CLUSTER_NAME
          value: my-cluster
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

The Kubernetes-native observability story with Prometheus + Grafana is powerful and widely adopted. If your team already operates a Prometheus/Grafana stack or uses a platform like Datadog, EKS integrates naturally.

# Service Monitor for Prometheus scraping
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
  namespaceSelector:
    matchNames:
    - production

Verdict on observability: ECS integrates more naturally with CloudWatch. EKS has a richer ecosystem but requires more setup. If your company is already invested in Prometheus/Grafana, EKS wins this one.

When I'd Actually Choose Each One

Here's where I give you the real opinion instead of the "it depends" cop-out.

Choose ECS When:

Your team is primarily application engineers, not platform engineers. ECS doesn't require deep operational expertise. A team of four backend engineers can own an ECS-based infrastructure without a dedicated DevOps/platform role. EKS cannot make that claim.

You're building on AWS and want tight native integration. ECS is an AWS service. IAM, ALB, CloudWatch, Secrets Manager, Service Connect — all integrate natively without additional tooling. If you're all-in on AWS (and most startups are), this matters.

You have fewer than 50 services and don't need advanced traffic management. Below this threshold, the operational overhead of EKS rarely pays off. You'll spend more time managing the cluster than you save from its flexibility.

Fargate's simplicity is genuinely valuable to you. No node management, no AMI updates, no node security patching. For teams that don't want to think about compute, Fargate is remarkable. You describe what your task needs, it runs.

You need to move fast. A working ECS environment with CI/CD can be set up in a day. EKS done properly takes a week the first time. When time-to-production matters, ECS has a clear advantage.

Choose EKS When:

You have multi-cloud requirements or may need to migrate off AWS. Kubernetes is portable. Your application manifests, your Helm charts, your operational knowledge — it all works on GKE, AKS, or self-managed Kubernetes. ECS knowledge doesn't transfer. If there's any chance you'll need to run workloads outside AWS, this is a significant factor.

You have specialized workloads that need Kubernetes-specific features. GPU workloads for ML inference, jobs that use init containers and sidecars extensively, workloads that need custom schedulers, anything that benefits from the Kubernetes extension ecosystem (custom operators, CRDs). EKS handles these; ECS handles them less gracefully or not at all.

You have a large team with dedicated platform engineers. If you have people whose job is operating the container platform, EKS's complexity becomes manageable and its power becomes accessible. A 3-engineer platform team can run EKS well. A 1-person DevOps team probably can't give it the attention it needs while keeping everything else running.

Cost optimization at scale matters. At 100+ pods, the combination of EC2 spot instances and Karpenter's bin-packing can deliver meaningful savings over Fargate. The math becomes compelling at scale in a way it doesn't for smaller deployments.

You're already using Kubernetes elsewhere. If your team runs EKS and you're adding a new service, it goes on EKS. The operational patterns are established, the monitoring is set up, the oncall runbook exists. Don't introduce ECS complexity just to avoid adding another EKS service.

The Hard Truths Nobody Puts in Comparison Posts

EKS oncall is harder. When a production incident happens at 2am on EKS, you're debugging Kubernetes. You need to understand pod states, node conditions, CNI issues, RBAC errors, resource quota exhaustion, and admission webhook failures in addition to your actual application. ECS incidents are usually simpler: bad task definition, failing health check, insufficient IAM permissions. The debugging surface is smaller.

ECS has an upgrade story that doesn't wake you up at night. EKS clusters need to be upgraded every 14 months or so (AWS supports each Kubernetes minor version for about that long before end of life). Node groups need upgrading separately from the control plane. Add-ons need upgrading separately from nodes. Each upgrade is a project. ECS manages runtime upgrades for you on Fargate. It's not zero effort, but it's dramatically less.

Kubernetes expertise is widely available; ECS expertise is less so. If you're hiring, more engineers know Kubernetes than ECS. This cuts both ways: it's easier to hire for EKS, but it also means your engineers may resist ECS or want to migrate away from it.

ECS doesn't have a great multi-tenant story. If you're building a platform that hosts multiple teams' workloads and you need namespace-level isolation, RBAC, resource quotas per team — Kubernetes does this natively. ECS doesn't have a clean equivalent. You end up using separate clusters or complex IAM setups to achieve similar isolation.

The Honest Recommendation

If you're starting fresh in 2026 with a team of fewer than 15 engineers and no specific requirements driving you toward Kubernetes:

Start with ECS.

You'll ship faster. Your oncall will be less complex. The AWS integration is better. When (if) you outgrow it, migration to EKS is a well-understood project, not a crisis.

If you're starting fresh with a large engineering organization, existing Kubernetes knowledge, multi-cloud requirements, or complex workload requirements (GPU, custom schedulers, extensive sidecar patterns):

Start with EKS.

But staff it properly. An EKS cluster that nobody fully understands is worse than an ECS setup that everyone can operate. Kubernetes complexity in the wrong hands creates incidents. I've seen it happen at Company 2, and it's entirely avoidable.

Quick Reference

Factor	ECS	EKS
Initial setup time	Hours	Days
Operational complexity	Low	High
AWS integration	Native	Via add-ons
Multi-cloud portability	No	Yes
Cost (small scale)	Competitive	Higher (cluster overhead)
Cost (large scale)	Higher	Lower (spot + packing)
Fargate support	First-class	Supported
Blue/Green deployments	Via CodeDeploy	Via Argo Rollouts
Service mesh	App Mesh / Service Connect	Istio, Linkerd, Cilium
Advanced scheduling	Limited	Full Kubernetes scheduler
Team expertise required	Moderate	High
Upgrade burden	Low (Fargate handles it)	High (nodes + control plane)

Final Thought

The engineers who have the strongest opinions about ECS vs EKS are usually the ones who've only used one of them in production. The engineers who've used both tend to be more pragmatic — they reach for whichever tool fits the problem.

Both ECS and EKS are mature, well-supported platforms. Both can run production workloads reliably. The choice is about operational tradeoffs, team capabilities, and what your workloads actually need — not about which one is objectively better.

Use the simpler one until you have a concrete reason not to.

I'm curious what others have found in practice — especially if you've migrated from one to the other. Drop a comment with what surprised you most. Migrations usually surface the tradeoffs that no comparison post captures.

推荐订阅源

DEV Community