No marketing fluff. Just what I learned running both in production across three different companies over five years.
I've had this conversation more times than I can count.
Someone joins a new team, looks at the infrastructure, and asks: "Why are we on ECS and not Kubernetes?" Or the opposite — they look at the EKS cluster and ask: "Why aren't we using ECS? This seems way more complicated than it needs to be."
Both questions are valid. Both have real answers. And the answers depend entirely on context that a blog post comparison table can't capture — which is exactly why most ECS vs EKS comparisons are useless. They list features side by side and conclude with something like "it depends on your use case," which tells you nothing.
I want to do something different. I want to tell you what it actually feels like to operate these two platforms, what goes wrong, what the costs look like in practice, and what signals I use to make the call when starting something new.
Let me start with where I've been.
My Background With Both Platforms
Company 1 — a mid-size SaaS startup, around 40 engineers. We were on ECS from 2020 and stayed there until I left. About 35 services, mix of Fargate and EC2 launch types, deployed via CodePipeline. No major incidents directly attributable to ECS. Oncall was manageable.
Company 2 — a larger company migrating from a legacy monolith to microservices. They chose EKS because "that's what everyone uses now." By the time I joined, the cluster had been running for 18 months. There were 12 engineers who could modify cluster config and 2 who actually understood it. Oncall had a lot of "why is this pod in CrashLoopBackOff" at 3am.
Company 3 — a data-heavy platform, around 80 engineers. Started with EKS for the ML workloads, added ECS for the API layer later. Both running in parallel to this day for valid reasons that I'll get into.
I'm not here to tell you one is better. I'm here to tell you the truth about both.
The Core Difference That Everything Else Flows From
Before we talk about networking, autoscaling, pricing, or anything else — there's one fundamental difference you need to internalize:
ECS is a managed container orchestrator. EKS is a managed Kubernetes control plane.
That sounds like a subtle distinction but the implications are massive.
With ECS, AWS owns the scheduling logic, the service reconciliation loop, the load balancer integration, the secrets injection, and the deployment mechanics. You configure these things, but AWS operates them. When something breaks, you look at your task definition, your service events, and your CloudWatch logs. The number of things that can go wrong is bounded.
With EKS, AWS manages the Kubernetes control plane (the API server, etcd, the scheduler, the controller manager). But the cluster is still Kubernetes. The worker nodes, the CNI plugin, the ingress controller, the cert-manager, the service mesh, the cluster autoscaler, the pod disruption budgets, the RBAC policies — all of that is yours to configure, operate, and debug. The number of things that can go wrong is essentially unbounded.
This isn't a criticism of EKS. Kubernetes is powerful precisely because it's extensible. But that extensibility has an operational cost, and that cost is real and ongoing.
Setup and Initial Complexity
ECS
Getting a working ECS cluster with a deployed service behind a load balancer takes a few hours with Terraform the first time. Here's the complete setup:
# ECS Cluster
resource "aws_ecs_cluster" "main" {
name = "${var.project_name}-${var.environment}"
setting {
name = "containerInsights"
value = "enabled"
}
tags = {
Environment = var.environment
ManagedBy = "terraform"
}
}
resource "aws_ecs_cluster_capacity_providers" "main" {
cluster_name = aws_ecs_cluster.main.name
capacity_providers = ["FARGATE", "FARGATE_SPOT"]
default_capacity_provider_strategy {
capacity_provider = "FARGATE"
weight = 100
base = 1
}
}
# Task Definition
resource "aws_ecs_task_definition" "app" {
family = "${var.project_name}-${var.environment}"
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = 512
memory = 1024
execution_role_arn = aws_iam_role.ecs_execution.arn
task_role_arn = aws_iam_role.ecs_task.arn
container_definitions = jsonencode([{
name = "app"
image = "${var.ecr_repository_url}:${var.image_tag}"
essential = true
portMappings = [{
containerPort = 8080
protocol = "tcp"
}]
environment = [
{ name = "APP_ENV", value = var.environment }
]
secrets = [{
name = "DATABASE_URL"
valueFrom = aws_secretsmanager_secret.db_url.arn
}]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = aws_cloudwatch_log_group.app.name
"awslogs-region" = var.region
"awslogs-stream-prefix" = "ecs"
}
}
healthCheck = {
command = ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"]
interval = 30
timeout = 5
retries = 3
startPeriod = 60
}
}])
}
# ECS Service
resource "aws_ecs_service" "app" {
name = "${var.project_name}-${var.environment}"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.app.arn
desired_count = 2
capacity_provider_strategy {
capacity_provider = "FARGATE"
weight = 100
base = 1
}
network_configuration {
subnets = var.private_subnet_ids
security_groups = [aws_security_group.ecs_tasks.id]
assign_public_ip = false
}
load_balancer {
target_group_arn = aws_lb_target_group.app.arn
container_name = "app"
container_port = 8080
}
deployment_circuit_breaker {
enable = true
rollback = true
}
deployment_maximum_percent = 200
deployment_minimum_healthy_percent = 100
lifecycle {
ignore_changes = [task_definition, desired_count]
}
}
That's it. No additional tooling, no CNI configuration, no ingress controller to install. ALB integration is native. Secrets come from Secrets Manager through IAM. Logging goes to CloudWatch automatically.
A reasonably experienced engineer can own this end-to-end.
EKS
The EKS setup story is longer. Much longer. Here's what a real production EKS setup involves beyond just the cluster:
# EKS Cluster
resource "aws_eks_cluster" "main" {
name = "${var.project_name}-${var.environment}"
role_arn = aws_iam_role.eks_cluster.arn
version = "1.29"
vpc_config {
subnet_ids = concat(var.private_subnet_ids, var.public_subnet_ids)
endpoint_private_access = true
endpoint_public_access = true
public_access_cidrs = var.allowed_cidrs
}
enabled_cluster_log_types = [
"api", "audit", "authenticator", "controllerManager", "scheduler"
]
depends_on = [
aws_iam_role_policy_attachment.eks_cluster_policy,
aws_iam_role_policy_attachment.eks_vpc_resource_controller,
aws_cloudwatch_log_group.eks,
]
tags = {
Environment = var.environment
ManagedBy = "terraform"
}
}
# Node Group
resource "aws_eks_node_group" "main" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "${var.project_name}-${var.environment}-ng"
node_role_arn = aws_iam_role.eks_node.arn
subnet_ids = var.private_subnet_ids
instance_types = ["m6g.large"]
ami_type = "AL2_ARM_64"
scaling_config {
desired_size = 3
max_size = 20
min_size = 2
}
update_config {
max_unavailable = 1
}
labels = {
role = "general"
environment = var.environment
}
lifecycle {
ignore_changes = [scaling_config[0].desired_size]
}
depends_on = [
aws_iam_role_policy_attachment.eks_worker_node_policy,
aws_iam_role_policy_attachment.eks_cni_policy,
aws_iam_role_policy_attachment.ecr_read_only,
]
}
# OIDC Provider (required for IRSA - IAM Roles for Service Accounts)
data "tls_certificate" "eks" {
url = aws_eks_cluster.main.identity[0].oidc[0].issuer
}
resource "aws_iam_openid_connect_provider" "eks" {
client_id_list = ["sts.amazonaws.com"]
thumbprint_list = [data.tls_certificate.eks.certificates[0].sha1_fingerprint]
url = aws_eks_cluster.main.identity[0].oidc[0].issuer
}
But the Terraform cluster resource is just the beginning. You also need to install and configure:
AWS Load Balancer Controller — because EKS doesn't have native ALB integration the way ECS does. You install this as a Helm chart.
helm repo add eks https://aws.github.io/eks-charts
helm repo update
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
-n kube-system \
--set clusterName=${CLUSTER_NAME} \
--set serviceAccount.create=false \
--set serviceAccount.name=aws-load-balancer-controller
Cluster Autoscaler — EKS won't automatically scale your node count based on pending pods. You need to install and configure the cluster autoscaler.
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
--namespace kube-system \
--set autoDiscovery.clusterName=${CLUSTER_NAME} \
--set awsRegion=${AWS_REGION} \
--set rbac.serviceAccount.name=cluster-autoscaler \
--set extraArgs.balance-similar-node-groups=true \
--set extraArgs.skip-nodes-with-system-pods=false
EBS or EFS CSI Driver — if any of your workloads need persistent volumes. Another Helm chart, another IAM role for service account.
External Secrets Operator — because Kubernetes Secrets are base64-encoded, not encrypted. If you want your secrets to come from AWS Secrets Manager (which you do), you need an operator to bridge the two.
# external-secrets/secret-store.yaml
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
name: aws-secretsmanager
spec:
provider:
aws:
service: SecretsManager
region: us-east-1
auth:
jwt:
serviceAccountRef:
name: external-secrets-sa
namespace: external-secrets
metrics-server — required for Horizontal Pod Autoscaler to work. Another installation.
By the time you have a production-ready EKS cluster, you've installed at minimum five separate components, each with its own configuration, versioning, and upgrade lifecycle. This isn't EKS being bad — it's just the nature of the platform. You're building on top of a general-purpose orchestrator, not a purpose-built AWS service.
Verdict on setup complexity: ECS wins decisively. Not slightly — decisively. The first-time setup difference is measured in days, not hours.
Networking
ECS Networking
ECS networking in Fargate mode is remarkably simple. Each task gets its own ENI with its own IP address. Security groups work exactly like they do for EC2 instances. You create a security group for your tasks, you define the ingress and egress rules, done.
resource "aws_security_group" "ecs_tasks" {
name = "${var.project_name}-ecs-tasks"
description = "Security group for ECS tasks"
vpc_id = var.vpc_id
ingress {
from_port = 8080
to_port = 8080
protocol = "tcp"
security_groups = [aws_security_group.alb.id]
description = "Allow traffic from ALB only"
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
description = "Allow all outbound"
}
}
Service-to-service communication in ECS works through internal ALBs or AWS Cloud Map for service discovery. Neither requires deep networking knowledge.
# Service discovery with Cloud Map
resource "aws_service_discovery_private_dns_namespace" "internal" {
name = "${var.project_name}.internal"
vpc = var.vpc_id
}
resource "aws_service_discovery_service" "app" {
name = "user-service"
dns_config {
namespace_id = aws_service_discovery_private_dns_namespace.internal.id
dns_records {
ttl = 10
type = "A"
}
routing_policy = "MULTIVALUE"
}
health_check_custom_config {
failure_threshold = 1
}
}
# Now your services can reach user-service.projectname.internal
EKS Networking
Kubernetes networking is famously complex. There are multiple layers: the CNI plugin that assigns pod IPs, kube-proxy for service routing, CoreDNS for service discovery, and your ingress controller for external traffic. Each layer has its own configuration surface.
With AWS EKS and the VPC CNI plugin (the default), pods get real VPC IP addresses. This is actually a significant advantage over some other Kubernetes setups — your pods are first-class VPC citizens and you can use security groups directly on pods.
# pod-security-group.yaml
apiVersion: v1
kind: Pod
metadata:
name: my-app
annotations:
vpc.amazonaws.com/pod-eni: "true"
spec:
securityContext:
runAsNonRoot: true
containers:
- name: app
image: my-app:latest
# Security group for pods (requires VPC CNI security groups for pods feature)
resource "aws_security_group" "pod_sg" {
name = "${var.project_name}-pod-sg"
description = "Security group for application pods"
vpc_id = var.vpc_id
ingress {
from_port = 8080
to_port = 8080
protocol = "tcp"
security_groups = [aws_security_group.alb.id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
Service-to-service communication in Kubernetes uses DNS names automatically. Any service is reachable at service-name.namespace.svc.cluster.local. This is one area where Kubernetes is genuinely more elegant than ECS.
# Internal service call - no configuration needed beyond the Service object
apiVersion: v1
kind: Service
metadata:
name: user-service
namespace: production
spec:
selector:
app: user-service
ports:
- port: 80
targetPort: 8080
type: ClusterIP
---
# Any pod can now reach this at: http://user-service.production.svc.cluster.local
For external traffic, you need an ingress controller. With the AWS Load Balancer Controller installed, you annotate your Ingress objects and it provisions ALBs automatically:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: app-ingress
namespace: production
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789:certificate/xxx
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443},{"HTTP":80}]'
alb.ingress.kubernetes.io/ssl-redirect: "443"
alb.ingress.kubernetes.io/healthcheck-path: /health
spec:
rules:
- host: api.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: app-service
port:
number: 80
This works well once it's set up. But getting to "works well" requires understanding ALB controller annotations, troubleshooting why the ALB isn't being provisioned, and handling the IRSA (IAM Roles for Service Accounts) configuration for the controller itself.
Verdict on networking: ECS is simpler for straightforward use cases. EKS networking is more powerful and flexible, especially for multi-service architectures, but requires more operational knowledge.
Autoscaling
ECS Autoscaling
ECS autoscaling has two dimensions: scaling the number of tasks (Application Auto Scaling) and scaling the underlying compute (if using EC2 launch type — not needed for Fargate).
# Scale tasks based on CPU
resource "aws_appautoscaling_target" "ecs_target" {
max_capacity = 50
min_capacity = 2
resource_id = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.app.name}"
scalable_dimension = "ecs:service:DesiredCount"
service_namespace = "ecs"
}
resource "aws_appautoscaling_policy" "cpu_scaling" {
name = "${var.project_name}-cpu-scaling"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.ecs_target.resource_id
scalable_dimension = aws_appautoscaling_target.ecs_target.scalable_dimension
service_namespace = aws_appautoscaling_target.ecs_target.service_namespace
target_tracking_scaling_policy_configuration {
target_value = 60.0
scale_in_cooldown = 300
scale_out_cooldown = 60
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageCPUUtilization"
}
}
}
# Scale based on ALB request count per target
resource "aws_appautoscaling_policy" "request_scaling" {
name = "${var.project_name}-request-scaling"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.ecs_target.resource_id
scalable_dimension = aws_appautoscaling_target.ecs_target.scalable_dimension
service_namespace = aws_appautoscaling_target.ecs_target.service_namespace
target_tracking_scaling_policy_configuration {
target_value = 1000.0
scale_in_cooldown = 300
scale_out_cooldown = 60
predefined_metric_specification {
predefined_metric_type = "ALBRequestCountPerTarget"
resource_label = "${aws_lb.main.arn_suffix}/${aws_lb_target_group.app.arn_suffix}"
}
}
}
With Fargate, you don't manage nodes at all. AWS handles the underlying compute. You define the CPU and memory your task needs, and Fargate provisions capacity. Scaling the task count scales your actual compute footprint automatically.
This is genuinely magical for teams that don't want to think about node sizing, node upgrades, or compute capacity planning.
EKS Autoscaling
EKS autoscaling has three dimensions: HPA (Horizontal Pod Autoscaler) for pods, Cluster Autoscaler or Karpenter for nodes, and optionally VPA (Vertical Pod Autoscaler) for right-sizing resource requests.
HPA for pods:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app
minReplicas: 2
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
Karpenter for node provisioning (the modern approach, replacing Cluster Autoscaler):
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: kubernetes.io/arch
operator: In
values: ["arm64"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["m6g.large", "m6g.xlarge", "m6g.2xlarge", "c6g.large", "c6g.xlarge"]
limits:
resources:
cpu: "200"
memory: 800Gi
providerRef:
name: default
ttlSecondsAfterEmpty: 30
ttlSecondsUntilExpired: 2592000 # 30 days - forces node rotation
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
name: default
spec:
subnetSelector:
karpenter.sh/discovery: ${CLUSTER_NAME}
securityGroupSelector:
karpenter.sh/discovery: ${CLUSTER_NAME}
instanceProfile: KarpenterNodeInstanceProfile-${CLUSTER_NAME}
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 50Gi
volumeType: gp3
encrypted: true
Karpenter is genuinely excellent. It provisions nodes in seconds, uses spot instances intelligently, and right-sizes node types based on actual pod requirements. If you're running EKS, Karpenter is worth the setup investment.
But notice what I mean about layers: you have the application (pods), the HPA controlling pod count, Karpenter controlling node count, and then the interaction between the two when a scale event happens. All of that works well when properly configured. Debugging it when it doesn't work requires understanding all three layers simultaneously.
Verdict on autoscaling: ECS with Fargate is simpler and honestly sufficient for most use cases. EKS with Karpenter is more powerful and cost-optimized but requires more operational investment.
Cost
This is the section people want most and where comparison posts are most misleading.
Let me be honest: it depends heavily on your workload profile, and anyone who gives you a simple "ECS is cheaper" or "EKS is cheaper" is oversimplifying.
That said, here are the real cost levers:
ECS Fargate Costs
Fargate pricing is per-second based on the vCPU and memory you allocate to each task.
Fargate pricing (us-east-1, arm64):
- $0.03238 per vCPU-hour
- $0.00356 per GB-hour
Example: 0.5 vCPU, 1GB memory task running 24/7 for 30 days:
- vCPU: 0.5 × $0.03238 × 720 hours = $11.66/month
- Memory: 1 × $0.00356 × 720 hours = $2.56/month
- Total per task: ~$14.22/month
With 10 tasks: ~$142/month. With 50 tasks: ~$711/month.
The key insight: you pay for allocated resources, not actual usage. A task allocated 1 vCPU that runs at 10% CPU still costs the same as one running at 90%. If your workloads have inconsistent utilization, Fargate has waste built into it.
ECS EC2 Launch Type Costs
If you use ECS with EC2 instead of Fargate, you pay for the EC2 instances whether they're fully utilized or not. This can be cheaper than Fargate at high, consistent utilization, and more expensive at low or variable utilization.
EKS Costs
EKS charges $0.10/hour per cluster ($72/month regardless of what's running on it). Your actual compute is charged at EC2 rates (or Fargate rates if using Fargate with EKS).
With Karpenter and spot instances, EKS workloads can be significantly cheaper than ECS Fargate at scale. Spot instances for m6g.large run at about 70% discount compared to on-demand, and Karpenter will use spot by default when available.
But this cost advantage only materializes if:
- Your workloads tolerate spot interruptions (most stateless services do)
- You have enough workload to pack nodes efficiently
- You've invested in proper resource requests/limits so scheduling is efficient
Below a certain scale (roughly under 20 concurrent tasks for most workloads), Fargate's simplicity is worth the premium. Above that scale, the math starts favoring managed EC2 nodes with spot.
My rough real-world numbers from Company 3:
| Workload | ECS Fargate | EKS + Karpenter (spot) | Savings |
|---|---|---|---|
| 50 API pods (0.5 vCPU, 1GB) | $711/month | $290/month | 59% |
| 10 background workers (2 vCPU, 4GB) | $512/month | $195/month | 62% |
| 5 ML inference (4 vCPU, 16GB) | $1,024/month | $380/month | 63% |
These numbers are real but context-dependent. The EKS cluster itself costs $72/month, plus you need to budget time for cluster maintenance and upgrades.
Deployments and Day-2 Operations
ECS Deployments
ECS rolling deployments are controlled by two parameters: deployment_maximum_percent and deployment_minimum_healthy_percent. The defaults (200% and 100%) mean ECS will bring up new tasks before draining old ones, ensuring no capacity loss during deployments.
resource "aws_ecs_service" "app" {
# ...
deployment_maximum_percent = 200
deployment_minimum_healthy_percent = 100
deployment_circuit_breaker {
enable = true
rollback = true
}
}
The circuit breaker with automatic rollback is the feature I care most about. If a deployment fails health checks, ECS rolls back automatically. No human intervention needed at 2am.
For blue/green deployments, you can use CodeDeploy with ECS:
resource "aws_codedeploy_deployment_group" "ecs" {
app_name = aws_codedeploy_app.main.name
deployment_group_name = "${var.project_name}-${var.environment}"
deployment_config_name = "CodeDeployDefault.ECSAllAtOnce"
service_role_arn = aws_iam_role.codedeploy.arn
auto_rollback_configuration {
enabled = true
events = ["DEPLOYMENT_FAILURE"]
}
blue_green_deployment_config {
deployment_ready_option {
action_on_timeout = "CONTINUE_DEPLOYMENT"
}
terminate_blue_instances_on_deployment_success {
action = "TERMINATE"
termination_wait_time_in_minutes = 5
}
}
ecs_service {
cluster_name = aws_ecs_cluster.main.name
service_name = aws_ecs_service.app.name
}
load_balancer_info {
target_group_pair_info {
prod_traffic_route {
listener_arns = [aws_lb_listener.https.arn]
}
target_group { name = aws_lb_target_group.blue.name }
target_group { name = aws_lb_target_group.green.name }
}
}
}
EKS Deployments
Kubernetes rolling deployments are controlled by the Deployment spec's strategy field:
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
namespace: production
spec:
replicas: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 2
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: app
image: my-app:latest
ports:
- containerPort: 8080
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "500m"
memory: "1Gi"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
terminationGracePeriodSeconds: 60
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-app
That topologySpreadConstraints block at the bottom is something most tutorials don't show but matters in production — it ensures your pods are spread across availability zones instead of all landing on nodes in the same AZ.
Kubernetes doesn't have automatic rollback on failed deployments out of the box. You either set up a deployment process that monitors rollout status and rolls back on failure, or you use a GitOps tool like ArgoCD or Flux that handles this for you.
# Manual rollback
kubectl rollout undo deployment/app -n production
# Check rollout status
kubectl rollout status deployment/app -n production
# View rollout history
kubectl rollout history deployment/app -n production
For true blue/green or canary deployments in EKS, Argo Rollouts is the best option:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: app
namespace: production
spec:
replicas: 5
strategy:
canary:
steps:
- setWeight: 20
- pause: {duration: 5m}
- setWeight: 40
- pause: {duration: 5m}
- setWeight: 60
- pause: {duration: 5m}
- setWeight: 80
- pause: {duration: 5m}
canaryService: app-canary
stableService: app-stable
trafficRouting:
alb:
ingress: app-ingress
servicePort: 80
selector:
matchLabels:
app: my-app
template:
# ... same as Deployment spec
This is genuinely more powerful than anything ECS offers for deployments. But it requires Argo Rollouts installed, the ALB controller configured, and someone who understands how canary routing works.
Verdict on deployments: ECS is simpler with sensible defaults and automatic rollback built in. EKS is more powerful with proper tooling but requires more setup and expertise.
IAM and Security
ECS
ECS uses task-level IAM roles, which is clean and intuitive. Each task gets a role, that role has permissions, done.
resource "aws_iam_role" "ecs_task" {
name = "${var.project_name}-ecs-task-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "ecs-tasks.amazonaws.com" }
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_role_policy" "ecs_task_permissions" {
name = "task-permissions"
role = aws_iam_role.ecs_task.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = ["s3:GetObject", "s3:PutObject"]
Resource = "${aws_s3_bucket.app.arn}/*"
},
{
Effect = "Allow"
Action = ["sqs:SendMessage", "sqs:ReceiveMessage", "sqs:DeleteMessage"]
Resource = aws_sqs_queue.app.arn
}
]
})
}
The task role is bound to the task definition, so every task that runs from that definition gets those permissions. Simple and auditable.
EKS
EKS uses IRSA (IAM Roles for Service Accounts) to bind IAM permissions to Kubernetes service accounts. It's more flexible and pod-level, but the setup is more involved.
# IAM role with trust policy for the specific service account
resource "aws_iam_role" "app_service_account" {
name = "${var.project_name}-app-sa-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = {
Federated = aws_iam_openid_connect_provider.eks.arn
}
Action = "sts:AssumeRoleWithWebIdentity"
Condition = {
StringEquals = {
"${replace(aws_iam_openid_connect_provider.eks.url, "https://", "")}:sub" = "system:serviceaccount:production:app-service-account"
"${replace(aws_iam_openid_connect_provider.eks.url, "https://", "")}:aud" = "sts.amazonaws.com"
}
}
}]
})
}
resource "aws_iam_role_policy" "app_permissions" {
name = "app-permissions"
role = aws_iam_role.app_service_account.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = ["s3:GetObject", "s3:PutObject"]
Resource = "${aws_s3_bucket.app.arn}/*"
}
]
})
}
# kubernetes/service-account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: app-service-account
namespace: production
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/my-project-app-sa-role
# kubernetes/deployment.yaml
spec:
template:
spec:
serviceAccountName: app-service-account # Links to the IAM role
containers:
- name: app
# ...
IRSA is secure and granular — individual pods get individual IAM roles. But it's also more moving parts. The OIDC provider, the trust policy with the exact service account reference, the Kubernetes service account with the annotation, and the deployment referencing the service account. All four pieces have to be correct for it to work.
Verdict on IAM: ECS is simpler. EKS is more granular (pod-level vs task-level). Both are secure when used correctly.
Observability
ECS
CloudWatch is the native home for ECS logs and metrics. Container Insights gives you CPU, memory, network, and storage metrics per task. Log routing is configured in the task definition.
resource "aws_cloudwatch_log_group" "app" {
name = "/ecs/${var.project_name}-${var.environment}"
retention_in_days = 30
}
For ECS on Fargate, FireLens is the way to route logs to multiple destinations (Datadog, Splunk, S3) without changing your application code:
{
"name": "log_router",
"image": "public.ecr.aws/aws-observability/aws-for-fluent-bit:stable",
"essential": true,
"firelensConfiguration": {
"type": "fluentbit"
}
}
EKS
EKS ships logs and metrics to CloudWatch via the CloudWatch agent and Fluent Bit, but this requires setup. The AWS-managed add-on for CloudWatch observability simplifies this:
aws eks create-addon \
--cluster-name ${CLUSTER_NAME} \
--addon-name amazon-cloudwatch-observability \
--service-account-role-arn ${CLOUDWATCH_ROLE_ARN}
For application logs, you deploy Fluent Bit as a DaemonSet that collects from all nodes:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: amazon-cloudwatch
spec:
selector:
matchLabels:
name: fluent-bit
template:
metadata:
labels:
name: fluent-bit
spec:
serviceAccountName: fluent-bit
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: fluent-bit
image: public.ecr.aws/aws-observability/aws-for-fluent-bit:stable
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 200m
memory: 256Mi
env:
- name: AWS_REGION
value: us-east-1
- name: CLUSTER_NAME
value: my-cluster
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
The Kubernetes-native observability story with Prometheus + Grafana is powerful and widely adopted. If your team already operates a Prometheus/Grafana stack or uses a platform like Datadog, EKS integrates naturally.
# Service Monitor for Prometheus scraping
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-metrics
namespace: monitoring
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 30s
path: /metrics
namespaceSelector:
matchNames:
- production
Verdict on observability: ECS integrates more naturally with CloudWatch. EKS has a richer ecosystem but requires more setup. If your company is already invested in Prometheus/Grafana, EKS wins this one.
When I'd Actually Choose Each One
Here's where I give you the real opinion instead of the "it depends" cop-out.
Choose ECS When:
Your team is primarily application engineers, not platform engineers. ECS doesn't require deep operational expertise. A team of four backend engineers can own an ECS-based infrastructure without a dedicated DevOps/platform role. EKS cannot make that claim.
You're building on AWS and want tight native integration. ECS is an AWS service. IAM, ALB, CloudWatch, Secrets Manager, Service Connect — all integrate natively without additional tooling. If you're all-in on AWS (and most startups are), this matters.
You have fewer than 50 services and don't need advanced traffic management. Below this threshold, the operational overhead of EKS rarely pays off. You'll spend more time managing the cluster than you save from its flexibility.
Fargate's simplicity is genuinely valuable to you. No node management, no AMI updates, no node security patching. For teams that don't want to think about compute, Fargate is remarkable. You describe what your task needs, it runs.
You need to move fast. A working ECS environment with CI/CD can be set up in a day. EKS done properly takes a week the first time. When time-to-production matters, ECS has a clear advantage.
Choose EKS When:
You have multi-cloud requirements or may need to migrate off AWS. Kubernetes is portable. Your application manifests, your Helm charts, your operational knowledge — it all works on GKE, AKS, or self-managed Kubernetes. ECS knowledge doesn't transfer. If there's any chance you'll need to run workloads outside AWS, this is a significant factor.
You have specialized workloads that need Kubernetes-specific features. GPU workloads for ML inference, jobs that use init containers and sidecars extensively, workloads that need custom schedulers, anything that benefits from the Kubernetes extension ecosystem (custom operators, CRDs). EKS handles these; ECS handles them less gracefully or not at all.
You have a large team with dedicated platform engineers. If you have people whose job is operating the container platform, EKS's complexity becomes manageable and its power becomes accessible. A 3-engineer platform team can run EKS well. A 1-person DevOps team probably can't give it the attention it needs while keeping everything else running.
Cost optimization at scale matters. At 100+ pods, the combination of EC2 spot instances and Karpenter's bin-packing can deliver meaningful savings over Fargate. The math becomes compelling at scale in a way it doesn't for smaller deployments.
You're already using Kubernetes elsewhere. If your team runs EKS and you're adding a new service, it goes on EKS. The operational patterns are established, the monitoring is set up, the oncall runbook exists. Don't introduce ECS complexity just to avoid adding another EKS service.
The Hard Truths Nobody Puts in Comparison Posts
EKS oncall is harder. When a production incident happens at 2am on EKS, you're debugging Kubernetes. You need to understand pod states, node conditions, CNI issues, RBAC errors, resource quota exhaustion, and admission webhook failures in addition to your actual application. ECS incidents are usually simpler: bad task definition, failing health check, insufficient IAM permissions. The debugging surface is smaller.
ECS has an upgrade story that doesn't wake you up at night. EKS clusters need to be upgraded every 14 months or so (AWS supports each Kubernetes minor version for about that long before end of life). Node groups need upgrading separately from the control plane. Add-ons need upgrading separately from nodes. Each upgrade is a project. ECS manages runtime upgrades for you on Fargate. It's not zero effort, but it's dramatically less.
Kubernetes expertise is widely available; ECS expertise is less so. If you're hiring, more engineers know Kubernetes than ECS. This cuts both ways: it's easier to hire for EKS, but it also means your engineers may resist ECS or want to migrate away from it.
ECS doesn't have a great multi-tenant story. If you're building a platform that hosts multiple teams' workloads and you need namespace-level isolation, RBAC, resource quotas per team — Kubernetes does this natively. ECS doesn't have a clean equivalent. You end up using separate clusters or complex IAM setups to achieve similar isolation.
The Honest Recommendation
If you're starting fresh in 2026 with a team of fewer than 15 engineers and no specific requirements driving you toward Kubernetes:
Start with ECS.
You'll ship faster. Your oncall will be less complex. The AWS integration is better. When (if) you outgrow it, migration to EKS is a well-understood project, not a crisis.
If you're starting fresh with a large engineering organization, existing Kubernetes knowledge, multi-cloud requirements, or complex workload requirements (GPU, custom schedulers, extensive sidecar patterns):
Start with EKS.
But staff it properly. An EKS cluster that nobody fully understands is worse than an ECS setup that everyone can operate. Kubernetes complexity in the wrong hands creates incidents. I've seen it happen at Company 2, and it's entirely avoidable.
Quick Reference
| Factor | ECS | EKS |
|---|---|---|
| Initial setup time | Hours | Days |
| Operational complexity | Low | High |
| AWS integration | Native | Via add-ons |
| Multi-cloud portability | No | Yes |
| Cost (small scale) | Competitive | Higher (cluster overhead) |
| Cost (large scale) | Higher | Lower (spot + packing) |
| Fargate support | First-class | Supported |
| Blue/Green deployments | Via CodeDeploy | Via Argo Rollouts |
| Service mesh | App Mesh / Service Connect | Istio, Linkerd, Cilium |
| Advanced scheduling | Limited | Full Kubernetes scheduler |
| Team expertise required | Moderate | High |
| Upgrade burden | Low (Fargate handles it) | High (nodes + control plane) |
Final Thought
The engineers who have the strongest opinions about ECS vs EKS are usually the ones who've only used one of them in production. The engineers who've used both tend to be more pragmatic — they reach for whichever tool fits the problem.
Both ECS and EKS are mature, well-supported platforms. Both can run production workloads reliably. The choice is about operational tradeoffs, team capabilities, and what your workloads actually need — not about which one is objectively better.
Use the simpler one until you have a concrete reason not to.
I'm curious what others have found in practice — especially if you've migrated from one to the other. Drop a comment with what surprised you most. Migrations usually surface the tradeoffs that no comparison post captures.
























