惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

N
News and Events Feed by Topic
Malwarebytes
Malwarebytes
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Cybersecurity and Infrastructure Security Agency CISA
F
Future of Privacy Forum
C
Cisco Blogs
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
S
Securelist
K
Kaspersky official blog
S
Schneier on Security
T
ThreatConnect
T
Tenable Blog
Spread Privacy
Spread Privacy
T
True Tiger Recordings
AWS News Blog
AWS News Blog
F
Fox-IT International blog
量子位
T
Threatpost
V
Vulnerabilities – Threatpost
C
CERT Recently Published Vulnerability Notes
Cisco Talos Blog
Cisco Talos Blog
GbyAI
GbyAI
宝玉的分享
宝玉的分享
腾讯CDC
G
Google Developers Blog
aimingoo的专栏
aimingoo的专栏
Cyberwarzone
Cyberwarzone
有赞技术团队
有赞技术团队
S
SegmentFault 最新的问题
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
U
Unit 42
雷峰网
雷峰网
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
The Register - Security
The Register - Security
MyScale Blog
MyScale Blog
小众软件
小众软件
A
About on SuperTechFans
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
博客园 - 三生石上(FineUI控件)
美团技术团队
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
MongoDB | Blog
MongoDB | Blog

DEV Community

Insults & Cutlasses, Local LLM Sword Fighting on Melee Island How 12 AI agent frameworks handle human approval (most badly) The Four-Index Reality: Why AI Search Isn't One Thing I Scanned 1 Million AI Services. Here's What Worries Me More Than the Vulnerabilities Managing multiple docker hub accounts using docker-use System Design Interview: Decentralized Web Crawler Metric Cardinality: High or Low? 4 Steps to Making the Right Choice 로컬 LLM 셋업 가이드 (v23) GEO vs SEO in 2026 — What Google's May Guidance Changed Cursor Review 2026 — Honest 'Not For Me' Take From a VSCode User Hello from rikuq — a practitioner blog for solo AI SaaS founders Why DevOps Engineers Need Practical Tutorials, Not Just Theory AI Agents in CI/CD: Give Them Context, Not Production Authority Now I See Why Translators Are Panicking Over AI—Should Coders Panic Too? Why I Track HRV Every Morning (And How It Actually Changes My Day) Diffusion Language Models: How NVIDIA's Nemotron-Labs DLM Is Killing Token-by-Token Generation Chatbots GPT pour le support client : ce que les équipes françaises ont réellement besoin de savoir I Hit the 1,232-Byte Wall So You Don't Have To Google Just Rebuilt the Search Box (Again) — But This Time It's Different Aether: A local Android assistant built with Gemma 4 BoxAgnts Introduction (1) — Out of the Box mkdev: trusted HTTPS for localhost, mapped by name Just one question, one answer. Why Java Still Rules the Programming World in 2026 Four Architectures for Letting Claude Edit Elementor (and Why We Shipped Clone-and-Mutate) yard-yaml 0.1.1: safer UTF-8 handling for YAML documentation I Built a Mac App That Keeps Your Clipboard in Sync Across All Your Android Devices Stop Using UUIDs: Why B2B SaaS Needs ULIDs in Laravel 🐘 I'm a non-technical founder who built a Slack approval tool. Here's what actually broke first. Open-Sourcing Our Game AI Stack — SDKs, Templates, and CLI Tools for NPC Dialogue I Built an AI System That Makes 1,000 Decisions a Day. Here's Where I Drew the Line. Lets Encrypt DNS Challenge with Traefik and AWS Route 53 Building an agent-ready website: how to make your site readable for ChatGPT, Perplexity and autonomous agents A productivity tool with GitHub as your cloud database How We Built Dynamic NPC Dialogue with LLMs — Lessons from Early Access cmux: The Native macOS Terminal Built for Running AI Coding Agents in Parallel Deep Atlantic Storage: Rewriting in Rust How I Built a Bulk Image Optimizer with $0 Server Costs Using Vanilla JS and Canvas API Humans and Machines read differently, I think I have a fix? Claude Code Deleted 92 Images Without Asking. This Happens More Than You Think. Method Calling Stack in Java I Built Schedule Sensei & Pushed It to GitHub – Here's What's Inside (And I Need Your Help 👀) OIC: From a Working Toast Watcher to a General "Watch It for Me" Agent Memory is two-thirds of what an AI chip costs to build The XState persistence problem is five years old. Here is what we built to finally solve it. i added MCP support to my SaaS in an afternoon. here's the whole thing. Framework: Link Building ☁️ Importing existing S3 buckets into Terraform state made easy with terraform import existing s3 bucket I Built a Token System on Solana (Without Any Backend Code) 터미널 AI 에이전트 구축 (v21) I Built an AI 3D Model Generator — Here's How I Handle Meshes in the Browser 🛡️ PromptGuard: I Built a Local AI Privacy Firewall That Sanitizes Your Prompts Before They Leave Your Machine PostgreSQL WAL Bloat: Why Automatic Management Is Often Insufficient? Seven PRs Before Lunch: Parallel Claude Code Tabs Plus Audit-Before-Bump Deployment using all three Kubernetes probes Qwen 3.6 Has Four Tiers. Here's How to Route Without Burning Cash. RAG 시스템 실전 구축 (v21) How I handle my errors in PHP The Blind Spot in Treasure Hunt Engine Configuration: Long-Term Server Health Run NVIDIA NIM on Your Own GPU — Same API, Different Endpoint Webflow SEO Implementation 로컬 LLM 셋업 가이드 (v21) How Logs Travel From Your EKS Pod to Datadog 𝗦𝘁𝗼𝗽 𝗖𝗿𝗮𝗺𝗺𝗶𝗻𝗴 𝗙𝗼𝗿 𝗘𝘅𝗮𝗺𝘀, 𝗦𝘁𝗮𝗿𝘁 𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗥𝗲𝗮𝗹 𝗦𝗸𝗶𝗹𝗹𝘀 How to Use EXPLAIN ANALYZE in PostgreSQL: A Visual Guide gRPC Performance: tonic (Rust) vs grpc-go Benchmarked at Scale Hack The Box (HTB): Cap Machine (Full Walkthrough) Visual Search Optimization studygemma: AI study buddy for CS students Architectural Tradeoffs in Webhook Idempotency and SaaS API Versioning One Open Source Project a Day (No. 75): Understand Anything - The AI Engine That Turns Any Codebase Into an Explorable Knowledge Graph From mock-only-works to real-world-works: 48 hours of reCAPTCHA debugging I built a free music tool AI Talking Avatar Pipelines Broke Our Ad CTR by 3.7% 800G to 400G Breakout: How to Scale 400G Networks with 800G Ports 터미널 AI 에이전트 구축 (v20) Topical Authority Architecture Inside Hermes Agent's Session Memory: What X-Hermes-Session-Id Actually Does How Logs Travel From Your EKS Pod to Datadog The Hidden Journey Inside / Kubernetes Is it safe to connect my bank account to AI? No Room — The World of Aying (8/12) Fossils — The World of Aying (10/12) Familiar Stranger — The World of Aying (9/12) Being Seen — The World of Aying (7/12) [I Ran an AI Agent for 30 Days Straight — Here's the Boring Engineering That Made It Work] Gemma 4: The 128K Multimodal Powerhouse in Your Terminal How to Consolidate Your QA Toolstack: A Practical Buyer's Guide The Thank-You Email Almost Nobody Sends (And Why That's Your Edge) Schema Types 2026 Idempotency Keys: The API Safety Net You're Probably Not Using How to let Claude see my Plaid bank data Kiro Did It: Build a Simple Portfolio Website with Kiro IDE | From Prompt to HTML Prototype Islands of Commerce: What Marketplace Founders Can Learn from 60 Years of Island Biogeography React Pointer Hooks: Hover, Long-Press, Double-Click, Scratch, and Click-Outside Without the Bugs Engineering decisions for my video call tool VBScript Still Lives: How a Custom Go VM Brought Classic ASP to Linux and Mac What Happens When You Teach Old Scripting Languages New Runtime Tricks? I Tested 6 AI Coding Assistants for a Month. Here's What Actually Works. Extendscript Still Has Life Afriex Webhook Integration Guide: Signature Verification, Event Handling, and Production Best Practices
Production Lab: ECS Fargate + Prometheus + Grafana + Loki + Alloy + Node Exporter
Aisalkyn Aid · 2026-05-25 · via DEV Community

Goal

You will build this architecture:

ECS Fargate Application
   |
   | metrics/logs
   v
Alloy sidecar
   |
   | remote_write metrics
   | push logs
   v
EC2 Monitoring Server
   - Prometheus :9090
   - Grafana    :3000
   - Loki       :3100
   - Alloy
   - Node Exporter

Enter fullscreen mode Exit fullscreen mode

Officially, ECS Fargate tasks use task execution roles for ECS actions like pulling images/logging, and task roles for application AWS permissions. (AWS Documentation) Alloy supports ECS/Fargate container metrics using the ECS Task Metadata Endpoint v4 and should run as a sidecar inside the task. (Grafana Labs)


Part 1: What Each Tool Does

Tool What it does Why DevOps/SRE uses it
ECS Runs containers on AWS Deploy microservices
Fargate Serverless container runtime No EC2 patching/management
IAM Role Gives permission securely No hardcoded AWS keys
Prometheus Stores metrics CPU, memory, request rate, errors
Grafana Visual dashboard See health visually
Loki Stores logs Troubleshoot errors
Alloy Collects metrics/logs/traces Modern agent replacing many old agents
Node Exporter Exposes EC2 Linux metrics Monitor EC2 server health

Part 2: EC2 Monitoring Server Check

Your EC2 already has:

Prometheus
Grafana
Node Exporter
Loki
Alloy

Enter fullscreen mode Exit fullscreen mode

Step 1: Check all services

Run on EC2:

sudo systemctl status prometheus
sudo systemctl status grafana-server
sudo systemctl status loki
sudo systemctl status alloy
sudo systemctl status node_exporter

Enter fullscreen mode Exit fullscreen mode

Expected:

active (running)

Enter fullscreen mode Exit fullscreen mode

Why we check this

Before we connect ECS, the central monitoring server must be healthy.

SRE/DevOps checks

DevOps checks:

sudo ss -tulnp | grep -E '3000|9090|9100|3100'

Enter fullscreen mode Exit fullscreen mode

Expected ports:

3000 Grafana
9090 Prometheus
9100 Node Exporter
3100 Loki

Enter fullscreen mode Exit fullscreen mode

SRE checks:

curl http://localhost:9090/-/ready
curl http://localhost:3100/ready
curl http://localhost:9100/metrics

Enter fullscreen mode Exit fullscreen mode

Expected:

Prometheus ready
Loki ready
Node metrics visible

Enter fullscreen mode Exit fullscreen mode


Part 3: Fix Prometheus for Remote Write

Fargate tasks are dynamic. Their private IP changes. So instead of Prometheus scraping every task IP, Alloy inside Fargate will push metrics to Prometheus.

Step 2: Enable Prometheus remote write receiver

Open Prometheus service file:

sudo systemctl edit prometheus

Enter fullscreen mode Exit fullscreen mode

Add:

[Service]
ExecStart=
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --web.listen-address=:9090 \
  --web.enable-lifecycle \
  --web.enable-remote-write-receiver

Enter fullscreen mode Exit fullscreen mode

Restart:

sudo systemctl daemon-reload
sudo systemctl restart prometheus
sudo systemctl status prometheus

Enter fullscreen mode Exit fullscreen mode

Test:

curl http://localhost:9090/-/ready

Enter fullscreen mode Exit fullscreen mode

Why we do this

Fargate containers cannot easily be scraped by fixed IP because tasks start/stop dynamically. Remote write lets Alloy push metrics to Prometheus.


Part 4: EC2 Security Group

In AWS Console:

Go to:

EC2 → Instances → Select monitoring EC2 → Security → Security Group

Enter fullscreen mode Exit fullscreen mode

Add inbound rules:

Port Source Purpose
3000 Your IP only Grafana UI
9090 VPC CIDR only Prometheus remote write
3100 VPC CIDR only Loki logs
9100 Your IP or VPC only Node Exporter test only

Example VPC CIDR:

10.0.0.0/16

Enter fullscreen mode Exit fullscreen mode

Do not open 9090, 3100, 9100 to 0.0.0.0/0.

Why we do this

Prometheus and Loki do not protect themselves like a public website. Keep them private.


Part 5: Configure Alloy on EC2

Open:

sudo nano /etc/alloy/config.alloy

Enter fullscreen mode Exit fullscreen mode

Use this:

prometheus.exporter.unix "local_host" {
  set_collectors = ["cpu", "meminfo", "diskstats", "filesystem", "netdev", "loadavg"]
}

prometheus.scrape "local_host" {
  targets    = prometheus.exporter.unix.local_host.targets
  forward_to = [prometheus.remote_write.local_prom.receiver]
}

prometheus.remote_write "local_prom" {
  endpoint {
    url = "http://127.0.0.1:9090/api/v1/write"
  }
}

loki.source.file "system_logs" {
  targets = [
    {__path__ = "/var/log/syslog", job = "syslog"},
    {__path__ = "/var/log/auth.log", job = "auth"},
    {__path__ = "/var/log/nginx/access.log", job = "nginx_access"},
    {__path__ = "/var/log/nginx/error.log", job = "nginx_error"},
  ]
  forward_to = [loki.write.local_loki.receiver]
}

loki.write "local_loki" {
  endpoint {
    url = "http://127.0.0.1:3100/loki/api/v1/push"
  }
}

Enter fullscreen mode Exit fullscreen mode

Restart:

sudo alloy fmt --write /etc/alloy/config.alloy
sudo systemctl restart alloy
sudo systemctl status alloy

Enter fullscreen mode Exit fullscreen mode

Important correction

Use:

127.0.0.1

Enter fullscreen mode Exit fullscreen mode

Not:

123.0.0.1

Enter fullscreen mode Exit fullscreen mode


Part 6: Create ECS IAM Roles

Role 1: ECS Task Execution Role

AWS Console:

IAM → Roles → Create role → AWS service → Elastic Container Service → ECS Task

Enter fullscreen mode Exit fullscreen mode

Attach:

AmazonECSTaskExecutionRolePolicy

Enter fullscreen mode Exit fullscreen mode

Name:

ecsTaskExecutionRole

Enter fullscreen mode Exit fullscreen mode

Why

This allows ECS/Fargate to:

Pull image from ECR
Send logs to CloudWatch
Read Secrets Manager if needed

Enter fullscreen mode Exit fullscreen mode

Role 2: ECS Task Role

Create another role:

IAM → Roles → Create role → ECS Task

Enter fullscreen mode Exit fullscreen mode

Name:

ecsAppTaskRole

Enter fullscreen mode Exit fullscreen mode

For this lab, start with no extra permissions.

If app needs S3 later, add only exact S3 permissions.

Why

Task role is for your application container, not ECS itself.


Part 7: Create ECS Cluster

AWS Console:

ECS → Clusters → Create cluster

Enter fullscreen mode Exit fullscreen mode

Choose:

AWS Fargate

Enter fullscreen mode Exit fullscreen mode

Name:

prod-observability-cluster

Enter fullscreen mode Exit fullscreen mode

Click:

Create

Enter fullscreen mode Exit fullscreen mode

Why

Cluster is the logical place where ECS services/tasks run.


Part 8: Create Simple Application Container

For easiest lab, use a demo app that exposes Prometheus metrics on port 8080.

Example image:

ghcr.io/brancz/prometheus-example-app:v0.5.0

Enter fullscreen mode Exit fullscreen mode

It exposes:

/metrics

Enter fullscreen mode Exit fullscreen mode

Port:

8080

Enter fullscreen mode Exit fullscreen mode


Part 9: Create Fargate Task Definition

Go to:

ECS → Task Definitions → Create new task definition → Create new task definition with JSON

Enter fullscreen mode Exit fullscreen mode

Use this template:

{
  "family": "fargate-observability-lab",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/ecsAppTaskRole",
  "containerDefinitions": [
    {
      "name": "demo-app",
      "image": "ghcr.io/brancz/prometheus-example-app:v0.5.0",
      "essential": true,
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/fargate-observability-lab",
          "awslogs-region": "us-east-2",
          "awslogs-stream-prefix": "demo-app",
          "awslogs-create-group": "true"
        }
      }
    },
    {
      "name": "alloy-sidecar",
      "image": "grafana/alloy:latest",
      "essential": false,
      "command": [
        "run",
        "--server.http.listen-addr=0.0.0.0:12345",
        "/etc/alloy/fargate.alloy"
      ],
      "environment": [
        {
          "name": "ALLOY_STABILITY_LEVEL",
          "value": "experimental"
        },
        {
          "name": "EC2_PROMETHEUS_URL",
          "value": "http://<EC2_PRIVATE_IP>:9090/api/v1/write"
        },
        {
          "name": "EC2_LOKI_URL",
          "value": "http://<EC2_PRIVATE_IP>:3100/loki/api/v1/push"
        }
      ],
      "portMappings": [
        {
          "containerPort": 12345,
          "protocol": "tcp"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/fargate-observability-lab",
          "awslogs-region": "us-east-2",
          "awslogs-stream-prefix": "alloy",
          "awslogs-create-group": "true"
        }
      }
    }
  ]
}

Enter fullscreen mode Exit fullscreen mode

Replace:

<ACCOUNT_ID>
<EC2_PRIVATE_IP>
us-east-2 if your region is different

Enter fullscreen mode Exit fullscreen mode

Important note

For a real production setup, store Alloy config in:

EFS
S3 pulled at startup
custom Alloy image

Enter fullscreen mode Exit fullscreen mode

For class/demo, custom Alloy image is easiest.


Part 10: Alloy Fargate Config

Create file:

fargate.alloy

Enter fullscreen mode Exit fullscreen mode

Content:

prometheus.scrape "app_metrics" {
  targets = [
    {"__address__" = "127.0.0.1:8080", "job" = "demo-app"}
  ]

  forward_to = [prometheus.remote_write.ec2_prometheus.receiver]
}

otelcol.receiver.awsecscontainermetrics "fargate_metrics" {
  collection_interval = "30s"

  output {
    metrics = [otelcol.exporter.prometheus.fargate_to_prom.receiver]
  }
}

otelcol.exporter.prometheus "fargate_to_prom" {
  forward_to = [prometheus.remote_write.ec2_prometheus.receiver]
}

prometheus.remote_write "ec2_prometheus" {
  endpoint {
    url = env("EC2_PROMETHEUS_URL")
  }
}

Enter fullscreen mode Exit fullscreen mode

Why

This collects:

Application /metrics
Fargate task CPU
Fargate task memory
Container-level metrics

Enter fullscreen mode Exit fullscreen mode


Part 11: Run ECS Service

Go to:

ECS → Clusters → prod-observability-cluster → Services → Create

Enter fullscreen mode Exit fullscreen mode

Choose:

Launch type: Fargate
Task definition: fargate-observability-lab
Service name: demo-app-service
Desired tasks: 1

Enter fullscreen mode Exit fullscreen mode

Networking:

VPC: same VPC as EC2 monitoring server
Subnets: private subnets preferred
Security group: allow outbound to EC2 private IP ports 9090 and 3100
Public IP: disabled if private subnet has NAT

Enter fullscreen mode Exit fullscreen mode

Click:

Create

Enter fullscreen mode Exit fullscreen mode

What to check

Go to:

ECS → Cluster → Service → Tasks

Enter fullscreen mode Exit fullscreen mode

Expected:

Task status: Running
Containers: demo-app running, alloy-sidecar running

Enter fullscreen mode Exit fullscreen mode


Part 12: Verify in Prometheus

Open:

http://<EC2_PUBLIC_IP>:9090

Enter fullscreen mode Exit fullscreen mode

Go to:

Status → TSDB Status

Enter fullscreen mode Exit fullscreen mode

Then search in Graph:

up

Enter fullscreen mode Exit fullscreen mode

Check Alloy internal metrics:

alloy_component_controller_running_components

Enter fullscreen mode Exit fullscreen mode

Check EC2 CPU:

rate(node_cpu_seconds_total[5m])

Enter fullscreen mode Exit fullscreen mode

Check EC2 memory:

node_memory_MemAvailable_bytes

Enter fullscreen mode Exit fullscreen mode

Check app request metrics:

http_requests_total

Enter fullscreen mode Exit fullscreen mode

Check Fargate container metrics:

ecs_task_memory_utilized

Enter fullscreen mode Exit fullscreen mode

or:

container_memory_usage_bytes

Enter fullscreen mode Exit fullscreen mode

Metric names may vary depending on Alloy/OpenTelemetry conversion.


Part 13: Verify in Grafana

Open:

http://<EC2_PUBLIC_IP>:3000

Enter fullscreen mode Exit fullscreen mode

Go to:

Connections → Data sources

Enter fullscreen mode Exit fullscreen mode

Add Prometheus:

URL: http://localhost:9090

Enter fullscreen mode Exit fullscreen mode

Add Loki:

URL: http://localhost:3100

Enter fullscreen mode Exit fullscreen mode

Click:

Save & test

Enter fullscreen mode Exit fullscreen mode

Expected:

Data source is working

Enter fullscreen mode Exit fullscreen mode


Part 14: Grafana Explore Queries

Go to:

Grafana → Explore → Prometheus

Enter fullscreen mode Exit fullscreen mode

Use:

up

Enter fullscreen mode Exit fullscreen mode

rate(node_network_receive_bytes_total[1m])

Enter fullscreen mode Exit fullscreen mode

100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)

Enter fullscreen mode Exit fullscreen mode

rate(http_requests_total[5m])

Enter fullscreen mode Exit fullscreen mode

Go to:

Grafana → Explore → Loki

Enter fullscreen mode Exit fullscreen mode

Use:

{job="syslog"}

Enter fullscreen mode Exit fullscreen mode

{job="auth"}

Enter fullscreen mode Exit fullscreen mode

{job="nginx_access"}

Enter fullscreen mode Exit fullscreen mode

For ECS logs, first check CloudWatch logs:

CloudWatch → Log groups → /ecs/fargate-observability-lab

Enter fullscreen mode Exit fullscreen mode


Part 15: What SRE Must Monitor

1. EC2 monitoring server health

100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)

Enter fullscreen mode Exit fullscreen mode

Alert if:

Memory > 85%

Enter fullscreen mode Exit fullscreen mode

Why:

If monitoring server dies, you lose visibility.

Enter fullscreen mode Exit fullscreen mode

2. Disk usage

100 - ((node_filesystem_avail_bytes{mountpoint="/"} * 100) / node_filesystem_size_bytes{mountpoint="/"})

Enter fullscreen mode Exit fullscreen mode

Alert if:

Disk > 80%

Enter fullscreen mode Exit fullscreen mode

Why:

Prometheus and Loki can fill disk quickly.

Enter fullscreen mode Exit fullscreen mode

3. Fargate task memory

ecs_task_memory_utilized / ecs_task_memory_reserved * 100

Enter fullscreen mode Exit fullscreen mode

Alert if:

> 85% for 3 minutes

Enter fullscreen mode Exit fullscreen mode

Why:

Fargate kills containers when memory limit is reached.

Enter fullscreen mode Exit fullscreen mode

4. Application request rate

sum(rate(http_requests_total[5m]))

Enter fullscreen mode Exit fullscreen mode

Why:

If traffic drops to zero, app or routing may be broken.

Enter fullscreen mode Exit fullscreen mode

5. Error rate

sum(rate(http_requests_total{code=~"5.."}[5m]))

Enter fullscreen mode Exit fullscreen mode

Why:

5xx errors show application or dependency failure.

Enter fullscreen mode Exit fullscreen mode


Part 16: What DevOps Must Check

DevOps engineer checks:

1. IAM roles are correct
2. ECS task is running
3. Security groups allow only needed ports
4. Fargate can reach EC2 private IP
5. Prometheus remote write is enabled
6. Loki is receiving logs
7. Grafana data sources work
8. No public access to Prometheus/Loki/Node Exporter
9. ECS service has desired count = running count
10. CloudWatch logs exist for both containers

Enter fullscreen mode Exit fullscreen mode


Part 17: Troubleshooting

Problem: ECS task running but no metrics

Check Alloy logs:

ECS → Task → alloy-sidecar → Logs

Enter fullscreen mode Exit fullscreen mode

Look for:

connection refused
timeout
remote write failed

Enter fullscreen mode Exit fullscreen mode

Common causes:

EC2 security group blocks port 9090
Wrong EC2 private IP
Prometheus remote write receiver not enabled
Alloy config error

Enter fullscreen mode Exit fullscreen mode

Problem: Grafana shows no Loki logs

Check:

curl http://localhost:3100/ready
sudo journalctl -u alloy -f
sudo journalctl -u loki -f

Enter fullscreen mode Exit fullscreen mode

Common causes:

Loki not running
Wrong Loki URL
Alloy cannot read log files
No permissions on /var/log/*

Enter fullscreen mode Exit fullscreen mode

Problem: Node Exporter works but Fargate metrics missing

Cause:

Node Exporter monitors EC2 only.
It cannot monitor Fargate hosts.

Enter fullscreen mode Exit fullscreen mode

Correct approach:

Use Alloy sidecar with ECS container metrics receiver.

Enter fullscreen mode Exit fullscreen mode


Final Teaching Summary

This lab demonstrates a real DevOps/SRE production pattern:

ECS Fargate runs application containers.
IAM secures container permissions.
Alloy collects telemetry.
Prometheus stores metrics.
Loki stores logs.
Grafana visualizes everything.
Node Exporter monitors the EC2 monitoring server.

Enter fullscreen mode Exit fullscreen mode

The most important SRE mindset:

Metrics tell you what is happening.
Logs tell you why it happened.
Grafana helps you see the story.
IAM and security groups control who can access what.

Enter fullscreen mode Exit fullscreen mode