惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

SecWiki News
SecWiki News
H
Help Net Security
罗磊的独立博客
Stack Overflow Blog
Stack Overflow Blog
M
MIT News - Artificial intelligence
Jina AI
Jina AI
L
LangChain Blog
K
Kaspersky official blog
I
Intezer
Martin Fowler
Martin Fowler
爱范儿
爱范儿
AWS News Blog
AWS News Blog
The Hacker News
The Hacker News
Recorded Future
Recorded Future
人人都是产品经理
人人都是产品经理
H
Hackread – Cybersecurity News, Data Breaches, AI and More
C
CXSECURITY Database RSS Feed - CXSecurity.com
Spread Privacy
Spread Privacy
Simon Willison's Weblog
Simon Willison's Weblog
U
Unit 42
N
News and Events Feed by Topic
A
Arctic Wolf
G
GRAHAM CLULEY
Microsoft Azure Blog
Microsoft Azure Blog
博客园 - 聂微东
F
Fortinet All Blogs
C
Cisco Blogs
美团技术团队
Vercel News
Vercel News
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
H
Hacker News: Front Page
T
Tailwind CSS Blog
I
InfoQ
宝玉的分享
宝玉的分享
Google DeepMind News
Google DeepMind News
博客园 - 司徒正美
P
Palo Alto Networks Blog
A
About on SuperTechFans
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
云风的 BLOG
云风的 BLOG
TaoSecurity Blog
TaoSecurity Blog
Google Online Security Blog
Google Online Security Blog
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
P
Privacy & Cybersecurity Law Blog
H
Heimdal Security Blog
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Hacker News: Ask HN
Hacker News: Ask HN
O
OpenAI News
博客园 - Franky
Scott Helme
Scott Helme

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
Building a Production-Grade Observability Platform for the Anvila API with LGTM, SLOs, DORA Metrics, and Game Day Testing
Adeolu · 2026-05-19 · via DEV Community

Introduction

For the HNG DevOps Stage 6 task, our team built a production-grade observability and reliability platform for the Anvila API.

The goal was not just to check whether a server was up or down. We needed to build a monitoring system that could help a team understand the health of an application from different angles:

  • Is the application available?
  • Is it fast enough for users?
  • Are requests failing?
  • Is the server under pressure?
  • Are deployments healthy?
  • Can engineers investigate logs, metrics, and traces from one place?
  • Can alerts reach the team with enough context to act quickly?

To achieve this, we used the LGTM observability stack:

  • Loki for logs
  • Grafana for dashboards
  • Tempo for traces
  • Prometheus for metrics

We also added Alertmanager for alert routing, Node Exporter for server metrics, Blackbox Exporter for uptime probing, OpenTelemetry Collector for logs and traces, and a custom GitHub Actions DORA exporter.

[Prometheus targets all up]

Architecture Overview

Our architecture uses a separate monitoring server. The Anvila API continues to run on its application server, while the monitoring stack runs on a dedicated AWS EC2 instance.

The monitoring server collects:

  • system metrics from Node Exporter
  • uptime and response time metrics from Blackbox Exporter
  • application metrics from the FastAPI /metrics endpoint
  • logs from the app server through OpenTelemetry Collector and Loki
  • traces from the instrumented FastAPI staging service through OpenTelemetry and Tempo
  • CI/CD metrics from GitHub Actions through a custom DORA exporter

Grafana connects all these data sources together so that we can view metrics, logs, and traces in one place.

[LGTM dashboards list]

At a high level, the data flow is:

Anvila API -> OpenTelemetry Collector -> Loki / Tempo
Anvila API /metrics -> Prometheus
Node Exporter -> Prometheus
Blackbox Exporter -> Prometheus
GitHub Actions exporter -> Prometheus
Prometheus / Loki / Tempo -> Grafana
Prometheus alerts -> Alertmanager -> Slack

Enter fullscreen mode Exit fullscreen mode

Why We Chose LGTM Over Managed Alternatives

Managed observability platforms are useful because they reduce operational work. However, for this task, we chose LGTM because we needed to understand and control the full observability pipeline.

LGTM gave us:

  • full control over metrics, logs, and traces
  • version-controlled configuration
  • no dependency on a paid managed observability provider
  • a better learning experience for how observability systems actually work
  • the ability to provision dashboards, alert rules, and data sources as code

Prometheus is strong for metrics and alerting. Loki is useful for storing logs in a way that works well with Grafana. Tempo stores distributed traces and helps us understand request flow. Grafana brings all three together in one interface.

This made LGTM a good fit for a platform engineering task where the goal was to build the stack from the ground up.

Infrastructure as Code and Systemd Deployment

One important requirement was that the stack should be reproducible. We used Terraform to provision the monitoring EC2 server and upload the configuration files.

The services are installed and managed with systemd instead of Docker. This was important because the task update said we should not install Docker on the server.

The stack includes systemd services for:

  • Prometheus
  • Grafana
  • Loki
  • Tempo
  • Alertmanager
  • Node Exporter
  • Blackbox Exporter
  • OpenTelemetry Collector
  • DORA exporter

The repository contains:

  • Terraform files
  • Prometheus configuration
  • Alertmanager configuration
  • Grafana dashboard JSON files
  • Loki and Tempo configuration
  • OpenTelemetry Collector configuration
  • alert rules
  • runbooks
  • SLI/SLO documentation
  • Game Day documentation

[Alert rules YAML]

Metrics Collection

Metrics are numerical measurements that tell us what is happening in the system.

We collect several types of metrics:

Server Metrics

Node Exporter gives us system-level metrics such as:

  • CPU usage
  • memory usage
  • disk usage
  • disk I/O
  • network I/O
  • system load

These metrics help us understand saturation, which means how full or stressed the system is.

[Node Exporter dashboard]

Uptime and HTTP Probe Metrics

Blackbox Exporter probes both the staging and production API URLs.

It checks:

  • whether the endpoint is reachable
  • whether it returns a successful HTTP response
  • how long the response takes
  • SSL certificate expiry

This is important because it measures the service from the outside, closer to how a user or client would experience it.

[Blackbox dashboard]

Application Metrics

We also added app-level Prometheus metrics to the FastAPI staging application through a /metrics endpoint.

This gives Prometheus real application request metrics such as:

  • http_requests_total
  • http_request_duration_seconds
  • http_requests_in_progress

These metrics help us track request volume, request latency, and request status codes from inside the application itself.

[App metrics target up]

[App request metrics]

[App latency histogram]

Logs and Traces

Metrics tell us that something is wrong. Logs and traces help us understand why.

We used OpenTelemetry Collector to collect application logs and send them to Loki. In Grafana, we can search these logs using Loki queries.

[Loki logs]

We also instrumented the staging FastAPI service with OpenTelemetry so that it sends traces to Tempo.

A trace shows the path of a request through the application. This is very useful when investigating slow requests, because it helps identify which endpoint or operation caused the delay.

[Tempo traces]

The Unified Observability dashboard combines metrics, logs, and traces. The idea is that if latency or error rate increases, an engineer can look at the metric, check related logs in Loki, and then inspect traces in Tempo.

[Unified observability dashboard]

The Four Golden Signals

The Four Golden Signals are a simple way to define service reliability from a user-focused perspective.

They are:

  1. Latency
  2. Traffic
  3. Errors
  4. Saturation

Latency

Latency means how long it takes the service to respond.

For Anvila, we track latency using both Blackbox response time and application request duration metrics.

Example PromQL:

histogram_quantile(
  0.95,
  sum by (le, path) (
    rate(http_request_duration_seconds_bucket{job="anvila-api-staging",status!~"5.."}[5m])
  )
)

Enter fullscreen mode Exit fullscreen mode

This tells us the 95th percentile request latency for successful requests.

Traffic

Traffic means how much demand the service is receiving.

Example PromQL:

sum(rate(http_requests_total{job="anvila-api-staging"}[5m]))

Enter fullscreen mode Exit fullscreen mode

This tells us the number of requests per second.

Errors

Errors measure how many requests are failing.

Example PromQL:

sum(rate(http_requests_total{job="anvila-api-staging",status=~"5.."}[5m]))
/
clamp_min(sum(rate(http_requests_total{job="anvila-api-staging"}[5m])), 1)

Enter fullscreen mode Exit fullscreen mode

This gives us the ratio of failed requests.

Saturation

Saturation shows how full the system is.

For example, CPU usage:

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Enter fullscreen mode Exit fullscreen mode

This goes beyond simply checking if the server is alive. A server can be up but overloaded. The Four Golden Signals help us understand the actual quality of the service.

SLOs and Error Budgets

An SLI is a Service Level Indicator. It is a measurement.

An SLO is a Service Level Objective. It is the target we want to meet.

An error budget is the amount of unreliability we are allowed within a time window.

For Anvila, our main availability SLO is:

99.5% successful probes over 30 days

Enter fullscreen mode Exit fullscreen mode

The error budget calculation is:

(1 - 0.995) * 30 days = 0.15 days = 3.6 hours

Enter fullscreen mode Exit fullscreen mode

This means the service can be unavailable for about 3 hours and 36 minutes in a 30-day window before the error budget is fully consumed.

The SLO dashboard shows:

  • 30-day availability SLI
  • SLO target
  • error budget remaining
  • error budget time remaining
  • burn rate
  • 7-day and 30-day compliance history

[SLO and Error Budget dashboard]

Burn Rate Alerting

Burn rate tells us how quickly the service is consuming its error budget.

This is better than simple alerting because it reduces alert fatigue.

For example, a short small failure may not need to wake everyone up. But if the service is burning through its error budget very quickly, the team should respond immediately.

We configured:

  • Fast Burn alert: critical
  • Slow Burn alert: warning

The Fast Burn alert means the system is consuming the error budget too quickly and needs immediate action.

The Slow Burn alert means the system may become a serious incident if the trend continues.

DORA Metrics

DORA metrics help connect engineering work to business outcomes.

The four DORA metrics are:

  • Deployment Frequency
  • Lead Time for Changes
  • Change Failure Rate
  • Mean Time to Restore

These metrics help answer questions such as:

  • How often do we deploy?
  • How long does it take for code to reach production?
  • How often do deployments fail?
  • How quickly can we recover from incidents?

We built a GitHub Actions DORA exporter that exposes deployment workflow data to Prometheus.

The dashboard shows:

  • deployment frequency
  • lead time
  • change failure rate
  • MTTR
  • deployment frequency classification

[DORA dashboard]

Our current DORA implementation measures commit-to-workflow-completion lead time. A future improvement would be to break it into more detailed sub-intervals such as commit time, workflow trigger time, workflow completion time, and deployment confirmation time.

Alerting and Slack Notifications

Alert rules are stored in version-controlled YAML files, not manually created in the Grafana UI.

We configured alerts for:

  • host downtime
  • high CPU
  • high memory
  • disk usage
  • SLO burn rate
  • change failure rate
  • MTTR threshold

Alertmanager handles routing and sends alerts to Slack.

The Alertmanager configuration includes:

  • route grouping by alert name, service, and severity
  • warning and critical routes
  • inhibition rules
  • Slack receiver configuration

[Alertmanager config]

We also used a structured Slack template. The alert payload includes:

  • alert name
  • severity
  • affected host or target
  • current value
  • dashboard link
  • runbook link
  • firing or resolved status

[Slack template]

Here is an example of a firing and resolved Slack notification:

[Slack firing and resolved alert]

Runbooks and Incident Management

Every alert needs a runbook. A runbook explains what the alert means and what to do first.

Our runbooks include:

  • what the alert means
  • likely causes
  • first investigation steps
  • how to resolve it
  • when to roll back
  • when to escalate

Example runbook:

[SLO burn rate runbook]

We also documented a blameless post-incident review for the simulated latency incident.

The goal of a blameless PIR is not to blame a person. The goal is to understand what happened, what the impact was, what went well, what did not go well, and what should be improved.

Game Day Testing

Game Day is where we intentionally create controlled failures to test whether our monitoring and response system works.

We ran three scenarios.

Scenario 1: Deployment Failure

We created a temporary PR with an intentionally failing GitHub Actions workflow.

The goal was to prove that a deployment or pipeline failure can be detected without touching production.

The PR was closed without merging after screenshots were captured.

[Game Day deployment failure PR]

[Game Day deployment failure check]

Scenario 2: Latency Injection

During the latency Game Day, we temporarily added a 2-second delay to the staging root endpoint and generated test requests. The evidence below shows the resulting Tempo traces and the recovery after the delay was removed.

[Game Day latency traces]

[Game Day latency recovery]

Scenario 3: Resource Pressure

We created CPU pressure on the monitoring server using controlled background processes.

This triggered a CPU warning alert in Slack. After stopping the pressure, we confirmed the resolved alert was also sent.

[Game Day resource pressure trigger]

[Game Day resource pressure Slack alert](

[Game Day resource pressure recovery]

Toil Identified and Automation

Toil is repetitive manual work that can be automated.

We identified several examples of toil:

1. Manual Dashboard Creation

Creating dashboards manually in Grafana is slow and hard to reproduce.

Automation: All dashboards are stored as JSON and provisioned automatically.

2. Manual Monitoring Server Setup

Installing each service manually would be time-consuming and error prone.

Automation: Terraform creates the monitoring server, and systemd installation scripts configure the observability stack.

3. Vague Alert Investigation

If an alert only says, "CPU high", the engineer still has to search for the server, dashboard, and runbook.

Automation: Alertmanager Slack messages include service, host, metric value, dashboard link, and runbook link.

4. Manual CI/CD Metrics Collection

Checking GitHub Actions manually does not scale.

Automation: The DORA exporter pulls GitHub Actions data and exposes it as Prometheus metrics.

What We Learned

This project helped us understand that observability is more than installing tools.

The tools are only useful when they are connected to reliability goals.

Prometheus helped us collect and alert on metrics. Loki helped us inspect logs. Tempo helped us understand request traces. Grafana helped us bring everything together. Alertmanager helped us route actionable alerts to Slack.

We also learned that good reliability engineering requires:

  • clear SLIs
  • realistic SLOs
  • error budgets
  • runbooks
  • incident reviews
  • controlled failure testing

Current Limitations and Next Steps

The platform is functional, but there are still improvements we can make.

Current limitations:

  • App-level /metrics is currently verified on staging.
  • Production is monitored through Blackbox probes and shared host-level metrics.
  • DORA lead time is currently simplified to commit-to-workflow-completion.
  • A follow-up PR was opened to normalize unmatched route labels in Prometheus metrics and reduce high-cardinality risk.

Next steps:

  • promote app-level metrics to production after staging validation
  • merge the metrics label normalization improvement
  • add more detailed DORA lead-time sub-intervals
  • add deployment annotations to Grafana dashboards
  • continue reviewing SLOs as the service matures

Conclusion

For this task, we built a complete observability platform around the Anvila API using the LGTM stack.

We deployed the monitoring stack with Terraform and systemd, collected metrics, logs, and traces, built dashboards, configured Slack alerts, wrote runbooks, and tested the system through Game Day simulations.

The biggest lesson is that observability is not just about knowing when something is broken. It is about giving the team enough context to understand the problem, respond quickly, and improve the system after every incident.

Repository:

https://github.com/Adeolu1024/anvila-observability-platform

Enter fullscreen mode Exit fullscreen mode

Evidence screenshots:

https://drive.google.com/drive/folders/1v_DiT6XQtq0iJtn5JqxfZecAmxhCdd82?usp=sharing```


Enter fullscreen mode Exit fullscreen mode