惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

博客园 - 三生石上(FineUI控件)
T
Threat Research - Cisco Blogs
月光博客
月光博客
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
爱范儿
爱范儿
Hugging Face - Blog
Hugging Face - Blog
腾讯CDC
云风的 BLOG
云风的 BLOG
D
Docker
罗磊的独立博客
U
Unit 42
博客园 - 聂微东
人人都是产品经理
人人都是产品经理
P
Proofpoint News Feed
博客园 - Franky
Apple Machine Learning Research
Apple Machine Learning Research
MyScale Blog
MyScale Blog
B
Blog RSS Feed
美团技术团队
J
Java Code Geeks
S
Securelist
Cyberwarzone
Cyberwarzone
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
NISL@THU
NISL@THU
Security Latest
Security Latest
Recent Commits to openclaw:main
Recent Commits to openclaw:main
Recorded Future
Recorded Future
Hacker News - Newest:
Hacker News - Newest: "LLM"
L
LINUX DO - 热门话题
Recent Announcements
Recent Announcements
Last Week in AI
Last Week in AI
A
About on SuperTechFans
MongoDB | Blog
MongoDB | Blog
Spread Privacy
Spread Privacy
T
Tenable Blog
I
Intezer
N
News | PayPal Newsroom
大猫的无限游戏
大猫的无限游戏
A
Arctic Wolf
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
V
V2EX - 技术
S
Schneier on Security
S
SegmentFault 最新的问题
Latest news
Latest news
宝玉的分享
宝玉的分享
V
Visual Studio Blog
V
V2EX
T
Tor Project blog
C
Comments on: Blog

DEV Community

Where Tensor-Parallel Inference Hits the NVLink Wall I open-sourced a World Cup 2026 prediction model — and tested it honestly Database WAL Bloat Management: The Core Anatomy for Performance WordPress Emails Were Failing Silently on DigitalOcean. Here's What Broke. Reading Belgium's KBO/CBE registry: what the live API returns 🤫 I Built CodeMoji: A VS Code Extension That Turns Code Into Emojis 5 AI Pair Programming Patterns That Actually Speed Up Development LLD Object-Oriented Design: From Requirements to Classes (Bridging Thinking to Domain Modeling) How We Built a CTO-Grade Grafana Dashboard With Codex T-Slot Bolts and Nuts for Secure Industrial Clamping From Tools to Economic Actors: Why AI Agents Need Independent Financial Infrastructure Designing a Scalable Event-Driven Data Processing Pipeline with Apache Kafka Streams M4 Pro vs M5 Pro: Which Apple Silicon Chip Wins? Calling GET API & Mapping Response in VBCS (Service Connection) How to Connect a Trailhead Org to Salesforce CLI WebAssembly - Why Your Browser Can Finally Do What Desktop Apps Could I built a CLI tool to find worthwhile GitHub issues to contribute to I Built a Global Location Data API with 12M+ Cities — Here's How Why we built a metadata-driven Angular framework instead of using Retool Your PR Queue Is the New Technical Debt. AI Code Review Is the Fix Nobody Set Up Yet. Hermes Quant: Zero-Cost Autonomous Equity Research Agent Powered by Hermes 3 RedBase / redb.Route / redb.Tsak 3.0.0 shipped Road To KiwiEngine #5: The Future of SaaS Might Be Operational Ownership Comparison pages that say where the competition wins Devlog #3 Turning OpenClaw Governance Into an Operating Layer AI Governance as Infrastructure on AGTP How Instagram, WhatsApp, Uber & Netflix Would Be Built Today Using Expo Router Linux: A few more tips From Zero to Manifest V3: How GitHub Copilot Helped Me Finish an RSC Vulnerability Detector for CVE‑2025‑55182 Best use of Gemini in everyday life. How to configure Claude Code (and Cursor) so it stops ignoring your conventions The AI Agent Ecosystem in PHP - From Simple OpenAI Calls to Multi-Agent Platforms Building a Micro-SaaS Empire: A Step-by-Step Guide to Creating and Monetizing Open-Source Developer Tools Google's Agentic Leap: How Gemini Turned Workspace Into Your Autonomous Executive Assistant The Art of Package Publishing: Best Practices for Creating and Maintaining Popular Open-Source Libraries Architectural Foundations of Adobe Experience Manager: A Developer's Deep Dive (Part - 1) Automating Bulk Content Authoring in SitecoreAI with PowerShell Extensions I tried to hide semantic meaning from embeddings without breaking search Static vs Non-Static in Java: Understanding Class and Object Through a Shop Story Selling a macOS app outside the App Store is easy. Licensing is the hard part. The Agent That Actually Remembers You: A Deep Dive into Hermes Agent published What Is HTTP & HTTPS ? Frontend Architecture: Where Does This File Go? W. Edwards Deming: The Father of Total Quality Who Predicted the Future of AI Vue Teleport component React: How does VuReact convert it? How to audit an AI agent skill: the 7-check framework we used on 200 skills AI for Knowledge Management: Real Workflows That Hold Up W. Edwards Deming: El Padre de la Calidad Total que Predijo el Futuro de la IA Mamba/SSM Basics 4 Ways to Get Started with AITuber, Sorted by Level How I Fixed a CORS Error Without Knowing Backend - and What I Learned From It Using jQuery to hide a DIV when the user clicks outside of it. How to Build a PostgreSQL Backed Job Queue in Go Manifest AI联创Jacob谈Transformer的不足与提出 Power Retention Gemma Mentor AI: From an Unfinished Prototype to a Real-Time Multi-Agent Learning Companion Giving Your Digital Employee a Company Credit Card (With Limits) Your AI Assistant Just Bought a $30,000 Cloud Subscription How Writing Can Help You Escape AI Delirium Self-Hosted VPN in 2026: WireGuard, Headscale, NetBird and More Compared Build Cache Strategies: The Operational Burden of Speed Inside agent-gov: Architecture of an Agent Cost Governance Platform Why we chose a Rust template engine and Go APIs Announcing agent-gov: Open-Source AI Agent Cost Governance AI doesn't fail because the model is bad. It fails because there's nothing underneath it Polly wants a transcript: giving agents ears and a voice, on your own machine AI coding assistants make junior devs faster and worse at the same time AI Won't Save You From Forgetting How to Think encodeURI vs encodeURIComponent: The JavaScript URL Encoding Trap Building an MCP Server Using Spring AI, JSON-RPC and SSE (Server-Sent Events) PgBouncer: Effectively Managing Your PostgreSQL Connection Pool How much should I charge for 3D prints? A complete pricing breakdown for Etsy sellers How to Fix Core Web Vitals in a MERN Stack App (Complete Guide) Is AI-Native .NET Development Actually Happening in 2026? The 54-point production deployment checklist that saves you from 3am rollbacks How I Built Hidden Collector Game in Unity Moving Beyond the Context Window: The Agentic Memory Architecture 🚀 Building an open-source email blast tool — free, self-hosted, no Mailchimp needed. Looking for contributors to help add: 📊 Open & click tracking 🐳 Docker support All issues are open. Jump in 👇 https://github.com/nikhilt101/email-blast-tool Progressive Distillation System Design - 6.CAP Theorem & PACELC, CAP Theorem & PACELC: The Most Important Trade-off in Distributed Systems [Boost] AWS Summit India Online 2026: Get a FREE AWS Certificate Without Any Exam or Fees! 🚀 Why your React tournament bracket breaks in Safari (and a 4 KB pure-CSS fix) Markdown Is Becoming the AI App Interface AstroFit – My Fitness Tracking Web Application When WP-CLI fatals on the plugin you came to rescue # Agentic AI: Architecture of Autonomous Systems Why Most AI Agents Forget Everything — And Why Hermes Agent Changes the Game Hermes Agent's Brain: How Its Skills & Memory System Actually Works How to Structure Reusable Components in a Next.js Project Scribe vs ClickTrek vs Tango vs Guidde vs Floik: Workflow Documentation Tools Compared (2026) Awk! Awk! Add a diagram!: Greptile-style PR diagrams, minus the SaaS How to Share Client Links Safely: Custom URLs, Passwords, and Expiration Dates The AI Agent That Deleted Everything in 9 Seconds — And What Every Developer Needs to Know I built a time-decaying knowledge graph for my terminal — here's how it works Demystifying Linux File Permissions and chmod (Without the Guesswork) I Let Claude Design 4 Chaos Experiments via MCP. The 4th Took Down Staging and Found a 6-Month-Old Bug. Two agents passing strings to each other is not a multi-agent system — it's a pipeline, and the distinction matters System Design - 5.Latency vs Throughput Latency vs Throughput: Why "Average Response Time" Is the Biggest Lie in Engineering Insights of git (part :2) Mix and Match: Running Kiro on Google Cloud Shell
How We Built a CTO-Grade Grafana Dashboard With Codex
Alexander Schneider · 2026-05-31 · via DEV Community

A good dashboard is not a wall of charts. It is an answer to a question.

For us, the question was simple:

Is production healthy, and if not, where should we look first?

We already had the usual observability ingredients: application metrics, host metrics, container metrics, structured logs, traces, synthetic checks, and database telemetry. The hard part was not collecting more data. The hard part was turning that data into a dashboard that a technical leader could open during a normal day, a deploy, or an incident and understand the state of the system quickly.

Grafana gave us the observability platform. Codex helped us keep the dashboard honest, small, tested, and aligned with the codebase.

This is the playbook we ended up with.

Start With Decisions, Not Charts

The first version of almost every dashboard grows by accumulation:

  • CPU chart
  • memory chart
  • request rate chart
  • latency chart
  • error chart
  • database chart
  • worker chart
  • logs chart
  • another latency chart

That is useful for exploration, but it is not a good top-level operational view.

For our CTO dashboard, we started from decisions:

  • Is the public API reachable?
  • Are enough API nodes alive to serve traffic redundantly?
  • Are workers alive and producing useful data?
  • Is PostgreSQL healthy?
  • Are alerts firing?
  • Is observability itself working?
  • If something is slow, is it API, worker, database, logs, or tracing?

Only after writing down those questions did we choose panels.

That one change made the dashboard much smaller.

Separate Product Health From Observability Coverage

One of the most important design choices was splitting these two ideas:

  1. Product health: is the service working for users?
  2. Observability coverage: can we see the service clearly?

Those are related, but they are not the same.

If an observability agent stops reporting from one host, that is bad. But it should not automatically make the product look down. Likewise, an API outage should not be hidden because the metrics pipeline is still green.

So our top-level dashboard has separate areas for:

  • production health
  • observability coverage
  • API node readiness
  • worker health
  • database health
  • alerts and drill-down links

This makes incidents easier to reason about. A broken monitoring path is visible, but it does not masquerade as a product outage.

Manage Grafana Assets Like Code

The biggest quality jump came when we stopped treating Grafana as a browser-only editing surface.

Our dashboards, alert rules, synthetic checks, and collector configs live in git:

ops/grafana/
  dashboards/
  alerts/
  synthetics/
  alloy/

Enter fullscreen mode Exit fullscreen mode

This gives us normal engineering controls:

  • pull requests
  • code review
  • CI checks
  • repeatable sync
  • history
  • rollback

Dashboard JSON is not pretty, but it is still production behavior. If it decides what engineers see during incidents, it deserves the same review discipline as application code.

Grafana's MCP Is Actually Useful

One thing that surprised us in a good way: Grafana's MCP integration was not a toy.

It was practical enough to help with real dashboard work:

  • finding dashboards and panels
  • inspecting datasource-backed queries
  • checking alerting and dashboard structure
  • moving between Grafana context and repository context
  • turning operational questions into concrete dashboard changes

That matters because AI assistance gets much better when it can inspect the actual observability system instead of guessing from exported JSON alone. The Grafana MCP made Codex feel connected to the running operational surface, while the repository still stayed the source of truth.

We also tried similar MCP-style integrations from other observability vendors. In our evaluation, Grafana's was the most useful and reliable for day-to-day engineering work. Some alternatives, including New Relic's, did not feel mature enough for our workflow yet, so I would not recommend adopting them as the primary AI-observability interface today.

That may change, but if I were setting this up again now, I would start with Grafana.

Use Grafana Alloy as the Collection Layer

Grafana Alloy became the edge collector for our production telemetry.

In broad terms, the pipeline looks like this:

application metrics
host metrics
container metrics
database metrics
structured logs
OTLP traces
synthetic checks
        |
        v
Grafana Alloy
        |
        v
Grafana Cloud: Prometheus, Loki, Tempo, dashboards, alerts

Enter fullscreen mode Exit fullscreen mode

Alloy lets us keep the collection profile explicit:

  • scrape host and container metrics
  • collect textfile metrics for custom worker state
  • forward structured logs
  • receive OTLP traces
  • remote-write metrics
  • apply relabeling and cardinality controls before data leaves the host

The important lesson: treat the collector config as part of the product. It defines what you can and cannot see during an incident.

Metrics For SLIs, Logs And Traces For Investigation

We try not to build core service-level indicators from log parsing.

For top-level API health, metrics are the right source:

rate(api_requests_total[5m])
rate(api_request_duration_seconds_bucket[5m])

Enter fullscreen mode Exit fullscreen mode

Logs are still critical, but they are better for answering:

  • Which request failed?
  • Which worker cycle failed?
  • Which trace ID connects API, worker, and database behavior?
  • What did the application say at the time?

Traces are the next step after metrics and logs:

  • Which route was slow?
  • Where did time go?
  • Did the database dominate the request?
  • Was the problem isolated to one worker or one dependency?

The top-level dashboard should point to those tools. It should not try to replace them.

Add Domain Metrics, Not Just Infrastructure Metrics

Infrastructure metrics tell you whether machines are alive.

They do not always tell you whether the business process is useful.

For worker systems, we added domain-specific metrics such as:

  • worker alive
  • heartbeat age
  • last cycle status
  • last cycle duration
  • items produced in the latest cycle
  • items observed over recent windows
  • consecutive failures

This matters because a worker can be technically alive and still produce no useful output.

That distinction changed the quality of our alerts. We could alert on:

the worker is alive, but it has produced no useful data for a while

That is much better than only alerting when the process is dead.

Be Ruthless About Metric Semantics

One small example: we had a non-scraper background worker showing up in a "mentions produced" panel with a value of zero.

Technically, the metric existed.

Operationally, it was noise.

The fix was not to hide the worker everywhere. Its status and duration still mattered. The correct fix was to exclude it only from panels where "mentions" was the business meaning.

That is the kind of dashboard maintenance Codex is very good at:

  1. inspect the dashboard JSON
  2. find the PromQL targets
  3. understand which panels use which metric
  4. make the smallest scoped change
  5. add a regression test

Small semantic fixes like this are what make dashboards feel trustworthy.

Test The Dashboard

Testing dashboards sounds strange until the first time a dashboard breaks during an incident.

We test things like:

  • dashboard JSON is valid
  • dashboard titles and UIDs stay stable
  • required panels exist
  • important PromQL expressions are still present
  • alert rule groups keep stable identifiers
  • datasource UIDs are correct
  • high-cardinality labels do not leak into metrics or logs
  • business panels exclude irrelevant worker types

The tests do not need to render the dashboard. They need to protect the contract.

Example contract:

assert "worker_type" in query
assert "analytics_rollup" not in mention_panel_series

Enter fullscreen mode Exit fullscreen mode

In practice, the real tests are a little more robust than this, but the idea is simple: encode the intent.

Where Codex Helped Most

Codex was useful because it could work across the repo, not just inside one file.

A dashboard change often touches several places:

  • dashboard JSON
  • alert YAML
  • collector config
  • metric writer script
  • tests
  • docs
  • deployment sync logic

Codex could inspect those relationships and avoid local-only fixes that looked correct but broke the wider system.

The best workflow was:

  1. state the operational problem in plain language
  2. ask Codex to inspect the relevant repo paths
  3. require a small scoped diff
  4. require tests
  5. review the PromQL and dashboard semantics like production code

Codex is not a replacement for operational judgment. It is a very fast assistant for applying that judgment consistently.

The Dashboard Architecture We Recommend

For a production service, I would structure Grafana like this:

1. CTO Overview

The top-level dashboard should answer:

  • is production healthy?
  • are users affected?
  • are alerts firing?
  • are enough nodes serving traffic?
  • are workers producing useful output?
  • is the database healthy?
  • is observability coverage intact?

Keep this dashboard short.

2. Production Infrastructure

This is where you put:

  • host CPU and memory
  • container CPU and memory
  • disk usage
  • process freshness
  • container freshness
  • agent health

3. Database Performance

This is where you put:

  • connection usage
  • cache hit ratio
  • lock pressure
  • transaction rate
  • temp file churn
  • slow query groups
  • table pressure
  • top query fingerprints

4. Tracing

This is where you put:

  • accepted spans
  • failed exports
  • slow API traces
  • recent worker traces
  • database spans

5. Synthetic Monitoring

This is where you put:

  • public liveness
  • readiness
  • deeper health checks
  • multi-region latency

The overview links to the diagnostic dashboards. It does not try to become all of them.

Security And Cardinality Rules

A public article should say this plainly: observability can leak data if you are careless.

Our rules:

  • never put API keys in labels
  • never put customer identifiers in labels
  • never put raw URLs with query strings in labels
  • never put request IDs or trace IDs in Prometheus labels
  • keep secrets out of dashboard JSON
  • keep collector credentials outside git
  • use route templates instead of raw paths
  • prefer bounded enum labels
  • drop or avoid high-cardinality metrics before remote write

Logs and traces can contain richer context, but even there you should be deliberate. Query-time parsing is often safer than promoting every field to a label.

Practical Tips

Here are the rules I would reuse on any team.

Write The Dashboard Goal First

Before adding panels, write one sentence:

This dashboard exists to help us decide ...

If you cannot finish that sentence, you are probably building a chart collection, not a dashboard.

Use Boring Names

Panel names should be operationally obvious:

  • API 5xx Error Rate
  • API Nodes Ready
  • Workers Alive
  • PostgreSQL Connections
  • Worker Heartbeat Age

Avoid clever names. During incidents, nobody wants to decode poetry.

Prefer Fewer Panels

The best overview dashboard is the one people actually use.

If a panel does not change a decision, move it to a drill-down dashboard.

Add Links

Every top-level panel should have an obvious next place to go:

  • production overview
  • database dashboard
  • tracing dashboard
  • logs exploration
  • runbook

Test The Intent

Do not only test JSON validity.

Test the meaning:

  • this panel uses the API metrics datasource
  • this alert has a stable UID
  • this worker is excluded from this business metric
  • this latency alert has a traffic floor
  • this dashboard does not depend on a deprecated metric

Keep Manual Edits Temporary

Manual Grafana edits are great for exploration. They are not a source of truth.

Once the panel matters, export it, commit it, review it, and sync it from code.

What "Perfect" Means

The perfect dashboard is not the biggest dashboard.

It is the dashboard where every panel earns its place.

For us, that meant:

  • metrics for health
  • logs for detail
  • traces for causality
  • alerts tied to action
  • dashboards stored in git
  • tests that protect dashboard meaning
  • Codex helping us keep changes small and consistent

Grafana made the system visible. Codex helped us make that visibility maintainable. That combination turned observability from a collection of charts into an operational product.