惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

aimingoo的专栏
aimingoo的专栏
量子位
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
S
Schneier on Security
Cisco Talos Blog
Cisco Talos Blog
T
ThreatConnect
J
Java Code Geeks
博客园 - 司徒正美
A
Arctic Wolf
T
True Tiger Recordings
C
Cybersecurity and Infrastructure Security Agency CISA
Cyberwarzone
Cyberwarzone
Know Your Adversary
Know Your Adversary
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
Recorded Future
Recorded Future
P
Palo Alto Networks Blog
The Hacker News
The Hacker News
The Register - Security
The Register - Security
S
Securelist
www.infosecurity-magazine.com
www.infosecurity-magazine.com
C
CXSECURITY Database RSS Feed - CXSecurity.com
Application and Cybersecurity Blog
Application and Cybersecurity Blog
I
Intezer
P
Privacy & Cybersecurity Law Blog
Scott Helme
Scott Helme
K
Kaspersky official blog
博客园 - 聂微东
Last Week in AI
Last Week in AI
V
V2EX
小众软件
小众软件
F
Fox-IT International blog
Martin Fowler
Martin Fowler
Apple Machine Learning Research
Apple Machine Learning Research
T
Tenable Blog
F
Future of Privacy Forum
Microsoft Security Blog
Microsoft Security Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
腾讯CDC
Stack Overflow Blog
Stack Overflow Blog
C
Check Point Blog
阮一峰的网络日志
阮一峰的网络日志
GbyAI
GbyAI
T
Threatpost
I
InfoQ
P
Proofpoint News Feed
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
T
Tor Project blog
G
GRAHAM CLULEY
D
DataBreaches.Net

DEV Community

BugBench: a developer origin story and practical guide for VS Code / Kiro users A solution to messy token systems for Next.js A NestJS reference app that proves the nest-native stack under realistic backend pressure I Thought “Data Analyst” Was the Whole Game… Then I Entered the Data Avengers Office 👀 Create and configure network security groups How to analyze the cost of Kafka? How I Shipped 2,500+ Commits With AI Agents Using a 12-Phase Workflow [Boost] We built MDCMS, a Markdown-first CMS for teams using AI agents Zero Heap Allocations at 1.18 GB/s: Deep Dive into ForgeZero 4.0.x The Minimum Viable Test Suite for Working with Agents Why Perplexity Started Citing My Blog: 5 Changes That Actually Worked Sync Supabase via OAuth: No Connection String Needed I asked three AI models the same API question. Only one had it right. Implementing Saga Pattern With Lambda Durable Function Why does AI forget what you said (and how to fix it) I built a daily Wordle-style game for AI tools - Here's how Mapping Polish company structures: querying KRS direct via API Built tmpdrop — a tiny self-hosted ephemeral file drop Running Local LLM - 0$ Personal Agentic AI Assistant - Part 3 LLD Object-Oriented Design: Interfaces & Abstract Classes (Designing Contracts) The Smaller Ship: Vitalik, the Ethereum Foundation's Restructuring, and What It Leaves for Investors Looking for 4 people to build something weird with me Building a Local-Only RAG System with Ollama and TypeScript The False Positive Tax: a 1:1 TP:FP analysis of eslint-plugin-security What's new in Data Preprocessor 1.5.x — R codegen, Robust Scaler, and a deadlock post-mortem How I self-hosted my Flask app on an old laptop for almost free I built a free DSA interview prep site because I was tired of the existing options I built an AI agent that migrates Next.js Pages Router to App Router Prisma Query Logging and PostgreSQL: Where the ORM Ends and the Database Begins Prisma query logging y PostgreSQL: dónde termina el ORM y empieza la base From Browser to Server : The Journey of an HTTP Request (Demystifying the Web’s Infrastructure) Santa Augmentcode Intent Ep.6 I Benchmarked 17 ESLint Security Plugins. Only One Found Every Vulnerability. How to Build a High-Performance Image Optimization Pipeline in 5 Minutes 50 Linux Commands Every DevOps Engineer Must Know Less Toil, More Flow - Automating the Path from Request to Implementation The Code Review Checklist I Actually Use How I run a small blog on Astro 5 + Content Collections Git: Best Practices for Professionals How IBM Bob Became My Everyday Coding Companion Solana Passkey Wallet: Replacing Seed Phrases with SIMD-0075 I built a small browser puzzle game about arrows I wrapped Claude Code in a zsh function. Here's every decision I almost got wrong. Mobile Game Optimization: A Unity Developer's Checklist Git: Best Practices for Beginners Three days I lost chasing a ghost that was already dead on disk Why Too Many Parts Hurt ClickHouse Performance Guardrails for Agent Output: Pluggable Validation Before and After LLM Calls Gemma Forge: Local AI Without the Setup Wall From Half‑dead Prototype to Local‑Only AI Medical Assistant: Rewiring MedClinic with GitHub Copilot Runninig a forkbomb in Jenkins What’s Actually Happening When You Use Git Preventing Recursive Tool Loops in LangChain Agents Building a Rock-Paper-Scissors CLI with TypeScript — Union Types, Conditionals, and Jest Your AI Coding Agent Wastes 80% of Its Context. Fixed That with Graph Theory. Why Flutter Has Become the Go-To Framework for Fintech App Development We built a scripting language just for AI agents. Here's why. Stop building AI inboxes. Build decision layers instead. Meme Monday Why I Built @editora/ui-react? Are AI tools the next level of abstraction in software development? Identity on Solana: Your Wallet Is Your Account One API Call Changed Everything The Internet Career Nobody Talks About Enough: What Is DevRel? Solar Panel Wiring Diagram: Series vs Parallel Hello everyone! Glad to join the dev.to community I Built an AI Agent That Tailors My Resume - Here's How Agents Actually Work I Built a WhatsApp OTP + AI Chatbot Platform for African Businesses MTP Explained — And Why It Matters for Android on Mac Most Beginners Learn Full-Stack Development Backwards GitHub Glow-Up: Open Source, READMEs, Badges, Streaks, Git and gh CLI System Design Cheat Sheet: Concepts Every Developer Should Know Are Junior Developer Roles Actually Dying? A Fresher's Honest Take Using DigitalOcean Droplets as Ephemeral Sandboxes for AI Agents I built a VSCode extension that visualises your code navigation as a call tree — made for legacy codebase pain Vite predev/prebuild: chaining scripts without losing your mind A website to save you from messy browser tabs Dear Web2 Developer... Solana is here calling Postgres JSONB indexes: GIN vs BTREE on the same column The $5 AI That Remembers Everything What are your goals for the week? #180 Zettelkasten for Developers: A Practical Method That Works OpenClaw vs Hermes Agent: Stars, Downloads & Usage 2026 `act` vs. `waitFor` Global Teams Don’t Struggle With Time Zones. They Struggle With Context Python as a JavaScript Dev $5.4 Billion in Damage. 8.5 Million Machines Down. Three YAML Controls Would Have Prevented It. Here's the Structural Analysis. 🚫 Stop Using PN532 V1 for Your NFC Projects (Real Debugging Experience) Probabilistic Graph Neural Inference for smart agriculture microgrid orchestration for extreme data sparsity scenarios Inference Is Becoming the New Steady-State Cost Center Why AI-Generated Code Is Always Good Enough — And Never Great I built a dark admin dashboard template in HTML — no React, no npm, just pure HTML What is the Difference Between Lattice-Based and Hash-Based Signatures? Next.js App Router caching: revalidate, dynamic, and no-store without the folklore Next.js App Router caching: revalidate, dynamic y no-store sin folklore I built Stashly — a full-stack content manager with a rich text editor published: false tags: react, node, mongodb, typescript Why I Started Building React Projects Instead of Just Watching Tutorials ? Every Tool Eventually Becomes Tuesday Nobody Warns You That Real Software Engineering Feels Chaotic
Observability for AI Systems: Monitoring Drift, Hallucinations, and Reliability in Production
Abhi Chatter · 2026-05-25 · via DEV Community

Part 5 of a series on building reliable AI systems


So far in this series, we explored:

  • AI testing fundamentals
  • Evaluation pipelines
  • RAG evaluation
  • Agent tracing and reliability

But there’s a major gap between:

“The system passed evaluation”

and

“The system is behaving reliably in production.”

That gap is where observability becomes critical.

Because AI systems don’t just fail once.

They drift.


Why AI Systems Need Observability

Traditional applications are usually monitored for:

  • CPU usage
  • Latency
  • Error rates
  • API failures

AI systems introduce an entirely different layer of operational risk:

  • Hallucinations
  • Behavioral drift
  • Retrieval degradation
  • Prompt regressions
  • Tool misuse
  • Silent quality decay

And most of these issues won’t show up in infrastructure metrics.


AI Failures Are Often Silent

This is what makes production AI systems dangerous.

The system:

  • returns 200 OK
  • responds within latency limits
  • appears operational

…but produces low-quality or misleading outputs.

Infrastructure monitoring says:

“Everything is healthy.”

Users experience:

“The system is getting worse.”


What Should You Monitor?

AI observability is about monitoring both:

  1. System performance
  2. Behavior quality

You need visibility into both layers.


Core Dimensions of AI Observability


1. Input Monitoring

Question:

What kinds of inputs is the system receiving?

Track:

  • Query distribution
  • Input length
  • Language changes
  • New user patterns
  • Adversarial inputs

Example issue:
A support chatbot trained mostly on short queries suddenly starts receiving multi-step enterprise requests.

Performance drops—even though the model hasn’t changed.

That’s drift.


2. Output Quality Monitoring

Question:

Are outputs still reliable?

Track:

  • Hallucination frequency
  • Response consistency
  • Formatting failures
  • Grounding quality
  • Toxicity / unsafe outputs

This is where online evaluation becomes important.


3. Retrieval Monitoring (for RAG)

RAG systems need dedicated observability.

Track:

  • Retrieval success rate
  • Context relevance
  • Empty retrievals
  • Retrieval latency
  • Top-K quality trends

Example:

Good model
    +
Poor retrieval
    =
Bad user experience

Enter fullscreen mode Exit fullscreen mode

Many “LLM issues” are actually retrieval degradation problems.


4. Agent Workflow Monitoring

Agent systems require workflow-level visibility.

Monitor:

  • Tool usage patterns
  • Retry frequency
  • Loop detection
  • Failed actions
  • Average execution steps

Example issue:
An agent starts making 4x more tool calls after a prompt update.

Outputs still look correct.

Operational cost quietly explodes.


5. Drift Detection

One of the hardest production problems.

Drift happens when:

  • user behavior changes
  • prompts evolve
  • retrieval data changes
  • model behavior shifts over time

Even small changes compound.

Common drift signals:

  • Lower task success rate
  • Increased hallucinations
  • More retries
  • Reduced grounding quality

The Difference Between Monitoring and Evaluation

This distinction is important.

Evaluation:

Usually offline and controlled.

Example:

Run dataset → Measure metrics

Enter fullscreen mode Exit fullscreen mode

Observability:

Continuous monitoring in production.

Example:

Live traffic → Detect anomalies → Trigger alerts

Enter fullscreen mode Exit fullscreen mode

You need both.


A Practical AI Observability Flow

Production Traffic
        ↓
Capture Inputs & Outputs
        ↓
Run Online Checks
        ↓
Detect Drift / Failures
        ↓
Trigger Alerts
        ↓
Feed Back Into Evaluation Pipeline

Enter fullscreen mode Exit fullscreen mode

This creates a continuous reliability loop.


Online Evaluation in Production

Many teams now run lightweight evaluations on live traffic.

Examples:

  • Hallucination checks
  • Grounding verification
  • Response quality scoring
  • Toxicity detection

This helps identify:

  • silent regressions
  • degraded prompts
  • retrieval failures

before users escalate issues.


Real-World Example

Consider a production RAG assistant.

Initial state:

  • Strong retrieval quality
  • Stable outputs
  • Good user satisfaction

What changed:

A large set of new documents was added to the vector database.

What happened next:

  • Retrieval relevance dropped
  • Context became noisy
  • Hallucinations increased

Infrastructure metrics remained healthy.

Only observability metrics exposed the degradation.


Common Mistakes Teams Make

1. Monitoring only infrastructure

AI quality problems are behavioral—not just operational.


2. No production sampling

If you never inspect real outputs, you’ll miss drift entirely.


3. No feedback loop

Observability should improve:

  • datasets
  • evaluations
  • prompts
  • retrieval quality

Otherwise monitoring becomes passive reporting.


4. Ignoring cost observability

AI systems also drift operationally:

  • token usage
  • tool calls
  • latency
  • retries

Reliability includes efficiency.


Practical Signals Worth Tracking

Here are some high-value production metrics:

Area Signals
Output Quality Hallucination rate, grounding score
RAG Retrieval relevance, empty retrievals
Agents Tool failures, retries, loops
Usage Query distribution, prompt drift
Operations Latency, token usage, cost

Start small. Expand over time.


Building Feedback Loops

The best AI teams continuously feed production insights back into evaluation.

Example loop:

Production Failure
        ↓
Add to Dataset
        ↓
Run Evaluations
        ↓
Improve System
        ↓
Deploy

Enter fullscreen mode Exit fullscreen mode

This is how reliable systems mature.


What’s Next

In the next part of this series, I’ll go deeper into:

  • Red teaming AI systems
  • Prompt injection attacks
  • Jailbreak testing
  • Adversarial evaluation strategies

Because reliability without security is incomplete.


Final Thoughts

AI systems are not static applications.

They evolve continuously through:

  • changing inputs
  • retrieval updates
  • prompt modifications
  • model behavior shifts

And that means reliability cannot depend on testing alone.

It requires continuous observability.

The teams building resilient AI systems are the ones that:

  • monitor behavior, not just infrastructure
  • detect drift early
  • build strong feedback loops
  • continuously evaluate production quality

Because in AI systems, failures rarely announce themselves.

They emerge gradually—until users notice first.