惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Microsoft Azure Blog
Microsoft Azure Blog
有赞技术团队
有赞技术团队
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
F
Fox-IT International blog
Recorded Future
Recorded Future
T
ThreatConnect
T
The Exploit Database - CXSecurity.com
SecWiki News
SecWiki News
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
人人都是产品经理
人人都是产品经理
T
Tenable Blog
L
LINUX DO - 最新话题
博客园_首页
Hugging Face - Blog
Hugging Face - Blog
罗磊的独立博客
博客园 - 司徒正美
The Hacker News
The Hacker News
博客园 - 聂微东
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Scott Helme
Scott Helme
博客园 - 【当耐特】
O
OpenAI News
Schneier on Security
Schneier on Security
Latest news
Latest news
S
Security @ Cisco Blogs
S
Secure Thoughts
F
Full Disclosure
L
Lohrmann on Cybersecurity
S
SegmentFault 最新的问题
T
Tor Project blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
量子位
小众软件
小众软件
T
Threat Research - Cisco Blogs
Simon Willison's Weblog
Simon Willison's Weblog
IT之家
IT之家
大猫的无限游戏
大猫的无限游戏
N
News and Events Feed by Topic
E
Exploit-DB.com RSS Feed
J
Java Code Geeks
Last Week in AI
Last Week in AI
酷 壳 – CoolShell
酷 壳 – CoolShell
Application and Cybersecurity Blog
Application and Cybersecurity Blog
S
Schneier on Security
Cisco Talos Blog
Cisco Talos Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Proofpoint News Feed
Recent Commits to openclaw:main
Recent Commits to openclaw:main
雷峰网
雷峰网

DEV Community

Spring Boot Auto-Configuration Source Code: Nail This Interview Question The Ultimate Guide to Free AI API Keys: 6 Platforms You Need to Know TryHackMe | Battery | WALKTHROUGH Stop Guessing Your Regex — Test It Live in the Browser I Built FreelancEye, an Open-Source Mobile PWA for Finding Clients Beyond the Hype: My Production Playbook for Docker Swarm Top AI App Builder Platforms with Integrated Backend, Hosting & Database ECS vs EKS in 2026: An Honest Comparison from Someone Who Has Run Both in Production Hardening Your Node.js App Against Supply Chain & Remote Code Execution Attacks linux commands A Practical GEO Case: How an AI System Started Recommending Our Blog Your AI Agent Works 24/7 and Earns $0. I Built the Fix. Your AI Trading Agent Will Lose All Your Money — Here's How To Stop It Google I/O 2026: What Happens When Everything Connects? Why AI writes software but doesn’t build a good product Beyond the Hype: How Google I/O 2026 Secretly Democratized Production-Ready AI Agents with Managed Sandboxes. The Killer Assumption Test: How to Spot Doomed Product Decisions Before You Ship Stop Describing Your Bugs — Just Screenshot Them # I Built an AI Website Builder and Here's What Actually Happened Cooking an AI Campaign in 5 Minutes with Google Cloud AI APIs Your PM Retrospectives Are Lying to You How I Built a Free, Self-Hosted Pipeline That Auto-Generates Faceless YouTube Shorts TypeScript 54 to 58: The Features That Actually Matter in 2026 How to Tailor Your CV to Any Job Posting in 2026 The 7-day SaaS MVP loop: ship fast, then validate with people who actually show up 95. Fine-Tuning LLMs: Make a General Model Do Your Specific Job What Is a Frontend Developer Roadmap and Why You Need One Google shipped three Gemini "Flash" models. Picking the wrong one could 6 your AI bill Building an MCP server so Claude can query my SaaS analytics directly Google I/O 2026 and the Rise of the AI Ecosystem Your Docker Builds Are Slow Because You're Doing It Wrong (And I Built a Tool to Prove It) How do you verify GitHub contributions without trusting self-reported skills? CV vs Resume: What's the Difference and Which Do You Need? student Devs: Build AI Agents & Compete for $55K in Prizes 🚀 How to Write a Cover Letter That Actually Gets You Interviews Battle-Tested: What Getting Hacked Taught Me About Web & Cyber Security Unda folders za kuandika code >> mkdir src >> cd src >> mkdir controllers database routes services utils >> cd .. Directory: C:\Users\mwaki\microfinance-system Mode LastWriteTime Length Name Code Coverage .NET AI slop debt" is technical debt on fast forward. Nobody's ready. Multi-Head Latent Attention (MLA) Memoria - A Local AI Reading Companion Powered by Gemma 4 Stop Trusting Your Accuracy Score: A Practical Guide to Evaluating Logistic Regression Models Serious Question: Is the Developer Job Actually in Risk Due to AI? published: true tags: #discuss #career #ai #help rav2d: We ported an AV2 video decoder from C to Rust — here's why Your New Domain's First Week of GA4 Is a Lie: 4 Days of Raw Data from a Launch Gemma Guide - Real-Time Spatial Awareness for Blind Users From YAML to AI Agents: Building Smarter DevOps Pipelines with MCP A Field Guide to Human–AI Relations (For the Newly Bewildered Mortal) The AI Agent That Learns While It Works — A Complete Guide to Hermes Agent Inviting collaborators to work on ArchScope ArchScope is an interactive web-based tool that lets you design, visualize, and test system architectures with real-time performance simulations. Github - ArchScope is an interactive web-based tool that lets you Gemma 4: Google's Open-Weight AI Is a Game Changer for Developers Confessions of a Git Beginner: Why the Terminal Stopped Scaring Me Docker 容器化实战:从零到生产部署 🚀 I Built a Full Stack Miro Clone with Real-Time Collaboration using Next.js Building an African Economic Data Pipeline with Python, DuckDB & World Bank API llms.txt vs robots.txt vs ai.txt: The Developer's Cheat Sheet Intigriti Challenge 0526 Writeup Business Logic Flaws: How Attackers Skip Steps in Your App to Get What They Should Never Have Why Vibe Coders Need Boilerplates to Save Time, Tokens, and Build More Secure SaaS Projects Idle Cloud Cost Is the New Egress Cost Quark's Outlines: Python Traceback Objects Ghost in the Stack (Part 1): Why uninitialized variables remember old data Building a High-Performance Local Chess Assistant Extension with WebAssembly Stockfish and Manifest V3 Breaking the Trade-off Between Self-Custody and Intelligent Automation on the Stellar Network I Open-Sourced a Practical Fullstack Interview Preparation Repository (React + Node + System Design) 🚀 How I Started Coding as a Student (Beginner-Friendly Guide) WordPress vs. Ghost: Why Automated Bot Attacks Are Making us think much I tested 4 AI agent-governance tools against an open spec - here's the matrix zkML Inference Proof: What the Receipt Proves, and What the Model Still Does Not I Scored 1000/1000 on AWS Certified AI Practitioner (AIF-C01) Here's Every Resource I Used Go - Struct and Interface Handling JSON Requests in Go Storing Kamal secrets in AWS Secrets Manager and deploying to a cheap Hetzner VPS How I Caught and Fixed an N+1 Query in My Django REST API I got tired of paying $10/month to remove image backgrounds – so I built it for free How to Start Coding as a Student: A Complete Beginner’s Guide 🚀 Storing Kamal secrets in AWS Secrets Manager and deploying to a cheap Hetzner VPS What Are Buffers? Build AI Agents with Hot Dev The Client Onboarding Checklist That Prevents 90% of Project Problems Scalable Treasure Hunts Are a Myth, But We Almost Made One Gemini 3.5 Flash Has a 1M Token Context Window. Here's What You Can Actually Build With It. I built a ultra-polished developer portfolio template using React & Tailwind v4 (with zero-JSX configuration) Gemini CLI Is Dead. Here's the Better Thing That Replaced It Post-quantum cryptography for embedded and IoT: secure boot, TLS and OTA Understanding Optimistic Preloading in Modern Applications Nobody Wants to Read Your Code (And You Don't Want to Read Theirs) A clothing pairing app E2B vs E4B vs 31B Dense: The Practical Guide to Choosing the Right Gemma 4 Model I built an AI app store screenshot generator because Figma made me cry — looking for brutal feedback Hello DEV Community — My Developer Journey Begins Adaptable apps on ChromeOS: a post-mortem The WordPress Paradox: Why It’s Here to Stay (and How to Stop Ruining It) I built a local voice AI that can change to 9 different personalities! UXRay: I Built an AI That Roasts Your UI Like a Senior Designer Would Wyrly DI: Type-safe Dependency Injection for Modern TypeScript The contract is the interface: agent-driven Steampipe Stave in one command Gemma 4's Hidden Superpower: Why Built-in Thinking Tokens Change Everything for Evaluation Tasks ⚡ WordPress Performance: The Real Truth They Don't Tell You A Mobile App Usually Needs an Admin System First
Why 91% of AI Agents Fail in Production (And What the 9% Do Differently)
Hari Sathwik · 2026-05-23 · via DEV Community

Everyone is building AI agents right now.

Autonomous systems that reason, plan, and act without humans in the loop. Agents that write code, manage workflows, analyze data, make decisions. The demos are incredible. The hype is deafening.

But here's what nobody talks about: 91% of AI agents that get built never make it to production successfully. They work in the demo. They fail in the real world.

And the reason is almost never the model.


The Real Problem Isn't Intelligence — It's Infrastructure

Most teams building agentic AI focus 90% of their energy on the agent itself. The prompts. The reasoning chain. The tool selection. The agent architecture.

Then they ship it and wonder why it falls apart after two weeks.

The problem is everything around the agent. The boring, unglamorous systems engineering that nobody wants to talk about at conferences. The stuff that doesn't make for a good demo but determines whether the agent actually works on day 30, day 90, day 365.

I'm talking about MLOps. Or more broadly, the discipline of making AI systems reliable in production.

And here's the thing — agentic AI is the hardest MLOps problem you can have.

Let me explain why.


Traditional ML vs Agentic AI: A Systems Engineering Gap

A traditional ML system is relatively simple: input goes in, model makes a prediction, output goes out. You monitor the prediction quality, retrain when drift happens, and you're done.

An agentic system is fundamentally different. It's not one model making one prediction. It's multiple models chained together in a loop. The agent reasons, plans, acts, observes the result, and reasons again. Each step depends on the previous one. Errors compound.

Here's what that means in practice:

Failure modes multiply. A wrong prediction in a traditional ML system is a single bad output. A wrong action by an agent can cascade — it takes a bad step, observes the wrong result, reasons from bad context, and takes another bad step. By the time you notice, the agent has been making confident mistakes for hours.

Monitoring gets harder. With a traditional model, you monitor prediction distributions and accuracy. With an agent, you need to monitor action quality, loop detection, cost per task, tool failure rates, and whether the agent is even pursuing the right goal.

Versioning explodes. A traditional model has one set of weights. An agent has multiple model versions, prompt versions, tool configurations, and orchestration logic. All of them need to be versioned and tracked together.

Drift becomes unpredictable. Traditional data drift is gradual — input distributions shift slowly. Agent drift can be sudden — a tool API changes, a new edge case appears, the environment the agent operates in evolves.

This is why agentic AI needs more MLOps discipline, not less. And why most teams are building on a foundation that can't support what they're creating.


The 5 Failure Modes That Kill Agents in Production

I've studied production ML failures — my own and others'. The same five patterns show up again and again. They're not model problems. They're systems problems.

1. No Monitoring — Flying Blind

This is the biggest one. Most agent demos have zero production monitoring. The agent runs, and the team only finds out something is wrong when a user complains or a business metric drops.

By then, it's too late.

Production agents need real-time monitoring of: action success rates, error patterns, cost per task, latency, and — most importantly — whether the agent is actually achieving its intended outcome.

If you can't see it, you can't fix it.

2. No Versioning — The One-Time Result

An agent worked once. It worked beautifully. But nobody recorded the exact configuration — the model version, the prompt version, the tool settings, the orchestration logic.

Two weeks later, something changed. The agent degrades. And the team has no idea what broke because they can't reproduce the last known good state.

Version everything. Code, data, model weights, prompts, configuration, environment. All of it. If you can't reproduce it, you can't debug it.

3. No Guardrails — Unbounded Behavior

Agents without guardrails are agents waiting to cause damage. I've seen agents that: kept retrying a failing tool until they hit rate limits and took down a service. Generated increasingly verbose responses that burned through token budgets. Pursued a goal past the point where they should have stopped and escalated.

Guardrails aren't optional. Circuit breakers, cost limits, retry budgets, human-in-the-loop checkpoints — these are what separate a demo from a production system.

4. Training-Serving Skew — The Twin That Isn't

The agent was tested in a sandbox. The production environment is different. Tool latencies are higher. Data formats are slightly different. Error messages look different.

The agent that worked perfectly in testing behaves unpredictably in production because it was never tested against the real world.

This is the same problem that kills traditional ML models, but it's worse for agents because they make sequences of decisions. A small skew at each step compounds into a large deviation by the end.

5. No Rollback — Stuck With a Bad Version

An agent starts degrading in production. The team knows something is wrong. But there's no quick way to revert to the previous version. They're stuck debugging a live system while users are affected.

Every production agent needs instant rollback. One command, back to the last known good version. No debate.


Demo Vs Production

What the 9% Do Differently

  • The teams that successfully ship agentic AI to production aren't smarter. They're not using better models. They're not better prompt engineers.
  • They just treat AI systems engineering as systems engineering.
  • They build the infrastructure first. Monitoring, versioning, guardrails, rollback. Before the agent is impressive, it's reliable.
  • They test in production-like environments from day one. Not in a notebook. Not in a demo. In an environment that looks and feels like the real world.
  • They set up drift detection. They know that the world changes, and their agent needs to adapt. They build automated retraining pipelines that validate new versions before promoting them.
  • They measure what matters. Not just "does the agent work?" but "does the agent work consistently, safely, and cost-effectively over time?"

A Real Example: Building a Self-Healing ML Pipeline

I recently built a customer churn prediction system for a telecom provider. On the surface, it's a simple binary classification problem — predict which customers will leave.

But I designed it as a self-healing system, because I knew the alternative was a model that degrades silently until the retention team notices they're losing more customers than usual.

Here's what that looks like:

Automated drift detection. Every day, the system compares incoming customer data against the training baseline. If feature distributions shift beyond a threshold — say, the company launches a new pricing plan and customer behavior changes — the system flags it.

Automated retraining. When drift is detected, the system automatically retrains the model on fresh data. Not a human deciding to retrain. The system detects the need and triggers the pipeline.

Quality gates. A new model doesn't go live just because it was retrained. It has to beat the current production model on F2-score, recall, and false positive rate. If it doesn't, the old model stays in place and the team gets an alert.

Instant rollback. If a promoted model starts underperforming, one command reverts to the previous version. No downtime. No debugging under pressure.

Full observability. Every prediction is logged. Every retraining run is tracked. Every drift report is stored. If something goes wrong, the full history is there to debug.

This is the same discipline that agentic AI systems need. The scale is different, but the principles are identical.


The Checklist: Is Your Agent Production-Ready?

Before you ship an agent to production, answer these questions honestly:

  • [ ] Can I monitor the agent's action quality in real time?
  • [ ] Can I reproduce any past run exactly (code + data + config + environment)?
  • [ ] Are there circuit breakers that stop the agent when it goes off track?
  • [ ] Has the agent been tested in an environment that matches production?
  • [ ] Can I roll back to the previous version in under 60 seconds?
  • [ ] Do I have drift detection that alerts me when the environment changes?
  • [ ] Do I have automated quality gates that prevent bad versions from going live?
  • [ ] Can I explain, to a non-technical stakeholder, what the agent did and why?

If you answered "no" to more than two of these, you're building a demo, not a product.


The Bottom Line

The AI agent hype is real. The technology is genuinely impressive. But technology without infrastructure is a demo.

The teams that win in agentic AI won't be the ones with the best models. They'll be the ones with the best systems. The ones who invested in monitoring, versioning, guardrails, drift detection, and rollback before they needed them.

The boring stuff. The stuff that doesn't make for a good demo. The stuff that determines whether your agent is still working six months from now.

Build the infrastructure first. Then build the agent.

Your future self — and your users — will thank you.