惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
Scott Helme
Scott Helme
爱范儿
爱范儿
WordPress大学
WordPress大学
博客园 - 三生石上(FineUI控件)
阮一峰的网络日志
阮一峰的网络日志
博客园 - Franky
V
V2EX
腾讯CDC
博客园_首页
博客园 - 司徒正美
酷 壳 – CoolShell
酷 壳 – CoolShell
T
Tailwind CSS Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
小众软件
小众软件
J
Java Code Geeks
大猫的无限游戏
大猫的无限游戏
月光博客
月光博客
Microsoft Azure Blog
Microsoft Azure Blog
B
Blog
雷峰网
雷峰网
Stack Overflow Blog
Stack Overflow Blog
IT之家
IT之家
罗磊的独立博客
Recorded Future
Recorded Future
博客园 - 聂微东
O
OpenAI News
S
Secure Thoughts
Hacker News: Ask HN
Hacker News: Ask HN
S
Schneier on Security
Hacker News - Newest:
Hacker News - Newest: "LLM"
Y
Y Combinator Blog
C
Cyber Attacks, Cyber Crime and Cyber Security
Project Zero
Project Zero
宝玉的分享
宝玉的分享
K
Kaspersky official blog
N
Netflix TechBlog - Medium
T
The Exploit Database - CXSecurity.com
Google Online Security Blog
Google Online Security Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Webroot Blog
Webroot Blog
云风的 BLOG
云风的 BLOG
Simon Willison's Weblog
Simon Willison's Weblog
C
Check Point Blog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
L
LINUX DO - 热门话题
美团技术团队
L
Lohrmann on Cybersecurity

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
5 Silent Failure Patterns I Keep Finding in Production AI Systems
Temur Khan · 2026-05-03 · via DEV Community

I've spent roughly two years debugging production AI systems for engineering teams that have already shipped, with production traffic, real users, and real cost surfaces. Different stacks (LangChain, LlamaIndex, vanilla SDK calls, custom agent harnesses), different audiences (B2B SaaS, internal tools, consumer features), and different scales. But the failure modes are remarkably consistent.

Here's what's surprised me: the failures that hurt the most aren't the obvious ones. Models hallucinate, sure, but most teams have at least some defense against that. APIs go down, and that's an exit code, that's a metric, that's an alert. Those failures get caught.

The ones that hurt are the silent failures. The job that ran successfully but produced nothing useful. The agent that returned an "ok" status while having done literally nothing. The cost line that slowly drifted up because one feature was hitting the LLM 4× per request instead of once. These don't trigger any alarms. They don't show up in error logs. They make it to production and stay there for weeks because the monitoring all says "healthy."

This is a catalog of the five I see most often, with the failure mode, how it actually surfaces, and what I now check for.

1. Exit code zero with empty output

The classic. A scheduled job, which could be a daily summary, a web search refresh, or an audit snapshot, runs, returns exit code 0, and finishes in normal time. The cron monitor turns green. Everyone's happy.

Except the output was empty. Or it was the literal string <no rows>. Or it was a 0 byte file. Or it was a 200 response with {"results": []} while the query was supposed to return roughly a thousand rows.

Why this happens: the script's "success" check is too lenient. Something like:

def run_summary():
    rows = fetch_data()
    if rows is None:
        sys.exit(1)  # explicit failure
    summary = summarize(rows)  # returns "" if rows == []
    send_email(summary)
    sys.exit(0)  # everything's fine?

Enter fullscreen mode Exit fullscreen mode

The if rows is None check is the only failure path. But rows = [] (empty list) flows through as if it were a normal day. The LLM dutifully summarizes nothing into nothing. The email goes out with an empty body. Exit code 0.

I've seen this pattern in:

  • Daily summary emails that gradually started arriving empty because an upstream API key expired silently
  • Web search backed agents that started returning empty results because of a query template change
  • Backup scripts that uploaded 0 byte files for weeks because the source path was wrong
  • Audit snapshot crons that returned exit 0 without writing the snapshot file because the disk was full and the write silently failed

What I check for now:

  • Output length anomaly versus historical median (if today's output is less than 30% of typical size, flag it)
  • Output presence; empty stdout from a job that's supposed to produce output is itself a failure
  • Expected pattern matching; if the job's manifest says it should produce a summary line, verify that line exists

The mental model shift: exit code is one signal. Output content is a second signal. Both must be checked independently. A job that exits 0 with empty output is a silent failure, not a success.

2. The "just this once" hook bypass that becomes permanent

Engineering needs to ship a hotfix. There's a validation hook in the way. Someone disables the hook for "just this deployment, we'll re enable next sprint." The hotfix ships. The hook stays disabled.

Six months later, an audit catches that the validation has been off for the entire window, and every release in the meantime has shipped without the check.

I've seen this pattern in:

  • LLM output validators disabled "temporarily" for a launch
  • PII redaction guards turned off because a customer support workflow needed raw logs
  • Cost cap circuit breakers raised "just for the holiday season" and never lowered
  • Tool argument schema validators bypassed because a model started passing nonsensical arguments and "we'll fix it later"

The pattern is universal: constraint X feels like it's blocking shipping, X gets disabled, the underlying reason X existed gets forgotten, and X never comes back.

What I check for now (and put in the framework):

  • Hygiene exception registry: every hook bypass is logged with reason, owner, explicit expiry date, and renewal review
  • Monthly audit ritual that walks the registry and asks "is this exception still needed?"
  • Hooks themselves emit a metric when bypassed, so even if the registry is forgotten, the production telemetry surfaces the bypass

The mental model shift: disabling a guard is a temporary action that needs an expiration date. Not "we'll re enable it eventually" but "this exception expires on $DATE and the owner is $NAME." If the date arrives and the exception is still needed, it's a real product decision, not background drift.

3. Action budget leak through agent loops

You build an agent. You give it a budget, say "20 tool calls per run, max." You ship it. Three weeks later, you're looking at your LLM bill and one specific feature's cost has 5×'d.

The bug: the budget was checked at the start of the run, not per action. The agent runs, makes 20 calls, the loop's recursion logic doesn't notice the budget is exhausted, makes a 21st call, then a 22nd, then a 23rd, and by call 80 the agent has solved the problem (or given up) but has burned through 4× the intended cost.

Worse: most agent frameworks don't expose per action budgets natively. The pattern is something like:

class Agent:
    def __init__(self, max_actions=20):
        self.max_actions = max_actions
        self.action_count = 0

    def run(self, task):
        while not done:
            if self.action_count >= self.max_actions:
                return  # this check is correct here, but...
            result = self.tool_call(...)  # ...this might recurse internally
            self.action_count += 1

Enter fullscreen mode Exit fullscreen mode

If tool_call internally invokes another agent, or has its own retry loop, the parent's action_count doesn't track those nested calls. The "20 max" is really "20 top level, unbounded total."

I've seen this manifest as:

  • A summarization agent that recursed when input was too long, with no recursion depth check
  • A search and rewrite loop that "kept trying" when results were empty (see also pattern 1; empty output triggering a retry cascade)
  • Tool calls that internally made multiple LLM calls each, while the budget was tracking tool calls, not LLM calls
  • Multi agent harnesses where each sub agent had its own budget but the parent had no global budget

What I check for now:

  • Budget should be decremented per action at the innermost call site, not per task at the outermost
  • Hard stop: budget at zero means return early, do not pass go, dead letter the run for review
  • Per call cost tracking and alerting on outliers (not just totals; an outlier run that 5×s normal cost should fire an alert before the day end summary catches it)
  • For multi agent setups: a shared budget pool that all sub agents decrement, not per agent budgets

The mental model shift: a budget enforced once at the start is not a budget; it's a suggestion. Real budgets are decremented per action, hard stop on zero, with an alert path so you find out about the depletion before the bill arrives.

4. Tool argument semantic validation gap

Your agent calls a tool: escalate_to_human(user_id, reason). Your tool has a JSON schema validator on the input. The schema says user_id: string. The LLM passes user_id="the user mentioned in the conversation". The schema is happy. Your tool dispatcher is happy. The escalation goes through.

You now have a support ticket against a literal user named "the user mentioned in the conversation."

I've seen this pattern in:

  • Tools that accepted user identifiers as strings but actually needed UUIDs or database IDs
  • Tools that took email arguments and got passed strings like "his email" or "the email from earlier"
  • Tools that took amount arguments and accepted strings like "the same amount as last time" (which the LLM thought was specific but the tool received as raw text)
  • Multi tool chains where output of tool A was supposed to become input of tool B, but the LLM paraphrased rather than passing through verbatim

JSON schema validation is necessary but not sufficient. It catches type mismatches but not semantic mismatches.

What I check for now:

  • Semantic post validation after JSON parse, before tool dispatch:
    • Does user_id resolve to a real user record? Reject if not.
    • Does email match an email regex? Reject if not.
    • Does amount parse as a number? Reject if not.
    • Does date parse as a real date in a plausible range? Reject if not.
  • For tool chains: explicit pass through tokens (the LLM is told "use the literal value from tool A's output, do not paraphrase")
  • Semantic validators return errors back to the LLM so it can self correct, not just hard fail

The mental model shift: type validation is for the parser; semantic validation is for the agent. A string that's correctly typed but semantically nonsense is a silent failure waiting to happen.

5. The "successful retry" that hides repeated failure

Your agent retries on failure. That's good. Your retry policy is exponential backoff with 3 attempts. That's also good. After the 3 attempts, the agent might succeed. Reported status: success.

But the actual user visible behavior was: 3 second delay, then 6 second delay, then 12 second delay, then succeed. Total: 21 seconds. The user has long since given up.

Or: the retries themselves are succeeding because the retry condition is too lenient. The first call returns a 200 with garbage content (silent failure pattern 1). The retry logic says "didn't see exit code other than 0, no retry." So the system "succeeded" on the first try, with garbage.

Or: the retries are masking a real upstream issue. The downstream service has a 50% error rate. Your retry three times logic gives you an 87.5% success rate at the cost of 1.875× the average call count. From the outside, "things look okay." From the inside, your costs are inflated 87% and you don't know why.

I've seen this manifest as:

  • Latency p99 spikes that nobody noticed because the success rate metric was unaffected
  • Cost overruns where the retry count was 3× normal but never alerted because no individual call failed visibly
  • "The product works fine" reports from QA followed by "the product is unusably slow" reports from real users, because QA's environment had ideal conditions and triggered no retries
  • Cascading retry storms where one upstream blip caused 3× downstream load, which caused other timeouts, which caused more retries

What I check for now:

  • Retry count as a first class metric, with alerts on outliers (not just averages)
  • Latency p99 measured after retries, not just per attempt latency
  • Retry rate per route; if a specific endpoint has a retry rate above 10%, that's a bug, not a normal mode
  • Per attempt logging so you can see the chain of attempts, not just the final outcome
  • Retry on content anomaly, not just retry on exception (if pattern 1 fires, that's a retry trigger)

The mental model shift: retries are not a fix; they're a defer. They turn one immediately visible problem into many slower visible problems. Every retry is a signal that something is wrong upstream, and if you're not measuring the retry rate per route, you're letting the upstream issue persist invisibly.

What to do with this catalog

These five aren't exhaustive. I have nine more in a longer catalog: error keyword in stdout despite exit zero, audit trail completeness drift, action budget per tick versus per task, expected pattern missing detection, and duration anomaly variants. But these five are the highest frequency ones.

If you want a starting point in your own production AI system:

  1. Pick one pattern from this list that you suspect is happening in your own stack. Don't pick the least likely one for variety; pick the one your gut says you've already hit.
  2. Spend 30 minutes looking for evidence. Grep your retry counts, look at p99 latencies after retries, sample 10 recent agent runs and check their output content (not just exit codes), and inspect any "temporarily disabled" hooks. You'll find the pattern.
  3. Write the corrective action. Not "we'll fix this someday," but a specific code change, a specific hook, a specific check. With an owner and a date.
  4. Schedule a recurring audit. Monthly is cheap (90 minutes if your data is wired up). Quarterly is the absolute floor. The patterns rot back without an audit cadence.

If you'd rather have someone outside your team do the first audit so you have a baseline to compare against, that's literally the service I run. Reach out to admin@pixelette.tech with subject AI audit inquiry. Three tiers, from $1,500 lite (one system, top 5 findings) to $7,500 audit and workshop.

Or if you want a free first pass on the same methodology without a commitment, paste your config or agent setup into the AI Production Auditor GPT on the GPT Store. Same five pattern framework, same 5 Cs report format, no signup beyond a ChatGPT account. Useful as a first look or when the real engagement isn't justified yet.

But you don't need to hire me or use the GPT to act on this article. The patterns above are public, the catalog they come from is openly available, and the framework that implements them is documented.

Tools I built around this

If you want the operational layer rather than just the patterns:

  • silentwatch mcp: an open source MCP server that surfaces patterns 1, 3, and 5 (silent failures, action budget leaks, retry anomalies) for any cron or scheduled job source. Drop in for system cron, systemd timers, or custom JSONL run logs. MIT license, no SaaS subscription. Install with pip install silentwatch-mcp.
  • AI Production Discipline Framework: a Notion template, 74 pages, the full 14 pattern catalog plus the audit ritual, the 5 Cs post mortem format, the hook patterns, and the database wiring. $29 one time. Free preview of the pattern catalog.
  • AI Production Auditor (GPT Store): drop your config or agent setup in, get a 5 Cs audit report against the same pattern framework. Free with a ChatGPT account. Use it for a self serve first pass before commissioning a paid audit.
  • 4 more MCP servers queued for the production AI deployment niche: health monitoring, skill registry vetting, upgrade orchestration, and cost tracking. Bundled when at least 3 ship.

Ending thought

Two years ago I wouldn't have called any of these patterns "silent failures." I would have called them "weird production bugs." Naming them was half the work; once you have a name for a pattern, you start spotting it everywhere, and you stop accepting "it just happens sometimes" as an explanation.

The reason this catalog exists is because every system I worked on had at least three of these, and most teams hadn't caught them yet. The patterns are public knowledge now. What you do with them is up to you.

If you found this useful, the longer catalog is here. For audit consulting: admin@pixelette.tech with subject AI audit inquiry.

Built by Temur Khan, an independent practitioner on production AI systems.