惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

人人都是产品经理
人人都是产品经理
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
P
Proofpoint News Feed
T
Tailwind CSS Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
G
GRAHAM CLULEY
Engineering at Meta
Engineering at Meta
Blog — PlanetScale
Blog — PlanetScale
量子位
GbyAI
GbyAI
C
Cybersecurity and Infrastructure Security Agency CISA
Know Your Adversary
Know Your Adversary
阮一峰的网络日志
阮一峰的网络日志
P
Privacy International News Feed
T
Tenable Blog
Cisco Talos Blog
Cisco Talos Blog
P
Privacy & Cybersecurity Law Blog
T
Tor Project blog
L
Lohrmann on Cybersecurity
S
Secure Thoughts
Y
Y Combinator Blog
S
Securelist
H
Hackread – Cybersecurity News, Data Breaches, AI and More
有赞技术团队
有赞技术团队
月光博客
月光博客
Cyberwarzone
Cyberwarzone
H
Heimdal Security Blog
博客园 - 聂微东
Latest news
Latest news
The Hacker News
The Hacker News
小众软件
小众软件
T
Troy Hunt's Blog
Google Online Security Blog
Google Online Security Blog
D
DataBreaches.Net
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Martin Fowler
Martin Fowler
罗磊的独立博客
www.infosecurity-magazine.com
www.infosecurity-magazine.com
U
Unit 42
Vercel News
Vercel News
T
The Blog of Author Tim Ferriss
F
Fortinet All Blogs
SecWiki News
SecWiki News
MongoDB | Blog
MongoDB | Blog
C
Check Point Blog
aimingoo的专栏
aimingoo的专栏
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Stack Overflow Blog
Stack Overflow Blog
WordPress大学
WordPress大学

Towards AI

The Verified Identity Agent Bridge | Towards AI The Free Agent Trap | Towards AI Your Agentic Loop Will Drift. Here Is the KL Divergence Equation That Measures How Far It Has Wandered From Its Original Instruction. | Towards AI Beyond Chat: Processing Images, PDFs, and Documents with the OpenAI Adapter in Oracle Integration Cloud | Towards AI Building AI Agents in Rust — part 3 | Towards AI Self-Hosting Airflow at Home: Automating Stock Price Data Collection | Towards AI The 76-Hour Frontier: How the Takedown of Claude Fable 5 Birthed the Military-Industrial-AI Complex | Towards AI I Trained a Markdown File to Boost GPT-5.5 by 23 Points — It Shouldn't Work | Towards AI We Replaced ChatGPT With a Local AI Server. Six Months of Honest Data. | Towards AI What Really Makes Cars Pollute? A Data Science Deep Dive into CO₂ Emissions | Towards AI Training GPT-2 From Scratch on a GTX1050 | Towards AI Principal Component Analysis (PCA): Theory, Mathematics, and Applications Build a Zero-Cost Web Automation Pipeline With OpenRouter, OpenClaw, and MediaUse I Gave Qwen3.7-Plus a Screenshot and It Found the Exact Pixel to Click for $0.40 Beyond the Prompt: Why Autonomous AI Agents Are Replacing the Chatbot Moonshot Cracked Claude Code’s Playbook with an MIT Terminal Agent and a $0.60 Model Connections, Roles, and Warehouses: Getting CoCo Desktop Production-Ready from Day One My First $5,000 Month Writing About AI Engineering on Medium Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good LangChain Explained: Understanding Models, Prompts, Chains, Memory, Indexes, and Agents TOON: Beyond JSON for LLMs Claude Code Casual, Pro, Elite: The Three Working Personas of Claude Code Mastery MiniMax M3 Decodes 1M Tokens 15x Faster — and It Shouldn’t Be This Cheap Using Amazon SQS for AI Agent Orchestration I Ran a 1.5B-Active Model on My Laptop That Embarrassed a 26B by 46 Points How to Build a Self-Improving Company with AI Part 3 — Implementation/Engine-Level: Choosing the Runtime That Gives You These for Free Part 2 — Serve-Level Speed: System Design That Stabilizes P95/P99 3-Part Series: LLM Latency in Production (Part 1) Claude Code: The AI Coding Partner Changing How Developers Build Software Claude Code Pitfalls: Claude Code Won’t Do What You Told It: A Troubleshooting Catalog Full-Stack Data Scientists for the Agentic Coding World Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier How One Spring Boot Optimization Saved Our Startup $30,000 a Year Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works What Is a Reverse Proxy? (And Why Every Backend Developer Should Care) What Claude Opus 4.8 Actually Changes If You’re Building Agents QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing When LLMs Meet Knowledge Graphs on the Battlefield Fine-Tuning is Dead: Why Context Orchestration Won in 2026 5 Things Broke When I Shipped a RAG + MCP Agent to Production. Google Co-Scientist: Hyper Scaling Research and Discovery Microsoft Just Embarrassed Browser Web Agents — 1,000 Lines Made GPT-5.4 Beat Opus 4.6 on 200 Web Tasks The Modern Data Stack Is Broken — Here’s How to Fix It With AI, Governance, and Real Architecture Building Production MCP Servers: What the Spec Won’t Tell You When Should an Agent Stop? The Anatomy of Termination Harness Engineering: The Layer That Matters More Than the Model AI Engineers Who Can’t Debug Are Getting Fired (Here’s How I Debug with Claude Code) Claude Code Memory: Why You Keep Explaining the Same Thing to Claude (and the Five Layers That Fix It) Claude Code Subagents: The Claude Code Feature You Skip Every Day (And Why It Quietly Wrecks Your Sessions) Agentic AI and the SMB Banking Advantage Claude Code: Spec-Driven Development — Why Your AI Coding Sessions Fall Apart at Hour Three The Real Cost of Agentic AI Nobody Budgets For SVM : 40 must visit Interview Questions (Part 2) Your AI Agent Works Perfectly in the Demo. Here Are the 6 Ways It Dies in Production. Unleashing the Power of ONNX for Speedier SBERT Inference Terraform vs CI/CD for Serverless Deployments Merve Noyan Stopped Writing Training Scripts — Her Agent Just Fine-Tuned 18 Models Solo for $11.40 Why Your Sales Forecast Is Always 20% Wrong (And How To Make It 12% Wrong) Genetic Cubic n{C/A} Ratios For Elementary Robotics Design Top 20 AdaBoost Interview Questions & Answers (Part 2 of 2) Agentic AI Vs AI Agents — What Are the Key Differences? LAI #127: The Infrastructure Layer of AI Is Becoming the Product Anthropic Caught Its Own AI Planning to Blackmail Engineers RNNs Cannot Think What Transformers Think Cheaply. ICLR 2026 Proved the Gap Is Exponential. Time Series Made So Easy My Aunt Got It on the Second Read Claude Cowork 101 | Towards AI Is 3-Bit KV Cache the Holy Grail? A Reality Check on Google’s TurboQuant LangGraph Multi-Agent Architecture: Building a Self-Critiquing AI Debate System AutoML on Autopilot | Towards AI I Ran This Open-Source AI Tool on a Messy Codebase and Got 71x Fewer Tokens — Here Is Exactly What Happened Month in 4 Papers (April 2026) AI Kept Forgetting My Notes. Fixing That Taught Me How It Actually Works. How ChatGPT Makes You Addicted Crack ML Interviews with Confidence: K-Nearest Neighbors (KNN 20 Q&A) The Event-Driven Blueprint: How I Scaled a Spring Boot System to 10 Million Kafka Messages/Day Building Vector Search? Why FAISS Alone Isn’t Enough TAI #202: GPT-5.5 Moves Codex Into Real Work Machine Learning System Design -The Model Serving Triangle, With One Forward Pass Flowing Through Every Trade-off (Part3) AI Orchestration in Action: How MuleSoft and LLMs Fuel the Future of Enterprise AI GPT-4 Has 1.8 Trillion Parameters. It Uses 2% of Them Per Token. Part 20: Data Manipulation in Multi-Dimensional Aggregation A Fundamental Introduction to Genetic Algorithm -Part Two TAI #200: Anthropic’s Mythos Capability Step Change and Gated Release From Notebook to Production: Running ML in the Real World (Part 4) Sqribble’s Template‑Driven Document Automation Anthropic Just Shipped the Layer That’s Already Going to Zero Long-Term vs Short-Term Memory for AI Agents: A Practical Guide Without the Hype The L1 Loss Gradient, Explained From Scratch Your Postcode Is Deciding Your Care. I Built a Pipeline to Prove It. I Directed AI Agents to Build a Tool That Stress-Tests Incentive Designs. Here’s What It Found. Your System Prompt Is the Product — Not the Feature The LLM Wiki Trend Has a Retention Problem Nobody Mentions Top 20 Data Preparation Interview Questions and Answers (Part 2 of 2) LAI #122: Word Embeddings Started in 1948, Not With Word2Vec Top 15 Computer Vision Datasets [2026] 40 Generative AI Interview Questions That Actually Get Asked in 2026 (With Answers)
You Can’t Prompt Your Away Your LLM Problems | Towards AI
Venkat Peri · 2026-06-18 · via Towards AI

Originally published on Towards AI.

You Can’t Prompt Your Away Your LLM Problems

When an LLM feature breaks in production, the first instinct in the room is to open the prompt and reword it. We had that instinct too. Then we built a production assistant for financial advisors and kept a record of every LLM-related failure the system hit, along with the fix that actually closed each one. Across the whole build, almost nothing that mattered was fixable by editing a prompt. The durable fixes were architectural. The single time we tried a prompt-only fix on the hardest problem, it made things measurably worse, and we reverted it.

This is the case for treating the model as one untrusted component in a larger system, written from the failures that taught us to do it.

The fix that made it worse

Routing was our most unstable surface, and it was unstable in a way that prompt editing could not reach. The same question routed one way on one run and a different way on the next, with no code change in between. On the ambiguous edges, routing accuracy sat around 56 to 64 percent and was non-deterministic from run to run. A question like “rank my households by AUM” came back as a clarification request on one run and a confident answer on the next.

The obvious move was to explain the boundary better in the routing prompt. We added a block of guidance describing how to tell the categories apart. The classifier got less stable, not more, so we took the prose back out.

The reason mattered more than the symptom. The flakiness was not bad luck. It was structural, and it got worse every time we added a domain. The router guessed an abstract category first, and then code mapped that category to a concrete tool. The category was a lossy middle step. When the guess was wrong, the right tool was unreachable, and there was no signal left to recover from.

The fix was to delete the abstract category. We collapsed routing into a single stage that picks one concrete tool directly from the catalog, with the scope derived from the tool it picked rather than guessed up front. We narrowed what each decision had to weigh, so any one call discriminates among a handful of tools instead of the full set. We grounded each tool with a few example utterances carried as structured data, not as prose inside the prompt. Accuracy moved from a flaky 98 percent on the old design, down through a real regression to 72 percent right after the rewrite, and up to 100 percent on both of our evaluation suites once the grounding was in place.

The improvement came from removing decisions the model did not need to make. That set the pattern for most of what followed: take work away from the model wherever code can do it, and catch what is left in deterministic guardrails.

A related failure looked like a bad model decision and was actually a missing option. Asked to “show the first account,” the model picked the nearest available tool, a holdings lookup, because no account-listing tool existed, and it invented an ordinal to make the answer fit. The model was choosing the closest thing within reach. We fixed it by building the tool it needed, not by writing a prompt that apologized for the gap.

A value the model invented is not the value you computed

A model fills a blank confidently, and confidence is the dangerous part. “Create a task for 2pm” put the string “2pm” into a field our code expected to hold a computed timestamp. The parser tried to read “2pm” as an ISO instant, threw, and the user saw a generic server error. This crash only reproduced with real model output. Every offline test we had written passed an empty argument map, and an empty map never triggers the bug. The lesson we wrote down for the next person chasing a live crash they cannot reproduce: vary the arguments the model actually produces, because empty-argument mocks hide a whole class of failure.

The crash was one instance of a wider habit. When asked to create a task with no subject, the model invented a subject that was a bare echo of the tool’s own name, and because the field was now non-empty, it sent the task straight to confirmation instead of asking. It presented a specific row as “the first” on a list our schema never ordered, reading fixture order as if it were real order. Given a comparison of two figures, it would do the arithmetic itself and sometimes get it wrong.

These were not instruction problems, and the fixes had the same shape every time. Detect the value the model invented and refuse it. A matcher catches tool-name-echo placeholders and forces a clarification. A check drops a non-ISO time argument and re-derives the time from the user’s words in code. A flag on unordered lists, plus a guard that returns the full list with a notice instead of asserting an ordinal. Percentages, deltas, and filtered totals moved into code, with the prompt reduced to one line: do not compute this yourself. The principle underneath all of them is to never trust a model-populated value as if it were the typed or computed value. Validate before you parse.

The guardrail is also code

We run a deterministic grounding check over every answer that cites figures. It compares the numbers in the rendered answer against the numbers the tool actually returned, and when the answer cites a figure with no support, it withholds the answer behind a message saying we could not fully verify it. This is the reason a fabricated “that’s 5% of your $8.3M,” which the renderer invented on three or four runs out of six, never reached a user. The render prompt already banned invented statistics. The ban alone did not hold, so deterministic shaping that drops the unreliable input and renders from the authoritative summary is what took those failures from four in six to one in six.

That same layer was also our most common source of false holds, where a correct answer was withheld because of a verification bug. A correct answer held back is expensive precisely because the fallback message is the right behavior when verification genuinely fails, which made every false positive look legitimate.

The verification bugs were all in code, never in a prompt. A currency regex read the leading “t” of the word “today” as a trillion suffix, turned a small dollar figure into a number on the order of ten to the twelfth, failed to find it in the source, and hard-blocked about 80 percent of runs. The first attempt to fix it widened the accepted values in a test fixture, which masked the symptom rather than the cause, and we reverted that. The real fix was a magnitude token that terminates on a word boundary. The check was sign-sensitive, so a negative withdrawal rendered as a positive figure failed to ground until we compared absolute values. It scanned only numeric fields, so a figure living inside a human-readable explanation string was flagged as fabricated until we extracted numbers from source strings too.

Two design decisions about that layer were the fix rather than a bug. We stopped running the non-deterministic rubric judge on executed writes, because the confirmation there is deterministic and templated, so judging it added flake for no signal. We made the data path fail closed, so a tool-backed read that cites figures against an empty source is blocked outright, and an empty tool result returns a fixed unavailable message instead of calling the renderer with nothing to render.

There is a cost to running a guardrail this strict, and we are still paying it. A required gate that flakes trains people to bypass it, which is the subject of the second part of this series.

Where prompts earned their keep

Prompt edits did real work in this system, inside a narrow band. A one-line rule so “compare A and B” is not mis-read as a request that needs clarification. A broadened description of an activity category so a class of question routes correctly. A shared preamble that owns the never-invent-a-link rule so it cannot drift across the several render prompts that each need it. Each of these is a small, generalizable nudge sitting on top of code that is already correct. Correctness lives in the code beneath them.

What we have not solved

We did not close all of it. One residual case where the renderer occasionally fabricates a derived percentage is still left to the prompt, with the grounding check as the backstop. The grounding layer is tuned conservatively, and reducing its false holds without reopening a fabrication gap is a standing piece of work. There is a model-variance lever we have not pulled: running the tool-selection call at temperature zero with a self-consistency vote. We deferred it because example-grounding already reached 100 percent on our current suites, which also means the lever is untested against the cases that would actually exercise it.

We work on AI and data infrastructure for wealth management at Advisor360°. We know these failure modes at this resolution because we kept the receipts: every incident here came out of the build history of a system we shipped, not from a whiteboard.

The reflex to fix an LLM by talking to it differently is strong because the prompt is the part of the system that feels like it is in our hands. The work that held up was less satisfying than that. We took decisions away from the model, we validated what it returned before we trusted it, and we let deterministic code block the outcomes we could not afford. The model is one component, and treating its output as untrusted input is what made the rest of the system safe to build on.

This is the first of a two-part series. Part 2 covers what happens when the test harness meant to catch these failures is itself non-deterministic: the rerun-until-green trap, accuracy floors that stay green while a real regression merges, and LLM judges that disagree with themselves run to run.

Published via Towards AI

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

6-Day Agentic AI Engineering Email Guide — one practical lesson per day

Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.