惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Schneier on Security
Schneier on Security
Vercel News
Vercel News
罗磊的独立博客
MyScale Blog
MyScale Blog
人人都是产品经理
人人都是产品经理
GbyAI
GbyAI
D
Docker
L
LangChain Blog
美团技术团队
The Register - Security
The Register - Security
G
Google Developers Blog
U
Unit 42
B
Blog RSS Feed
MongoDB | Blog
MongoDB | Blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
aimingoo的专栏
aimingoo的专栏
F
Fortinet All Blogs
Recorded Future
Recorded Future
Last Week in AI
Last Week in AI
大猫的无限游戏
大猫的无限游戏
WordPress大学
WordPress大学
Stack Overflow Blog
Stack Overflow Blog
有赞技术团队
有赞技术团队
M
MIT News - Artificial intelligence
月光博客
月光博客
P
Proofpoint News Feed
Recent Announcements
Recent Announcements
J
Java Code Geeks
宝玉的分享
宝玉的分享
The Cloudflare Blog
Microsoft Azure Blog
Microsoft Azure Blog
K
Kaspersky official blog
G
GRAHAM CLULEY
A
Arctic Wolf
T
Tenable Blog
S
Schneier on Security
C
Cyber Attacks, Cyber Crime and Cyber Security
T
Threatpost
Project Zero
Project Zero
C
CXSECURITY Database RSS Feed - CXSecurity.com
Latest news
Latest news
L
LINUX DO - 最新话题
C
CERT Recently Published Vulnerability Notes
S
Security Affairs
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Spread Privacy
Spread Privacy
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
The Last Watchdog
The Last Watchdog
W
WeLiveSecurity
Security Latest
Security Latest

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
Agentic AI FinOps: Why Claude Agent Loops Cost 30 a Single Inference
Muskan · 2026-06-15 · via DEV Community

Muskan

TL;DR A single Claude API call is predictable. An agent with tool access is not.

A single Claude API call is predictable. An agent with tool access is not.

Spec-sheets price agents the way they price single calls. Architects look at Claude Sonnet 4.5 at $3 per million input tokens, multiply by an expected 8,000 tokens per task, and tell finance the agent will cost $0.20 per invocation. Six weeks after launch, the cloud bill arrives at $50,000 a month for a fleet that processes 10,000 daily invocations. The cost per invocation, against the math nobody redid, is $5.

The 30x markup is not bad math. It is a structural property of how agent loops consume tokens. Each tool call replays most of the prior context. Each parse error retries the call.

Each sub-agent spawn carries its own full context. The token bill grows quadratically with tool-call count, not linearly, and the production reality of parse retries and tool description bloat compounds the curve further.

This post is about where that 30x markup comes from in tokens, how to instrument the cost at the right level (per tool call, not per invocation), and what closed-loop budget enforcement looks like. The pattern composes with read-only MCP servers and LLM FinOps per-feature token budgets without re-architecting either.

The 30x markup nobody priced in

The cost asymmetry between a single call and an agent loop is the defining surprise of 2026 production AI. Teams that spec one call cost the agent like it is one call, and the bill diverges silently for weeks before anyone catches the curve.

Shape Spec-sheet cost Production cost Why the gap
Single call, 8k input + 1k output $0.04 $0.04 None
4-tool agent, 25k context, 4 tool calls $0.30 $0.85 Context replay grows quadratically
8-tool agent, 50k context, 8 tool calls $0.20 quoted $4.00 Context replay + tool desc bloat
Multi-agent with 3 sub-agent spawns $0.50 quoted $7.50 Each sub-agent carries its own context window

The 8-tool agent line is the hard one. Architects routinely under-quote it because the system prompt feels small (8 tools at maybe 200 tokens each is "just" 1,600 tokens). The trap is that this 1,600 tokens replays at every tool-call step. Across 8 tool calls, that is 12,800 tokens of system prompt alone, before any user message or tool result.

The full context (system + user + tool results so far) at step 8 of an 8-tool agent commonly hits 80,000 input tokens for that single step.

Anatomy of one agent loop

Walk through the token math for a realistic 8-tool agent loop. The agent answers a question by calling 4 read tools, processing the results, calling 4 more, and synthesizing an answer.

[diagram could not be rendered]

At Claude Sonnet 4.5 input pricing of $3 per million tokens, 340,000 input tokens cost $1.02 in context replay alone. Add 6,000 output tokens (reasoning at each step plus the final synthesis) at $15 per million for $0.09. The base cost of one clean invocation: $1.11. The trap is that the spec-sheet quoted $0.20, and the architect did the math from "$3 per million times one 60k-token reasoning context."

Step Input tokens Cumulative cost
1 (system + tools + user) 12,000 $0.036
2 (reasoning, no tool yet) 13,000 $0.075
3 (tool result 1 replayed) 22,000 $0.141
4 (tool 2 reasoning) 30,000 $0.231
5 (tools 3-4 results) 48,000 $0.375
6 (tool 5 reasoning) 60,000 $0.555
7 (tools 6-7 results) 72,000 $0.771
8 (final synth, all context) 83,000 $1.020

That is the clean path. Production paths are not clean.

The four cost multipliers

Four failure modes inflate the clean number into the $4-8 production reality.

Token bloat and retry costs

Tool description bloat. Each tool description in the system prompt replays at every step. A 200-token description is 200 tokens at step 1, plus another 200 at step 2 replay, plus 200 at step 3, and so on. Across 8 tool calls, a single 200-token tool description costs 1,600 input tokens, or about $0.005. Five over-described tools cost an extra $0.024 per invocation.

At 10,000 invocations per day, that is $7,200 per month for tool descriptions that nobody trimmed.

Parse-error retries. Tool calls return JSON. Production tool-call parse failure rates run 5 to 15 percent depending on the schema strictness and the model. Each parse failure replays the full prior context for the retry. A 10 percent retry rate on an 8-tool agent means the average invocation has 0.8 retries, each costing roughly $0.10 to $0.30 depending on which step failed.

That is another $0.10 to $0.25 per invocation on average.

Sub-agent and result sprawl

Sub-agent spawning. A parent agent that spawns 3 specialist sub-agents to handle subtasks now has 4 distinct context windows in flight. If the parent holds 30k tokens and each sub-agent holds 20k, the total context cost for the orchestration is 4x the single-agent baseline, plus the inter-agent message-passing overhead. A 3-spawn pattern that returned to the parent for 2 more tool calls easily reaches $5 per invocation on its own.

Context window growth from verbose tool results. A tool that returns 5,000 tokens of formatted output gets replayed at every subsequent step. If that tool is called at step 2, its 5,000 tokens replay at steps 3 through 8, contributing 30,000 input tokens to the total. The fix is summarization at tool boundary, but most teams ship the raw output by default.

Failure mode Mechanism Typical multiplier Fix
Tool description bloat 200-token description × 8 replays +0.5x to +1x Trim descriptions to 60-80 tokens; lazy-load detailed schemas
Parse-error retries 5-15% retry rate × full context +0.2x to +0.4x Strict JSON schema; structured output mode
Sub-agent spawning N parallel context windows +2x to +4x Single agent with conditional routing
Verbose tool results 5,000-token result × N step replays +1x to +2x Summarize at tool boundary; store full result by reference

Multipliers combined

A clean 8-tool invocation costs $1.02. A production invocation with all four multipliers active hits $4 to $8. That is the structural source of the 30x gap.

Per-tool-call attribution that dashboards miss

Most agent frameworks log per-invocation token totals and not per-step. The dashboard shows "average cost per invocation: $4.80" without revealing that step 6 with the verbose tool result is the 60 percent driver. Teams cannot fix what they cannot see, so they argue about whether to switch models when the actual win is at the tool-result-summarization step.

The fix is per-step token attribution. OpenTelemetry's GenAI semantic conventions specify the spans: gen_ai.tool.call, gen_ai.client.token.usage, with prompt and completion token counts as attributes. Log every step. Aggregate by tool call, not by invocation.

Now the dashboard says "tool aws:get_cost_and_usage averages 8,400 input tokens across calls" and the team trims that tool's response shape.

[diagram could not be rendered]

Per-tool-call attribution surfaces three patterns invocation-level dashboards never show: which tools are token-heavy, which retry most, and which produce verbose results that compound downstream. Teams that ship attribution before scaling fleets avoid the FinOps surprise. Teams that scale first and instrument second discover the bill in panic.

Soft budget caps that do not kill the task

The naive enforcement is a hard cap: at 50,000 input tokens, abort. The agent stops mid-task, partial work is wasted, the user sees an error, and the user retries (recovering all the cost the cap saved, plus some). Hard caps are correct in spec and wrong in practice.

The better pattern is a soft cap delivered as an in-context system message at 80 percent of budget. The agent receives a directive: "You have used 40,000 of 50,000 budgeted tokens. Synthesize the answer with current information rather than calling more tools." The agent finishes gracefully, the user gets an answer, and the budget is enforced without the partial-work waste.

Cap type Behavior Cost outcome UX outcome
No cap Agent runs to completion regardless $4-8 average, $20+ tail Best UX, worst bill
Hard cap (abort) Truncates mid-task Caps spend at threshold Wasted partial work; user retries
Soft cap (in-context) Agent finishes with what it has Caps spend with completion Slightly degraded answer, budget held

Soft caps composed with per-tool-call attribution produce a system where a 95th-percentile-cost invocation auto-degrades to a 50th-percentile-cost answer instead of producing a 99th-percentile bill.

Closed-loop agent FinOps

The pattern composes with the existing closed-loop work. Closed-loop FinOps for cloud cost runs detect-decide-act-verify in 5 minutes. The same loop applies to agent invocations, just at a 5-second timescale.

Detect: per-loop input tokens exceed the p99 baseline for this agent class. Decide: route the next tool call through a smaller model (Haiku instead of Sonnet) or summarize the prior tool result before continuing. Act: continue the loop with the route or summarization in place. Verify: the invocation completed under budget with an acceptable answer.

The MCP layer is where this composes cleanly. A policy-aware governance MCP reports per-tool-call cost back into the agent's context, so the agent can self-aware budget decisions during the loop. The same agent can also degrade gracefully because the cost signal is in its context, not buried in an external dashboard the agent cannot read.

This works when the team commits to per-tool-call attribution as a first-class observability layer. It breaks when teams treat agent cost as an after-the-fact budget review and only instrument when the bill arrives. The 30x markup is not a model problem or a pricing problem. It is a visibility problem with a structural cost shape, and the fix is the same shape as every other FinOps closed-loop the cloud has needed for the last decade.

Frequently Asked Questions

Q: How does the 30x markup nobody priced in apply in practice?

See the section above titled "The 30x markup nobody priced in" for the full breakdown with examples.

Q: How does anatomy of one agent loop apply in practice?

See the section above titled "Anatomy of one agent loop" for the full breakdown with examples.

Q: How does the four cost multipliers apply in practice?

See the section above titled "The four cost multipliers" for the full breakdown with examples.

Q: How does per-tool-call attribution that dashboards miss apply in practice?

See the section above titled "Per-tool-call attribution that dashboards miss" for the full breakdown with examples.


Drop a comment if you've audited a similar spike. What was the dominant cause for your team? Share what worked or what blew up.