Agentic AI FinOps: Why Claude Agent Loops Cost 30 a Single Inference

TL;DR A single Claude API call is predictable. An agent with tool access is not.

A single Claude API call is predictable. An agent with tool access is not.

Spec-sheets price agents the way they price single calls. Architects look at Claude Sonnet 4.5 at $3 per million input tokens, multiply by an expected 8,000 tokens per task, and tell finance the agent will cost $0.20 per invocation. Six weeks after launch, the cloud bill arrives at $50,000 a month for a fleet that processes 10,000 daily invocations. The cost per invocation, against the math nobody redid, is $5.

The 30x markup is not bad math. It is a structural property of how agent loops consume tokens. Each tool call replays most of the prior context. Each parse error retries the call.

Each sub-agent spawn carries its own full context. The token bill grows quadratically with tool-call count, not linearly, and the production reality of parse retries and tool description bloat compounds the curve further.

This post is about where that 30x markup comes from in tokens, how to instrument the cost at the right level (per tool call, not per invocation), and what closed-loop budget enforcement looks like. The pattern composes with read-only MCP servers and LLM FinOps per-feature token budgets without re-architecting either.

The 30x markup nobody priced in

The cost asymmetry between a single call and an agent loop is the defining surprise of 2026 production AI. Teams that spec one call cost the agent like it is one call, and the bill diverges silently for weeks before anyone catches the curve.

Shape	Spec-sheet cost	Production cost	Why the gap
Single call, 8k input + 1k output	$0.04	$0.04	None
4-tool agent, 25k context, 4 tool calls	$0.30	$0.85	Context replay grows quadratically
8-tool agent, 50k context, 8 tool calls	$0.20 quoted	$4.00	Context replay + tool desc bloat
Multi-agent with 3 sub-agent spawns	$0.50 quoted	$7.50	Each sub-agent carries its own context window

The 8-tool agent line is the hard one. Architects routinely under-quote it because the system prompt feels small (8 tools at maybe 200 tokens each is "just" 1,600 tokens). The trap is that this 1,600 tokens replays at every tool-call step. Across 8 tool calls, that is 12,800 tokens of system prompt alone, before any user message or tool result.

The full context (system + user + tool results so far) at step 8 of an 8-tool agent commonly hits 80,000 input tokens for that single step.

Anatomy of one agent loop

Walk through the token math for a realistic 8-tool agent loop. The agent answers a question by calling 4 read tools, processing the results, calling 4 more, and synthesizing an answer.

[diagram could not be rendered]

At Claude Sonnet 4.5 input pricing of $3 per million tokens, 340,000 input tokens cost $1.02 in context replay alone. Add 6,000 output tokens (reasoning at each step plus the final synthesis) at $15 per million for $0.09. The base cost of one clean invocation: $1.11. The trap is that the spec-sheet quoted $0.20, and the architect did the math from "$3 per million times one 60k-token reasoning context."

Step	Input tokens	Cumulative cost
1 (system + tools + user)	12,000	$0.036
2 (reasoning, no tool yet)	13,000	$0.075
3 (tool result 1 replayed)	22,000	$0.141
4 (tool 2 reasoning)	30,000	$0.231
5 (tools 3-4 results)	48,000	$0.375
6 (tool 5 reasoning)	60,000	$0.555
7 (tools 6-7 results)	72,000	$0.771
8 (final synth, all context)	83,000	$1.020

That is the clean path. Production paths are not clean.

The four cost multipliers

Four failure modes inflate the clean number into the $4-8 production reality.

Token bloat and retry costs

Tool description bloat. Each tool description in the system prompt replays at every step. A 200-token description is 200 tokens at step 1, plus another 200 at step 2 replay, plus 200 at step 3, and so on. Across 8 tool calls, a single 200-token tool description costs 1,600 input tokens, or about $0.005. Five over-described tools cost an extra $0.024 per invocation.

At 10,000 invocations per day, that is $7,200 per month for tool descriptions that nobody trimmed.

Parse-error retries. Tool calls return JSON. Production tool-call parse failure rates run 5 to 15 percent depending on the schema strictness and the model. Each parse failure replays the full prior context for the retry. A 10 percent retry rate on an 8-tool agent means the average invocation has 0.8 retries, each costing roughly $0.10 to $0.30 depending on which step failed.

That is another $0.10 to $0.25 per invocation on average.

Sub-agent and result sprawl

Sub-agent spawning. A parent agent that spawns 3 specialist sub-agents to handle subtasks now has 4 distinct context windows in flight. If the parent holds 30k tokens and each sub-agent holds 20k, the total context cost for the orchestration is 4x the single-agent baseline, plus the inter-agent message-passing overhead. A 3-spawn pattern that returned to the parent for 2 more tool calls easily reaches $5 per invocation on its own.

Context window growth from verbose tool results. A tool that returns 5,000 tokens of formatted output gets replayed at every subsequent step. If that tool is called at step 2, its 5,000 tokens replay at steps 3 through 8, contributing 30,000 input tokens to the total. The fix is summarization at tool boundary, but most teams ship the raw output by default.

Failure mode	Mechanism	Typical multiplier	Fix
Tool description bloat	200-token description × 8 replays	+0.5x to +1x	Trim descriptions to 60-80 tokens; lazy-load detailed schemas
Parse-error retries	5-15% retry rate × full context	+0.2x to +0.4x	Strict JSON schema; structured output mode
Sub-agent spawning	N parallel context windows	+2x to +4x	Single agent with conditional routing
Verbose tool results	5,000-token result × N step replays	+1x to +2x	Summarize at tool boundary; store full result by reference

Multipliers combined

A clean 8-tool invocation costs $1.02. A production invocation with all four multipliers active hits $4 to $8. That is the structural source of the 30x gap.

Per-tool-call attribution that dashboards miss

Most agent frameworks log per-invocation token totals and not per-step. The dashboard shows "average cost per invocation: $4.80" without revealing that step 6 with the verbose tool result is the 60 percent driver. Teams cannot fix what they cannot see, so they argue about whether to switch models when the actual win is at the tool-result-summarization step.

The fix is per-step token attribution. OpenTelemetry's GenAI semantic conventions specify the spans: gen_ai.tool.call, gen_ai.client.token.usage, with prompt and completion token counts as attributes. Log every step. Aggregate by tool call, not by invocation.

Now the dashboard says "tool aws:get_cost_and_usage averages 8,400 input tokens across calls" and the team trims that tool's response shape.

[diagram could not be rendered]

Per-tool-call attribution surfaces three patterns invocation-level dashboards never show: which tools are token-heavy, which retry most, and which produce verbose results that compound downstream. Teams that ship attribution before scaling fleets avoid the FinOps surprise. Teams that scale first and instrument second discover the bill in panic.

Soft budget caps that do not kill the task

The naive enforcement is a hard cap: at 50,000 input tokens, abort. The agent stops mid-task, partial work is wasted, the user sees an error, and the user retries (recovering all the cost the cap saved, plus some). Hard caps are correct in spec and wrong in practice.

The better pattern is a soft cap delivered as an in-context system message at 80 percent of budget. The agent receives a directive: "You have used 40,000 of 50,000 budgeted tokens. Synthesize the answer with current information rather than calling more tools." The agent finishes gracefully, the user gets an answer, and the budget is enforced without the partial-work waste.

Cap type	Behavior	Cost outcome	UX outcome
No cap	Agent runs to completion regardless	$4-8 average, $20+ tail	Best UX, worst bill
Hard cap (abort)	Truncates mid-task	Caps spend at threshold	Wasted partial work; user retries
Soft cap (in-context)	Agent finishes with what it has	Caps spend with completion	Slightly degraded answer, budget held

Soft caps composed with per-tool-call attribution produce a system where a 95th-percentile-cost invocation auto-degrades to a 50th-percentile-cost answer instead of producing a 99th-percentile bill.

Closed-loop agent FinOps

The pattern composes with the existing closed-loop work. Closed-loop FinOps for cloud cost runs detect-decide-act-verify in 5 minutes. The same loop applies to agent invocations, just at a 5-second timescale.

Detect: per-loop input tokens exceed the p99 baseline for this agent class. Decide: route the next tool call through a smaller model (Haiku instead of Sonnet) or summarize the prior tool result before continuing. Act: continue the loop with the route or summarization in place. Verify: the invocation completed under budget with an acceptable answer.

The MCP layer is where this composes cleanly. A policy-aware governance MCP reports per-tool-call cost back into the agent's context, so the agent can self-aware budget decisions during the loop. The same agent can also degrade gracefully because the cost signal is in its context, not buried in an external dashboard the agent cannot read.

This works when the team commits to per-tool-call attribution as a first-class observability layer. It breaks when teams treat agent cost as an after-the-fact budget review and only instrument when the bill arrives. The 30x markup is not a model problem or a pricing problem. It is a visibility problem with a structural cost shape, and the fix is the same shape as every other FinOps closed-loop the cloud has needed for the last decade.

Frequently Asked Questions

Q: How does the 30x markup nobody priced in apply in practice?

See the section above titled "The 30x markup nobody priced in" for the full breakdown with examples.

Q: How does anatomy of one agent loop apply in practice?

See the section above titled "Anatomy of one agent loop" for the full breakdown with examples.

Q: How does the four cost multipliers apply in practice?

See the section above titled "The four cost multipliers" for the full breakdown with examples.

Q: How does per-tool-call attribution that dashboards miss apply in practice?

See the section above titled "Per-tool-call attribution that dashboards miss" for the full breakdown with examples.

Drop a comment if you've audited a similar spike. What was the dominant cause for your team? Share what worked or what blew up.

推荐订阅源

DEV Community