惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

V
Vulnerabilities – Threatpost
F
Fortinet All Blogs
Vercel News
Vercel News
C
Check Point Blog
P
Privacy International News Feed
Know Your Adversary
Know Your Adversary
Google DeepMind News
Google DeepMind News
T
Troy Hunt's Blog
TaoSecurity Blog
TaoSecurity Blog
I
Intezer
T
The Exploit Database - CXSecurity.com
Security Archives - TechRepublic
Security Archives - TechRepublic
H
Hacker News: Front Page
P
Proofpoint News Feed
GbyAI
GbyAI
Engineering at Meta
Engineering at Meta
Attack and Defense Labs
Attack and Defense Labs
S
Security @ Cisco Blogs
IT之家
IT之家
D
DataBreaches.Net
Hacker News: Ask HN
Hacker News: Ask HN
SecWiki News
SecWiki News
Y
Y Combinator Blog
Project Zero
Project Zero
H
Hackread – Cybersecurity News, Data Breaches, AI and More
L
Lohrmann on Cybersecurity
T
Tenable Blog
大猫的无限游戏
大猫的无限游戏
L
LINUX DO - 最新话题
G
Google Developers Blog
The GitHub Blog
The GitHub Blog
Recorded Future
Recorded Future
有赞技术团队
有赞技术团队
Martin Fowler
Martin Fowler
K
Kaspersky official blog
PCI Perspectives
PCI Perspectives
A
Arctic Wolf
Latest news
Latest news
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
N
Netflix TechBlog - Medium
雷峰网
雷峰网
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Google Online Security Blog
Google Online Security Blog
P
Palo Alto Networks Blog
The Hacker News
The Hacker News
WordPress大学
WordPress大学
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
月光博客
月光博客
Schneier on Security
Schneier on Security
M
MIT News - Artificial intelligence

Finisky Garden

The Hivemind of Language Models From RAG to Knowledge Compilation Theoretical Ceiling of Vector Retrieval Unexpected Perks of Talking to AI How Claude Dreams: Background Memory Defragmentation AI and Employment: A 200-Year-Old Debate Three Evolutions of Agent Engineering Context Management in Claude Code vs OpenClaw Foundation Models Plateau, Applications Take Off How OpenClaw Hit 350K Stars in 4 Months Deferred Tool Loading in Claude Code Why Claude Code's Edit Tool Doesn't Mangle Your Files Claude Code's Undercover Mode: When AI Learns to Hide Itself Context Compaction in Claude Code: A Five-Layer Cascade and the Art of Free Summaries How Claude Code Defends Against Bash Injection
How Forked Sub-Agents Share Prompt Cache for 90% Savings
finisky · 2026-04-05 · via Finisky Garden

When tackling complex tasks, Claude Code spawns multiple sub-agents in parallel, each needing the full parent conversation context. If the parent has accumulated 100K tokens and three sub-agents are spawned, a naive implementation charges 300K tokens of input.

Anyone familiar with LLM inference optimization will recognize this immediately: it’s a KV Cache sharing problem. When multiple requests share the same prefix, the Attention layer’s Key/Value tensors can be reused, skipping redundant computation. Anthropic exposes this capability to API users as Prompt Cache, offering a 90% discount on cached prefix portions — but only if the prefix bytes are exactly identical across requests. Claude Code’s fork sub-agents are deliberately constructed so that over 99% of the bytes are identical, compressing the effective input cost of three sub-agents to roughly 120K token-equivalent (100K full price + 2 × 100K × 10%).

Byte-Identical, Not Semantically Equivalent

The prompt cache key is determined by five dimensions: the full system prompt, the JSON serialization of tool definitions, the model name, the messages array prefix, and the thinking config. A single byte of difference in any dimension causes a complete cache miss. There’s no partial hit.

An extra space, a field ordering change, a boolean that differs because a feature flag was read at a slightly different moment — any of these will invalidate the cache. This constraint shapes the entire design.

Push All Differences to the Last Line

This is the most elegant part of the design. When the parent agent emits three parallel tool calls, Claude Code constructs the following message sequence for each fork:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
Fork 1:
  [...parent_history,
   assistant { tool_use_1, tool_use_2, tool_use_3 },
   user { tool_result_1="Fork started...",
          tool_result_2="Fork started...",
          tool_result_3="Fork started...",
          <fork-directive> directive_1 </fork-directive> }]

Fork 2:
  [...parent_history,
   assistant { tool_use_1, tool_use_2, tool_use_3 },
   user { tool_result_1="Fork started...",
          tool_result_2="Fork started...",
          tool_result_3="Fork started...",
          <fork-directive> directive_2 </fork-directive> }]

Fork 3:  (same pattern, directive_3)

Except for the directive text at the very end, the entire message sequence is byte-for-byte identical across all three forks. In pseudocode:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
FORK_PLACEHOLDER_RESULT = "Fork started - processing in background"

def build_forked_messages(directive, assistant_message):
    full_assistant = clone(assistant_message)

    tool_use_blocks = [b for b in assistant_message.content
                       if b.type == "tool_use"]

    # ALL tool_uses get IDENTICAL placeholder results
    tool_results = [
        {"type": "tool_result",
         "tool_use_id": block.id,
         "content": FORK_PLACEHOLDER_RESULT}
        for block in tool_use_blocks
    ]

    user_message = create_user_message(
        content=[*tool_results,
                 {"type": "text",
                  "text": build_child_directive(directive)}]
    )
    return [full_assistant, user_message]

Several counterintuitive choices are baked into this logic.

The placeholder is a constant string — every tool_result across every fork uses the exact same text. Why not use real results? Because forks launch in parallel. When Fork 1 starts, Fork 2 and Fork 3’s results don’t exist yet. And even if they did, different results would break prefix consistency.

The parent’s assistant message is preserved in full, including all tool_use blocks, not just the one for the current fork. Seems wasteful at first: Fork 1 only executes directive_1, so why does it need to see tool_use_2 and tool_use_3? Caching again. If each fork only saw its own tool_use, the three forks would diverge at the assistant message, and prefix consistency would break right there.

Every fork’s user message includes all three tool_results, regardless of which one it actually cares about. This keeps every byte identical from the tool_result sequence up to the directive. The directive itself is wrapped in a fixed XML template shared across all forks — the only difference is the last few dozen tokens of specific task description.

All differences get pushed to the very end of the message sequence. This resembles clustered index design in databases: put the high-cardinality column last to make the common prefix as long as possible.

It’s Not Just Messages That Need Alignment

Message prefix construction solves one of the five dimensions. The other four can’t have any differences either.

The subtlest is the system prompt. Its rendering may depend on feature flag state, which could change between the parent’s turn and fork creation. Even with identical logic, a flag change means different output bytes. So fork sub-agents use the parent’s already-rendered bytes directly, rather than re-rendering.

Tool definitions follow the same principle: forks use useExactTools=true to take the parent’s tool array reference directly, with no filtering or reordering. Thinking config is inherited wholesale — no maxOutputTokens override, to avoid downstream clamping logic altering the byte representation.

There’s one easy-to-miss dimension: ContentReplacementState. Claude Code has a tool result budget manager that truncates overly long results. For the same tool_use_id, a fresh state and a cloned state might make different truncation decisions. So forks clone the parent’s replacement state rather than creating a new one, ensuring inherited messages get identical treatment. From observable behavior, forks clone rather than create fresh state, ensuring identical truncation decisions for the same tool_use_ids.

The five-dimensional alignment requirement means fork sub-agents essentially can’t have any “ideas of their own.” Their system prompt isn’t theirs, their tool set isn’t their choice, their thinking config isn’t their setting, even their truncation strategy is cloned. Everything serves that byte-identical prefix.

The Math

Let’s see what this saves. Assume parent_history is 100K tokens, the assistant message with three tool_uses is 500 tokens, placeholder results total 200 tokens, each directive is about 100 tokens.

Fork 1 pays full price for the assistant message + placeholders + directive (~800 tokens), with the preceding 100K hitting cache at 10%. Fork 2 and Fork 3 cache-hit even on the assistant message and placeholders, paying full price only for the 100-token directive.

Total input cost across three forks: 100K × 10% × 3 + 800 + 100 × 2 ≈ 31K token-equivalent. Compared to 300K+ without cache sharing, that’s nearly 90% savings. The longer the conversation, the bigger the savings — when parent_history goes from 100K to 200K, the absolute savings double.

This benefit extends beyond parallel forks. Post-turn background tasks like session memory extraction and prompt suggestions also ride the same cache. The system saves a parameter snapshot after each main loop turn; subsequent sidecar forks read from this snapshot rather than independently re-collecting parameters, avoiding timing-window byte inconsistencies.

Keeping a Tool That Will Never Be Called

Fork sub-agents retain the Agent tool in their pool but are actually forbidden from calling it. Why keep it? Removing the Agent tool would change the tool schema’s serialized bytes, breaking cache alignment on the tools dimension. A tool’s presence exists purely for the bytes it produces during serialization — completely unrelated to its functionality.

Recursive fork prevention uses two layers: the first checks querySource (set on options at spawn time, survives message compaction), the second scans message history for the fork boilerplate tag (fallback in case querySource wasn’t properly propagated). Each layer has blind spots; together they cover all paths.

Sharing cache doesn’t mean sharing state. Each fork gets an isolated runtime: cloned file cache, fresh abort controller (linked to parent for downward signal propagation), no-op mutation callbacks. Resources are explicitly released after completion to prevent memory leaks in long sessions.

In this system, a tool can have zero chance of being invoked, but it cannot lack the necessity of being serialized. That’s perhaps the ultimate demand that byte-level compatibility makes on engineering aesthetics.