How Forked Sub-Agents Share Prompt Cache for 90% Savings

When tackling complex tasks, Claude Code spawns multiple sub-agents in parallel, each needing the full parent conversation context. If the parent has accumulated 100K tokens and three sub-agents are spawned, a naive implementation charges 300K tokens of input.

Anyone familiar with LLM inference optimization will recognize this immediately: it’s a KV Cache sharing problem. When multiple requests share the same prefix, the Attention layer’s Key/Value tensors can be reused, skipping redundant computation. Anthropic exposes this capability to API users as Prompt Cache, offering a 90% discount on cached prefix portions — but only if the prefix bytes are exactly identical across requests. Claude Code’s fork sub-agents are deliberately constructed so that over 99% of the bytes are identical, compressing the effective input cost of three sub-agents to roughly 120K token-equivalent (100K full price + 2 × 100K × 10%).

Byte-Identical, Not Semantically Equivalent

The prompt cache key is determined by five dimensions: the full system prompt, the JSON serialization of tool definitions, the model name, the messages array prefix, and the thinking config. A single byte of difference in any dimension causes a complete cache miss. There’s no partial hit.

An extra space, a field ordering change, a boolean that differs because a feature flag was read at a slightly different moment — any of these will invalidate the cache. This constraint shapes the entire design.

Push All Differences to the Last Line

This is the most elegant part of the design. When the parent agent emits three parallel tool calls, Claude Code constructs the following message sequence for each fork:

Fork 1:
  [...parent_history,
   assistant { tool_use_1, tool_use_2, tool_use_3 },
   user { tool_result_1="Fork started...",
          tool_result_2="Fork started...",
          tool_result_3="Fork started...",
          <fork-directive> directive_1 </fork-directive> }]

Fork 2:
  [...parent_history,
   assistant { tool_use_1, tool_use_2, tool_use_3 },
   user { tool_result_1="Fork started...",
          tool_result_2="Fork started...",
          tool_result_3="Fork started...",
          <fork-directive> directive_2 </fork-directive> }]

Fork 3:  (same pattern, directive_3)

Except for the directive text at the very end, the entire message sequence is byte-for-byte identical across all three forks. In pseudocode:

FORK_PLACEHOLDER_RESULT = "Fork started - processing in background"

def build_forked_messages(directive, assistant_message):
    full_assistant = clone(assistant_message)

    tool_use_blocks = [b for b in assistant_message.content
                       if b.type == "tool_use"]

    # ALL tool_uses get IDENTICAL placeholder results
    tool_results = [
        {"type": "tool_result",
         "tool_use_id": block.id,
         "content": FORK_PLACEHOLDER_RESULT}
        for block in tool_use_blocks
    ]

    user_message = create_user_message(
        content=[*tool_results,
                 {"type": "text",
                  "text": build_child_directive(directive)}]
    )
    return [full_assistant, user_message]

Several counterintuitive choices are baked into this logic.

The placeholder is a constant string — every tool_result across every fork uses the exact same text. Why not use real results? Because forks launch in parallel. When Fork 1 starts, Fork 2 and Fork 3’s results don’t exist yet. And even if they did, different results would break prefix consistency.

The parent’s assistant message is preserved in full, including all tool_use blocks, not just the one for the current fork. Seems wasteful at first: Fork 1 only executes directive_1, so why does it need to see tool_use_2 and tool_use_3? Caching again. If each fork only saw its own tool_use, the three forks would diverge at the assistant message, and prefix consistency would break right there.

Every fork’s user message includes all three tool_results, regardless of which one it actually cares about. This keeps every byte identical from the tool_result sequence up to the directive. The directive itself is wrapped in a fixed XML template shared across all forks — the only difference is the last few dozen tokens of specific task description.

All differences get pushed to the very end of the message sequence. This resembles clustered index design in databases: put the high-cardinality column last to make the common prefix as long as possible.

It’s Not Just Messages That Need Alignment

Message prefix construction solves one of the five dimensions. The other four can’t have any differences either.

The subtlest is the system prompt. Its rendering may depend on feature flag state, which could change between the parent’s turn and fork creation. Even with identical logic, a flag change means different output bytes. So fork sub-agents use the parent’s already-rendered bytes directly, rather than re-rendering.

Tool definitions follow the same principle: forks use useExactTools=true to take the parent’s tool array reference directly, with no filtering or reordering. Thinking config is inherited wholesale — no maxOutputTokens override, to avoid downstream clamping logic altering the byte representation.

There’s one easy-to-miss dimension: ContentReplacementState. Claude Code has a tool result budget manager that truncates overly long results. For the same tool_use_id, a fresh state and a cloned state might make different truncation decisions. So forks clone the parent’s replacement state rather than creating a new one, ensuring inherited messages get identical treatment. From observable behavior, forks clone rather than create fresh state, ensuring identical truncation decisions for the same tool_use_ids.

The five-dimensional alignment requirement means fork sub-agents essentially can’t have any “ideas of their own.” Their system prompt isn’t theirs, their tool set isn’t their choice, their thinking config isn’t their setting, even their truncation strategy is cloned. Everything serves that byte-identical prefix.

The Math

Let’s see what this saves. Assume parent_history is 100K tokens, the assistant message with three tool_uses is 500 tokens, placeholder results total 200 tokens, each directive is about 100 tokens.

Fork 1 pays full price for the assistant message + placeholders + directive (~800 tokens), with the preceding 100K hitting cache at 10%. Fork 2 and Fork 3 cache-hit even on the assistant message and placeholders, paying full price only for the 100-token directive.

Total input cost across three forks: 100K × 10% × 3 + 800 + 100 × 2 ≈ 31K token-equivalent. Compared to 300K+ without cache sharing, that’s nearly 90% savings. The longer the conversation, the bigger the savings — when parent_history goes from 100K to 200K, the absolute savings double.

This benefit extends beyond parallel forks. Post-turn background tasks like session memory extraction and prompt suggestions also ride the same cache. The system saves a parameter snapshot after each main loop turn; subsequent sidecar forks read from this snapshot rather than independently re-collecting parameters, avoiding timing-window byte inconsistencies.

Keeping a Tool That Will Never Be Called

Fork sub-agents retain the Agent tool in their pool but are actually forbidden from calling it. Why keep it? Removing the Agent tool would change the tool schema’s serialized bytes, breaking cache alignment on the tools dimension. A tool’s presence exists purely for the bytes it produces during serialization — completely unrelated to its functionality.

Recursive fork prevention uses two layers: the first checks querySource (set on options at spawn time, survives message compaction), the second scans message history for the fork boilerplate tag (fallback in case querySource wasn’t properly propagated). Each layer has blind spots; together they cover all paths.

Sharing cache doesn’t mean sharing state. Each fork gets an isolated runtime: cloned file cache, fresh abort controller (linked to parent for downward signal propagation), no-op mutation callbacks. Resources are explicitly released after completion to prevent memory leaks in long sessions.

In this system, a tool can have zero chance of being invoked, but it cannot lack the necessity of being serialized. That’s perhaps the ultimate demand that byte-level compatibility makes on engineering aesthetics.

推荐订阅源