惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

阮一峰的网络日志
阮一峰的网络日志
D
Darknet – Hacking Tools, Hacker News & Cyber Security
S
Schneier on Security
The Last Watchdog
The Last Watchdog
Cyberwarzone
Cyberwarzone
S
Securelist
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Cyber Attacks, Cyber Crime and Cyber Security
L
Lohrmann on Cybersecurity
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
博客园 - 司徒正美
The Cloudflare Blog
V
V2EX
博客园_首页
博客园 - 聂微东
Vercel News
Vercel News
人人都是产品经理
人人都是产品经理
G
GRAHAM CLULEY
T
Tenable Blog
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
L
LINUX DO - 最新话题
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
SecWiki News
SecWiki News
博客园 - 三生石上(FineUI控件)
S
Secure Thoughts
N
News | PayPal Newsroom
T
The Blog of Author Tim Ferriss
The GitHub Blog
The GitHub Blog
T
Troy Hunt's Blog
博客园 - 【当耐特】
Forbes - Security
Forbes - Security
H
Hacker News: Front Page
A
About on SuperTechFans
B
Blog RSS Feed
Engineering at Meta
Engineering at Meta
MongoDB | Blog
MongoDB | Blog
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
罗磊的独立博客
D
DataBreaches.Net
P
Privacy & Cybersecurity Law Blog
Schneier on Security
Schneier on Security
Application and Cybersecurity Blog
Application and Cybersecurity Blog
Google DeepMind News
Google DeepMind News
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Jina AI
Jina AI
D
Docker
P
Proofpoint News Feed

Finisky Garden

The Hivemind of Language Models From RAG to Knowledge Compilation Theoretical Ceiling of Vector Retrieval Unexpected Perks of Talking to AI How Claude Dreams: Background Memory Defragmentation AI and Employment: A 200-Year-Old Debate Three Evolutions of Agent Engineering Context Management in Claude Code vs OpenClaw Foundation Models Plateau, Applications Take Off How OpenClaw Hit 350K Stars in 4 Months Why Claude Code's Edit Tool Doesn't Mangle Your Files Claude Code's Undercover Mode: When AI Learns to Hide Itself How Forked Sub-Agents Share Prompt Cache for 90% Savings Context Compaction in Claude Code: A Five-Layer Cascade and the Art of Free Summaries How Claude Code Defends Against Bash Injection
Deferred Tool Loading in Claude Code
finisky · 2026-04-05 · via Finisky Garden

When you use Claude Code, there’s something you probably never notice: it has over 40 registered tools, but when you ask it to read a file or edit a few lines of code, it only uses three or four. The definitions for the remaining 30-plus tools, each around 500 tokens, add up to over 10,000 tokens of fixed overhead per request. You just want to change one line of CSS, but you’re paying for WebSearch, NotebookEdit, CronCreate, and a bunch of tools you’ll never touch.

This problem has a well-known solution in traditional software: dynamic linking. Instead of loading every shared library at startup, the program waits until a function is actually called before loading its library. Claude Code does something similar, except it’s not managing memory address space. It’s managing token budget.

How Big Is the Problem

The more capable an LLM Agent becomes, the more tools it registers. File reading, writing, searching, and command execution are the basics. Add notebook editing, scheduled tasks, plan mode, web search, and various MCP extension tools, and you easily pass 40. Every API request must include the name, description, and parameter schema of every tool as part of the context sent to the model.

In a 200K context window, the deferred tool definitions alone account for nearly 7%. This isn’t a one-time cost; it’s paid every turn. And since Anthropic charges by the token, even with cache hits these definitions consume cache space, reducing caching efficiency for everything else.

More critically, the more tokens tool definitions consume, the less room there is for actual conversation and code. In long sessions, this 7% fixed overhead can be the difference between triggering an extra compression and not.

Two Classes of Tools

Claude Code’s solution is to split tools into two categories.

The first category is always-loaded tools, included with full definitions in every request:

1
2
3
4
Always loaded:
  Bash, Read, Edit, Write    (core file ops)
  Glob, Grep                 (search)
  Agent, ToolSearch, Skill   (infrastructure)

These are high-frequency tools used in almost every task. Note that ToolSearch itself is always loaded because it’s the entry point for loading other tools. If it were deferred, nothing could load anything.

The second category is deferred tools. Only their names are sent, not their full definitions. The model knows these tools exist but can’t see their parameters or usage. This includes WebSearch, TodoWrite, NotebookEdit, all Cron tools, plan mode tools, and every MCP extension tool.

The heuristic is intuitive: tools likely needed in the first turn stay loaded; tools that might go unused the entire session are deferred. MCP tools are natural candidates for deferral. They’re configured per-project, their count is unpredictable, and some projects connect a dozen MCP servers with tool definitions easily exceeding 10,000 tokens.

The Discovery Mechanism

How does the model load a deferred tool? By calling ToolSearch.

Say the user asks “search for Claude Code release notes.” The model determines it needs WebSearch, but the context only contains the tool’s name with no parameter schema. The model first calls ToolSearch, specifying it wants WebSearch. ToolSearch returns a special reference marker. When the API server sees this marker, it injects WebSearch’s full definition into the model’s context. The model can then call WebSearch normally.

The flow looks roughly like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
Turn 1:
  Model sees: Bash, Read, Edit, Write, Glob, Grep,
              Agent, ToolSearch, Skill
  Model sees names only: WebSearch, TodoWrite,
              NotebookEdit, mcp__slack__post, ...

User: "search for Claude Code release notes"

  Model calls ToolSearch("select:WebSearch")
    --> returns reference marker
    --> API injects WebSearch full schema

  Model calls WebSearch("Claude Code release notes")
    --> returns results

Turn 2:
  WebSearch now included as loaded tool
  Other deferred tools still name-only

The cost is one extra ToolSearch call, roughly 200 input tokens. But the savings are the fixed overhead of 30-plus tool definitions, over 10,000 tokens. As long as you’re not searching for new tools every turn, the net benefit is positive. And once a tool has been loaded, it stays loaded for subsequent requests.

ToolSearch also supports fuzzy search. When the model isn’t sure of a tool’s exact name, it can query by keyword. Searching “notebook jupyter” finds NotebookEdit, and “slack send” finds the corresponding MCP tool.

Surviving Compression

A previous post discussed Claude Code’s context compression mechanism. Compression condenses message history into summaries, but tool references loaded earlier get lost in the process. Without handling this, the model forgets it ever loaded WebSearch after compression and has to search for it again.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
Before compaction:
  [msg1] [msg2] ... [msg30] [WebSearch loaded at msg15]
                                      |
                              compaction happens
                                      |
                                      v
After compaction (without recovery):
  [summary] [msg29] [msg30]
  WebSearch? what WebSearch?

After compaction (with recovery):
  [summary + preCompactDiscoveredTools: [WebSearch]]
  [msg29] [msg30]
  WebSearch still loaded

Claude Code solves this with two mechanisms. The first records the list of loaded tools in the compaction boundary message. After compression, the system scans this boundary and restores the previous state. The second is incremental notification: before each turn, the system compares currently available deferred tools against what the model has already been told about. If anything changed, such as an MCP server disconnecting or a new one connecting, it generates a message to inform the model.

Together, these two layers ensure that the API-level tool filtering stays correct and the model’s awareness of available tools remains intact. Compression doesn’t break tool state.

Adaptive Activation

Not every scenario needs deferred loading. If a project only uses two or three MCP tools totaling under 1,500 tokens, the extra ToolSearch call is just wasted time.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
deferred tool definitions < 10% of context window?
            |
     +------+------+
     |             |
    YES            NO
     |             |
     v             v
  all inline    deferred mode
  (no search    (ToolSearch
   overhead)     on demand)

Claude Code has an automatic mode: it calculates the total token count of all deferrable tool definitions, and if it exceeds 10% of the context window, deferred loading kicks in. Below that threshold, everything is inlined. Small projects with few tools run more efficiently with everything loaded; large projects with many tools save more with deferred loading. The system decides on its own. Users don’t need to think about it.

Third-party API proxies have this feature disabled by default, since the tool reference marker is an Anthropic-specific capability that proxies may not support. But if users confirm their proxy can handle it, they can enable it manually.

Why This Design Is Interesting

Deferred tool loading looks like a very specific engineering optimization, but the tension behind it is quite universal: the more capable an Agent becomes, the more tools it registers, the larger the context overhead, and the less room is left for actual work. Capability itself becomes a burden.

This tension showed up in traditional software long ago. The more functions a program can call, the larger and slower it gets if everything is bundled in. The solution was lazy loading: load each library only when it’s first called, trading one extra level of indirection for dramatically reduced resource consumption. What Claude Code does with tools is exactly the same thing, except it’s not saving memory. It’s saving tokens.