Deferred Tool Loading in Claude Code

When you use Claude Code, there’s something you probably never notice: it has over 40 registered tools, but when you ask it to read a file or edit a few lines of code, it only uses three or four. The definitions for the remaining 30-plus tools, each around 500 tokens, add up to over 10,000 tokens of fixed overhead per request. You just want to change one line of CSS, but you’re paying for WebSearch, NotebookEdit, CronCreate, and a bunch of tools you’ll never touch.

This problem has a well-known solution in traditional software: dynamic linking. Instead of loading every shared library at startup, the program waits until a function is actually called before loading its library. Claude Code does something similar, except it’s not managing memory address space. It’s managing token budget.

How Big Is the Problem

The more capable an LLM Agent becomes, the more tools it registers. File reading, writing, searching, and command execution are the basics. Add notebook editing, scheduled tasks, plan mode, web search, and various MCP extension tools, and you easily pass 40. Every API request must include the name, description, and parameter schema of every tool as part of the context sent to the model.

In a 200K context window, the deferred tool definitions alone account for nearly 7%. This isn’t a one-time cost; it’s paid every turn. And since Anthropic charges by the token, even with cache hits these definitions consume cache space, reducing caching efficiency for everything else.

More critically, the more tokens tool definitions consume, the less room there is for actual conversation and code. In long sessions, this 7% fixed overhead can be the difference between triggering an extra compression and not.

Two Classes of Tools

Claude Code’s solution is to split tools into two categories.

The first category is always-loaded tools, included with full definitions in every request:

Always loaded:
  Bash, Read, Edit, Write    (core file ops)
  Glob, Grep                 (search)
  Agent, ToolSearch, Skill   (infrastructure)

These are high-frequency tools used in almost every task. Note that ToolSearch itself is always loaded because it’s the entry point for loading other tools. If it were deferred, nothing could load anything.

The second category is deferred tools. Only their names are sent, not their full definitions. The model knows these tools exist but can’t see their parameters or usage. This includes WebSearch, TodoWrite, NotebookEdit, all Cron tools, plan mode tools, and every MCP extension tool.

The heuristic is intuitive: tools likely needed in the first turn stay loaded; tools that might go unused the entire session are deferred. MCP tools are natural candidates for deferral. They’re configured per-project, their count is unpredictable, and some projects connect a dozen MCP servers with tool definitions easily exceeding 10,000 tokens.

The Discovery Mechanism

How does the model load a deferred tool? By calling ToolSearch.

Say the user asks “search for Claude Code release notes.” The model determines it needs WebSearch, but the context only contains the tool’s name with no parameter schema. The model first calls ToolSearch, specifying it wants WebSearch. ToolSearch returns a special reference marker. When the API server sees this marker, it injects WebSearch’s full definition into the model’s context. The model can then call WebSearch normally.

The flow looks roughly like this:

Turn 1:
  Model sees: Bash, Read, Edit, Write, Glob, Grep,
              Agent, ToolSearch, Skill
  Model sees names only: WebSearch, TodoWrite,
              NotebookEdit, mcp__slack__post, ...

User: "search for Claude Code release notes"

  Model calls ToolSearch("select:WebSearch")
    --> returns reference marker
    --> API injects WebSearch full schema

  Model calls WebSearch("Claude Code release notes")
    --> returns results

Turn 2:
  WebSearch now included as loaded tool
  Other deferred tools still name-only

The cost is one extra ToolSearch call, roughly 200 input tokens. But the savings are the fixed overhead of 30-plus tool definitions, over 10,000 tokens. As long as you’re not searching for new tools every turn, the net benefit is positive. And once a tool has been loaded, it stays loaded for subsequent requests.

ToolSearch also supports fuzzy search. When the model isn’t sure of a tool’s exact name, it can query by keyword. Searching “notebook jupyter” finds NotebookEdit, and “slack send” finds the corresponding MCP tool.

Surviving Compression

A previous post discussed Claude Code’s context compression mechanism. Compression condenses message history into summaries, but tool references loaded earlier get lost in the process. Without handling this, the model forgets it ever loaded WebSearch after compression and has to search for it again.

Before compaction:
  [msg1] [msg2] ... [msg30] [WebSearch loaded at msg15]
                                      |
                              compaction happens
                                      |
                                      v
After compaction (without recovery):
  [summary] [msg29] [msg30]
  WebSearch? what WebSearch?

After compaction (with recovery):
  [summary + preCompactDiscoveredTools: [WebSearch]]
  [msg29] [msg30]
  WebSearch still loaded

Claude Code solves this with two mechanisms. The first records the list of loaded tools in the compaction boundary message. After compression, the system scans this boundary and restores the previous state. The second is incremental notification: before each turn, the system compares currently available deferred tools against what the model has already been told about. If anything changed, such as an MCP server disconnecting or a new one connecting, it generates a message to inform the model.

Together, these two layers ensure that the API-level tool filtering stays correct and the model’s awareness of available tools remains intact. Compression doesn’t break tool state.

Adaptive Activation

Not every scenario needs deferred loading. If a project only uses two or three MCP tools totaling under 1,500 tokens, the extra ToolSearch call is just wasted time.

deferred tool definitions < 10% of context window?
            |
     +------+------+
     |             |
    YES            NO
     |             |
     v             v
  all inline    deferred mode
  (no search    (ToolSearch
   overhead)     on demand)

Claude Code has an automatic mode: it calculates the total token count of all deferrable tool definitions, and if it exceeds 10% of the context window, deferred loading kicks in. Below that threshold, everything is inlined. Small projects with few tools run more efficiently with everything loaded; large projects with many tools save more with deferred loading. The system decides on its own. Users don’t need to think about it.

Third-party API proxies have this feature disabled by default, since the tool reference marker is an Anthropic-specific capability that proxies may not support. But if users confirm their proxy can handle it, they can enable it manually.

Why This Design Is Interesting

Deferred tool loading looks like a very specific engineering optimization, but the tension behind it is quite universal: the more capable an Agent becomes, the more tools it registers, the larger the context overhead, and the less room is left for actual work. Capability itself becomes a burden.

This tension showed up in traditional software long ago. The more functions a program can call, the larger and slower it gets if everything is bundled in. The solution was lazy loading: load each library only when it’s first called, trading one extra level of indirection for dramatically reduced resource consumption. What Claude Code does with tools is exactly the same thing, except it’s not saving memory. It’s saving tokens.

推荐订阅源

Finisky Garden