The incident report said the agent called search() 847 times in a single run.
Nobody noticed until the invoice arrived. The agent was tasked with researching a topic. It got into a loop: search, parse the result, decide it needed more context, search again. The search results kept being relevant enough to continue. The termination condition never triggered. The agent was not broken. It was doing exactly what it was told, just too many times, for too long.
The fix people reach for is a global step counter. if steps > 100: break. That works until you have an agent with ten tools and one tool accounts for ninety of those steps. A global cap is too blunt. You want to say: search can run 5 times, fetch_url can run 10 times, send_email can run 1 time. Each tool has its own ceiling.
tool-call-budgets is the library that does that.
The shape of the fix
The core API is three things: a ToolBudgets dict, a RunContext, and a @guarded decorator.
from tool_call_budgets import ToolBudgets, RunContext, guarded, ToolBudgetExceeded
# Define your limits once.
budgets = ToolBudgets({
"search": 5,
"fetch_url": 10,
"send_email": 1,
})
# Create a context for each agent invocation.
ctx = RunContext(budgets)
Wrap each tool function with @guarded:
@guarded(ctx, name="search")
def search(query: str) -> list[str]:
return my_search_api(query)
@guarded(ctx, name="fetch_url")
def fetch_url(url: str) -> str:
return requests.get(url).text
@guarded(ctx, name="send_email")
def send_email(to: str, body: str) -> None:
my_email_client.send(to=to, body=body)
Now pass those wrapped functions to your agent. When search is called a sixth time, ToolBudgetExceeded is raised before the function body runs.
try:
result = run_agent(
tools=[search, fetch_url, send_email],
task="Research the competitive landscape for ...",
)
except ToolBudgetExceeded as e:
print(f"Budget exceeded: {e.tool_name} hit its limit of {e.limit}")
# log it, return partial results, alert, etc.
The exception carries the tool name and the limit. Your agent harness catches it and decides what to do: return partial results, retry with a smaller task, page an operator.
What it does NOT do
- It does not track token or USD cost. If you need a dollar cap, pair this library with
token-budget-py. These two cover different axes: call count vs. spend. - It does not reset automatically over time. If you want a "5 search calls per minute" window, that is
llm-budget-window.tool-call-budgetsis per-run scoped, not time-windowed. - It does not inspect or validate tool arguments. If you need to reject a call because the args look wrong, that is
agentvet.tool-call-budgetsonly counts calls, regardless of what is passed. - It does not modify or wrap the return value. The decorated function runs normally if within budget. The only change is the guard on the way in.
Inside the lib: the RunContext design
The design choice that took the most iteration was how to scope the counters.
The obvious approach is a shared global dict: call_counts["search"] += 1. That is simple but breaks immediately when you run two agents at the same time. Agent A's search calls pollute Agent B's counter. You end up refusing Agent B's third call because Agent A already burned two of the five slots.
The next approach is a thread-local. One counter per thread. That works for thread-based concurrency but falls apart with async agents running on an event loop. Multiple coroutines share a thread. Thread-local storage looks like one agent to all of them.
RunContext is the answer. You create a new RunContext for each agent run. It is a plain object that holds the counters. You pass it to @guarded when you decorate your functions. The counters live on the context, not on a global or thread-local.
# Each invocation gets its own context.
def handle_request(task: str) -> str:
ctx = RunContext(budgets) # fresh counters, zero cost
@guarded(ctx, name="search")
def search(query: str) -> list[str]:
return my_search_api(query)
return run_agent(tools=[search], task=task)
The RunContext is thread-safe. Its internal counter uses a lock. Two threads wrapping different functions against the same context, rare but possible in some agent frameworks, do not race past the cap.
You do not have to call ctx.reset() between runs. You just make a new RunContext. The old one gets garbage collected. No shared state to clean up.
For agents that spawn sub-agents, you can pass the same RunContext down. The child agent's tool calls count against the parent's budget. Or you create a child RunContext with tighter limits. Both patterns are supported.
# Sub-agent shares the parent's budget.
def run_sub_agent(ctx: RunContext, subtask: str):
@guarded(ctx, name="search")
def search(query: str) -> list[str]:
return my_search_api(query)
return run_agent(tools=[search], task=subtask)
When this is useful
- You have an agent with tools that have real-world costs or side effects.
send_email,post_to_slack,create_ticket. You never want those called more than once or twice per run. - Your agent runs in a loop (ReAct pattern, tool-use loops, multi-step planners) and you want a hard ceiling on how many iterations any given tool can contribute.
- You are debugging a looping agent in staging and want a fast fail instead of waiting 20 minutes and paying for 500 calls before killing the process.
- You are deploying to users and want to guarantee that a single bad input cannot cause an agent to spend an unlimited amount on search or API calls.
When this is NOT what you want
- For simple scripts with one LLM call and no loop. A plain if-statement is enough.
- For caps based on token count or USD spend. Use
token-budget-pyorllm-budget-windowfor those. - For preventing duplicate calls with the same arguments. That is caching, not counting. Use
tool-call-cacheto memoize tool results so budget is not wasted on repeated identical calls.
Install
pip install tool-call-budgets
No dependencies. Zero. The library is pure Python with no third-party imports.
GitHub: MukundaKatta/tool-call-budgets
44 tests, all passing.
Sibling libraries
| Lib | Boundary | Repo |
|---|---|---|
| token-budget-py | Token and USD cap per run | MukundaKatta/token-budget-py |
| llm-budget-window | Time-windowed cap (per minute, hour, day) | MukundaKatta/llm-budget-window |
| tool-call-cache | Memoize tool results so budget is not wasted on repeated calls | MukundaKatta/tool-call-cache |
| llm-circuit-breaker-py | Error-rate circuit breaker for LLM calls | MukundaKatta/llm-circuit-breaker-py |
| agent-deadline | Cooperative time cap per agent run | MukundaKatta/agent-deadline |
token-budget-py and tool-call-budgets are the most common pair. Token budget says "stop spending money." Call-count budget says "stop calling this tool." Combined, you have both a USD ceiling and a per-tool ceiling. Neither one alone covers the full picture.
What is next
A few things are on the list:
- A
ctx.summary()method that returns a dict of{tool_name: {calls: N, limit: M, remaining: K}}so agent harnesses can log or display budget status mid-run. - A soft-limit mode that logs a warning at 80% of the cap without raising, giving the agent a chance to wrap up gracefully before hitting the hard stop.
- A
ToolBudgets.from_config(path)loader for teams that want to define limits in a YAML or JSON config file rather than in code.
The core loop is stable. @guarded, RunContext, ToolBudgetExceeded. Those three pieces cover the most common failure mode: a tool called too many times in a single run, unnoticed, until the invoice arrives.
Built for the Hermes Agent Challenge. Part of a series of small libraries for production agent infrastructure.





















