Per-Key Rate Limiting for Agent Tool Calls: Stop One User From Breaking Everything

Multi-tenant agents share infrastructure. When one user's agent calls the web search tool 200 times in a minute, every other user's agent slows down or gets errors.

Global rate limits protect the provider. Per-key rate limits protect your users from each other. agent-rate-fence is a sliding-window rate limiter where the key is whatever you want: user ID, session ID, tool name, or any combination.

The Shape of the Fix

from agent_rate_fence import RateFence, RateLimitExceeded

fence = RateFence(
    max_calls=10,
    window_seconds=60.0,
)

def rate_limited_search(user_id: str, query: str) -> list:
    try:
        with fence.allow(key=user_id):
            return web_search(query)
    except RateLimitExceeded as e:
        return {"error": f"Rate limit exceeded. Retry after {e.retry_after_seconds:.0f}s"}

Ten calls per user per 60-second window. The eleventh call raises RateLimitExceeded with a retry_after_seconds hint.

What It Does NOT Do

agent-rate-fence does not rate limit across processes or machines. It is in-process only. For distributed rate limiting across a fleet of workers, you need Redis or a similar shared store.

It does not differentiate between tool types. All calls through the same fence share the same counter for the same key. If you want different limits for different tools, create a fence per tool.

It does not queue requests that exceed the limit. Excess requests fail immediately. For a queueing approach, you need a task queue.

Inside the Library

Sliding window implementation using a deque of timestamps:

from collections import deque
import time
import threading

class RateFence:
    def __init__(self, max_calls: int, window_seconds: float):
        self._max = max_calls
        self._window = window_seconds
        self._calls: dict[str, deque] = {}
        self._lock = threading.Lock()

    @contextmanager
    def allow(self, key: str):
        with self._lock:
            now = time.monotonic()
            if key not in self._calls:
                self._calls[key] = deque()

            # Remove expired entries
            dq = self._calls[key]
            while dq and dq[0] < now - self._window:
                dq.popleft()

            if len(dq) >= self._max:
                oldest = dq[0]
                retry_after = self._window - (now - oldest)
                raise RateLimitExceeded(retry_after_seconds=retry_after)

            dq.append(now)

        yield

The sliding window is more accurate than a fixed window (which can allow 2x the limit at window boundaries). A deque per key is efficient: popleft() is O(1), and the deque only holds timestamps within the current window.

Thread safety: the entire check-and-append is under one lock. This prevents the TOCTOU race where two concurrent callers both see len(dq) < max and both proceed.

Cleanup: keys that have been idle for more than window_seconds can accumulate in self._calls. A background cleanup thread is optional. By default, the deque for an idle key stays in memory but empty, which is cheap.

When to Use It

Use it for multi-tenant agents where users share tool infrastructure. Web search, database queries, external API calls — any tool where one user consuming excessive capacity affects others.

Use it by tool category, not just by user. A user making 10 web searches per minute might be fine. A user making 10 DELETE operations per minute might not be. Create a fence per tool category with different limits.

Use it for cost attribution and control. Pairing per-user rate limiting with per-user cost tracking gives you a complete picture of resource consumption per tenant.

Skip it for single-user agents or closed environments where all agent instances belong to the same user. The overhead is not worth it.

Install

pip install git+https://github.com/MukundaKatta/agent-rate-fence

from agent_rate_fence import RateFence, RateLimitExceeded

# Different limits for different tool categories
search_fence = RateFence(max_calls=20, window_seconds=60.0)
write_fence = RateFence(max_calls=5, window_seconds=60.0)
delete_fence = RateFence(max_calls=2, window_seconds=300.0)

FENCES = {
    "search_web": search_fence,
    "create_record": write_fence,
    "update_record": write_fence,
    "delete_record": delete_fence,
}

def execute_tool(tool_name: str, user_id: str, args: dict):
    fence = FENCES.get(tool_name)
    if fence:
        try:
            with fence.allow(key=user_id):
                return call_tool(tool_name, args)
        except RateLimitExceeded as e:
            return {
                "error": "rate_limit_exceeded",
                "retry_after": e.retry_after_seconds,
            }
    return call_tool(tool_name, args)

Sibling Libraries

Library	What it solves
`llm-rate-limit-bucket`	Token-bucket rate limiter for LLM API calls
`token-budget-pool`	Shared USD/token budget across concurrent agents
`llm-cost-cap`	Per-call cost gate
`llm-batch-coalesce`	Collapse duplicate calls from concurrent callers
`agent-deadline`	Time-bound agent execution

The multi-tenant stack: agent-rate-fence per user per tool, token-budget-pool for shared fleet budget, llm-cost-cap for individual call bounds.

What's Next

Redis backend for distributed rate limiting is the most requested missing feature. The interface would stay the same — RateFence(max_calls=10, window=60, backend=RedisBackend(url="redis://...")) — but the deque would live in Redis with atomic Lua scripts for check-and-append.

Burst allowance: RateFence(max_calls=10, window=60, burst=20) would allow up to 20 calls in a short burst before rate limiting kicks in, while still enforcing the 10-per-60s average. This matches how most API rate limits actually work.

Per-key limit overrides: some users legitimately need higher limits (enterprise tiers). A fence.set_limit(key="enterprise-user-1", max_calls=100) would allow per-key overrides without a separate fence.

Built as part of the agent-stack family: composable Python primitives for production LLM agents.

推荐订阅源

DEV Community