Why AI Agent Policies Must Be Deterministic, Not Probabilistic

There's a philosophical split in how the AI industry thinks about agent safety. One camp says the model should govern itself — better prompts, better training, better alignment. The other says external enforcement is necessary because models are inherently probabilistic and shouldn't be trusted to enforce their own constraints.

Both camps are partially right — but in practice, almost every production MCP agent today relies entirely on the first approach. Safety rules live in the system prompt. The model interprets them. The model decides whether to follow them. There's no external check. This post argues for deterministic AI agent policies: rules evaluated outside the model, at the transport layer, on every tool call.

What "Deterministic" Means Here

A deterministic policy produces the same result for the same input, every time. Given a tool call with args.amount = 60000 and a rule that says args.amount <= 50000, the call is denied. No interpretation. No context. No probability.

Compare this to how a language model evaluates the same constraint. The system prompt says "do not allow charges over $500." The model sees a tool call for $600. In most cases, it will refuse. But its decision is influenced by conversation history, prompt injection, context window length, temperature, and the specific phrasing of the instruction. The outcome is probabilistic.

For many agent behaviours, probabilistic is fine. You want the model to use judgment about which tools to call, how to interpret user requests, and when to ask for clarification. These are inherently fuzzy decisions.

But safety constraints are not fuzzy decisions. "Don't exceed $500 per charge" has a definitive answer for any given amount. "Don't call delete_repository" is either enforced or it isn't. "Maximum 5 issues per hour" requires exact counting, not estimation.

Deterministic policies handle the definitive cases. The model handles everything else.

The Prompt Enforcement Problem

Consider a concrete example. You're running an agent with access to a Stripe MCP server. Your system prompt says:

You must not create charges exceeding $500.
You must not create more than $10,000 in charges per day.
You must only charge in USD or EUR.

Three rules. All seem clear. Here's where they break down:

Rule 1 works reasonably well for single calls. The model can see amount: 60000 and recognise it exceeds $500. But what about amount: 50001? The model needs to convert cents to dollars, apply the comparison, and decide. It usually gets this right. Not always.

Rule 2 is where things get interesting. To enforce a daily spend cap, the model needs to track cumulative spending across all create_charge calls in the current day. This requires maintaining a running total in its context window. After 20 calls, the model is summing numbers from earlier in the conversation. After 50 calls, some of those earlier calls may have been compressed or summarised. The model is now estimating its cumulative spend, not calculating it.

Rule 3 seems simple until the model encounters an edge case. What about "USD"? "us_dollar"? "dollars"? The model interprets these flexibly. A deterministic policy using the in operator matches exact strings — "usd" and "eur" — with no ambiguity.

What Deterministic Enforcement Looks Like

The same three rules as a deterministic policy:

version: "1"
description: "Stripe spending controls"

tools:
  create_charge:
    rules:
      - name: "max single charge"
        conditions:
          - path: "args.amount"
            op: "lte"
            value: 50000
        on_deny: "Single charge cannot exceed $500.00"

      - name: "daily spend cap"
        conditions:
          - path: "state.create_charge.daily_spend"
            op: "lte"
            value: 1000000
        on_deny: "Daily spending cap of $10,000.00 reached"
        state:
          counter: "daily_spend"
          window: "day"
          increment_from: "args.amount"

      - name: "allowed currencies"
        conditions:
          - path: "args.currency"
            op: "in"
            value: ["usd", "eur"]
        on_deny: "Only USD and EUR charges are permitted"

Each rule evaluates against the raw tool call arguments and persistent state. The daily spend counter is maintained in a state store (SQLite or Redis), not in the model's context window. It's exact, not estimated. It survives context compression. It survives process restarts.

The increment_from: "args.amount" directive tells the counter to add the actual charge amount, not just count calls. A $50 charge increments by 5000. A $200 charge increments by 20000. The arithmetic is precise because a computer is doing it, not a language model. See How to Add Spending Controls to Any MCP Agent for a full walkthrough of building these policies.

Three Properties of Good Policies

Deterministic policies have three properties that prompt-based rules lack:

1. Verifiability

You can prove a policy does what it claims. Given a policy YAML file, you can enumerate every possible outcome for any tool call. There are no hidden states, no contextual dependencies, no model-specific behaviours.

Intercept's validate command checks policies statically:

intercept validate -c policy.yaml

This catches missing counters, invalid operators, type mismatches, and logical conflicts. You know the policy is correct before deploying it. Try proving the same thing about a system prompt.

2. Auditability

Every policy evaluation produces a deterministic trace. Tool X was called with arguments Y. Rule Z evaluated condition A against value B. Result: allow or deny.

This matters for compliance. When a regulator asks "how do you prevent agents from exceeding spending limits?", you can point to a policy file and its enforcement logs. The policy is the spec. The logs prove enforcement. There's no gap between intent and implementation.

3. Composability

Policies compose cleanly. A single YAML file can combine default-deny, tool hiding, per-tool argument validation, stateful counters, rate limits, and wildcard rules — each constraint independent, composable, and removable without affecting the others. Rules are evaluated independently and ANDed together.

For example, a policy can simultaneously say: deny everything by default, hide destructive tools, allow create_charge with spending limits, allow read_balance unconditionally, and cap total calls at 60 per minute. Each constraint is independent. Adding a new rule doesn't affect existing ones. Removing a rule doesn't break the others.

Try expressing this in a system prompt. It's possible, but the interaction between rules becomes ambiguous. Does "deny everything by default" override "allow create_charge with limits"? The model has to interpret the priority. A policy engine has explicit evaluation order.

The Separation of Concerns

The deeper argument for deterministic policies is about separation of concerns. Language models are good at reasoning, planning, and adapting to context. They're not good at exact arithmetic, precise state tracking, or consistent rule enforcement.

A well-designed agent system gives each component what it's good at:

The model decides which tools to call and what arguments to pass. It interprets user intent, plans multi-step workflows, handles errors gracefully, and communicates results clearly.

The policy engine decides whether each tool call is allowed. It checks arguments against constraints, tracks cumulative state, enforces rate limits, and blocks prohibited operations.

This separation means the model doesn't need perfect safety training to produce a safe system. It just needs to be good enough that it doesn't constantly trigger policy denials. The policies handle the rest.

"But What About Dynamic Policies?"

A common objection: deterministic policies are too rigid. Real-world scenarios need context-sensitive rules. "Allow $5,000 charges on weekdays but not weekends." "Rate limit based on the user's subscription tier." "Block certain tools only during incident response."

Some of this is addressable with static policies. Time-windowed counters already handle temporal patterns. Different policy files can be loaded for different contexts.

But the objection has merit. There's a spectrum between "fully static YAML" and "ask the model." The right answer is probably: static enforcement for safety-critical constraints (spending limits, destructive operation blocks, rate limits), with more flexible mechanisms for context-dependent rules.

The important principle is that the safety floor — the set of constraints that must never be violated — should be deterministic. Everything above the floor can be as dynamic and context-sensitive as needed.

Getting Started

If you're currently relying on prompt guardrails for MCP agent safety, here's a pragmatic migration path:

Audit your system prompt for safety-relevant rules. Anything that says "do not," "never," "must not," or "limit to" is a candidate for a deterministic policy.
Scan your MCP servers to see what tools are exposed:

   intercept scan -o policy.yaml -- npx -y @modelcontextprotocol/server-github

Start with tool hiding and unconditional blocks. Remove tools the agent doesn't need. Block destructive operations. These are zero-risk, high-impact changes.
Add rate limits next. Even conservative limits (50 calls/hour per tool, 200 calls/hour global) prevent the worst runaway scenarios.
Add argument validation last. This requires understanding the tool's parameter schema, but it's where the highest-value constraints live — spending caps, region locks, permission boundaries.
Keep your prompt guardrails. They're still useful as a first line of intent. But don't rely on them as the only line of enforcement.

The gap between "the model will probably follow this rule" and "this rule is enforced at the transport layer" is the gap between hope and engineering. Safety-critical constraints belong on the engineering side. See what happens when they're not.

FAQ

What does "deterministic" mean for AI agent policies?

Can deterministic policies handle dynamic or context-sensitive rules?

Safety-critical constraints (spending limits, destructive operation blocks, rate limits) should be deterministic. For context-dependent rules — like different limits by time of day or user tier — you can load different policy files for different contexts or use time-windowed counters. The important principle is that the safety floor is always deterministic.

How do deterministic policies work alongside prompt guardrails?

They're complementary. The system prompt sets behavioural intent (the model should respect limits). The deterministic policy enforces hard constraints (the policy will enforce limits). The model handles fuzzy decisions like which tools to call and how to interpret user requests. The policy handles definitive decisions like "is this amount under the cap?"

推荐订阅源

DEV Community