Why the retry loop is usually the expensive part of agent work

The first failure usually is not the expensive one.

The expensive part is what happens after the first failure when the system keeps trying, keeps spending, and keeps producing the same outcome because nothing about the situation changed.

We kept running into a simple pattern: the agent would miss a step, the runtime would retry, the next attempt would see the same state, and the loop would repeat until the cost was visible in the bill or the operator log. At that point the problem stops being a model-quality issue and becomes a control-system issue.

Why the loop hurts more than the mistake

A single bad step is recoverable. An unbounded retry loop compounds the mistake.

That is true for token spend, API calls, and operator attention. It is also true for trust. Once a system gets a reputation for wandering, people stop letting it touch real work.

The failure mode is boring, which is why it gets missed. Nobody looks at a happy-path demo and thinks about what happens after the third identical error. But that is where the real cost lives.

What we tried first

The obvious moves are usually the wrong ones:

make the prompt longer
add a generic retry
increase the timeout
let the model reason more
rerun the same command with slightly different wording

Those changes can make a demo look better, but they do not fix a stuck loop.

If the environment is unchanged, a retry is often just a second copy of the same mistake.

What actually worked

The fix was not smarter language. It was stricter boundaries.

We had to make the runtime answer four questions before it kept going:

What is the budget?
What counts as success?
What is the verifier?
What happens when the same failure repeats?

A small policy block is often enough to make that concrete:

{
  "budget_cap": 250,
  "max_attempts": 3,
  "stop_on_same_error": true,
  "require_verifier": true,
  "emit_receipt": true
}

That does not sound ambitious. That is the point.

The biggest reliability gain came from refusing to treat repeated failure as progress. Once the runtime could detect the same blocker twice or three times in a row, it had permission to stop instead of pretending the next rerun would somehow be different.

Why receipts matter

Receipts turn a run from a vague story into a checkable fact.

A receipt should show:

what the agent tried
what changed
what failed
why the run stopped

Without that, a loop can hide inside a confidence-generating summary. With it, you can see the exact stopping point and decide whether the next action should be a human intervention, a different tool, or no action at all.

That is also why this kind of work ends up feeling less like prompt engineering and more like operations.

The tradeoff

Stricter control means the system stops earlier.

That can feel annoying when you want the agent to push through friction. But earlier stopping is cheaper than a long blind retry sequence. More importantly, it preserves operator trust.

A bounded agent is less flashy than an agent that never gives up. It is also much more usable.

That is the core of the control-layer approach we keep coming back to in MartinLoop: the runtime should know when to stop, when to ask for help, and when to write down what happened.

What we are watching next

The next improvement is not more retries.

It is better failure classification so the runtime can separate:

missing permission
stale state
tool mismatch
external outage
real task completion

When those are distinct, the system can choose a better next step instead of recycling the same command.

That is the line between an agent that looks autonomous and an agent that is actually operable.

What failure shape are you still letting your runtime retry too many times?

推荐订阅源

DEV Community