The Grilling

In Part 1 I argued that every spec-driven AI framework on the market - sixteen of them in my survey - has the same structural blind spot. They all check the implementation against the spec. None of them check the spec against attack before it gets written.

Part 2 is the operational deep dive.

What does the missing phase actually look like when you build it?
How does it run?
What are the agents, the prompts, the termination conditions, the artifacts?
When should you not use it?

This part assumes you’ve read Part 1, or at least bought the premise: the spec needs to be on trial before it becomes gospel.

Why “multi-agent debate” isn’t enough

A few research papers and a couple of frameworks have something they call multi-agent debate. Two agents argue, a third synthesizes. This is a real technique with real research behind it, and it’s a meaningful improvement over single-agent reasoning.

It is not Grilling.

The differences matter, and they’re worth being precise about.

The first difference is grounding.

Most debate setups in current frameworks operate on whatever’s in the prompt - they don’t first survey the codebase, the existing tests, the past failures, the applicable constraints. The result is two LLMs hallucinating at each other politely.

The Advocate invents objections that don’t apply to the actual system; the Proposer defends positions against attacks that wouldn’t matter even if they landed. Without a Recon Dossier in front of both agents, the debate is theater. It produces dialogue, not decisions.

Grilling refuses to start until the ground truth is established and verified. That’s not a stylistic choice - it’s the only way the attacks have weight.

The second difference is the optimization target.

Standard debate optimizes for the best version of a chosen position. Two agents start with opposing views and the synthesizer extracts what’s strongest from each. This is genuinely useful when you’ve already decided to do something and you’re trying to figure out the best way.

Grilling optimizes for a different thing entirely: whether the position should be held at all. The Proposer isn’t defending a position because it was assigned to them; they’re proposing a solution they actually think is correct, and the Advocate is trying to dismantle that proposal.

The legitimate output of Grilling is kill the idea entirely. The legitimate output of standard debate is rarely neither side has a point.

The third difference is the stopping condition.

And this might be the most important one. Standard debate ends when both sides have made their case - typically after a fixed number of rounds, or when the orchestrator decides the discussion has matured. That’s a procedural ending, not a substantive one. The debate stops because the schedule says it stops, not because the question has been resolved.

Grilling has a structural stopping condition: equilibrium between two opposing pressures. The attacker has nothing left. The Don has nothing left. Both pressures simultaneously exhausted. Until that condition is met, the rounds continue (up to the hard ceiling). After that condition is met, no more rounds - they’d add nothing.

The stopping condition is the whole game.

If your debate stops on we’re done arguing, you’re polishing turds - you exit with whatever the agents converged on, regardless of whether what they converged on was correct.

If it stops on no new valid objection AND no remaining concerns, you have something stronger: a verdict that survived attack, with the surviving objections explicitly logged.

Multi-agent debate is a useful tool. It’s just a different tool, solving a different problem.

How a Grilling session is structured

Grilling sits as Phase 2 of the Heist Pipeline. It’s not a prompt and it’s not a standalone tool - it’s a phase with hard gates before it (Reconnaissance must complete and produce a Recon Dossier) and after it (the Don must sign off on the verdict before anything moves to the Sit-Down).

The process runs like a structured interrogation. Three subagents have specific roles, the Don (the user) participates in every round, and the rounds follow a fixed order.

The Proposer opens. It reads the Recon Dossier - the verified findings from Phase 1 - and proposes a solution: architecture, file changes, identified risks, expected behavior.

The Proposer’s job is to put the strongest possible version of the idea on the table. Not the safest version, not the most diplomatic version. The strongest. If the idea is bad, you want it to die fighting, not die mumbling.

The Devil’s Advocate attacks. Architectural flaws. Security gaps. Constitution violations. Performance regressions. Scalability ceilings. Edge cases the Proposer didn’t think about. The Devil’s Advocate’s job - and this is important - is to find the failure mode.

Not to be polite.
Not to suggest improvements.
To attack.

If the Proposer says “we’ll cache this in Redis,” the Devil’s Advocate says:

What happens when Redis is down?
What happens when the cache is poisoned?
Have you measured the actual cache hit rate or are you guessing?

Bad attacks get filtered by the Proposer’s response.
Good attacks force a revision.

The Don - that’s the user, you - weighs in every round. One question at a time. Never bundled. This rule matters more than it sounds.

If the Don asks three questions at once, the agent will answer the easy one fully, the medium one partially, and quietly skip the hard one. One question forces an actual answer. The Don’s questions are usually the most valuable in the whole Grilling, because the Don has context the agents don’t have - about the team, about the business, about the politics, about what’s been tried before that didn’t make it into the codebase.

The Synthesizer closes each round. It incorporates the valid attacks and the Don’s feedback and produces a revised solution. Not a defense of the original - a revision. If nothing valid came up that round, the revision is small. If something hit hard, the revision is structural. Sometimes the revision is kill this idea entirely and propose a different approach, and that’s a legitimate outcome.

Then the next round begins.

What a round actually looks like

Theory is cheap.

Here’s a real Round 1, lightly edited for length, from a Grilling session on a small feature: adding a local high-score leaderboard to a browser Tetris game.

The task sounds trivial. Watch how fast “trivial” falls apart under attack.

The Proposer opens. It reads the Recon Dossier - a three-layer pure state machine, zero DOM in the logic layer, 261 passing tests, a constitution whose first commandment is game logic has zero DOM dependencies — and puts a solution on the table:

Add a name-entry status to the state machine. Store the typed initials in state.nameEntry = { buffer: '', maxLen: 3 }. Detect game-over by reading the gameOver event from state.events[], then route keystrokes into the buffer. Render the leaderboard as a DOM <div> overlay, mirroring the existing settings.js. A pure isTopTen(score, entries) function decides qualification.

Reasonable on its face. Mirrors existing patterns. Names real files. This is the strongest version, not the safest — exactly what the Proposer is supposed to deliver.

The Devil’s Advocate attacks. It doesn’t suggest improvements. It goes for failure modes, citing the Dossier as ammunition:

Attack 1 — state.nameEntry violates Constitution Commandment 1. The name buffer is pure UI interaction state. It has zero effect on scoring, physics, or piece movement. Putting it in state.js pollutes the logic layer that the entire test suite depends on being DOM-free. It belongs in main.js as local orchestration state. Evidence: state.js is the logic layer; 261 tests import it headless.

Attack 2 — state.events[] is the wrong trigger. The Dossier describes that array as existing “for renderer/audio coupling.” It does not specify whether it’s drained per-frame or persistent. If the renderer or audio handler consumes it before main.jsreads it, the game-over trigger is silently missed. That’s a race condition waiting to happen. Evidence: Dossier — “Events array for renderer/audio coupling,” no drain semantics specified.

Attack 3 — key suppression is unspecified. S (settings), M (mute), P (pause), R (restart), L (leaderboard) all have live bindings in main.js. Type a name containing “S” and the settings panel opens mid-entry. The Proposer flags this as a risk but offers no mechanism. Evidence: existing main.js key handlers.

Plus a secondary finding the Proposer never mentioned: names rendered from localStorage straight into the DOM are a stored XSS vector unless the write uses textContent instead of innerHTML.

Verdict: CHALLENGE. The idea is sound. The solution has three holes that need closing before anyone writes a spec.

The Don weighs in — one question at a time, never bundled.

Do you agree with the attack? → Agree fully.

Anything the Devil’s Advocate missed? → Nothing missed.

Want to override any part of the proposal before synthesis? → No override.

Three forced answers. No place to hide an evasive “well, sort of.”

The Synthesizer revises. Not a defense of the original - a revision that absorbs every valid hit:

Remove nameEntry from state.js entirely. The buffer becomes let nameBuffer = '' local to main.js. The state machine keeps only the 'name-entry' status string — that genuinely is game state, because it gates input routing and rendering.

Abandon state.events[] for orchestration. main.jstracks prevStatus across ticks; when status flips to game-over, the transition fires. No array drain, no race. The events array keeps its existing renderer/audio role untouched.

Hard early-return guard at the top of the keydown handler: while status === 'name-entry', all keys route to handleNameEntryKey() and nothing else can fire.

All DOM writes use textContent. XSS closed.

One round.

The “trivial” feature now has a clean separation between UI and logic state, a race condition designed out before it existed, and a security hole closed before a line of code was written. That’s the difference between we agreed and it survived attack.

This particular session ran a second round — the Devil’s Advocate came back with six more attacks (no max-length cap on the buffer, a pure-vs-DOM module split, JSON schema validation on load) — before reaching equilibrium. The full verdict, with every rejected alternative and every accepted risk, flowed straight into the Sit-Down as the basis for the Contract.

Hard rules on rounds.

Minimum 2 rounds - you can’t grill an idea in a single pass, because the first attack is always shallow. Default maximum 5 rounds - most ideas resolve here, either by surviving or by transforming into something different.

Hard ceiling 7 - the Don can extend, but not beyond, because past round 7 returns diminish sharply and you’re usually just rationalizing at that point. Early exit only after round 2, only by explicit Don call - used when convergence is genuinely fast and continuing would be theater.

Termination is not “we agreed.”

Agreement is the easiest thing in the world to manufacture between an LLM and another LLM, and between an LLM and a tired user. Termination is one of three structural conditions:

Nash Equilibrium is the canonical one. The Devil’s Advocate raises no new valid objection AND the Don has no remaining concerns. Both attacking pressures have run out of ammunition simultaneously. The idea has genuinely survived attack - not because the attack stopped, but because the attack hit nothing that wasn’t already
accounted for. This is the outcome you want.

Explicit consensus is the fast-path. The Don ends the Grilling after round 2, declaring that the idea has been adequately tested. This is appropriate when the problem is genuinely simple, when the Recon Dossier already addressed the major risks, or when the team has high confidence from prior similar work. It’s a real exit, but it’s the Don’s call to make, not the agents’.

Round limit is the safety valve. If neither equilibrium nor consensus is reached by round 7, the Grilling ends - but the unresolved objections don’t disappear. They get logged into the verdict as accepted risks. The Don is explicitly carrying them forward.

This matters: it means a forced termination doesn’t pretend the idea is clean. It just makes the dirt explicit. Six months later, when something breaks, you can look at the verdict and see exactly which risk was knowingly accepted.

The output of Grilling isn’t a spec. It’s a verdict - Key Decisions (and why), Rejected Alternatives (and why they were rejected), Unresolved Objections (and what risks the Don is carrying forward), and the Termination Reason.

The verdict is held in-context, not written to a file - it flows directly into the next phase, the Sit-Down, where the actual Contract gets drafted.

Only after Grilling does anything get written down as a Contract.
Only after the Contract gets signed does code get planned.
Only after the plan does code get written.

Five gates before a single line of implementation. That sounds heavy, and on small tasks it is - which is why Grilling has explicit “skip” conditions, which I’ll get to below.

But for anything that’s actually load-bearing, the cost of skipping any of those gates is higher than the cost of running them. Always. The point of the pipeline is that the friction is real friction, not theatrical friction. It catches things.

Yes, this adds tokens. Recon plus Grilling cost real money on every feature, and on a moderate-sized change the overhead is non-trivial - I’ll publish hard numbers from instrumented runs separately. The bet is that the cost of arguing about a bad idea is always smaller than the cost of building one. So far that bet has held.

When NOT to use Grilling

I’m not going to pretend this is universal. It isn’t. Grilling is a serious tool with serious overhead, and it has clear failure modes when applied wrong.

The first failure mode is using Grilling on changes that don’t deserve it. If the task is fixing a typo, bumping a dependency version, or renaming a variable - Recon plus 2 rounds of Grilling is absurd. You’ll spend more tokens debating the change than implementing it, and the agents will start manufacturing fake objections to fill the rounds because there genuinely aren’t real ones to raise.

The Devil’s Advocate will say something like have we considered backwards compatibility for users who depend on this exact variable name? and you’ll know the system has descended into theater.

The second failure mode is using Grilling on pure refactors with a verified baseline. If the existing code already works, the tests already pass, and the goal is to clean up structure without changing behavior - the original decision was already grilled (or should have been) when the original code was written.

Re-grilling at refactor time is litigating a settled question. The right thing in that case is a different gate: a behavior-preservation check, not a should-this-exist check.

The third failure mode is using Grilling during exploratory prototyping, where the entire point is to fail fast and learn. If you’re spiking out three different approaches to see which one is even tractable, you don’t want each spike to get a full adversarial review - you want to throw cheap code at the problem and see what survives contact with reality. Grilling here actively kills the exploration.

The fourth failure mode is using Grilling under genuine time pressure when the cost of being wrong is small. Production is on fire, the fix is small, you’re confident in the diagnosis, and the cost of an extra hour of debate is real customer pain. Skip it.
Document what you did. If the fix turns out to be wrong, that’s what the Ledger is for - you log the failure and feed it into Reconnaissance for next time.

So when should you grill?

Use Grilling for new features that touch architectural decisions - anything where the structural shape of the change matters, not just its correctness.
Use Grilling for changes that introduce a new dependency, a new external integration, a new data model - these are the changes where the cost of getting it wrong propagates for years.
Use Grilling for security-relevant changes, where the failure mode is we shipped a vulnerability - the Devil’s Advocate role is genuinely valuable here, because security failures are exactly the failures that careful, well-meaning people miss.
Use Grilling any time the cost of building the wrong thing is meaningfully larger than the cost of arguing about it for an hour.

The decision rule is brutal but simple: how much will it cost to undo this if you’re wrong? If the answer is more than the Grilling itself, grill it. If the answer is less, don’t.

The hard part isn’t applying the rule.
The hard part is being honest about which side of the rule a given task falls on.

Most engineers underestimate the cost of being wrong, because the cost is mostly invisible - it shows up later, in the form of technical debt, integration headaches, security audits that find old shortcuts, and refactors that take months to unwind.

Grilling is the moment you pay that cost up front, in tokens and minutes, instead of paying it later in engineer-years.

The uncomfortable implication

If your framework doesn’t have a Grilling phase, your framework is a productivity tool for shipping bad ideas faster.

That’s a real product. There’s a market for it. Plenty of people want their bad idea shipped quickly and don’t want to be told it’s bad. Fine. Ship it. Sell it.

To be fair, most existing frameworks aren’t claiming to do this - they’re claiming to enforce rigor in implementation, and they do that genuinely well.

Spec-Kit, MUSUBI, Tessl, the rest - within their scope, they’re honest about what they offer. The problem is the gap between what they offer and what users think they’re getting.

If you read the marketing, “spec-driven” sounds like the spec is the source of truth. It isn’t. The spec is just the input that the rigor machinery operates on. The spec itself was never on trial.

The next generation of AI frameworks won’t be the ones with more agents, longer context, or fancier orchestration. It’ll be the ones brave enough to tell the user no before writing a single line of spec.

That’s the bar. Almost everyone is below it. The hole is right there in the middle of every framework, and we’re all stepping around it pretending it isn’t there.

Stop pretending.

Recon the ground. Grill the idea. Kill the bad ones. Build the survivors.

That’s the whole job.

Wrapping the series

Part 1 mapped the landscape and named the gap. Part 2 showed what filling the gap actually looks like - the agents, the rounds, the termination conditions, the failure modes.

The next pieces in this series will go deeper into the rest of the Heist Pipeline: the Sit-Down (where the Contract gets signed), Resource Development (where the plan gets built), the Hit (where code finally gets written), and Laundering (where everything gets verified and logged into the Ledger).

Each of these phases has the same general design philosophy - explicit gates, named artifacts, no phase skippable - but they solve
different problems.

If you build agentic systems and you’ve felt the productivity-tool-shipping-bad-ideas-faster problem, Gangsta Agents is open source. It’s a young project (first stable release in April 2026, v1.1.1). Issues, PRs, and adversarial critique of the framework itself are all welcome. Especially the last one - it would be embarrassing to ship a framework about Grilling without grilling the framework.

← Part 1: Sixteen Frameworks. One Blind Spot.

Gangsta Agents is an open-source agentic framework built around a 6-phase Heist Pipeline: Reconnaissance → Grilling → Sit-Down → Resource Development → The Hit → Laundering. Every phase has a gate. No phase is skipped.

github.com/kucherenko/gangsta

gangsta.page

推荐订阅源

DEV Community