I built a tool to catch AI coding agents misbehaving — and put zero AI in it

I lean on AI coding agents hard. Claude Code, Cursor, Codex — I drive them fast to ship fast. That's not a confession, it's the whole reason this project exists. If you push these tools to their limits every day, you stop seeing them as magic and start seeing exactly where they break.

And the thing I kept noticing is this: they never break in the chat.

In the conversation, the agent looks great. It explains its plan, it sounds reasonable, it agrees with all your constraints. The problem shows up later — in the diff, after the fact, when you're tired and the PR is green and you just want to merge.

What actually goes wrong

A short, real list of things I watched coding agents do, none of which looked wrong in the chat:

Quietly widened its own permissions — edited the agent's settings allowlist to grant access it didn't have at the start of the session.
Contradicted its own config — one file said "never touch the network," another granted a network tool, and nothing reconciled the two.
Made an undeclared outbound network call, tucked into an unrelated change.
Opened a PR titled fix: typo in README that touched a dozen unrelated files.
Left a session transcript showing it had read an SSH key and piped curl to a shell.

Every one of those would sail through code review. Not because the reviewer is careless — because nobody is looking for this class of problem. Human reviewers look for bugs and style. They don't diff your agent's permission allowlist between the base and head of a branch, and they don't cross-check three different agent config files for contradictions.

The fix everyone reaches for (and why it's wrong here)

The first instinct is to prompt harder. Add more rules to your CLAUDE.md, write a stricter system prompt, tell the agent not to do the bad things.

It doesn't work, and the reason is structural: better instructions going in don't catch what actually came out. The agent that widened its own permissions wasn't defying a rule it failed to understand — the gap is between what it said and what it shipped. You can't close that gap from the input side.

The next instinct is LLM-as-judge: have a second model read the diff and flag problems. And this is where I made the call the whole project hangs on.

I put no LLM in the analysis path. None. The thing that reviews your agent's work has zero AI in it.

That sounds backwards for an AI-governance tool, so let me defend it.

Why deterministic, not probabilistic

This runs as a CI gate — it can fail your build. The moment something is allowed to block a merge, it has to clear a much higher bar than "usually right":

1. It has to be reproducible. Same diff, same verdict, every time. An LLM judge gives you a different answer across runs, across temperatures, across model updates you never opted into. You cannot gate a build on a coin flip, however weighted.

2. It can't hallucinate a finding. A deterministic checker flags a permission escalation because the allowlist literally changed from X to Y — and it can point at the exact line. An LLM judge can invent a "critical" issue that isn't there. The first time your gate blocks a legitimate PR over a hallucinated problem, the team stops trusting it — and a governance tool nobody trusts gets switched off inside a week.

3. It runs everywhere, for free, in seconds. No API key, no rate limit, no token budget, no network round-trip on every pull request. It's just code reading code.

4. Nothing leaves the machine. All analysis runs locally against your checked-out repo. Your proprietary code and your agent transcripts never get shipped to a third-party model. For a lot of teams that isn't a nice-to-have — it's the line between "can adopt this" and "can't."

5. Every finding is auditable. Not "the model thought this looked risky," but "this config key changed, here's the before and after, here's the rule it tripped." That's what makes a finding defensible in review instead of the start of an argument.

How it's built

It started as one small deterministic check — does this PR's diff quietly change what the agent is allowed to do? — and grew into a suite of eight packages.

A shared core library does the unglamorous, load-bearing work: JSON/JSONC/TOML parsing, shell tokenization, normalizing MCP server commands into a canonical form, and a single Finding schema — frozen at v1.0 — that everything else speaks.
Five focused detectors, each catching one class of drift: config/permission drift between base and head, contradictions across agent config files, network and subprocess capability signals in a diff, mismatches between a PR's stated task and its actual changes, and risky behavior recorded in session transcripts.
A live monitor that watches an agent's trajectory in real time in the terminal — for when you want to see it as it happens, not just at PR time.
A meta-reviewer that consolidates the PR-time detectors into one deduplicated, severity-sorted verdict and fails CI on anything critical — so the whole suite reports as a single pass/fail check instead of five noisy ones.

The hardest engineering wasn't any single detector. It was the schema. Getting five tools that look at completely different things — config files, diffs, transcripts — to emit findings in one shape a meta-reviewer can merge, dedupe, and rank is the part that took the most design. That boring shared library is the reason the suite feels like one tool instead of five loose scripts.

Where determinism falls short (being honest)

I'm not going to pretend rules beat models at everything. Determinism only catches what you've written a rule for. Genuinely novel misbehavior that doesn't match a known pattern walks straight through.

The sharpest example is in my own suite: the detector that checks whether a PR's diff matches its stated task. "Does this change match this description" is a genuinely semantic question, and the deterministic version approximates it with heuristics — file scope, paths touched, keyword overlap. That's cruder than what a model could assess, and I'll own it.

So the position isn't "LLMs are bad." It's deterministic where it gates, probabilistic where it advises. The reproducible, no-hallucination checker is the only thing allowed to fail your build. If an LLM layer ever goes on top, it belongs in an advisory, non-blocking role — surfacing fuzzy concerns for a human to weigh, never silently blocking a merge on a probability. The gate stays deterministic.

Proving it works

Claims are cheap, so I shipped a demo: a deliberately "rogue" pull request that commits every category of drift at once — escalated permissions, contradictory configs, an undeclared network call, a fix typo PR touching unrelated files, and a transcript reading SSH keys and piping curl to a shell. Every tool fires, the meta-reviewer folds them into one comment, and the CI check goes red on the critical findings. It doubles as an eval harness: change a detector, re-run the rogue PR, see what still gets caught.

It went from nothing to a shipped v1.0 in a matter of days — self-taught, working solo — and it's all open: code, demo, and docs.

github.com/Conalh

If you're running coding agents against real repositories, and the "green PR, tired reviewer, just merge it" moment makes you a little nervous — that nervousness is the exact bug I was trying to fix.

推荐订阅源

DEV Community