Two AI reviews agreeing is not two reviews: how I learned to test claims before adopting them

One night, two audits, one identical score

The evening of 17 May, I finish version 0.4.1 of the Counterpart Toolkit and decide to submit it to two external reviews. I paste the manifesto and the fourteen rules into a ChatGPT-4o session, then paste exactly the same content into a Claude.ai web session. I wait. A few minutes later, both verdicts land. Score 8/10 on one side. Score 8/10 on the other. Near-identical criticisms about the theoretical apparatus — Bourdieu invoked without operational traction — identical simplification suggestions, same angle on the freshness of the M1-M5 instrumentation. My initial reflex holds for thirty seconds. Two independent reviewers, same score, same criticisms — the doctrine is calibrated right, I can publish.

Then I stop. Because something in that convergence rings like a barometer bought in duplicate from the same supplier.

Why two converging AIs are not two measurements

I understand fairly quickly what the convergence is actually measuring. Two language models trained on corpora that overlap to a very large degree — technical articles, public GitHub repos, Stack Overflow discussions, a decade of blogs — produce correlated errors. What they have in common is their shared learning intersection, not the external reality I am submitting to them. When both find the theoretical apparatus disproportionate, I am not learning that it is true. I am learning that the shared statistics of their two corpora recognise this as a typical flaw in a text of this format.

Granted, two human reviewers who converge are, on their own, two separate measurements. But the parallel is misleading. Two human reviewers have disjoint biographies, different readings, sometimes opposing schools of thought. Two LLMs share a substrate that does not carry that texture. Cross-corroboration, in classical epistemology, presupposes independence of sources. With two statistical models trained on the same corpora, independence is not given — it has to be demonstrated, and it almost never is.

This intuition probably spared me a few days of pointless rewriting. But it remained speculative. I wanted material probes.

Three probes in three days

First probe, that same evening. A side project I had been running in parallel — a game dev repo — had led me to study WebFetch the previous week. A Claude assistant had told me that the tool returned "the complete OCR text of a 25 MB PDF" and I had built an ingestion pipeline on that claim. I run the command in the current session, just to verify. Raw output printed to the console: maxContentLength size of 10485760 exceeded. The claim was technically impossible. The pipeline rested on a non-existent mechanism. I had not tested it because the assistant's phrasing had been confident, structured, and plausibly true.

Second probe, the next day. Same game dev substrate. A conversational audit had pointed me toward a technical blog described as "containing the 18 domain regency actions" — exactly the material I was after. Before scraping, I run WebFetch on the blog's index. Raw return: eight development-chronicle articles, zero articles on the announced subject. The claim was a coherent hallucination. The pattern "so-and-so's blog contains X" is a statistical combination the model produces readily because it is syntactically plausible, without any internal mechanism verifying its factual existence.

Third probe, that morning, on the doctrine itself. An external Claude shared via claude.ai had audited my doctrine-counterpart repo the day before and asserted, screenshots included, that "6 out of 6 SKILL.md files have a broken YAML frontmatter" — diagnosis delivered with high confidence, used to justify a 7.5/10 downgrade. That morning I run yaml.safe_load on all twelve SKILL.md files in the repo. Raw result: 11/12 OK, 1/12 broken. The systemic flaw announced did not exist. The one genuinely broken file, the visual evaluator had not isolated — because it had concluded "systemic flaw" without naming a single individual case. What it had seen was GitHub's Markdown rendering eating the frontmatter and displaying it as a table, a host-rendering quirk that has nothing to do with the state of the raw source.

Three external claims tested, three falsifications. Not a partial overlap, not a nuanced rebuttal — three claims out of three knocked down by their probes. The ratio would have been invisible without material testing.

The canonical rule — Am.R12 of the Counterpart Toolkit v0.7

I amended R12 on 20 May, two days after these three incidents. The official text:

"Any claim formulated by an external AI (other Claude, ChatGPT, conversational sparring) about (a) the behavior of a concrete tool you can probe, (b) the content of an external resource, (c) the structure of a system whose ground truth you can sample — must be tested materially before being taken as input for an architectural decision. Test cost ≈ 1 shell command. Cost of believing without testing = pipeline entirely based on a non-existent mechanism. Two external AI reviews converging on the same diagnostic = one source for R5 purposes, not two — cross-substrate independent corroboration requires one human or one mechanically distinct probe (logs, metrics, sample run)."

Who would contest that two reviewers are better than one? Nobody — and yet the formula deserves to be taken apart. With two LLMs, convergence is not corroboration. It is orphaned convergence — a statistical intersection without verifiable exteriority. Three material counter-arguments, in answer to the objection I would have heard from a meta-AI tech lead six months ago.

First, shared training statistics. Two models whose corpora overlap on the essentials produce correlated errors on the same syntactic ranges. Their agreement measures their learning intersection. It seems that convergence, in this case, is more likely on typical statements — those that pre-training recognises as well-formed — than on true statements. These are not the same thing.

Second, correlated hallucination on tools and resources. On claims about the behaviour of a specific tool or the factual content of an external resource, two models tend to hallucinate the most statistically plausible result, which is often wrong and almost always stated with confidence. My three probes are the raw illustration of this.

Third, the asymmetric cost. A material probe costs one shell command, fifteen seconds, sometimes less. An architectural decision built on untested convergence can cost several dev-days of redo and a pipeline to rebuild from scratch. The amended R12 arbitrates this asymmetric cost by making the probe mandatory before any commit that takes the claim as input.

Coda

The reflex to cultivate is not distrust. It is the probe. And the probe starts with oneself — applying R12 first to my own assertions before imposing it on others, verifying my own repo with yaml.safe_load before trusting a visual audit that flatters or condemns. An agent that does not disagree materially is not a counterpart — it is a typist that speaks. Two agents that agree without either being tested are not two reviewers — it is the same typist in duplicate. The rule fits in one command.

# Before adopting an external claim about a tool / resource / structure:
$ <the material command that could have falsified it>

The repo: github.com/michelfaure/doctrine-counterpart. Am.R12 lives in plain sight in CLAUDE.md, with its three founding incidents documented in v0.7-candidates.md. If a single one of your next architectural decisions avoids the untested belief, the rule has already paid for itself.

Counterpart Toolkit v0.7. R12 amended 20 May 2026 on N=3 multi-substrate incidents. Three external claims, three falsifications, one shell command each. License CC-BY-4.0.

推荐订阅源