Swarm Orchestrator just lost its swarm. Dropped the multi-agent parallel coordination layer. Running one agent now and putting all the weight on a five-layer post-merge falsification battery instead.
This is an experiment, not an endpoint. v8 will bring proper multi-agent swarming back. The reason for cutting it temporarily: I want to know whether the value I was getting from coordinated parallel agents was the coordination itself, or the verification pressure that coordination produced. Easier to measure with one variable. Intended side effect: cost reduction, since the previous architecture spun up multiple CLI agent instances per run. Real benchmarks pending.
TL;DR: every patch survives a five-layer post-merge battery before the orchestrator declares success. Layers 1 and 2 are hard gates. Layers 3, 4, 5 are advisory and feed a composite score. Hard-gate failure throws before attestation, before final gates, before any external success signal.
Pipeline Order
The battery runs once per orchestrator execution against the merged working tree, not per-step branches. The per-step verifier is a separate component. Layers fire in fixed order:
- Differential gate (hard)
- Mutation gate (hard)
- Cheat detector (advisory)
- Property gate (advisory)
- Attestation (advisory on first run, signed after)
If the hard gate fails, the composite is forced to 0 and the orchestrator throws falsification battery blocked the patch before any external success signal can fire.
Layer 1: Differential Gate (Hard)
Before any agent touches the repo, a synthesizer generates a regression test against the goal. Layer 1 then runs that test in two detached worktrees: one at the base commit, one at the patch commit.
The contract: the test must fail at base and pass at patch.
If the test passes at base, the layer returns INVALID_TEST. This catches the specific failure mode where an agent writes a tautological test that passes against any code. Without this gate, that pattern slips through every other check downstream.
If no command can be synthesized and the caller doesn't pass --differentialTestCommand, the layer fails closed. Deliberate policy.
Layer 2: Mutation Gate (Hard)
Runs Stryker for JS/TS, mutmut for Python, PITest (Gradle/Maven) for Java, against changed files only. First runs the regression command; if that fails, the layer fails immediately. On pass, dispatches the mutation tool and parses the reporter output, including Stryker's clear-text table layout.
| Score | Status |
|---|---|
| < 0.6 | FAIL |
| < 0.8 | WARNING |
| ≥ 0.8 | PASS |
Patches with no meaningful test coverage of the changed lines fail here even when npm test is green.
Layer 3: Cheat Detector (Advisory)
Diff-static scan for known agent-shortcut patterns. Each rule maps to a real category of behaviour I've watched agents fall into under pressure.
Cheat detector rule IDs
-
test-modification: agent edited the test instead of the implementation -
complexity-mismatch: trivial implementation paired with a non-trivial test -
mock-mutation: real call replaced with a mock that returns the expected value -
exception-swallowing: caught and ignored exceptions to make tests pass -
hardcoded-answer: literal return of the expected output
Optional Semgrep integration loads any registered cheat-rule packs into a transient config. The scan reads added and removed lines and patterns; it does not execute the patch. An agent that hides a hardcoded answer behind a small computation can still pass this layer. Honest about that.
Layer 4: Property Gate (Advisory)
Discovers modified TS/JS/Python functions, parses their parameter types, maps each to a fast-check arbitrary or Hypothesis strategy, generates a harness, runs it. Counterexamples surface as findings. Untyped or unsupported types degrade to a low-severity advisory finding rather than blocking.
Layer 5: Attestation (Advisory on First Run)
Reads the refs/notes/swarm-attestation git note for the patch commit, validates the in-toto SLSA v1.0 envelope's subject SHA against the patch commit, then verifies the cosign signature. On the first run for a commit there's no note yet, so this layer reports advisory-warn and the post-battery attestation step writes the note.
The note is verifiable later via swarm attest verify <commit>. A downstream consumer can verify the patch survived the battery without trusting the running orchestrator process.
Composite Scoring
When the hard gate passes, a weighted composite is computed across the three advisory layers and any optional advisory quality-gate results. Failed advisory gates each subtract a fixed penalty.
Default scoring (overridable via .swarm/gates.yaml)
- composite threshold:
0.7 - weights: cheat detector
0.4, property gate0.4, attestation0.2 - advisory gate penalty:
0.02per failure
humanReviewRequired is true when the composite score is below threshold or any advisory layer is in advisory-warn status.
Where It Actually Runs
Three real call sites, not just unit tests:
- Production orchestrator on every
swarmrun - Synthetic calibration corpus (36 paired test specs across 6 broken-category families) executing in CI on every push
- SWE-bench harness using Layer 1 and Layer 4 as standalone spot-check eval drivers
Honest Caveats
These are in docs/known-gaps.md and I won't hide them:
- Differential gate is host-Python-sensitive on legacy codebases. The synth-eval can reflect import-chain errors rather than assertion outcomes. The authoritative resolution gate in the per-instance Docker image is unaffected.
- Mutation gate skips quietly when no changed files match supported languages. YAML, Markdown, Rust, Go diffs don't get mutation-tested.
- Cheat detector is diff-static, not behavioural. The hidden-computation-around-hardcoded-answer pattern can pass it.
- Attestation signing is best-effort. Cosign-not-installed errors get logged and the run proceeds without a note. The note's absence is reflected in Layer 5's advisory-warn on subsequent runs but does not block.
Why Run This Experiment
If the falsification battery alone produces patches that survive scrutiny at acceptable quality, then a lot of the apparent value of multi-agent coordination was actually the verification pressure it created, not the agent diversity itself. If the battery alone isn't enough, then v8 multi-agent gets a clearer mandate: the swarm is the value, not the side effect.
Either result is useful. The point of the rewrite is to make the answer measurable.
























