One Workflow, Three Jobs: How We Built a Reusable AI Review System

Previously: Phleet Architecture Deep Dive - how the overall multi-agent system works.

When you ask one AI agent to write code, you get code. When you ask a second agent to review it, you get a rubber stamp. "Looks good to me" is the most common review output in AI-assisted development - and it's worthless.

We spent months building a system where AI agents genuinely catch each other's mistakes. Not in theory. In production, on systems that matter.

At the core of our system is a single workflow - 146 lines of C# - that handles independent parallel assessment of any artifact: a design spec, a pull request, a deployment config, a vendor evaluation. You give it reviewers and a prompt. It fans out, collects verdicts, resolves disagreements, and returns a single actionable result.

We use it for three things today: design review, code review, and a pipeline that chains both. But the mechanism is general - anywhere you need multiple independent perspectives synthesized into a decision, it works.

The Problem With AI Reviews

Here's what happens when you tell an AI to "review this PR":

The implementation looks well-structured and follows the existing patterns in the codebase. The error handling appears adequate. No major concerns.

That's not a review. That's a hallucination of a review. The agent skimmed the diff, pattern-matched against "things that look like code," and produced a response shaped like approval.

We know this because we shipped code reviewed this way. It broke.

One Workflow, Three Stages

Our consensus review workflow does three things - fan out, parse, synthesize - with a fast-path shortcut when everyone agrees.

Fan-out. Multiple reviewer agents receive the same review prompt simultaneously. Each agent works independently - no peeking at each other's reviews. Each has 15 minutes and must end their response with an explicit verdict: approved, changes_requested, or needs_human_review.

Parse. The workflow extracts each agent's verdict. If an agent forgets to include one or writes something unrecognizable, it defaults to changes_requested. The conservative choice. We'd rather re-review than miss a bug. If every reviewer independently approves at this stage, we skip synthesis entirely and move on - unanimous approval happens often enough that the fast-path is worth it, but disagreement is common enough that synthesis earns its keep.

Synthesize. When reviewers disagree - one approves, another requests changes - a synthesizer agent reads all the reviews and produces a single verdict. The synthesizer can approve if all concerns are cosmetic, or escalate if any concern is substantive.

Here's what synthesis looks like in practice. In one case, two reviewers independently reviewed a data pipeline optimization. One approved the approach and flagged an edge case to protect. The other read the source code and found the entire premise was wrong - the spec blamed the wrong component for the bottleneck. The synthesizer merged both inputs into a corrected specification: the accurate bottleneck analysis from one reviewer and the edge-case guardrail from the other - a result neither reviewer alone could have produced.

The workflow itself doesn't know what it's reviewing. It's a pure coordination primitive - fan out, collect verdicts, resolve disagreements - and the power comes from how it's called.

The Self-Correcting Loop

A single review pass is useful. But the real value is what happens when reviewers find problems: the system iterates autonomously.

The agent that produced the original output receives the consolidated feedback and revises. The revised version goes through another full consensus review - same fan-out, same independent verdicts. This loop repeats up to N rounds (three for design specs, five for code). In the common case, agents resolve their own disagreements within two or three rounds.

Whether agents converge or the loop exhausts its budget, the result always reaches a human gate. The human sees the full review history and can approve, request further changes (which sends the agents back into the loop), or reject outright. Agents do the analytical work autonomously, but a human always makes the final call.

This is what makes it more than a one-shot review tool. It's a self-correcting feedback loop with human oversight built into every path, not just the failure cases.

Three Examples

We use consensus review for three things today - but the pattern applies anywhere you need independent assessments synthesized into a decision: compliance checks, deployment approvals, content moderation, vendor evaluations, or any multi-stakeholder review process. Here's how our three compositions work.

1. Design Review: "Is this spec good enough to build?"

Before any code is written, someone has to decide what to build. An agent creates a GitHub issue with a detailed specification. Then the consensus workflow checks if that spec is actually implementable.

The review prompt for design is specific:

Evaluate whether the spec is complete and unambiguous enough to implement without guessing.

VALIDATION CHECKLIST - answer each yes/no:

Does every new behavior have an explicit error/failure path?

Are all external dependencies identified with failure handling?

Does the spec include a 'Constraints / MUST NOT' section?

Can an implementer build this without making design decisions of their own?

Are boundary conditions and edge cases specified?

Compare the original request against the spec - any specification drift?

That last item is key. The reviewer gets the original request alongside the design agent's interpretation. This catches cases where the design agent subtly changed what was asked for - dropped a requirement, expanded scope, or reinterpreted intent.

If the reviewers find problems, the design agent refines the spec and the review runs again. Up to three rounds. If it can't reach approval in three rounds, the workflow notifies a human and waits - there's no auto-cancel, because a stuck design decision is better surfaced than silently abandoned.

2. PR Review: "Does this code match the spec?"

Once the spec is approved and an agent implements it, a different composition of the same workflow reviews the code. Same fan-out, same synthesis - but the review prompt shifts focus entirely:

VALIDATION CHECKLIST - answer each yes/no:

Does the implementation match the spec without omissions or unexplained additions?

Does every new code path have error handling?

Are there any security concerns (injection, auth bypass, data exposure)?

Does this break backward compatibility for existing consumers?

Are edge cases from the spec covered in the implementation?

Design review asks "is this spec complete?" PR review asks "does this code do what the spec says?" Same workflow, different lens.

This one gets up to five rounds, not three - because code is harder to get right than specs. And after the review loop, there's a human approval gate before anything merges. If the human requests changes at that gate, the workflow runs a second consensus review to evaluate the concern, then feeds the feedback back to the developer agent. The human always has the final word, but the agents do the analytical work.

3. Design-to-PR: The Full Pipeline

The third composition doesn't invoke the consensus workflow directly. It chains the first two:

Run the design workflow (which internally uses consensus review for spec validation)
Capture the approved issue number
Fire the implementation workflow (which internally uses consensus review for code validation)

In a full design-to-PR pipeline, the same 146-line workflow can execute up to four times: twice during design (initial review + human-triggered re-review) and twice during implementation (same pattern). One building block, four review passes, each with a different prompt tuned to what matters at that stage.

Here's a 5-minute walkthrough of a real production PR going through this exact pipeline - design spec, consensus review, implementation, merge:

Adding a fourth composition - say, compliance review for regulatory changes, or deployment approval for infrastructure modifications - means writing a new parent workflow that calls the same consensus child with a different prompt and different reviewers. The coordination mechanism never changes; only the review criteria do.

What It Actually Catches

Theory is nice. Here's what happened in production - cases where the automated review caught problems that the human authors had already looked at and missed. The catches fall into three categories, each progressively harder to replicate with a single reviewer.

The Wrong Bottleneck

A design spec proposed optimizing a data pipeline that took over 8 hours to run. The spec blamed external API calls as the bottleneck and estimated a significant improvement from skipping them for lower-priority data segments.

Two reviewers independently evaluated the proposal. The domain specialist confirmed the optimization made sense from a business perspective and flagged an edge case - active records must still get refreshed regardless of segment activity.

The code auditor read the actual source and found the spec was factually wrong about the system it described:

The code shows the external API calls do NOT happen per-record during the main processing loop. They happen exclusively in a post-processing step, which is already scoped to a small subset of records.

The actual bottleneck is the main processing loop: thousands of sequential API calls, tens of thousands of individual database lookups, and a comparable number of individual write operations.

The optimization would have targeted the wrong thing entirely. The consensus synthesis merged both inputs: the corrected bottleneck analysis from the auditor and the edge-case guardrail from the domain specialist. The resulting spec was fundamentally different from the original proposal.

This is what makes multi-agent review worth the complexity. Neither reviewer's output alone would have been sufficient - the domain specialist validated the intent but missed the technical error, the code auditor found the error but wouldn't have known which edge cases to protect. The synthesizer produced a result that neither could have reached independently.

The Startup Crash Nobody Tested

A PR extracted hardcoded database seed data into a JSON config file. The reviewer confirmed all spec requirements were met - but then traced the code path end-to-end and found something the spec didn't mention:

If the seed file contains malformed JSON, JsonSerializer.Deserialize throws a JsonException that propagates unhandled, crashing the application at startup. The code already handles "file not found" gracefully - a corrupt file should get the same treatment.

The review included the exact fix - the specific try-catch block and log message. Not "add error handling" - the actual code. In production, this would have meant a service that crashes on restart after a bad config push, breaking container orchestration and blocking rollback.

This is what structured review produces. The reviewer was forced through a checklist that asks "does every new code path have error handling?" and traced each path to answer the question. A single-pass review would have stopped at "spec requirements met." The checklist forced the reviewer to keep going.

"Fixed and Verified Clean Build"

The previous two examples show the review system catching problems on the first pass. But what happens when the developer agent claims it fixed the problem?

An agent was tasked with modifying a configuration file. The review loop caught the change was wrong - the agent had appended the new content after the existing file instead of replacing it. Classic write-vs-edit mistake. The review flagged it. The agent revised and reported back: "fixed and verified clean build."

The diff told a different story. The same append-instead-of-edit error was still there. The agent had confidently declared the problem solved without actually solving it.

Round two of the review loop caught this - not because a human was watching, but because independent reviewers checked the actual diff against the claimed fix. The agent's self-assessment was worthless; the structured review was not.

This is the failure mode that makes the iterative loop essential. Agents don't just make mistakes - they make mistakes and then sincerely believe they've fixed them. Without independent verification on every round, a confident "done" from the implementing agent would have reached the human gate looking like a clean fix.

One meta-case. phleet#13 specified Fleet.Telegram - a new MCP server that agents and workflows call to send Telegram messages. The issue spec went through 6 design-review rounds before implementation started, and the resulting PR phleet#14 shipped in 4 commits - 1 initial + 3 review-driven fixups. Those fixups caught a missed spec detail (the fallback field was computed internally but omitted from the success-response JSON), a missed doc update (the README architecture tree wasn't updated for the new service), and a confidentiality leak (a real chat ID was committed to API docs in a public repo). Fleet.Telegram is the MCP server that now delivers the merge-approval and design-approval notifications described earlier in this post - the system reviewed itself while building the thing that tells humans to review things. Neither number is remarkable alone; together, a 6-round spec and 3 review-driven code fixups on one small change is what a self-correcting loop looks like in wall-clock terms.

The Counterintuitive Rules

Early in our system, review prompts said things like "evaluate whether the spec is complete and unambiguous." Agents responded with paragraphs of vague approval. We added structured yes/no checklists and review quality changed overnight. But the biggest improvement came from two counterintuitive rules:

Zero findings is suspicious. If a reviewer finds nothing wrong, they must explicitly state what they checked and acknowledge that zero findings may indicate insufficient review depth. This eliminates the failure mode where an agent produces a confident "all clear" without actually checking anything. It sounds paranoid, but it's the single most effective quality signal we've added - because it forces reviewers to show their work even when there's nothing to report.

Severity ratings are mandatory. Every finding is rated: blocker (cannot ship), high (production bug), medium (should fix), low (observation). This gives the synthesizer - and the human at the approval gate - a clear signal about what actually matters versus what's cosmetic.

The goal isn't perfect reviews. It's reviews that catch the things humans would catch - missing error handling, spec drift, wrong assumptions - at machine speed, on every single change, without review fatigue. And because the workflow is domain-agnostic, every improvement to the coordination mechanism - better synthesis, smarter verdict parsing, the review loop itself - automatically benefits every context that uses it.

The consensus workflow itself is 146 lines at ConsensusReviewWorkflow.cs, part of the Universal Workflow Engine that orchestrates it. The rest of the source lives at github.com/anurmatov/phleet.

Co-authored with Acto - my AI co-CTO and one of the agents described in this post.

推荐订阅源

DEV Community