惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Y
Y Combinator Blog
博客园 - 司徒正美
TaoSecurity Blog
TaoSecurity Blog
Martin Fowler
Martin Fowler
T
Threat Research - Cisco Blogs
Blog — PlanetScale
Blog — PlanetScale
S
Secure Thoughts
博客园 - 三生石上(FineUI控件)
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
K
Kaspersky official blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Cisco Talos Blog
Cisco Talos Blog
H
Help Net Security
博客园 - 叶小钗
爱范儿
爱范儿
GbyAI
GbyAI
I
Intezer
M
MIT News - Artificial intelligence
Latest news
Latest news
Schneier on Security
Schneier on Security
T
Tor Project blog
Simon Willison's Weblog
Simon Willison's Weblog
I
InfoQ
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
C
CXSECURITY Database RSS Feed - CXSecurity.com
罗磊的独立博客
N
News and Events Feed by Topic
T
The Blog of Author Tim Ferriss
V2EX - 技术
V2EX - 技术
B
Blog
T
Tailwind CSS Blog
N
Netflix TechBlog - Medium
Security Latest
Security Latest
V
V2EX
F
Fortinet All Blogs
Forbes - Security
Forbes - Security
Application and Cybersecurity Blog
Application and Cybersecurity Blog
The Hacker News
The Hacker News
Scott Helme
Scott Helme
P
Privacy International News Feed
P
Palo Alto Networks Blog
H
Heimdal Security Blog
C
Cisco Blogs
T
The Exploit Database - CXSecurity.com
博客园 - Franky
酷 壳 – CoolShell
酷 壳 – CoolShell
G
Google Developers Blog
W
WeLiveSecurity
L
LINUX DO - 最新话题

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
One Workflow, Three Jobs: How We Built a Reusable AI Review System
Anvar Nurmat · 2026-04-30 · via DEV Community

Previously: Phleet Architecture Deep Dive - how the overall multi-agent system works.

When you ask one AI agent to write code, you get code. When you ask a second agent to review it, you get a rubber stamp. "Looks good to me" is the most common review output in AI-assisted development - and it's worthless.

We spent months building a system where AI agents genuinely catch each other's mistakes. Not in theory. In production, on systems that matter.

At the core of our system is a single workflow - 146 lines of C# - that handles independent parallel assessment of any artifact: a design spec, a pull request, a deployment config, a vendor evaluation. You give it reviewers and a prompt. It fans out, collects verdicts, resolves disagreements, and returns a single actionable result.

We use it for three things today: design review, code review, and a pipeline that chains both. But the mechanism is general - anywhere you need multiple independent perspectives synthesized into a decision, it works.

The Problem With AI Reviews

Here's what happens when you tell an AI to "review this PR":

The implementation looks well-structured and follows the existing patterns in the codebase. The error handling appears adequate. No major concerns.

That's not a review. That's a hallucination of a review. The agent skimmed the diff, pattern-matched against "things that look like code," and produced a response shaped like approval.

We know this because we shipped code reviewed this way. It broke.

One Workflow, Three Stages

Our consensus review workflow does three things - fan out, parse, synthesize - with a fast-path shortcut when everyone agrees.

Consensus review flow

Fan-out. Multiple reviewer agents receive the same review prompt simultaneously. Each agent works independently - no peeking at each other's reviews. Each has 15 minutes and must end their response with an explicit verdict: approved, changes_requested, or needs_human_review.

Parse. The workflow extracts each agent's verdict. If an agent forgets to include one or writes something unrecognizable, it defaults to changes_requested. The conservative choice. We'd rather re-review than miss a bug. If every reviewer independently approves at this stage, we skip synthesis entirely and move on - unanimous approval happens often enough that the fast-path is worth it, but disagreement is common enough that synthesis earns its keep.

Synthesize. When reviewers disagree - one approves, another requests changes - a synthesizer agent reads all the reviews and produces a single verdict. The synthesizer can approve if all concerns are cosmetic, or escalate if any concern is substantive.

Here's what synthesis looks like in practice. In one case, two reviewers independently reviewed a data pipeline optimization. One approved the approach and flagged an edge case to protect. The other read the source code and found the entire premise was wrong - the spec blamed the wrong component for the bottleneck. The synthesizer merged both inputs into a corrected specification: the accurate bottleneck analysis from one reviewer and the edge-case guardrail from the other - a result neither reviewer alone could have produced.

The workflow itself doesn't know what it's reviewing. It's a pure coordination primitive - fan out, collect verdicts, resolve disagreements - and the power comes from how it's called.

The Self-Correcting Loop

A single review pass is useful. But the real value is what happens when reviewers find problems: the system iterates autonomously.

Review loop

The agent that produced the original output receives the consolidated feedback and revises. The revised version goes through another full consensus review - same fan-out, same independent verdicts. This loop repeats up to N rounds (three for design specs, five for code). In the common case, agents resolve their own disagreements within two or three rounds.

Whether agents converge or the loop exhausts its budget, the result always reaches a human gate. The human sees the full review history and can approve, request further changes (which sends the agents back into the loop), or reject outright. Agents do the analytical work autonomously, but a human always makes the final call.

Human gate: dashboard signal approval UI

This is what makes it more than a one-shot review tool. It's a self-correcting feedback loop with human oversight built into every path, not just the failure cases.

Three Examples

We use consensus review for three things today - but the pattern applies anywhere you need independent assessments synthesized into a decision: compliance checks, deployment approvals, content moderation, vendor evaluations, or any multi-stakeholder review process. Here's how our three compositions work.

1. Design Review: "Is this spec good enough to build?"

Before any code is written, someone has to decide what to build. An agent creates a GitHub issue with a detailed specification. Then the consensus workflow checks if that spec is actually implementable.

The review prompt for design is specific:

Evaluate whether the spec is complete and unambiguous enough to implement without guessing.

VALIDATION CHECKLIST - answer each yes/no:

  1. Does every new behavior have an explicit error/failure path?
  2. Are all external dependencies identified with failure handling?
  3. Does the spec include a 'Constraints / MUST NOT' section?
  4. Can an implementer build this without making design decisions of their own?
  5. Are boundary conditions and edge cases specified?
  6. Compare the original request against the spec - any specification drift?

That last item is key. The reviewer gets the original request alongside the design agent's interpretation. This catches cases where the design agent subtly changed what was asked for - dropped a requirement, expanded scope, or reinterpreted intent.

If the reviewers find problems, the design agent refines the spec and the review runs again. Up to three rounds. If it can't reach approval in three rounds, the workflow notifies a human and waits - there's no auto-cancel, because a stuck design decision is better surfaced than silently abandoned.

2. PR Review: "Does this code match the spec?"

Once the spec is approved and an agent implements it, a different composition of the same workflow reviews the code. Same fan-out, same synthesis - but the review prompt shifts focus entirely:

VALIDATION CHECKLIST - answer each yes/no:

  1. Does the implementation match the spec without omissions or unexplained additions?
  2. Does every new code path have error handling?
  3. Are there any security concerns (injection, auth bypass, data exposure)?
  4. Does this break backward compatibility for existing consumers?
  5. Are edge cases from the spec covered in the implementation?

Design review asks "is this spec complete?" PR review asks "does this code do what the spec says?" Same workflow, different lens.

This one gets up to five rounds, not three - because code is harder to get right than specs. And after the review loop, there's a human approval gate before anything merges. If the human requests changes at that gate, the workflow runs a second consensus review to evaluate the concern, then feeds the feedback back to the developer agent. The human always has the final word, but the agents do the analytical work.

3. Design-to-PR: The Full Pipeline

The third composition doesn't invoke the consensus workflow directly. It chains the first two:

  1. Run the design workflow (which internally uses consensus review for spec validation)
  2. Capture the approved issue number
  3. Fire the implementation workflow (which internally uses consensus review for code validation)

In a full design-to-PR pipeline, the same 146-line workflow can execute up to four times: twice during design (initial review + human-triggered re-review) and twice during implementation (same pattern). One building block, four review passes, each with a different prompt tuned to what matters at that stage.

Here's a 5-minute walkthrough of a real production PR going through this exact pipeline - design spec, consensus review, implementation, merge:

Adding a fourth composition - say, compliance review for regulatory changes, or deployment approval for infrastructure modifications - means writing a new parent workflow that calls the same consensus child with a different prompt and different reviewers. The coordination mechanism never changes; only the review criteria do.

What It Actually Catches

Theory is nice. Here's what happened in production - cases where the automated review caught problems that the human authors had already looked at and missed. The catches fall into three categories, each progressively harder to replicate with a single reviewer.

The Wrong Bottleneck

A design spec proposed optimizing a data pipeline that took over 8 hours to run. The spec blamed external API calls as the bottleneck and estimated a significant improvement from skipping them for lower-priority data segments.

Two reviewers independently evaluated the proposal. The domain specialist confirmed the optimization made sense from a business perspective and flagged an edge case - active records must still get refreshed regardless of segment activity.

The code auditor read the actual source and found the spec was factually wrong about the system it described:

The code shows the external API calls do NOT happen per-record during the main processing loop. They happen exclusively in a post-processing step, which is already scoped to a small subset of records.

The actual bottleneck is the main processing loop: thousands of sequential API calls, tens of thousands of individual database lookups, and a comparable number of individual write operations.

The optimization would have targeted the wrong thing entirely. The consensus synthesis merged both inputs: the corrected bottleneck analysis from the auditor and the edge-case guardrail from the domain specialist. The resulting spec was fundamentally different from the original proposal.

This is what makes multi-agent review worth the complexity. Neither reviewer's output alone would have been sufficient - the domain specialist validated the intent but missed the technical error, the code auditor found the error but wouldn't have known which edge cases to protect. The synthesizer produced a result that neither could have reached independently.

The Startup Crash Nobody Tested

A PR extracted hardcoded database seed data into a JSON config file. The reviewer confirmed all spec requirements were met - but then traced the code path end-to-end and found something the spec didn't mention:

If the seed file contains malformed JSON, JsonSerializer.Deserialize throws a JsonException that propagates unhandled, crashing the application at startup. The code already handles "file not found" gracefully - a corrupt file should get the same treatment.

The review included the exact fix - the specific try-catch block and log message. Not "add error handling" - the actual code. In production, this would have meant a service that crashes on restart after a bad config push, breaking container orchestration and blocking rollback.

This is what structured review produces. The reviewer was forced through a checklist that asks "does every new code path have error handling?" and traced each path to answer the question. A single-pass review would have stopped at "spec requirements met." The checklist forced the reviewer to keep going.

"Fixed and Verified Clean Build"

The previous two examples show the review system catching problems on the first pass. But what happens when the developer agent claims it fixed the problem?

An agent was tasked with modifying a configuration file. The review loop caught the change was wrong - the agent had appended the new content after the existing file instead of replacing it. Classic write-vs-edit mistake. The review flagged it. The agent revised and reported back: "fixed and verified clean build."

The diff told a different story. The same append-instead-of-edit error was still there. The agent had confidently declared the problem solved without actually solving it.

Review loop catching a false fix

Round two of the review loop caught this - not because a human was watching, but because independent reviewers checked the actual diff against the claimed fix. The agent's self-assessment was worthless; the structured review was not.

This is the failure mode that makes the iterative loop essential. Agents don't just make mistakes - they make mistakes and then sincerely believe they've fixed them. Without independent verification on every round, a confident "done" from the implementing agent would have reached the human gate looking like a clean fix.

One meta-case. phleet#13 specified Fleet.Telegram - a new MCP server that agents and workflows call to send Telegram messages. The issue spec went through 6 design-review rounds before implementation started, and the resulting PR phleet#14 shipped in 4 commits - 1 initial + 3 review-driven fixups. Those fixups caught a missed spec detail (the fallback field was computed internally but omitted from the success-response JSON), a missed doc update (the README architecture tree wasn't updated for the new service), and a confidentiality leak (a real chat ID was committed to API docs in a public repo). Fleet.Telegram is the MCP server that now delivers the merge-approval and design-approval notifications described earlier in this post - the system reviewed itself while building the thing that tells humans to review things. Neither number is remarkable alone; together, a 6-round spec and 3 review-driven code fixups on one small change is what a self-correcting loop looks like in wall-clock terms.

The Counterintuitive Rules

Early in our system, review prompts said things like "evaluate whether the spec is complete and unambiguous." Agents responded with paragraphs of vague approval. We added structured yes/no checklists and review quality changed overnight. But the biggest improvement came from two counterintuitive rules:

Zero findings is suspicious. If a reviewer finds nothing wrong, they must explicitly state what they checked and acknowledge that zero findings may indicate insufficient review depth. This eliminates the failure mode where an agent produces a confident "all clear" without actually checking anything. It sounds paranoid, but it's the single most effective quality signal we've added - because it forces reviewers to show their work even when there's nothing to report.

Severity ratings are mandatory. Every finding is rated: blocker (cannot ship), high (production bug), medium (should fix), low (observation). This gives the synthesizer - and the human at the approval gate - a clear signal about what actually matters versus what's cosmetic.

The goal isn't perfect reviews. It's reviews that catch the things humans would catch - missing error handling, spec drift, wrong assumptions - at machine speed, on every single change, without review fatigue. And because the workflow is domain-agnostic, every improvement to the coordination mechanism - better synthesis, smarter verdict parsing, the review loop itself - automatically benefits every context that uses it.


The consensus workflow itself is 146 lines at ConsensusReviewWorkflow.cs, part of the Universal Workflow Engine that orchestrates it. The rest of the source lives at github.com/anurmatov/phleet.

Co-authored with Acto - my AI co-CTO and one of the agents described in this post.