惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

宝玉的分享
宝玉的分享
WordPress大学
WordPress大学
博客园 - 司徒正美
美团技术团队
酷 壳 – CoolShell
酷 壳 – CoolShell
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
小众软件
小众软件
量子位
阮一峰的网络日志
阮一峰的网络日志
Apple Machine Learning Research
Apple Machine Learning Research
有赞技术团队
有赞技术团队
博客园 - 【当耐特】
博客园 - Franky
Jina AI
Jina AI
人人都是产品经理
人人都是产品经理
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
T
Threat Research - Cisco Blogs
D
Darknet – Hacking Tools, Hacker News & Cyber Security
F
Fox-IT International blog
T
ThreatConnect
A
Arctic Wolf
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Last Week in AI
Last Week in AI
C
CERT Recently Published Vulnerability Notes
P
Palo Alto Networks Blog
李成银的技术随笔
Project Zero
Project Zero
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Register - Security
The Register - Security
F
Full Disclosure
H
Hacker News: Front Page
雷峰网
雷峰网
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
S
SegmentFault 最新的问题
S
Schneier on Security
T
Tor Project blog
博客园_首页
月光博客
月光博客
大猫的无限游戏
大猫的无限游戏
博客园 - 聂微东
S
Securelist
C
Comments on: Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
Attack and Defense Labs
Attack and Defense Labs
IT之家
IT之家
博客园 - 叶小钗
J
Java Code Geeks
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events

DEV Community

Mastering Node.js HTTP Module: Build Servers, REST APIs, and Handle Requests RP2040 Wristwatch Tells Time With a Vintage VU Meter Needle AI Agent Dev Environment Guide — Real Experience from an AI Living Inside a Server What exactly changes with the Claude Max plan? I Revived a Broken MLOps Platform — Now It's Self-Service, Policy-Guarded, and Operationally Credible OpenAI's $2M-tokens-for-equity YC deal, decoded Why DMX Infrastructure is Still Stuck in the 90s Agent Series (2): ReAct — The Most Important Agent Reasoning Paradigm Open Source Project (No.73): Sub2API - All-in-One Claude/OpenAI/Gemini Subscription-to-API Relay I Made the Wrong Bet on Event Streaming in Our Treasure Hunt Engine #ai #productivity #chatgpt #python Symbolic Constant Conundrum From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM Building an outbound-only WebSocket bridge for local AI agents Our System's Sins in Ghana: Why We Had to Rethink Digital Product Sales Execution Governance, AI Drift, and the Security Paradox of Runtime Enforcement Differential Pair Impedance: Why USB and HDMI Routing Is a Geometry Problem Small AI database questions can become big scans Claude Code 2.1 Agent View & /goal: Autonomous Dev Guide 2026 Your AI database agent should not see every column Rust's Low-Latency Conquest: Why We Ditched C++ for a Treasure Hunt Engine Floating-point will quietly corrupt your emissions math, and 0.1 + 0.2 already warned you Autonomous Agents: what breaks first (and why that's the real product) [2026-05-23] Agent payments are the new cloud bill footgun ORA-00069 오류 원인과 해결 방법 완벽 가이드 How I Built a Local, Multimodal Gemma 4 Visual Regression & Patch Agent: Closed-Loop Validation, Canvas Pixel Diffing, and Reproducible Benchmarks Pressure-testing Ota on Supabase: from setup prose to executable repo readiness VPC CNI en EKS: cómo dejar de pagar nodos que no usás The Future of Text Analysis: Introducing TechnoHelps Semantic Engine I built a Chrome Extension that saves product images + context directly to Google Drive & Sheets 95+ browser-based dev tools that never touch a server Running Qwen 2.5 Coder 14B Locally in Cursor with Ollama From a 10,000-line OpenSearch export script to a log analysis tool Ghost Bugs Cost $40K: A Neural Debugging Postmortem SECPAC: A Lightweight CLI Tool to Password-Protect Your Environment Variables 🚀 PasteCheck v1.7 + v1.8 — Hints that tell you what to fix, and a nudge panel that tells you where to start 8 Real Ways Developers Make Money in 2026 (Ranked by Effort) I built a free AI-powered Git CLI that writes your commit messages for you sds-converter: Converting Safety Data Sheets to MHLW Standard JSON with Rust and LLMs OpenLiDARViewer: A Browser-Based LiDAR and Point-Cloud Viewer Local-First Browser Tools: What You Should Not Upload Online Why most freelancers undercharge (and the maths behind fixing it) We built a mahjong dangerous-tile predictor calibrated on 4.97M real hands Building a Chord Progression Generator in the Browser — Music Theory in JS, Sound via Web Audio API tutorial #10: 148 Opens, 0 Replies — How My Forge Cold Email v1 Completely Failed 9 in 10 Docker Compose files skip the basic security flags How to Forward Android SMS to Telegram Automatically I built the first security scanner for MCP servers — here's what I found Building an Interplanetary Quantum Logic Engine in Rust/Ovie From AI Code Generation to AI System Investigation I gave Gemini 3.5 Flash a CVE-fix PR to review. It found another bug in the same file. When I Realized We Were Throwing Away Half Our Engine's Potential TokenJuice and the 20-Minute Cron: Inside OpenHuman’s Aggressive Context-Harvesting Engine CodeDNA: AI Codebase Archaeologist Built with Gemma 4 Thinking Mode Building a semantic search API in Go with Meilisearch April 2026 DigitalOcean Tutorials: Inference Optimization and AI Infrastructure Looking for DTMF transceiver module Moving Beyond "Tribal Software": Why the Singularity Demands the Interplanetary Hybrid Human Use SVGIcons as a Claude Custom Connector to Find Icons Faster DMARC Is Now a Proper Internet Standard: What Changed in RFC 9989/9990/9991 OpenTelemetry Is Now a CNCF Graduate — and It's Coming for Your AI Stack OpenHuman Follows OpenClaw’s Rise, But With an Obsidian Brain O erro mais caro em programas Solana: PDA sem bump check Build a Live Flight Radar in a Single HTML File DuckDB 1.5.3 Adds Quack Client-Server, SQLite Gets Cypher Graph Extension Custom Copilot Agents: Building Domain-Expert AI Teammates with Skills, MCP Tools, and Custom Knowledge RTX 5090 Cooling, BeeLlama VRAM Opts, Resizable BAR Performance Gains This week in Cursor + .NET — 3 rules + 4 essays (week ending May 22, 2026) RAG Architecture with n8n + PostgreSQL (pgvector) + Ollama Gemma4 on AWS EC2 Keep Your Taste I Built chanprobe Because My Go Queues Were Invisible Building a Live Solana TPS Meter with OrbitFlare's TypeScript SDK Using Gemma 4 to Analyze Bitcoin’s Next 5, 15, and 60 Minutes Security news weekly round-up - 22nd May 2026 When Stress Disguises Itself as Rational Planning (Bite-size Article) A Domain-Driven Notification Microservice — Patterns From Production I Built KubeCrash: Learn Kubernetes by Diagnosing Real Incidents The Real-World Test: How Gemini’s New Interface Won Over My Wife and Mother-in-Law (Who Are Totally Non-Tech) Running a Full Multi-Stage Intrusion Simulation. Every Detection Fired. Spec sheets aren't capabilities: a Day-1 Gemma 4 eval on Telugu vision Design a Clean Form with Floating Labels in Bootstrap 5 Your MCP Server Is Probably Overprivileged - Here's a Scanner For It I built a free developer tools site that works entirely in your browser Maatru: An agentic Telugu literacy app for kids, built with Gemma 4 GitHub confirms internal repository breach via poisoned VS Code extension Gemma 4 Is Not Just Another Open Model — It Changes What Developers Can Build Locally OpenVibe: An Open-Source AI Coding IDE That Works With Any Model I Inspected the System Program and It Looked Just Like My Wallet Hermes vs OpenClaw: The Two Most-Starred AI Agent Frameworks of 2026 Stop retraining YOLO: a developer’s guide to zero-shot object detection with generative VLMs AI, the New UI, Not the New API Sensors and Guides: Two Ways Your Harness Talks to Your Agent Fixing Google BigQuery Auth Proxying We didn't ship a feature, we shipped an agentic opt-in beta Wake-Up Call: Why AI Safety Guardrails Break Under Pressure 🧩 Handling 1,000+ Inputs with Angular Reactive Forms: An Enterprise Architecture Breakdown How to Collect Telegram Media Groups in Node.js I Ran Gemma 4 on an 8GB Laptop — Here’s What the Experience Was Actually Like Lean 4 101 for Python Programmers: A Gentle Introduction to Theorem Proving From Assistants to Agents: My Take on Google I/O 2026
How I Run 7 AI Models 24/7: Multi-Agent Architecture in Practice
Judy · 2026-05-23 · via DEV Community

TL;DR: I used Multi-Agent architecture to organize seven different models into a 24/7 AI team — Claude Opus as supervisor to break down tasks, MiniMax writes code, Hermes writes articles, Gemini CLI checks facts, Groq Llama makes trading decisions. Control console uses Linear, task cards get grabbed within 5 minutes, pass through Gate review and QA fact-checking before reporting back to me. This article breaks down the entire architecture, logic behind model role selection, how Gate blocked 300+ fake completions, and how you can start from the smallest unit.


Why Everyone Asks Me "How Do You Make AI Work Automatically"

I've been running an almost all-AI company for a while now, with a Multi-Agent architecture running internally that coordinates seven different AI models like a real company working together. Whenever I introduce this Multi-Agent architecture to people, the most common thing they want to know isn't "what model do you use", but rather this:

"How exactly do you make AI work on its own?"

A lot of people have tried letting AI take over tasks, but they all hit the same bottleneck — they still have to constantly monitor it, and eventually it feels faster to just do it themselves. The problem isn't that AI isn't strong enough, the problem is architecture. A single AI, no matter how powerful, can't do "pick up tasks on its own, divide work, review, and deliver" well. To achieve true AI automation, you don't need a stronger model, you need Multi-Agent architecture.

My daily routine doesn't look like that. I think of a task in the morning, open Linear, create a card, tag it "J", hit Enter. Within 5 minutes, that card gets picked up by my supervisor Agent, which judges whether it's writing an article, coding, research or marketing promotion, assigns it to the corresponding role, that role starts executing, passes through Gate review, then QA fact-checking, and only reports back to me after everything passes.

The only two things I do: post the card, read the results.

In this article, I'll explain the entire Multi-Agent architecture clearly, including my detours from "one dedicated Agent per pipeline" to "seven different models working by specialization".


Common Mistakes: My Multi-Agent Architecture Detours

Initially, I made one intuitive mistake too.

I thought: "Different things need different Agents." So I created a bunch: Blog-writing Agent, X-posting Agent, trading signal Agent, market intel Agent, SEO Agent, image generation Agent…

Three months later, I discovered several issues:

  • Memory isn't shared. The Blog Agent doesn't know what the X-posting Agent posted last week.
  • They don't know what each other is doing. One Agent wrote a blog post, another Agent rewrote it when posting to X.
  • Debugging is a nightmare. When something goes wrong, you have to check five logs.

I cut and rebuilt. Changed to using "specialty" as the classification axis — no longer organized by "task".

What this means: whichever pipeline needs "writing" goes to the Writing Agent, whichever pipeline needs "QA fact-checking" goes to the QA Agent. One Agent per specialty, all pipelines share.

With this setup:

  • Writing Agent handles all writing (Blog, X, Email, product copy).
  • QA Agent handles all reviews (writing quality, code review, fact-checking).
  • Engineering Agent handles all code implementation.
  • Supervisor Agent handles breaking down tasks, assigning, and sending back.

More isn't always better. Less but correct is.

Only after this architecture stabilized did the rest of the story happen.


What the Multi-Agent Architecture Looks Like

       Issue Task
         │
         ▼
   ┌──────────┐  Every 5 min polling  ┌──────────┐
   │  Linear  │ ──────────────────▶│ Central Scheduler │
   └──────────┘                     └─────┬────┘
                                         │
                                         ▼
                         ┌──────────────────────┐
                         │  Supervisor Agent    │ ── Break tasks, assign, send back
                         │ (Opus)              │
                         └──────────┬───────────┘
                                    │
            ┌────────────┬──────────┼──────────┬────────────┐
            ▼            ▼          ▼          ▼            ▼
       ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐
       │ Engineer│ │ Writer │  │  QA    │  │Marketing│  │ Trading│
       │MiniMax │  │Hermes  │  │Gemini  │  │MiniMax │  │ Groq   │
       └───┬────┘  └───┬────┘  └───┬────┘  └───┬────┘  └───┬────┘
           │           │           │           │           │
           └───────────┴────┬──────┴───────────┴───────────┘
                            │
                            ▼
                    ┌──────────────┐
                    │  Gate Review │ ── Block vagueness, fabrication, leaks
                    └──────┬───────┘
                           │
                           ▼
                    ┌──────────────┐
                    │ QA Fact-Check│ ── Built-in search verification
                    └──────┬───────┘
                           │
                           ▼
                        Report to Judy

Enter fullscreen mode Exit fullscreen mode

Each box is an independent script or service, communicating via file-based mailboxes. The whole thing runs on an Oracle cloud VM, using cron scheduling, no message queue, no webhook, no extra middleware. Simple enough that I can debug directly by reading files myself.


How It Runs: Step-by-Step Breakdown From Card to Publication

Let me give a concrete example — this morning I wanted to write an introduction article about some AI tool.

Step 1 (00:00) — I open Linear, create a card: "Write an XX tool introduction, 800 words, target audience is indie developers who want to automate with AI". Tag it "J", hit Enter. Close Linear, move on to other things.

Step 2 (+5 minutes) — Central scheduler polls and finds this new card. It sees the tag "J", writes the card content to the Supervisor Agent's mailbox.

Step 3 (+6 minutes) — Supervisor Agent wakes up, reads mailbox, judges this is a "writing" type task, assigns it to the Writer Agent's mailbox with instructions: "800 words, tutorial style, target audience indie developers, include FAQ".

Step 4 (+10 minutes) — Writer Agent drafts. Hermes model writes and drops to draft folder.

Step 5 (+15 minutes) — Gate review scans the draft, checking for vague hedging language that shifts responsibility back to humans, fabricated KPI numbers, internal path leaks. This time Gate caught one sentence: "data source is internal database" — automatically sent back to Writer Agent to change to "estimated based on public data".

Step 6 (+20 minutes) — After revision, passes Gate. Moves to QA stage.

Step 7 (++30 minutes) — QA Agent uses Gemini's built-in search to verify each fact mentioned in the article. Found a version number error, automatically flagged. Sent back for minor fix.

Step 8 (+45 minutes) — Fixed, all passes. Article auto-syncs to Notion, status "Waiting for Judy's approval".

Step 9 — After breakfast, I open Notion and see a ready-to-publish draft. After reviewing, I click "Ready to publish". System automatically translates to English and Korean, deploys to blog.

Total time I spent: 30 seconds to post card + 5 minutes to review. Rest of the time I'm doing other stuff.


My Collaboration Mode: Opus Gives Framework, MiniMax Implements

This is the core know-how of the entire architecture, let me give a concrete example — coding.

If writing a new strategy for a trading system, the most intuitive approach is to have Claude Opus write from start to finish. Can do. But expensive and slow.

My approach is:

  1. Opus handles framework breakdown — I give Opus requirements, it outputs: file structure, function signatures, responsibilities for each function, edge cases, tests needed. No implementation details, just the skeleton.
  2. MiniMax fills in implementation — The skeleton Opus broken down gets handed to MiniMax, which fills in one function at a time.
  3. Sonnet handles code review — After MiniMax writes, Sonnet does a round of review, catching logic holes and edge cases.
  4. Opus handles final polish — For issues Sonnet flags, back to Opus to decide whether to fix and how.

Why split this way?

  • Opus is strong at reasoning, best for judgment work (breaking down, polishing). But running at scale is expensive, running full capacity daily hurts the wallet.
  • MiniMax is subscription-based, fast and cheap to write, context is long enough, best for implementing according to spec.
  • Sonnet sits between the two, perfect for review — cheaper than Opus, but still solid logic.

This pattern applies to writing too: Opus gives structural outline, Hermes writes first draft, Sonnet does factual review, Opus polishes the final round.

Let each model do what it's best at — that's the only principle in myMulti-Agent architecture.


Seven Models, Seven Specialties

I'm currently running seven different models, each with its dedicated position:

1. Claude Opus 4.x — Strategist / COO

Strongest judgment. Handles breaking down tasks, assigning, sending back, programming framework, final code review, dispute arbitration.

When to use it: When choices need to be made. For example "which bug fix A or B is correct", "who should this card go to", "this blog post is going off direction halfway, how to save it".

2. Claude Sonnet 4.6 — Writing fact-check / code review / backup trading analyst

Cheaper version of Opus. Logic is almost as solid, price is half.

When to use it: Scenarios requiring rigorous reasoning at high volume. Blog tutorial fact-checking, code review, trading analyst (Groq's backup).

3. MiniMax M2.7 — Engineering implementation / marketing execution

Subscription-based, long context, fast writer.

When to use it: Implementing according to already-broken-down framework (no need to make own judgments), marketing copy execution, translation.

4. Hermes (via OpenRouter) — Writing role

Writing quality is sufficient, low cost. Long-form style is stable.

When to use it: Blog draft, social post draft, product copy draft. All writing that needs output length but doesn't require fact-checking.

5. Gemini CLI subscription — QA fact-checking

Built-in web search is its irreplaceable advantage. Some models need external search APIs to check facts, Gemini CLI searches on its own.

Note: There's also a practical reason for picking Gemini CLI — I bought a full-year subscription at the start of the year, might as well use it.

When to use it: Check facts before Blog goes live, verify press releases, market research source comparison.

6. Gemini API (Flash) — News pipeline

Free tier is enough for daily news aggregation, fast enough, API is stable.

When to use it: Daily news fetching and organization pipeline, high volume, not much reasoning needed. Free allows me to not worry about cost explosion.

7. Groq Llama-4-Scout 17B — Trading signals / position management

Extremely fast, inference latency is 1/10 of other models.

When to use it: Scenarios where trading strategies need instant response — signal review (decide whether to enter), position management (suggest HOLD / tight stop loss / profit taking / close immediately). Things where losing a second means losing money.


Why Use Linear as Control Console

I tried Notion, tried Slack, tried building my own dashboard, but kept Linear. Reason is simple:

Issue tracker is already the most natural interface for assigning work to humans.

Labels are routing, status is workflow, comments are communication, sub-tasks are breakdown. These features don't need designing, they're built-in. The only thing I need to do is connect "AI Agents that pick up cards on their own", treating Linear as their source of work.

More importantly: comment round-trips also go through Linear. Agent leaves comments when done, I comment questions back, Agent sees comments and replies. The entire conversation flow automatically deposits on the card — for future review, I just look at that card.

Notion is too flexible, Slack is too linear. I do have a custom dashboard — but Agents keep forgetting to update it, requiring me to nag them to write, which defeats the purpose. Linear is just right — status flow is built-in, Agents must go through Linear to do work, there's no "forgot to update" option.


Files Are the Communication Protocol

A lot of people ask how Agents communicate. Using files.

Each Agent has a mailbox folder. Supervisor assigning work means writing a message file into the corresponding mailbox. Executing Agents poll their own mailboxes, see new messages, process them, then write results to a "completed" folder.

Why not use message queue, Redis, or webhook?

  • Simple — cron can run it. No servers to maintain.
  • Auditable — every message is a file, to trace debugging just read the file content.
  • Rerunnable — result wrong? Move file back to mailbox and reprocess.
  • Zero dependencies — I don't need to learn any new tools, what's built into the OS is enough.

Sounds primitive, but runs stably. The simplest tools are often the most durable.


Gate + QA + TA Three-Layer Review: Why I Trust AI But Not Completely

AI's biggest problem isn't that it can't produce stuff, it's stuff that looks right but actually isn't.

I've encountered several specific situations:

  • Finishes code and tells me it's all good — turns out it wasn't tested at all.
  • Writes an article citing a data point, the data was made up.
  • Copy accidentally leaks internal paths or API variable names.
  • Same bug fixed three times, all wrong, each time saying "this one is right".

That's why I added two layers of automated review.

Gate Layer — Detect Lazy Patterns

Gate is a series of regex + rules, automatically scanning Agent outputs:

  • Vague hedging detection — catches language that shifts responsibility back to humans (asking humans to check themselves, vaguely describing "should be fine", hedging in English), automatically FAIL. To PASS, evidence must be attached (command output, file content, API response).
  • Same-card bounce count — same card bounced from same Agent over 3 times automatically switches to another person, preventing infinite loops.
  • Timeout alerts — Agent picks up task but no report after 6 hours, push notification sent.
  • Internal info leak — paths, hostnames, API variable names, private accounts, financial data, blocks upon detection.

Just this Gate layer has blocked over 380 FAKES cumulatively, over 90 automatic bounces. Meaning none of these fake completions reached my eyes.

QA Layer — Real Fact Verification

Passing Gate doesn't mean the article is correct. QA layer uses Gemini's built-in search to verify every factual claim in the article (version numbers, dates, references, statistics) against the web. Misaligned sends back for minor fixes.

TA Layer — Target Audience Perspective Review

Passing QA doesn't mean the piece is "worth reading". TA (Target Audience) review has another Agent play the target reader — for example, if this article's target audience is indie developers, TA Agent reads from that perspective: finished reading, is there something actionable? Want to click the next one? Are technical terms too heavy? Is emotional resonance there? TA fails it, send back.

All three layers combined let me trust the results directly, without needing to fact-check every piece myself.

Note: The three-layer review described here is the basic version, different pipelines have their own custom settings — for example, trading pipeline adds a risk control Gate (single trade >2% risk gets BLOCKED), product launch pipeline adds a copyright and brand consistency Gate, Blog pipeline adds a lead generation rhythm Gate. The basic skeleton is the same, details grow according to scenario.


What We Can Do

Now that this Multi-Agent architecture has stabilized, a lot becomes possible:

  • Auto-write blogs (Chinese, English, Korean trilingual) with SEO/AEO optimization
  • Automatically scan market news daily, summarize key points, push out
  • Run crypto quantitative trading, from auto strategy research, signal review to position management fully automated — system runs backtests, discovers new strategies, throws into Testnet to verify, promotes to real money only when target is hit
  • Auto-process email, important ones translated to Chinese and pushed to me
  • Run social marketing, X posts, Threads, Reddit auto-distributed
  • Conduct market research, from keyword research to competitive analysis to Notion reports

Most representative capability is product development — I mandated that I must see one new product when I wake up every morning. So the system must complete market research → opportunity assessment → product design → copy → launch → marketing in those few hours while I sleep, delivering the finished product the moment I open my eyes. One per day, no exceptions.

I can achieve these not because I used the strongest model, but because each specialty has the right person on the job.


You Can Start From the Smallest Unit Too

Don't aim for seven models from the start. Here's the minimum version you can get running in 90 minutes over a weekend.

Step 1: One Linear board (10 minutes)

Sign up for a free Linear workspace → create a project → go to Settings → API → generate LINEAR_API_KEY and save it in [REDACTED].

Minimum config:

  • Three labels: role:executor, role:qa, status:rejected
  • Three statuses: TodoIn ProgressDone (Linear default works)

Step 2: Two crons (20 minutes)

Cron A — Dispatcher (polls Linear every 5 minutes)

# dispatcher.py pseudocode
new_cards = linear.list_issues(status="Todo", missing_label=["role:*"])
for card in new_cards:
    role = classify(card.title)        # keyword match or call an LLM
    linear.add_label(card.id, f"role:{role}")
    linear.set_status(card.id, "In Progress")

Enter fullscreen mode Exit fullscreen mode

[REDACTED]:

*/5 * * * * cd /opt/agent && python dispatcher.py >> logs/dispatch.log 2>&1

Enter fullscreen mode Exit fullscreen mode

Cron B — Executor (fires the matching agent every 5 minutes)

# executor.py pseudocode
cards = linear.list_issues(status="In Progress", label="role:executor")
for card in cards:
    prompt = f"Task: {card.title}\nSpec: {card.description}"
    result = run_cli("claude-code", prompt)   # or codex / aider / gemini cli
    linear.add_comment(card.id, result)
    linear.add_label(card.id, "needs-qa")

Enter fullscreen mode Exit fullscreen mode

*/5 * * * * cd /opt/agent && python executor.py >> logs/exec.log 2>&1

Enter fullscreen mode Exit fullscreen mode

Step 3: Three role prompts (30 minutes, copy-paste ready)

Each role = one system prompt + one trigger condition.

Supervisor (Opus or Sonnet) — breaks requirements into executable cards:

You are a PM. Read user requirements and output a Linear card as JSON:
{"title": "...", "description": "what specifically to do", "acceptance": "pass criteria"}
Acceptance criteria MUST be verifiable. No fluff like "do it well".

Enter fullscreen mode Exit fullscreen mode

Executor (any cheap model — MiniMax / Codex / Haiku all work) — actually does the work:

You are the executor. Read the task, do it, output the result.
Iron rule: NEVER output hedging like "probably", "should", "please check manually".
If unsure, write "I don't know, reason: ..." instead.

Enter fullscreen mode Exit fullscreen mode

QA (Gemini CLI or any tool with built-in web search) — fact-check + block vagueness:

You are QA. Read the executor's output. Three checks:
1. Are the facts correct? (use web search to verify at least 1 data point)
2. Any vague language? (grep "probably|should|please check|可能|應該")
3. Does it meet the acceptance criteria?
Any check fails → output FAIL + specific reason; all pass → output PASS.

Enter fullscreen mode Exit fullscreen mode

Step 4: Run the first card (20 minutes)

In Linear, manually create:

  • Title: Write a 100-word intro to Claude Code
  • Description: For non-technical readers, must be factual, no hedging, must cover "what it is" and "what it does"

Expected within 10 minutes:

  1. Cron A picks it up → adds role:executor → moves to In Progress
  2. Cron B picks it up → calls the executor → comments the result → adds needs-qa
  3. QA cron (or chained after Cron B) → reads the comment → web-searches facts + greps hedging → comments PASS/FAIL

PASS → move to Done. FAIL → bounce back to Todo + status:rejected label + comment with the reason.

Step 5: Expand once it works

Once the first card flows end-to-end, layer on:

  • Second executor model — different labels route to different CLIs (lang:code → Codex, lang:writing → MiniMax)
  • Bounce mechanism — same card hits FAIL 3 times → auto-Blocked + Telegram alert
  • More roles — marketing, trading, data each get one system prompt + one label rule
  • Notification layer — Telegram bot pushes done cards to you, so you don't sit in Linear

Don't try to build everything at once. Get one card running first, the rest is expansion.


Closing: 1.0 is multi-instance, 2.0 is multi-model multi-role

Recently OpenAI open-sourced something called Symphony, doing similar things: using Linear as control console, one Codex agent per card, agents pick up cards and do the work themselves. OpenAI interno delivered PR growth of 500%.

After reading, I have one thought: They only solved half the problem.

Symphony's setup is "same model, many instances" — all Codex. Can solve coding mass production, but can't solve "different tasks need different specialties" problem. Coding, writing, fact-checking, trading, running marketing are five fundamentally different things, shouldn't be done by the same model.

My Multi-Agent architecture is closer to how real companies operate: multiple roles, multiple models, each does its part. A supervisor, an engineer, a writer, a QA, a marketing, a trading, a data — like a real small team.

multi-instance homogenous agents is 1.0. multi-role multi-model Multi-Agent architecture is 2.0.

If you're also trying to make AI truly work on its own, I hope this helps you skip my detours. Start from one Linear card, three roles, grow slowly.

Side note, the prototype of this Multi-Agent architecture was open-sourced just over two months ago — a repo called ai-night-shift, mainly solving the "let Agents keep working while I sleep at night" problem.

What was already running back then:

  • Files as the communication protocolbot_inbox/ + night_chat.md dual channels, no message queue dependency
  • Adapter abstraction layer — one script supports Claude Code, Codex CLI, Aider, custom CLI
  • Autonomy Rules to prevent interactive deadlocks — every prompt template carries an anti-interactive block
  • Atomic PID lock — uses mkdir instead of files, prevents TOCTOU race conditions
  • Plugin system — pre/task/post phases, pluggable

What this article covers that grew later:

  • Linear as control tower — upgraded from local bot_inbox/ to cloud board, cross-machine cross-agent
  • Multi-model division of labor — went from "one night-shift model" to "seven models each on their own beat"
  • Gate-9 blocking vague language — only added as a post-hoc check layer after hundreds of fake completions burned me
  • Rejection + second-pass QA — Agent saying PASS doesn't count, J reruns + Moon QA must clear it

Those interested can start from that repo, it's the minimal viable ancestor of the architecture in this article, MIT licensed, free to fork. Read it first to grasp the "file communication + autonomous execution" skeleton, then come back to see how it grew into a 24/7 multi-model team.


Further Reading


This Multi-Agent architecture is still iterating. If you're working on something similar, or have questions about any section, feel free to comment.

Want first-look access to architecture iterations, lessons learned the hard way, and product development notes? Subscribe to the Judy AI Lab newsletter — every week I send out what this week's Multi-Agent team built, plus my own thinking, straight to your inbox. You'll also get the AI Team Architecture Guide PDF we use daily.


Originally published at Judy AI Lab. Visit for more articles on AI engineering and development.