惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

www.infosecurity-magazine.com
www.infosecurity-magazine.com
Security Archives - TechRepublic
Security Archives - TechRepublic
TaoSecurity Blog
TaoSecurity Blog
Cloudbric
Cloudbric
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
N
News and Events Feed by Topic
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
S
Securelist
The Cloudflare Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
D
DataBreaches.Net
S
Schneier on Security
L
LangChain Blog
Jina AI
Jina AI
M
MIT News - Artificial intelligence
Recent Announcements
Recent Announcements
T
Tenable Blog
B
Blog RSS Feed
V
Visual Studio Blog
Simon Willison's Weblog
Simon Willison's Weblog
G
Google Developers Blog
T
The Exploit Database - CXSecurity.com
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
WordPress大学
WordPress大学
W
WeLiveSecurity
I
InfoQ
The Hacker News
The Hacker News
雷峰网
雷峰网
月光博客
月光博客
P
Privacy & Cybersecurity Law Blog
O
OpenAI News
Hacker News: Ask HN
Hacker News: Ask HN
T
Threat Research - Cisco Blogs
GbyAI
GbyAI
The Last Watchdog
The Last Watchdog
P
Privacy International News Feed
Cyberwarzone
Cyberwarzone
S
SegmentFault 最新的问题
L
Lohrmann on Cybersecurity
人人都是产品经理
人人都是产品经理
V
V2EX
V
Vulnerabilities – Threatpost
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Cybersecurity and Infrastructure Security Agency CISA
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
T
Troy Hunt's Blog
Application and Cybersecurity Blog
Application and Cybersecurity Blog
阮一峰的网络日志
阮一峰的网络日志
SecWiki News
SecWiki News
Microsoft Azure Blog
Microsoft Azure Blog

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
AutoML for Agent Fleets, Without the Vendor Bill
Matthias | S · 2026-05-16 · via DEV Community

Last night I shipped AutoML to a 10-agent fleet in a single session. The added monthly cost was zero euros. Not because we found a discount, but because the math at the heart of agent routing does not need an LLM call.

The fleet runs every other Sunday and writes 10 to 15 page reports for a real customer who pays for the service. Until yesterday, all nine worker agents ran every single time, even when only four or five of them really had something to say about that particular customer. The math layer I added watches how well each worker actually performs, learns which workers are pulling their weight for which customer profile, and in a few weeks will be ready to route only the four to six that earn the spot. The bill stays the same. The throughput goes up.

I am writing this down because the pattern is dead simple, transferable to almost any multi-agent setup, and almost nobody outside academic circles talks about how cheap it really is.

The Setup Nobody Else Has

We run a service called StudioMeyer Agents. Ten specialized agents work on one customer at a time and a master agent stitches their findings into a single coherent report. Four agents check website-side signals (visibility, traffic, competitors, technical SEO). Three check AI visibility (LLM citations, brand mentions, cited sources). Two check industry trends. The last one, the master synthesizer, reads all nine reports and writes the customer-facing version.

For our pilot customer, an anti-luxury real-estate agency on Mallorca, the master fires roughly every other Sunday. For StudioMeyer's own site, every other Sunday too, on a different slot. Each run consumes a fair chunk of Anthropic Max-Plan tokens. Each run also produces about 40 to 80 KB of structured worker reports plus the customer-facing markdown.

Here is the part nobody had asked yet: which of those nine workers are actually contributing? Some weeks the SEO-technical agent has nothing to say because nothing changed on the technical layer. Some weeks the AI-visibility agent finds twelve new citations and the master ends up half its report around those. Different customer types pull on different agents. A tourism client probably benefits more from the visibility and the local-search agents. A B2B SaaS client probably pulls harder on the citation-source and competitor agents.

The fleet has been live since Phase D, mid-May 2026. It works. But we were leaving signal on the floor by treating all nine agents as equally relevant for every customer.

Why AutoML Usually Means a Vendor Bill

If you tell most engineers "we should add AutoML to our agent fleet," they hear "let's pay DataRobot, SageMaker Autopilot, or Vertex AI for the privilege." That is a real solution for a different problem. None of those platforms is cheap, and none of them was built for the question "which subset of my LLM agents should I run on customer X this Tuesday."

The other instinct is "let the LLM decide." Build a meta-agent whose job is to read each customer's profile, decide which sub-agents to fire, and dispatch them. That works. It also means every single routing decision is now an LLM call, with its latency, its token budget, and its hallucination surface area.

There is a third option, and it has been the production-standard for routing problems since the early 2010s in adtech and recommender systems. It just took until AAAI 2026 for somebody to put a tutorial together explicitly applying it to LLM agent routing. IBM Research presented two of them this January: "Bandits, LLMs, and Agentic AI" and "Multi-Armed Bandits Meet Large Language Models". The vLLM Semantic Router team made the same point in their April 2026 vision paper, recommending "multi-armed bandits to route queries by context-aware features."

The pattern is older than the LLM era. The multi-armed bandit problem assumes you have a fixed number of options (slot machines, ad creatives, content blocks, or in our case worker agents) and a finite budget of trials. You want to learn which options pay off and exploit them, while still occasionally trying the others to make sure your beliefs are not outdated. Production code does it in dozens of lines.

The AdaptOrch benchmark from the Augment Code orchestration guide measured routing overhead at less than 50 milliseconds. Compare that to the 2 to 15 seconds of LLM inference latency per agent call. The math layer is essentially free.

Twelve Lines of Math

Here is the formula I shipped. It is Bayesian additive smoothing, also known as Laplacian smoothing or Beta-Binomial conjugate prior, depending on which Wikipedia article you land on first. The additive smoothing page has the cleanest version:

export function bayesianMean(
  observed: Array<number | null>,
  priorMean: number,
  priorWeight: number,
): number {
  const valid = observed.filter(
    (x): x is number => x !== null && Number.isFinite(x),
  );
  if (valid.length === 0) return priorMean;
  const sum = valid.reduce((acc, x) => acc + x, 0);
  return (priorWeight * priorMean + sum) / (priorWeight + valid.length);
}

Enter fullscreen mode Exit fullscreen mode

That is the entire ranking core. The intuition: you do not start from "I have no data, so I cannot rank." You start from a prior belief, expressed as a mean and a pseudo-sample-count. With priorMean = 0.6 and priorWeight = 5, the prior says: "I think each worker is decently good (0.6 on a 0 to 1 scale), and I am as confident in that as if I had observed five samples already."

When the first real sample arrives, it gets averaged in with the five pseudo-samples. The estimate moves, but not violently. After five real observations the prior has exactly as much weight as the data. After twenty real observations, the prior is essentially noise floor and the actual measurements dominate.

What does each worker get scored on? In our case three signals, all extracted from the worker's own report:

  • Verify-confidence: a 0 to 1 score the worker assigns to itself in a "Verify-Confidence" block at the end of every report. We made it mandatory in Session 1068 as part of the anti-hallucination layer. Now it is the primary input to the ranking layer.
  • Source citation count: how many tool calls and external sources the worker cited in its "Datenquellen" block. A high number means evidence-backed work. A low number means the worker leaned on its training data.
  • Domain-lock pass rate: a yes/no per run. Did the worker stay on the customer's actual domain or did it drift to staging subdomains or competitor sites?

The composite score is a weighted sum:

rankScore =
  smoothedConfidence * 0.5 +
  normalize(smoothedSourceDensity) * 0.3 +
  domainLockPassRate * 0.2;

Enter fullscreen mode Exit fullscreen mode

50 percent on the worker's own confidence claim, 30 percent on evidence density, 20 percent on hygiene. Three knobs you can tune later when you have enough data to argue about the right ratio. None of those three signals required a new piece of infrastructure. They were already in every worker report, written by the agents themselves, for the anti-hallucination guard. The ranking layer just reads them.

Cold Start Is the Actual Problem

Most multi-armed bandit tutorials lead with exploration versus exploitation. The classic dilemma: should you keep playing the slot machine that has paid the best so far, or try the one you have not pulled in a while?

In production, that is not the hard problem. The hard problem is what to do on day one when you have zero data, or day three when you have data on three of nine workers and nothing on the rest.

Facebook's Reels team solved this in 2023 by using Thompson Sampling with posterior samples for content cold-start, drawing from the posterior distribution rather than a point estimate so brand-new content still had a fair shot. The 2026 papers on LLM-augmented bandits go further: they let an LLM predict the missing observations and feed them into the bandit as pseudo-data, weighted by how well the LLM's predictions have matched reality so far.

I considered both. For now I shipped something simpler: a hard cold-start guard. If the total number of observed worker runs is below three, the recommendation function just returns "all nine workers, exploration phase." No routing decision is made on a dataset that small. After three runs we have nine workers times three samples plus the prior, which is enough signal to make a soft recommendation. After ten to twenty runs, the prior has melted into the noise floor.

if (totalRunsObserved < MIN_SAMPLES_FOR_RECOMMENDATION) {
  return {
    coldStart: true,
    recommendedWorkers: rankings.map((r) => r.agentKuerzel), // all 9
    ...
  };
}

Enter fullscreen mode Exit fullscreen mode

This is a deliberate trade-off. A more sophisticated bandit, like LinUCB or Thompson Sampling, would make a soft recommendation even on day one. But a soft recommendation on day one is exactly the kind of thing that bites you in week three when you realize the system has been disproportionately favoring the agent who got lucky in its first run. I would rather pay for nine full runs through the cold-start window and ship a confident routing decision in week six than ship a wobbly one immediately.

Closure-Locked Tools, or Why Tenant Isolation Costs You Nothing Here Either

The master synthesizer needs to actually call this. We did that with two inline tools: track_worker_performance and get_worker_ranking. Both registered on the master agent at startup.

The Customer-Slug Closure pattern is worth a paragraph because it is the kind of thing that bites you the day you onboard customer number two. Here is the relevant signature:

export function buildTrackPerformanceInlineTool(
  customerSlug: string,
  agentResolver: (kuerzel: string) => SmaAgentDef | undefined,
  options: { dryRun?: boolean } = {},
): SdkMcpToolDefinition {
  return {
    name: "track_worker_performance",
    description: `... Customer-Slug is locked to "${customerSlug}". ...`,
    handler: async (args) => {
      // customerSlug is captured by closure, NOT a tool argument
      const metrics = buildMetricsFromReport({ customerSlug, ... });
      return await recordWorkerPerformance(metrics);
    },
  };
}

Enter fullscreen mode Exit fullscreen mode

The LLM never sees the customer slug as a parameter it can write. The slug is baked into the tool at build time. Even if the master synthesizer hallucinates "actually let me also track this report for the other customer" mid-run, there is no parameter for it to pass and no path that could route the write to anyone else's bucket. This is the same isolation pattern we use for the analytics-sources inline tool we shipped in Session 1069, and it has not let us down once across about 30 master runs.

For defense in depth, the database layer also validates the slug format itself, in case somebody later builds a script that calls the library directly and accidentally hands it a path-traversal-like value. Our Code Critic agent caught that one and made me add it during the same session.

What Phase 1 Ships, and What Phase 2 Will Ship

Phase 1, the one that went live last night, is informative. The master collects performance data from every worker report it reads, persists it to a new sma_worker_performance table with a hard 5,000-row cap per customer to keep memory bounded, and offers a ranking view to whoever asks. The actual routing logic, the part that decides "only run smasicht and smakonk for tourism customers," is not yet wired up. The fleet still fires all nine agents every cycle.

That is deliberate. The fleet has run a handful of times in production. We do not have enough data to draw conclusions yet. If I had shipped routing right away, we would now be optimizing against a noise pattern.

Phase 2 is the routing layer. It will live in sma-run-all.ts, the script that fires the cron schedule. It reads the recommendation from the ranking layer, picks the top two website-module agents plus all three GEO agents plus the top two business agents (a default of seven instead of nine), and respects an anti-stale guard: any worker that has not run in the last 60 days runs anyway, no matter what its current rank is. That keeps exploration alive even after exploitation kicks in.

The cost in token budget for skipping two agents per run, every other week, across two customers: about 20 to 25 percent fewer Anthropic-Max-Plan tokens spent on each cycle. Times the cycle count over a year, that translates into roughly an extra customer's worth of headroom in the same Max-Plan flat rate.

What I Will Be Watching

A few things that could go wrong, and that I want to catch before they do:

The verify-confidence score is self-reported by each worker. Workers might learn to inflate it, the same way employees learn what their performance metric is and game it. In our case the workers do not actually know the score is being used for ranking. The system prompt does not mention it. But the moment we put this into the prompt, that incentive shows up. I will keep the ranking signal sources unmentioned in worker prompts.

The "all nine for cold-start" rule could trap us. If a customer is fundamentally never going to need the AI-visibility agent (because they are a B2B SaaS company with no public-facing brand), the system will keep firing it forever, scoring it low forever, and never quite cross the threshold to drop it. A future refinement is a low-confidence floor: if a worker scores below 0.4 across more than five runs, ask the master to argue for or against dropping it, with the customer profile in context.

The 50/30/20 weight split is a guess. After ten Phase 2 cycles we should have enough variance to ask whether that split actually correlates with customer-facing report quality. If not, the weights should move.

The Replicable Part

I keep coming back to this: the math layer is twelve lines, the SQL is one table, the integration is two inline tools. The Phase 1 shipping cost was one focused session. The Phase 2 shipping cost will be similar, mostly because all the data plumbing already exists.

If you run any kind of multi-agent fleet, whether it is a customer-onboarding pipeline, a research squad, a content-generation system, or a code-review orchestrator, the same pattern applies. You probably already have a confidence signal somewhere in your pipeline (eval scores, judge models, retry rates, output lengths, or just a self-reported number). You probably already have a signal for hygiene (did the agent stay on task? did it cite sources? did it write more than 500 characters of actual content?). What you do not have, until you add it, is a record of those signals over time, normalized across customers or queries, and a sub-100ms function that turns the record into a routing decision.

This is what AutoML actually looks like when you are not buying it from a vendor. It looks like a table, a function, and a guard. The "ML" is a 1.96 KB SQL file and a Bayesian estimator that an undergraduate could write. The "Auto" comes from the fact that nobody has to look at the data, the system updates itself every run.

The vendor bill is zero because the LLM is the thing being routed, not the thing doing the routing. The math does not need a model with a billion parameters. It needs a prior, a counter, and a sort.


If you want to see the full implementation, the migration SQL, the inline tool wiring on the master synthesizer, and the test suite covering Bayesian smoothing, extraction logic, and cold-start, the StudioMeyer Agents source is documented at studiomeyer.io/services/agents. Or if you want a similar pattern designed for your own fleet, the same service handles the implementation.