惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
G
GRAHAM CLULEY
P
Privacy & Cybersecurity Law Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
宝玉的分享
宝玉的分享
P
Proofpoint News Feed
H
Help Net Security
V
Visual Studio Blog
阮一峰的网络日志
阮一峰的网络日志
C
Cisco Blogs
人人都是产品经理
人人都是产品经理
Know Your Adversary
Know Your Adversary
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
Recorded Future
Recorded Future
I
Intezer
罗磊的独立博客
T
The Exploit Database - CXSecurity.com
Blog — PlanetScale
Blog — PlanetScale
Malwarebytes
Malwarebytes
Spread Privacy
Spread Privacy
T
Tor Project blog
V
Vulnerabilities – Threatpost
云风的 BLOG
云风的 BLOG
腾讯CDC
B
Blog RSS Feed
Stack Overflow Blog
Stack Overflow Blog
F
Future of Privacy Forum
MyScale Blog
MyScale Blog
Latest news
Latest news
IT之家
IT之家
MongoDB | Blog
MongoDB | Blog
The Hacker News
The Hacker News
S
Securelist
博客园 - 【当耐特】
C
CXSECURITY Database RSS Feed - CXSecurity.com
T
Threat Research - Cisco Blogs
Jina AI
Jina AI
Cisco Talos Blog
Cisco Talos Blog
B
Blog
博客园 - 三生石上(FineUI控件)
Last Week in AI
Last Week in AI
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
M
MIT News - Artificial intelligence
V
V2EX
D
Darknet – Hacking Tools, Hacker News & Cyber Security
The Cloudflare Blog
The GitHub Blog
The GitHub Blog
博客园 - 聂微东
F
Full Disclosure
C
CERT Recently Published Vulnerability Notes

DEV Community

From Fragmented Pipelines to Coherent Intelligence — Why Gemma 4 Actually Changes How I Work Why P95 Latency Is the Only Metric That Matters at 3 AM Recycling made easy: a Polish recycling assistant powered by Gemma 4 The Complete Guide to Running a Midnight Node: Setup, Sync & Monitoring De CSRF a RCE: una visita web cuesta una shell en OpenYak Why We Built a Faster Wiki Building a Browser-Based Inkarnate Alternative for D&D Battle Maps Apache Kafka How to Build a FinTech Platform as a Solo Developer (By Any Means Necessary) Your LLM Logs Deserve Better — Send Claude Code Events to Bronto I built a free tool to track subscriptions and stop getting surprised by charges Building the TEYZIX CORE Internship Portal — My Full-Stack Development Journey PocketCFO: a private personal-finance brain that runs entirely in your browser Go Idioms I Wish I Knew Earlier Hey how are you guys I'm newbie web developer , learning wordpress+elementor Right now I don't know what to make I don't know what to write or use what color can you tell me about it ? Google I/O 2026 Blew My Mind — Here's What It Means for the Family App I'm Building 5 Things I Learned in My First Month as a Dev Intern EU AI Sovereignty Belongs in the Workflow Layer Why AI Coding Agents Need Business Context, Not Just Code Context How I Built 9 Claude AI Features into a Production SaaS Expo SDK 56 HashiCorp built an MCP server for writing Terraform. I built one for reviewing it Why Enterprise AI Agent Deployments Keep Failing Date Shear: A New Term for a Common Programming Pain Point Compass v1.1.0 · we shipped a memory plugin that catches its own consumption drift Zod Validation: Type-Safe APIs & Forms in TypeScript (Complete Guide) GitHub Actions CI/CD: Build a Complete Node.js Pipeline (2026) MCP in 2026: The numbers behind the ecosystem explosion working with an ai model mirror Learnt new things Four Metrics That Actually Tell You Whether Your Enterprise RAG Is Working Beyond the Stateless Prompt: Building an Auditable Product Intelligence Pipeline with Cascadeflow and Hindsight Most Creators Are Building in Pieces. I’m Building the Entire System. The Hidden Privacy Problem in Every AI App CVE-2026-26007: Subgroup Confinement Attack in pyca/cryptography The One Thing I See in Every Developer Who Gets Unstuck AI Memory Governance for Legal Tech: How Contract AI Agents Handle Privileged Data Two tables, zero migrations, full LINQ — a .NET data engine that's been running our production for 3 months Join the GitHub Finish-Up-A-Thon Challenge: $3,000 Prize Pool! I Replaced a $50/Month OCR API with Gemma 4’s Native Vision (And You Can Too) Building a Data-Driven Medical Image Enhancement Pipeline with Differential Evolution 🔥🩻 Why I Like Small Software Beyond the Model: Why the Gemini Ecosystem and Google AI Studio Are Redefining Enterprise AI Architecture in 2026 Complete set of Claude Skills for Solo Developer I read 50 years of network science, then built a CRM that runs entirely in the browser The New AI Workflow Is Not “More Agents” How to Make Large Time-Series Charts Smooth in Vue.js + ApexCharts (and fix Zoom & Scroll behavior issues) I Built a Cross-Platform Port Intelligence Tool to Stop Accidental Process Kills During Local Dev AI is heading toward a wall, and most people still don’t see it... Python String Methods Explained Simply (Common Operations) Why We Built a Zero-Knowledge Clipboard Manager for Developers (And Dropped Native Mobile Apps) Add Your Own Component to Bombie in 5 Edits Why Your OSS Advocacy Strategy Probably Doesn't Fit Building an MCP server for a Swiss hosting provider (and what reverse-engineering its manager taught me) Does MCP Still Matter in the AI Ecosystem? Building a Smart LRU Cache in Java: When Machines Mimic Human Memory 🧠💻 A Beginner’s Guide to Redux in React Build a Real-Time Excalidraw-like Collaborative Canvas using Velt MCP and Antigravity🎉 Using Reddit to Validate SaaS Ideas Before Building How We Built an AI That Evolves Alongside a Creator Through Memory Building a Self-Hosted AI WhatsApp Agent for Structured Invoice Extraction Three Design Decisions That Shaped the Enterprise RAG Retrieval Pipeline How React's Virtual DOM Works Under the Hood Build a Dropbox Paper-Style Collaborative Editor with Next.js and Velt💥 Holy Typos, Batman! How I Built 'SpellJump' How to Test Frontend Error States Without Breaking Your Backend A .NET Dinosaur in Web3. Day 8 — Reading & Writing — WishList Chain Building AI Digital Employees with Markus: An Open-Source Platform for Agent Teams [Boost] The Auditor — High-Reasoning Synthesis and the Ethics of Governance Building 'Offline Brain': How I Wrote My First Custom Agent Skill for Android (Google I/O 2026) 📱🧠 Building a Superhuman-Style Collaborative Email Editor with Next.js and Velt🔥 I Built an On-Chain Marketplace Where AI Agents Solve GitHub Bounties for USDC Three Stripe subscription patterns I locked in before going live (with code) Six Ways AI Agents Communicate in 2026. I Benchmarked All of Them. Building AI Digital Employees with Markus: An Open-Source AI Workforce Platform I built a tool that detects broken security headers, missing robots.txt, and WP_DEBUG=true — then opens a PR to fix them automatically NIST Just Exposed the Age Estimation Number Vendors Don't Want You to See Authentication Looks Easy - Until You Build It for Real Users I Built a Free Stock Market Game You Can Play Right Now — No Login, No Download GitHub Agentic Workflows: Building Self-Healing CI for .NET Building a No-Code AI Agent for WooCommerce Order Analytics with Flowise & HPOS Your AI Coding Agent Has Been Flying Blind. Google I/O 2026 Just Fixed That I built a CLI that eliminates README reading forever Measuring AI Gateway Failover: 30 Days of Production Data The Folly of Global AI Platforms: Or How We Built a System That Actually Works in Cameroon Week 9 The 10-Minute Race: Scaling the "Cancel Order" Button to 100K+ Requests Per Second SQL Performance: Indexing, Query Tuning & Explain Plans (Developer Guide) Tutorial: This AI Now Tells You if a Meeting Could Be an Email Why I Got Tired of Class-Heavy UI Code and Started Building Around Attributes GitHub Is No Longer a Place for Serious Work Build an AI-Powered Developer Portal with Backstage and .NET Updates to developer experience on Setapp Node.Js Express CRUD template Lint Your Phishing Templates Like You Lint Your Code From Code to Cloud: 3 Labs for Deploying Your AI Agent I built Voice2Sub: a local AI subtitle generator for video and audio The OCR Rabbit Hole Built a 100k-Document RAG System by Hand. Hermes Read the Architecture in 47 Seconds.
Our AI Inference Bill Dropped 65% After We Stopped Treating Every Query the Same
Karthik S · 2026-05-22 · via DEV Community
  • Every query hitting our AI layer was going straight to the most powerful model we had. A user asking "what does HIPAA Section 164.312 say?" got the same compute budget as one asking "should we shut down the payment processor during this active incident?" That was expensive and stupid, and it took embarrassingly long to fix.

This is the story of how we built a routing layer called CascadeFlow into SentinelOps AI, an enterprise decision intelligence platform, and what actually happened when we turned it on.

The Problem With "One Model Fits All"

When you're building an AI system for enterprise operations teams—people making real decisions about infrastructure, compliance posture, and incident response—you face a genuine tension. You need the model to be good when it matters. But "good" on a documentation lookup is a different thing from "good" on "we have a potential SOC2 violation, walk me through the remediation path."

Before routing, every query went to our primary reasoning model (Llama 3.3 70B via Groq). The latency was fine. The quality was fine. The cost was not fine. At scale, routing simple factual queries through a 70B parameter model is just burning money.

The naive fix is to have engineers triage queries manually, which doesn't scale. The correct fix is a classifier that does it automatically.

CascadeFlow: A Lightweight Routing Engine

We integrated @cascadeflow/core as our routing middleware. The idea is straightforward: before a query hits the expensive model, a cheap, fast classifier decides which tier it belongs to.

Our routing logic looks roughly like this:

import { CascadeFlow } from '@cascadeflow/core';

const cascade = new CascadeFlow({
  classifier: {
    model: 'llama-3.1-8b-instant', // fast, cheap
    provider: 'groq',
  },
  tiers: [
    {
      name: 'simple',
      model: 'llama-3.1-8b-instant',
      triggers: ['documentation', 'lookup', 'definition', 'what is'],
    },
    {
      name: 'complex',
      model: 'llama-3.3-70b-versatile',
      triggers: ['incident', 'compliance', 'risk', 'critical', 'breach'],
    },
  ],
});

Enter fullscreen mode Exit fullscreen mode

The classifier runs first—it's an 8B model, so it's fast and cheap—and classifies the incoming query into a complexity tier. Simple queries (policy lookups, definition requests, status checks) stay on the 8B model. Complex queries (active incidents, compliance risk assessments, multi-system decisions) escalate to the 70B.

From our LLM service layer, the routing call is transparent:

async function routeAndExecute(query, context) {
  const tier = await cascade.classify(query);
  const model = tier === 'complex'
    ? 'llama-3.3-70b-versatile'
    : 'llama-3.1-8b-instant';

  return groq.chat.completions.create({
    model,
    messages: buildMessages(query, context),
    response_format: { type: 'json_object' },
  });
}

Enter fullscreen mode Exit fullscreen mode

That response_format: json_object constraint is important—we'll come back to it.

What Routing Actually Costs You

There's a hidden cost to routing that nobody talks about: the classifier itself can be wrong.

In our early testing, the 8B classifier was misrouting about 12% of complex queries down to the cheap tier. A question like "is our current encryption at rest sufficient for PHI storage?" looks superficially like a documentation query. The classifier saw "encryption" and "PHI" as lookup-adjacent terms and routed it to the cheap model, which gave a technically accurate but shallow answer that lacked the risk-weighted framing an auditor would need.

We fixed this in two ways:

  1. Conservative misclassification bias. When the classifier's confidence is below a threshold, escalate to the expensive tier. False positives (routing simple queries high) cost money. False negatives (routing complex queries low) cost credibility. In an enterprise governance context, credibility is more expensive.

  2. Domain keyword pre-checks. Before the classifier even runs, we scan for a hardcoded list of high-stakes terms. If a query contains words like breach, PHI, incident, remediation, or SOC2, it goes to the 70B model unconditionally.

const HIGH_STAKES_KEYWORDS = [
  'breach', 'incident', 'PHI', 'PII', 'SOC2', 'HIPAA',
  'remediation', 'critical', 'violation', 'audit', 'penalty'
];

function requiresComplexModel(query) {
  const lower = query.toLowerCase();
  return HIGH_STAKES_KEYWORDS.some(kw => lower.includes(kw));
}

Enter fullscreen mode Exit fullscreen mode

This is not elegant, but it's safe. The performance overhead is a single .includes() check per query.

The Numbers

After deploying CascadeFlow routing against a realistic mix of enterprise queries, roughly 68% of queries fell into the "simple" tier. The remaining 32% were genuinely complex—incident-related, compliance-heavy, or multi-system risk assessments that benefited from the more capable model.

That routing split—combined with the price difference between an 8B and 70B parameter model—accounts for most of the cost reduction. The exact figure depends on your query distribution and your provider's pricing, but 60-65% is a reasonable estimate for an enterprise operational workload where most interactions are informational rather than analytical.

Forcing Structure Out of Both Models

One consequence of routing to two different models is that you now have two sources of unstructured text to deal with. We solved this by enforcing a strict JSON response schema at the prompt level, regardless of which model is running.

Every response from SentinelOps AI conforms to this shape:

{
  "summary": "One-sentence decision summary",
  "risk_level": "LOW | MEDIUM | HIGH | CRITICAL",
  "confidence": 0.87,
  "recommendation": "Specific, actionable recommendation",
  "tradeoffs": ["Tradeoff A", "Tradeoff B"],
  "governance_flags": [],
  "citations": []
}

Enter fullscreen mode Exit fullscreen mode

The frontend renders this as a Decision Card—not a chat bubble. Risk level gets a color-coded badge. Confidence is displayed as a progress bar. Tradeoffs are rendered as a checklist. Governance flags trigger a separate UI element that routes to the compliance dashboard.

When you force both the cheap and expensive model into the same output schema, the quality difference between tiers becomes measurable. You can compare confidence scores, count governance_flags, and track whether the 8B model's recommendations match the 70B model's on borderline queries. This becomes a feedback loop for improving your routing thresholds over time.

Lessons

1. Start with keyword gating, not just ML classification. A simple list of high-stakes terms as a pre-filter saved us from the worst misrouting failures. ML classifiers are probabilistic. Safety-critical routing decisions shouldn't be.

2. Misrouting in the wrong direction is asymmetric. Routing a simple query to a powerful model costs you money. Routing a complex query to a weak model costs you trust. Size your misclassification bias accordingly.

3. A common output schema across tiers is essential. Without it, you're comparing apples and oranges and your frontend needs to handle two different response shapes. Force the schema at the prompt level.

4. Routing is a product decision, not just an infrastructure one. The thresholds you set for escalation reflect your platform's risk tolerance. In a governance context, we erred conservative. A developer tool might err aggressive. Know which direction your users would rather you fail.

You can read more about how CascadeFlow handles multi-tier routing in the cascadeflow docs. The cost savings are real, but the more important outcome is that complex queries now get the compute they actually need instead of competing on the same tier as "what does this acronym stand for."