Our AI Inference Bill Dropped 65% After We Stopped Treating Every Query the Same

Every query hitting our AI layer was going straight to the most powerful model we had. A user asking "what does HIPAA Section 164.312 say?" got the same compute budget as one asking "should we shut down the payment processor during this active incident?" That was expensive and stupid, and it took embarrassingly long to fix.

This is the story of how we built a routing layer called CascadeFlow into SentinelOps AI, an enterprise decision intelligence platform, and what actually happened when we turned it on.

The Problem With "One Model Fits All"

When you're building an AI system for enterprise operations teams—people making real decisions about infrastructure, compliance posture, and incident response—you face a genuine tension. You need the model to be good when it matters. But "good" on a documentation lookup is a different thing from "good" on "we have a potential SOC2 violation, walk me through the remediation path."

Before routing, every query went to our primary reasoning model (Llama 3.3 70B via Groq). The latency was fine. The quality was fine. The cost was not fine. At scale, routing simple factual queries through a 70B parameter model is just burning money.

The naive fix is to have engineers triage queries manually, which doesn't scale. The correct fix is a classifier that does it automatically.

CascadeFlow: A Lightweight Routing Engine

We integrated @cascadeflow/core as our routing middleware. The idea is straightforward: before a query hits the expensive model, a cheap, fast classifier decides which tier it belongs to.

Our routing logic looks roughly like this:

import { CascadeFlow } from '@cascadeflow/core';

const cascade = new CascadeFlow({
  classifier: {
    model: 'llama-3.1-8b-instant', // fast, cheap
    provider: 'groq',
  },
  tiers: [
    {
      name: 'simple',
      model: 'llama-3.1-8b-instant',
      triggers: ['documentation', 'lookup', 'definition', 'what is'],
    },
    {
      name: 'complex',
      model: 'llama-3.3-70b-versatile',
      triggers: ['incident', 'compliance', 'risk', 'critical', 'breach'],
    },
  ],
});

The classifier runs first—it's an 8B model, so it's fast and cheap—and classifies the incoming query into a complexity tier. Simple queries (policy lookups, definition requests, status checks) stay on the 8B model. Complex queries (active incidents, compliance risk assessments, multi-system decisions) escalate to the 70B.

From our LLM service layer, the routing call is transparent:

async function routeAndExecute(query, context) {
  const tier = await cascade.classify(query);
  const model = tier === 'complex'
    ? 'llama-3.3-70b-versatile'
    : 'llama-3.1-8b-instant';

  return groq.chat.completions.create({
    model,
    messages: buildMessages(query, context),
    response_format: { type: 'json_object' },
  });
}

That response_format: json_object constraint is important—we'll come back to it.

What Routing Actually Costs You

There's a hidden cost to routing that nobody talks about: the classifier itself can be wrong.

In our early testing, the 8B classifier was misrouting about 12% of complex queries down to the cheap tier. A question like "is our current encryption at rest sufficient for PHI storage?" looks superficially like a documentation query. The classifier saw "encryption" and "PHI" as lookup-adjacent terms and routed it to the cheap model, which gave a technically accurate but shallow answer that lacked the risk-weighted framing an auditor would need.

We fixed this in two ways:

Conservative misclassification bias. When the classifier's confidence is below a threshold, escalate to the expensive tier. False positives (routing simple queries high) cost money. False negatives (routing complex queries low) cost credibility. In an enterprise governance context, credibility is more expensive.
Domain keyword pre-checks. Before the classifier even runs, we scan for a hardcoded list of high-stakes terms. If a query contains words like breach, PHI, incident, remediation, or SOC2, it goes to the 70B model unconditionally.

const HIGH_STAKES_KEYWORDS = [
  'breach', 'incident', 'PHI', 'PII', 'SOC2', 'HIPAA',
  'remediation', 'critical', 'violation', 'audit', 'penalty'
];

function requiresComplexModel(query) {
  const lower = query.toLowerCase();
  return HIGH_STAKES_KEYWORDS.some(kw => lower.includes(kw));
}

This is not elegant, but it's safe. The performance overhead is a single .includes() check per query.

The Numbers

After deploying CascadeFlow routing against a realistic mix of enterprise queries, roughly 68% of queries fell into the "simple" tier. The remaining 32% were genuinely complex—incident-related, compliance-heavy, or multi-system risk assessments that benefited from the more capable model.

That routing split—combined with the price difference between an 8B and 70B parameter model—accounts for most of the cost reduction. The exact figure depends on your query distribution and your provider's pricing, but 60-65% is a reasonable estimate for an enterprise operational workload where most interactions are informational rather than analytical.

Forcing Structure Out of Both Models

One consequence of routing to two different models is that you now have two sources of unstructured text to deal with. We solved this by enforcing a strict JSON response schema at the prompt level, regardless of which model is running.

Every response from SentinelOps AI conforms to this shape:

{
  "summary": "One-sentence decision summary",
  "risk_level": "LOW | MEDIUM | HIGH | CRITICAL",
  "confidence": 0.87,
  "recommendation": "Specific, actionable recommendation",
  "tradeoffs": ["Tradeoff A", "Tradeoff B"],
  "governance_flags": [],
  "citations": []
}

The frontend renders this as a Decision Card—not a chat bubble. Risk level gets a color-coded badge. Confidence is displayed as a progress bar. Tradeoffs are rendered as a checklist. Governance flags trigger a separate UI element that routes to the compliance dashboard.

When you force both the cheap and expensive model into the same output schema, the quality difference between tiers becomes measurable. You can compare confidence scores, count governance_flags, and track whether the 8B model's recommendations match the 70B model's on borderline queries. This becomes a feedback loop for improving your routing thresholds over time.

Lessons

1. Start with keyword gating, not just ML classification. A simple list of high-stakes terms as a pre-filter saved us from the worst misrouting failures. ML classifiers are probabilistic. Safety-critical routing decisions shouldn't be.

2. Misrouting in the wrong direction is asymmetric. Routing a simple query to a powerful model costs you money. Routing a complex query to a weak model costs you trust. Size your misclassification bias accordingly.

3. A common output schema across tiers is essential. Without it, you're comparing apples and oranges and your frontend needs to handle two different response shapes. Force the schema at the prompt level.

4. Routing is a product decision, not just an infrastructure one. The thresholds you set for escalation reflect your platform's risk tolerance. In a governance context, we erred conservative. A developer tool might err aggressive. Know which direction your users would rather you fail.

You can read more about how CascadeFlow handles multi-tier routing in the cascadeflow docs. The cost savings are real, but the more important outcome is that complex queries now get the compute they actually need instead of competing on the same tier as "what does this acronym stand for."

推荐订阅源

DEV Community