惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

S
Schneier on Security
Hugging Face - Blog
Hugging Face - Blog
V
Visual Studio Blog
博客园 - Franky
酷 壳 – CoolShell
酷 壳 – CoolShell
Last Week in AI
Last Week in AI
博客园 - 叶小钗
博客园_首页
阮一峰的网络日志
阮一峰的网络日志
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Application and Cybersecurity Blog
Application and Cybersecurity Blog
TaoSecurity Blog
TaoSecurity Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
爱范儿
爱范儿
宝玉的分享
宝玉的分享
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
量子位
N
News and Events Feed by Topic
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Recent Commits to openclaw:main
Recent Commits to openclaw:main
SecWiki News
SecWiki News
MyScale Blog
MyScale Blog
AI
AI
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
博客园 - 【当耐特】
Security Archives - TechRepublic
Security Archives - TechRepublic
F
Fortinet All Blogs
V2EX - 技术
V2EX - 技术
T
Troy Hunt's Blog
有赞技术团队
有赞技术团队
W
WeLiveSecurity
Project Zero
Project Zero
T
Tor Project blog
Help Net Security
Help Net Security
L
LINUX DO - 最新话题
IT之家
IT之家
The Hacker News
The Hacker News
腾讯CDC
Schneier on Security
Schneier on Security
N
News and Events Feed by Topic
C
Cisco Blogs
博客园 - 聂微东
Webroot Blog
Webroot Blog
Forbes - Security
Forbes - Security
M
MIT News - Artificial intelligence
C
Cyber Attacks, Cyber Crime and Cyber Security
雷峰网
雷峰网
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
A
About on SuperTechFans

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
How I built a 3-provider LLM fallback system in production (and what actually broke)
Ayush Not so great · 2026-06-18 · via DEV Community

How I built a 3-provider LLM fallback system in production (and what actually broke)

I'm a pre-final year student. I built Socra(https://socra-production.up.railway.app/) — a multi-agent LLM SaaS that interrogates your startup idea using 5 specialist AI personas before generating an architecture masterplan. It has paying users. It runs on Railway. And for the first two weeks of production, it was quietly broken in a way I didn't notice until real users hit it.

This is the story of how I built the 3-provider fallback chain (Anthropic → Google → Groq), what broke along the way, and the actual code that runs in production today.


Why you need a fallback chain at all

When I first deployed Socra, the LLM routing was simple: one provider, one model, one API key. It worked fine in development.

Then real users started using it.

Groq's free tier is 6,000 tokens per minute. A single Socra masterplan pipeline — 5 specialist agents running in parallel, each with ~1,500 input tokens — consumes roughly 9,500 tokens in one burst. The math: 3 out of 5 agents were returning Error code: 429 on every session with any real traffic.

The app was showing agent cards to users. Some said "Error" in amber text. I thought it was a race condition. It wasn't. It was me naively assuming one free-tier API could handle a multi-agent pipeline.

The fix wasn't to optimize — it was to add redundancy.


The routing priority chain

The final production routing order:

1. Anthropic Claude Haiku   — if ANTHROPIC_API_KEY is set
2. Google Gemini 2.0 Flash  — if GOOGLE_API_KEY is set  ← production default
3. Groq LLaMA 3.1 8B        — if GROQ_API_KEY is set    ← fallback
4. Stub mode                — demo scenarios, no API key needed

Why this order? Cost and rate limits, not model quality:

Provider Model Input $/MTok Output $/MTok Free tier TPM
Anthropic claude-haiku-4-5 $0.80 $4.00 None
Google gemini-2.0-flash $0.075 $0.30 1,000,000
Groq llama-3.1-8b-instant $0.06 $0.06 6,000

Google's free tier is 150× more headroom than Groq for a pipeline that fires 5 LLM calls simultaneously. For a student-built SaaS where LLM cost needs to be near zero while testing, that's not a small difference — it's the difference between the app working and not working.


The implementation

The routing check

Every LLM call in the system goes through one of two entrypoints: _call_llm (non-streaming, for structured JSON) and _stream_llm_tokens (streaming, for conversation text). Both use the same routing logic:

# backend/llm_client.py

async def _call_llm(system: str, messages: list[dict], max_tokens: int, json_mode: bool = False) -> str:
    if settings.anthropic_api_key:
        return await _call_anthropic(system, messages, max_tokens, json_mode)
    elif settings.google_api_key:
        return await _call_google(system, messages, max_tokens, json_mode)
    elif settings.groq_api_key:
        return await _call_groq(system, messages, max_tokens, json_mode)
    else:
        return _stub_response(messages)

Dead simple. The routing is just: which key is set? The first match wins.

Google via the OpenAI SDK (the elegant hack)

Google AI Studio exposes an OpenAI-compatible endpoint. This means you don't need the Google SDK — just point the OpenAI SDK at a different base URL:

async def _call_google(system: str, messages: list[dict], max_tokens: int, json_mode: bool = False) -> str:
    from openai import AsyncOpenAI
    client = AsyncOpenAI(
        api_key=settings.google_api_key,
        base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
    )
    kwargs = {
        "model": "gemini-2.0-flash",
        "max_tokens": max_tokens,
        "messages": [{"role": "system", "content": system}, *messages],
    }
    if json_mode:
        kwargs["response_format"] = {"type": "json_object"}
    response = await client.chat.completions.create(**kwargs)
    return response.choices[0].message.content or ""

Same pattern works for streaming — just use stream=True and iterate async for chunk in stream.

This is a pattern worth knowing: Groq, Azure OpenAI, and Google AI Studio all support the OpenAI-compatible endpoint format. If you write against the OpenAI SDK with configurable base_url and api_key, you get multi-provider support with almost no extra code.

The structured output problem

Here's where it got messy. After the multi-agent pipeline runs and generates a masterplan, Socra needs structured JSON back from the LLM — eval scores, assumption tracking, quick reply choices. The original approach was a separator in the stream:

Stream: "Here are my questions... ###JSON###{"eval_delta": {...}, "choices": [...]}"

This worked fine with Anthropic (Claude follows formatting instructions reliably). It broke completely with smaller models.

The 8B Groq model would occasionally include the separator, occasionally not, occasionally put it in the middle of a sentence. Parsing failed silently and choices came back empty — users saw no quick reply options after the first message.

The fix: two separate calls.

# Call 1: Stream plain text, no format requirements
async for token in _stream_llm_tokens(system, messages):
    yield token
    full_message += token

# Call 2: After streaming ends, get structured data separately
eval_data = await _call_llm(
    system=eval_system_prompt,
    messages=messages + [{"role": "assistant", "content": full_message}],
    json_mode=True
)

The Anthropic path still uses the separator (it's reliable there and saves one API call). The Groq and Google paths use two calls. A bit more latency, zero parsing failures.


What actually broke in production

The trailing newline API key

This one cost me 45 minutes.

After deploying to Railway, every LLM call was failing with Illegal header value. The API key was correct — I'd copied it straight from the Groq console. Except I hadn't. I'd pasted it into Railway's Variables tab and there was an invisible \n at the end.

The fix was two things:

  1. Re-enter the key manually (don't paste from clipboard)
  2. Add .strip() defensively in config.py:
class Settings(BaseSettings):
    groq_api_key: str = ""
    anthropic_api_key: str = ""
    google_api_key: str = ""

    @validator('groq_api_key', 'anthropic_api_key', 'google_api_key', pre=True)
    def strip_keys(cls, v):
        return v.strip() if v else v

Now the app is defensive against copy-paste mistakes. The .strip() costs nothing and prevents a class of errors that are genuinely hard to debug.

The startup log that lied

After adding Google as the second provider, I pushed to Railway and checked the logs. They said:

Using Groq LLaMA for LLM calls

But I'd set GOOGLE_API_KEY. For two days I thought Google wasn't working. It was. The startup log was wrong.

The main.py lifespan check had a bug:

# Before — skipped Google entirely
if settings.anthropic_api_key:
    logger.info("Using Anthropic Claude")
elif settings.groq_api_key:         # ← checked Groq before Google
    logger.info("Using Groq LLaMA")

The actual routing in _call_llm was correct (Google checked second, before Groq). But the log check had a different order — so if Groq was also set (it was), it logged "Using Groq" even though every actual call was going to Google.

Fix: mirror the routing logic exactly in the startup log.

The 429 cascade

Running 5 parallel specialist agents against Groq's 6k TPM free tier: the math never worked and I was pretending it did.

Each agent gets ~1,500 input tokens + generates ~400 output tokens = ~1,900 tokens per call. 5 parallel calls = 9,500 tokens launched simultaneously. Groq's rate limiter sees all 9,500 in the same minute window and rejects the overflow.

Three approaches I tried, in order:

Approach 1: Retry with backoff. Added 3-attempt retry with 4s/8s exponential backoff on 429 errors. Helped slightly. Didn't fix the underlying math.

Approach 2: Sequential execution with delays. Switched from asyncio.gather() to sequential calls with 1.5s gaps between agents. This spread the token burst across multiple rate-limit windows. Worked on Groq, but added ~7.5s to the masterplan pipeline — noticeable.

Approach 3: Switch to Google. Google's free tier is 1,000,000 TPM. Problem disappeared entirely. Now Groq is the fallback, not the primary.

The real lesson: design for the rate limits of your fallback providers, not just your primary. Groq is fast and cheap but not meant for parallel multi-agent workloads on the free tier.


The cost analysis

After switching to Google as the production default, I did a full token and cost breakdown per session:

Stage Input tokens Output tokens
Conversation (7 turns avg) ~16,700 ~3,500
5 specialist agents ~24,000 ~3,500
Synthesis ~12,700 ~2,500
Devil's advocate ~2,800 ~600
Total per session ~56,200 ~10,100

At Google Gemini Flash pricing ($0.075 input / $0.30 output per million tokens):

Input cost:  56,200 / 1,000,000 × $0.075 = $0.0042
Output cost: 10,100 / 1,000,000 × $0.30  = $0.0030
Total:       ~$0.007 per session

Socra charges ₹499 (~$6) for a full masterplan session. LLM cost per session: $0.007. That's 99.8% gross margin on the LLM cost alone.

Railway hosting is ~$30/month fixed. Break-even is roughly 6 paid sessions per month.

This math only works because of the provider choice. The same session on Anthropic Haiku costs ~$0.085 — 12× more expensive, which would put margins at ~98.6%. Still fine, but the point is: provider selection is a product decision, not just a technical one.


What I'd do differently

1. Design for multi-provider from day one. I added the fallback chain in Phase 3 after production broke. It should have been in the architecture from the start. The routing abstraction (_call_llm with provider detection) is simple enough to add in 30 minutes — there's no reason to start with a single provider.

2. Test the rate limit math before deploying parallel calls. 5 parallel agents × 1,900 tokens = 9,500 tokens in one burst. Groq's free tier is 6,000 TPM. This is elementary arithmetic that I didn't do until users were getting errors.

3. Strip API keys at the config layer. .strip() in your settings class is a 5-minute change that eliminates an entire class of deployment bugs.

4. Make your startup log mirror your routing logic exactly. A log that says "Using Groq" when you're actually using Google is worse than no log — it actively misleads debugging.


The full stack for context

Socra is built on: FastAPI + React + PostgreSQL + Railway + LangGraph (for the multi-agent pipeline) + Langfuse v4 (for per-call LLM observability) + Clerk (auth) + Razorpay (payments). The LLM fallback chain described here handles all LLM calls across the entire system — conversation, agents, synthesis, pitch deck generation, and the tribunal verdict scoring.

The live app is at socra-production.up.railway.app. The approach described here — OpenAI-compatible endpoints, two-call structured output, provider detection at the config layer — is all running in production today.


I'm a pre-final year student at HBTU Kanpur building production ML systems. If you're working on something similar or have questions about the multi-agent architecture, I'm on LinkedIn and GitHub.