How I built a 3-provider LLM fallback system in production (and what actually broke)

I'm a pre-final year student. I built Socra(https://socra-production.up.railway.app/) — a multi-agent LLM SaaS that interrogates your startup idea using 5 specialist AI personas before generating an architecture masterplan. It has paying users. It runs on Railway. And for the first two weeks of production, it was quietly broken in a way I didn't notice until real users hit it.

This is the story of how I built the 3-provider fallback chain (Anthropic → Google → Groq), what broke along the way, and the actual code that runs in production today.

Why you need a fallback chain at all

When I first deployed Socra, the LLM routing was simple: one provider, one model, one API key. It worked fine in development.

Then real users started using it.

Groq's free tier is 6,000 tokens per minute. A single Socra masterplan pipeline — 5 specialist agents running in parallel, each with ~1,500 input tokens — consumes roughly 9,500 tokens in one burst. The math: 3 out of 5 agents were returning Error code: 429 on every session with any real traffic.

The app was showing agent cards to users. Some said "Error" in amber text. I thought it was a race condition. It wasn't. It was me naively assuming one free-tier API could handle a multi-agent pipeline.

The fix wasn't to optimize — it was to add redundancy.

The routing priority chain

The final production routing order:

1. Anthropic Claude Haiku   — if ANTHROPIC_API_KEY is set
2. Google Gemini 2.0 Flash  — if GOOGLE_API_KEY is set  ← production default
3. Groq LLaMA 3.1 8B        — if GROQ_API_KEY is set    ← fallback
4. Stub mode                — demo scenarios, no API key needed

Why this order? Cost and rate limits, not model quality:

Provider	Model	Input $/MTok	Output $/MTok	Free tier TPM
Anthropic	claude-haiku-4-5	$0.80	$4.00	None
Google	gemini-2.0-flash	$0.075	$0.30	1,000,000
Groq	llama-3.1-8b-instant	$0.06	$0.06	6,000

Google's free tier is 150× more headroom than Groq for a pipeline that fires 5 LLM calls simultaneously. For a student-built SaaS where LLM cost needs to be near zero while testing, that's not a small difference — it's the difference between the app working and not working.

The implementation

The routing check

Every LLM call in the system goes through one of two entrypoints: _call_llm (non-streaming, for structured JSON) and _stream_llm_tokens (streaming, for conversation text). Both use the same routing logic:

# backend/llm_client.py

async def _call_llm(system: str, messages: list[dict], max_tokens: int, json_mode: bool = False) -> str:
    if settings.anthropic_api_key:
        return await _call_anthropic(system, messages, max_tokens, json_mode)
    elif settings.google_api_key:
        return await _call_google(system, messages, max_tokens, json_mode)
    elif settings.groq_api_key:
        return await _call_groq(system, messages, max_tokens, json_mode)
    else:
        return _stub_response(messages)

Dead simple. The routing is just: which key is set? The first match wins.

Google via the OpenAI SDK (the elegant hack)

Google AI Studio exposes an OpenAI-compatible endpoint. This means you don't need the Google SDK — just point the OpenAI SDK at a different base URL:

async def _call_google(system: str, messages: list[dict], max_tokens: int, json_mode: bool = False) -> str:
    from openai import AsyncOpenAI
    client = AsyncOpenAI(
        api_key=settings.google_api_key,
        base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
    )
    kwargs = {
        "model": "gemini-2.0-flash",
        "max_tokens": max_tokens,
        "messages": [{"role": "system", "content": system}, *messages],
    }
    if json_mode:
        kwargs["response_format"] = {"type": "json_object"}
    response = await client.chat.completions.create(**kwargs)
    return response.choices[0].message.content or ""

Same pattern works for streaming — just use stream=True and iterate async for chunk in stream.

This is a pattern worth knowing: Groq, Azure OpenAI, and Google AI Studio all support the OpenAI-compatible endpoint format. If you write against the OpenAI SDK with configurable base_url and api_key, you get multi-provider support with almost no extra code.

The structured output problem

Here's where it got messy. After the multi-agent pipeline runs and generates a masterplan, Socra needs structured JSON back from the LLM — eval scores, assumption tracking, quick reply choices. The original approach was a separator in the stream:

Stream: "Here are my questions... ###JSON###{"eval_delta": {...}, "choices": [...]}"

This worked fine with Anthropic (Claude follows formatting instructions reliably). It broke completely with smaller models.

The 8B Groq model would occasionally include the separator, occasionally not, occasionally put it in the middle of a sentence. Parsing failed silently and choices came back empty — users saw no quick reply options after the first message.

The fix: two separate calls.

# Call 1: Stream plain text, no format requirements
async for token in _stream_llm_tokens(system, messages):
    yield token
    full_message += token

# Call 2: After streaming ends, get structured data separately
eval_data = await _call_llm(
    system=eval_system_prompt,
    messages=messages + [{"role": "assistant", "content": full_message}],
    json_mode=True
)

The Anthropic path still uses the separator (it's reliable there and saves one API call). The Groq and Google paths use two calls. A bit more latency, zero parsing failures.

What actually broke in production

The trailing newline API key

This one cost me 45 minutes.

After deploying to Railway, every LLM call was failing with Illegal header value. The API key was correct — I'd copied it straight from the Groq console. Except I hadn't. I'd pasted it into Railway's Variables tab and there was an invisible \n at the end.

The fix was two things:

Re-enter the key manually (don't paste from clipboard)
Add .strip() defensively in config.py:

class Settings(BaseSettings):
    groq_api_key: str = ""
    anthropic_api_key: str = ""
    google_api_key: str = ""

    @validator('groq_api_key', 'anthropic_api_key', 'google_api_key', pre=True)
    def strip_keys(cls, v):
        return v.strip() if v else v

Now the app is defensive against copy-paste mistakes. The .strip() costs nothing and prevents a class of errors that are genuinely hard to debug.

The startup log that lied

After adding Google as the second provider, I pushed to Railway and checked the logs. They said:

Using Groq LLaMA for LLM calls

But I'd set GOOGLE_API_KEY. For two days I thought Google wasn't working. It was. The startup log was wrong.

The main.py lifespan check had a bug:

# Before — skipped Google entirely
if settings.anthropic_api_key:
    logger.info("Using Anthropic Claude")
elif settings.groq_api_key:         # ← checked Groq before Google
    logger.info("Using Groq LLaMA")

The actual routing in _call_llm was correct (Google checked second, before Groq). But the log check had a different order — so if Groq was also set (it was), it logged "Using Groq" even though every actual call was going to Google.

Fix: mirror the routing logic exactly in the startup log.

The 429 cascade

Running 5 parallel specialist agents against Groq's 6k TPM free tier: the math never worked and I was pretending it did.

Each agent gets ~1,500 input tokens + generates ~400 output tokens = ~1,900 tokens per call. 5 parallel calls = 9,500 tokens launched simultaneously. Groq's rate limiter sees all 9,500 in the same minute window and rejects the overflow.

Three approaches I tried, in order:

Approach 1: Retry with backoff. Added 3-attempt retry with 4s/8s exponential backoff on 429 errors. Helped slightly. Didn't fix the underlying math.

Approach 2: Sequential execution with delays. Switched from asyncio.gather() to sequential calls with 1.5s gaps between agents. This spread the token burst across multiple rate-limit windows. Worked on Groq, but added ~7.5s to the masterplan pipeline — noticeable.

Approach 3: Switch to Google. Google's free tier is 1,000,000 TPM. Problem disappeared entirely. Now Groq is the fallback, not the primary.

The real lesson: design for the rate limits of your fallback providers, not just your primary. Groq is fast and cheap but not meant for parallel multi-agent workloads on the free tier.

The cost analysis

After switching to Google as the production default, I did a full token and cost breakdown per session:

Stage	Input tokens	Output tokens
Conversation (7 turns avg)	~16,700	~3,500
5 specialist agents	~24,000	~3,500
Synthesis	~12,700	~2,500
Devil's advocate	~2,800	~600
Total per session	~56,200	~10,100

At Google Gemini Flash pricing ($0.075 input / $0.30 output per million tokens):

Input cost:  56,200 / 1,000,000 × $0.075 = $0.0042
Output cost: 10,100 / 1,000,000 × $0.30  = $0.0030
Total:       ~$0.007 per session

Socra charges ₹499 (~$6) for a full masterplan session. LLM cost per session: $0.007. That's 99.8% gross margin on the LLM cost alone.

Railway hosting is ~$30/month fixed. Break-even is roughly 6 paid sessions per month.

This math only works because of the provider choice. The same session on Anthropic Haiku costs ~$0.085 — 12× more expensive, which would put margins at ~98.6%. Still fine, but the point is: provider selection is a product decision, not just a technical one.

What I'd do differently

1. Design for multi-provider from day one. I added the fallback chain in Phase 3 after production broke. It should have been in the architecture from the start. The routing abstraction (_call_llm with provider detection) is simple enough to add in 30 minutes — there's no reason to start with a single provider.

2. Test the rate limit math before deploying parallel calls. 5 parallel agents × 1,900 tokens = 9,500 tokens in one burst. Groq's free tier is 6,000 TPM. This is elementary arithmetic that I didn't do until users were getting errors.

3. Strip API keys at the config layer. .strip() in your settings class is a 5-minute change that eliminates an entire class of deployment bugs.

4. Make your startup log mirror your routing logic exactly. A log that says "Using Groq" when you're actually using Google is worse than no log — it actively misleads debugging.

The full stack for context

Socra is built on: FastAPI + React + PostgreSQL + Railway + LangGraph (for the multi-agent pipeline) + Langfuse v4 (for per-call LLM observability) + Clerk (auth) + Razorpay (payments). The LLM fallback chain described here handles all LLM calls across the entire system — conversation, agents, synthesis, pitch deck generation, and the tribunal verdict scoring.

The live app is at socra-production.up.railway.app. The approach described here — OpenAI-compatible endpoints, two-call structured output, provider detection at the config layer — is all running in production today.

I'm a pre-final year student at HBTU Kanpur building production ML systems. If you're working on something similar or have questions about the multi-agent architecture, I'm on LinkedIn and GitHub.

推荐订阅源

DEV Community