TL;DR: We measured failover latency across three AI gateways (Bifrost, LiteLLM, Portkey) during 30 days of production traffic at Nexus Labs. Bifrost added 11ms p99 overhead with automatic provider fallback. The model is the easy part. Routing it reliably is not.
Our agent platform at Nexus Labs handles around 2.4M LLM requests per day. Half of those hit OpenAI, the rest spread across Anthropic, Bedrock, and Vertex. When OpenAI had its 4-hour incident on April 23, we lost 38 minutes of traffic before our homegrown retry logic gave up and rerouted.
That hurt. So we replaced the retry layer.
The actual problem
Most gateway benchmarks measure throughput on a cold path with no failures. That tells you very little about production. What I care about: how long does it take for a request to recover when a provider returns 429 or 503? How much p99 latency does the gateway add when nothing is wrong?
Our team of 9 engineers spent two weeks instrumenting three options. Same hardware (c6i.4xlarge, 2 nodes behind an NLB). Same upstream credentials. Same request distribution sampled from our actual logs.
Setup
Each gateway sat between our agent service and four providers. We configured identical fallback chains: OpenAI primary, Anthropic secondary, Bedrock tertiary. Cache disabled. Rate limits set to mirror our prod allocation.
Here's the Bifrost config we used:
providers:
openai:
keys:
- value: env.OPENAI_API_KEY
weight: 1.0
anthropic:
keys:
- value: env.ANTHROPIC_API_KEY
weight: 1.0
bedrock:
keys:
- value: env.AWS_BEDROCK_KEY
weight: 1.0
fallbacks:
- provider: openai
model: gpt-4o
fallback_to:
- provider: anthropic
model: claude-sonnet-4
- provider: bedrock
model: anthropic.claude-sonnet-4
Documented behavior is at https://docs.getbifrost.ai/features/retries-and-fallbacks. LiteLLM and Portkey have equivalent configs. Different YAML shape, same semantics.
Results
We ran 720 hours of mirrored traffic. Numbers below are from the actual logs, not synthetic load.
| Gateway | p50 overhead | p99 overhead | Failover time (provider down) | Memory at 1k RPS |
|---|---|---|---|---|
| Bifrost | 3ms | 11ms | 180ms (one retry + switch) | 412 MB |
| LiteLLM | 8ms | 41ms | 620ms | 890 MB |
| Portkey (self-hosted) | 6ms | 29ms | 340ms | 650 MB |
Bifrost is written in Go. LiteLLM is Python with FastAPI. That accounts for most of the gap on the hot path. Not all of it. Bifrost's fallback chain evaluates synchronously without re-queuing the request, which matters when you're already on retry attempt two.
Portkey was solid but the self-hosted version lagged their managed offering in feature parity. LiteLLM's killer feature for our team was richer support for custom cost-tracking callbacks. We still use those for finance reporting.
What we used Bifrost for
Three things, specifically.
Fallback routing. When OpenAI returns 429, the request goes to Anthropic with the equivalent model. Our agent code never knows. Docs at https://docs.getbifrost.ai/features/retries-and-fallbacks.
Semantic caching. For our evaluation harness specifically. We replay 18,000 prompts against new model versions nightly. Cache hit rate is 73% because the evaluation suite asks the same questions repeatedly. That's around 13k requests we don't pay for each night. Reference: https://docs.getbifrost.ai/features/semantic-caching.
Prometheus metrics. Native export. We already had a Prom stack. Five-minute integration. The default dashboards aren't great but the metrics themselves are useful. Reference: https://docs.getbifrost.ai/features/observability/default.
What we did not use
MCP gateway, governance, SSO. Our auth sits in front of the gateway, not inside it. The custom plugins interface looked interesting but we haven't needed one yet.
Trade-offs and Limitations
Bifrost is younger than LiteLLM. The provider list is wide (23+) but if you need a niche provider, check the docs first. The plugin interface is straightforward so you can add one yourself, but that's still work.
The web UI is decent for initial setup, not where you want to be doing complex governance. Configure things in YAML and version them in git like anything else.
If you're already deep in LiteLLM and using its callback ecosystem, migration cost is real. LiteLLM has more community integrations because it's been around longer. Portkey is also a fine choice if you want a managed control plane and don't want to operate a gateway yourself. Pick based on what your team will actually maintain.
Last caveat. The numbers above are from our workload. Your traffic shape will differ. Run the test yourself before deciding.
Further Reading
- Bifrost retries and fallbacks: https://docs.getbifrost.ai/features/retries-and-fallbacks
- Bifrost semantic caching: https://docs.getbifrost.ai/features/semantic-caching
- Bifrost observability: https://docs.getbifrost.ai/features/observability/default
- Bifrost provider configuration: https://docs.getbifrost.ai/quickstart/gateway/provider-configuration
- Bifrost source: https://github.com/maximhq/bifrost
The model is the easy part. Routing it under failure is the hard part. Spend the time on the boring infrastructure problem.





















