How I Cut My LLM Costs by 90% Without Changing My App Logic
There’s a particular kind of dread that comes with checking your OpenAI billing dashboard mid-month.
I’ve been building a news automation hub that runs 14 editorial workspaces — summarizing, rewriting, fact-checking, SEO-tagging, and translation pipelines around the clock.
The AI layer was already fairly optimized:
- Groq
- Gemini Flash
- DeepSeek
- OpenRouter
- provider rotation
- fallback logic
But the final fallback was still OpenAI, and once rate limits hit, costs climbed faster than expected.
What I needed wasn’t more routing logic.
I needed a smarter endpoint.
The Problem
My setup already rotated between multiple providers, but the architecture had a weakness:
Provider exhausted
-> fallback
-> OpenAI
-> credits disappear
The more providers I added, the messier things became:
- more API keys
- more retry logic
- more conditional branches
- more provider-specific handling
I was optimizing infrastructure with application code.
That was the mistake.
The Fix
After digging through self-hosted AI tooling, I found freellmapi.
It’s a lightweight OpenAI-compatible proxy that automatically routes requests across multiple free-tier LLM providers:
- Groq
- Cerebras
- SambaNova
- Cloudflare Workers AI
- GitHub Models
- OpenRouter free models
- and others
Combined free-tier capacity: roughly 800M tokens/month.
The interesting part is that the routing happens inside the proxy — not inside your app.
My Integration
The integration took less than an hour.
1. Deploy the proxy
I ran it on my existing VPS:
- Node.js 20
- ~40MB idle RAM
- localhost only
2. Add provider credentials
I added:
- Groq key
- Cloudflare credentials
- OpenRouter key
inside the admin panel.
3. Point my app to a single endpoint
const client = new OpenAI({
baseURL: "http://localhost:3001/v1",
apiKey: process.env.LOCAL_ROUTER_KEY
});
That was basically it.
The important detail:
I stopped specifying models for non-critical tasks.
Instead of forcing a specific provider, I let the proxy auto-route requests to whatever free provider was currently available.
App
-> freellmapi
-> Groq
-> Cloudflare Workers AI
-> Cerebras
-> SambaNova
-> OpenRouter
If Groq rate-limited:
- another provider picked up the request
If a provider became slow:
- routing shifted automatically
My application code never needed to know.
The Result
Within 24 hours:
- OpenAI usage dropped by ~90%
- background AI tasks became almost entirely free-tier
- no additional retry logic was needed
Most importantly:
I removed provider chaos from my application layer.
What I Learned
When engineers hit rate limits, the instinct is usually:
- add more providers
- add more fallback logic
- add more code
But sometimes the better solution is adding an abstraction layer that absorbs the complexity for you.
Another realization:
Most AI tasks do not require a specific premium model.
For:
- summaries
- tagging
- drafts
- translations
- background enrichment
…almost any decent modern 70B model works fine.
Caveats
Free-tier infrastructure has tradeoffs.
Some providers:
- have cold starts
- introduce latency spikes
- become temporarily unavailable
For real-time user-facing chat systems, you should test failover carefully.
For async pipelines and batch jobs, though, it’s been surprisingly solid.
Also:
run this on infrastructure you control.
A proxy like this handles upstream API keys — don’t hand that responsibility to random hosted services.
Final Thought
The biggest optimization wasn’t changing models.
It was removing complexity from the layer that had to manage them.





















