How I Cut My LLM Costs by 90% Without Changing My App Logic

There’s a particular kind of dread that comes with checking your OpenAI billing dashboard mid-month.

I’ve been building a news automation hub that runs 14 editorial workspaces — summarizing, rewriting, fact-checking, SEO-tagging, and translation pipelines around the clock.

The AI layer was already fairly optimized:

Groq
Gemini Flash
DeepSeek
OpenRouter
provider rotation
fallback logic

But the final fallback was still OpenAI, and once rate limits hit, costs climbed faster than expected.

What I needed wasn’t more routing logic.

I needed a smarter endpoint.

The Problem

My setup already rotated between multiple providers, but the architecture had a weakness:

Provider exhausted
    -> fallback
        -> OpenAI
            -> credits disappear

The more providers I added, the messier things became:

more API keys
more retry logic
more conditional branches
more provider-specific handling

I was optimizing infrastructure with application code.

That was the mistake.

The Fix

After digging through self-hosted AI tooling, I found freellmapi.

It’s a lightweight OpenAI-compatible proxy that automatically routes requests across multiple free-tier LLM providers:

Groq
Cerebras
SambaNova
Cloudflare Workers AI
GitHub Models
OpenRouter free models
and others

Combined free-tier capacity: roughly 800M tokens/month.

The interesting part is that the routing happens inside the proxy — not inside your app.

My Integration

The integration took less than an hour.

1. Deploy the proxy

I ran it on my existing VPS:

Node.js 20
~40MB idle RAM
localhost only

2. Add provider credentials

I added:

Groq key
Cloudflare credentials
OpenRouter key

inside the admin panel.

3. Point my app to a single endpoint

const client = new OpenAI({
  baseURL: "http://localhost:3001/v1",
  apiKey: process.env.LOCAL_ROUTER_KEY
});

That was basically it.

The important detail:

I stopped specifying models for non-critical tasks.

Instead of forcing a specific provider, I let the proxy auto-route requests to whatever free provider was currently available.

App
  -> freellmapi
      -> Groq
      -> Cloudflare Workers AI
      -> Cerebras
      -> SambaNova
      -> OpenRouter

If Groq rate-limited:

another provider picked up the request

If a provider became slow:

routing shifted automatically

My application code never needed to know.

The Result

Within 24 hours:

OpenAI usage dropped by ~90%
background AI tasks became almost entirely free-tier
no additional retry logic was needed

Most importantly:
I removed provider chaos from my application layer.

What I Learned

When engineers hit rate limits, the instinct is usually:

add more providers
add more fallback logic
add more code

But sometimes the better solution is adding an abstraction layer that absorbs the complexity for you.

Another realization:

Most AI tasks do not require a specific premium model.

For:

summaries
tagging
drafts
translations
background enrichment

…almost any decent modern 70B model works fine.

Caveats

Free-tier infrastructure has tradeoffs.

Some providers:

have cold starts
introduce latency spikes
become temporarily unavailable

For real-time user-facing chat systems, you should test failover carefully.

For async pipelines and batch jobs, though, it’s been surprisingly solid.

Also:
run this on infrastructure you control.

A proxy like this handles upstream API keys — don’t hand that responsibility to random hosted services.

Final Thought

The biggest optimization wasn’t changing models.

It was removing complexity from the layer that had to manage them.

推荐订阅源

DEV Community