Stop guessing your AI bill: one endpoint for GPT-5.5, Claude & Gemini at a flat per-call price

If you build on top of LLMs, you've probably hit this: you ship a feature, traffic spikes, and the API bill comes back way higher than you expected. Per-token pricing makes costs hard to predict — you're billed by how verbose the model is, not by the value you ship.

I got tired of that (plus juggling three API keys), so here's a setup that fixes both: one OpenAI-compatible endpoint that auto-picks the best model and charges a flat price per call.

The core idea

Instead of calling each provider directly, you point your existing OpenAI SDK at a single gateway and send one model name: modelis-auto. It routes each request to the best model for the task (GPT-5.5, Claude Opus 4.8, Gemini 3.1, Grok, DeepSeek…) and bills a flat per-call rate — so your cost is predictable regardless of which model handled it.

Zero migration: just change base_url

If you already use the OpenAI SDK, this is a one-line change.

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_MODELIS_KEY",
    base_url="https://modelishub.com/v1",   # the only change
)

resp = client.chat.completions.create(
    model="modelis-auto",                    # let it pick the best model
    messages=[{"role": "user", "content": "Explain CRDTs in two sentences."}],
)
print(resp.choices[0].message.content)

Or with curl:

curl https://modelishub.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_MODELIS_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"modelis-auto","messages":[{"role":"user","content":"Hi"}]}'

That's it. Your existing code, SDKs, and OpenAI-compatible tools keep working.

"But which model actually answered?"

Fair question — auto-routing shouldn't be a black box. Every response returns a header telling you exactly which model handled the request:

X-Modelis-Routed-Model: claude-opus-4-8

And if you want control, you can stay in a quality tier or call a specific model directly:

model: "modelis-auto:premium"     # stay in a quality tier
model: "gpt-5.5"                   # or pin a specific model

Why flat per-call instead of per-token

The point isn't "cheaper than everyone" — it's predictable. With a flat per-call price:

A verbose response doesn't cost more than a terse one.
A busy day scales with calls, not with token noise.
You can actually budget, and price your own product with confidence.

Honest take: when per-token is still fine

If your workload is steady, you control prompt/response sizes tightly, and you've already optimized model choice per route, per-token billing can be cheaper. Flat per-call shines when traffic is bursty, prompts vary, or you just don't want to babysit model selection and cost. Pick what fits your reality.

Try it

There's a free tier: modelishub.com. I'd genuinely love feedback — especially whether predictable pricing actually matters for how you build, or if you prefer per-token control.

推荐订阅源

DEV Community