I A/B tested 4 LLMs on the same 500 queries. The results surprised me.

I see a lot of claims about which model is "best." Best at what? For whom? At what cost?

I got tired of guessing. So I ran my own comparison.

The setup
I took 500 real queries from my production logs – a mix of:

Code generation (120 queries)

Document summarization (150 queries)

Question answering (180 queries)

Creative writing (50 queries)

I ran each query through four models using the same prompt, same temperature (0.7), same everything.

The models:

DeepSeek-V4 Pro

Kimi 2.6

MiniMax 2.7

Qwen3 235B

I used NovaStack as the gateway – one API endpoint that let me switch models by changing one parameter. Saved me from writing integration code for four different providers.

What I measured
Response time (end-to-end latency)

Cost per query

Accuracy (human-rated on a 1-5 scale, two reviewers)

The surprising results
Fastest model: DeepSeek-V4 Pro (avg 1.8s). Qwen3 was slowest (avg 4.2s) – not surprising given its size.

Cheapest model: MiniMax 2.7 (40% cheaper than DeepSeek on similar tasks).

Most accurate overall: Qwen3 235B (4.3/5). But here's the catch – it wasn't best at everything.

Task type Best model Runner-up
Code generation DeepSeek-V4 Pro (4.6) Qwen3 (4.2)
Long doc summarization Kimi 2.6 (4.7) Qwen3 (4.1)
QA (short context) DeepSeek (4.4) MiniMax (4.2)
Creative writing Qwen3 (4.5) Kimi (4.0)
The biggest surprise: No single model won more than 45% of the task categories. The "best" model depends entirely on what you're doing.

What this means for real-world use
If you're building a production system, picking one model leaves performance on the table.

I now route based on task type:

text
Code task → DeepSeek-V4 Pro
Long document → Kimi 2.6

Image-related → MiniMax 2.7
Complex reasoning → Qwen3 235B
Everything else → DeepSeek (fast + cheap)
What broke during testing
Rate limits were inconsistent – Some models throttled me after 50 requests/minute, others after 200. I had to add per-model rate limiters.

Streaming latency hid real performance – One model sent the first token in 200ms but took 5 seconds to finish. Another took 1s to start but finished in 2s total. Measure end-to-end, not time-to-first-token.

Model responses vary in length – Even with the same prompt, Qwen3 wrote 30% longer responses than MiniMax. This affects cost and user experience.

Human rating is expensive – Two reviewers spent 6 hours rating 500 responses. Worth doing once, but not weekly.

If you want to run your own test
NovaStack (the gateway I used) offers new users credits at novapai.ai/en-US/. Enough to run a few hundred queries through all four models.

The script I used is simple:

python
models = ["deepseek-v4-pro", "kimi-2.6", "minimax-2.7", "qwen3-235b"]
results = []

for model in models:
start = time.time()
response = requests.post(
"https://api.novapai.ai/v1/chat/completions",
headers={"Authorization": f"Bearer {KEY}"},
json={"model": model, "messages": messages}
)
latency = time.time() - start
results.append({"model": model, "latency": latency, "response": response.text})
Questions for the community
What task types have you found surprising differences between models? I want to expand my benchmark.

How do you handle per-model rate limits in production? My simple retry-with-backoff feels inadequate.

Has anyone tried dynamic routing based on real-time cost/latency? Curious if that's worth the complexity.

I'll share the full benchmark dataset and rating rubric if there's interest. Just comment or DM.

推荐订阅源

DEV Community