I see a lot of claims about which model is "best." Best at what? For whom? At what cost?
I got tired of guessing. So I ran my own comparison.
The setup
I took 500 real queries from my production logs – a mix of:
Code generation (120 queries)
Document summarization (150 queries)
Question answering (180 queries)
Creative writing (50 queries)
I ran each query through four models using the same prompt, same temperature (0.7), same everything.
The models:
DeepSeek-V4 Pro
Kimi 2.6
MiniMax 2.7
Qwen3 235B
I used NovaStack as the gateway – one API endpoint that let me switch models by changing one parameter. Saved me from writing integration code for four different providers.
What I measured
Response time (end-to-end latency)
Cost per query
Accuracy (human-rated on a 1-5 scale, two reviewers)
The surprising results
Fastest model: DeepSeek-V4 Pro (avg 1.8s). Qwen3 was slowest (avg 4.2s) – not surprising given its size.
Cheapest model: MiniMax 2.7 (40% cheaper than DeepSeek on similar tasks).
Most accurate overall: Qwen3 235B (4.3/5). But here's the catch – it wasn't best at everything.
Task type Best model Runner-up
Code generation DeepSeek-V4 Pro (4.6) Qwen3 (4.2)
Long doc summarization Kimi 2.6 (4.7) Qwen3 (4.1)
QA (short context) DeepSeek (4.4) MiniMax (4.2)
Creative writing Qwen3 (4.5) Kimi (4.0)
The biggest surprise: No single model won more than 45% of the task categories. The "best" model depends entirely on what you're doing.
What this means for real-world use
If you're building a production system, picking one model leaves performance on the table.
I now route based on task type:
text
Code task → DeepSeek-V4 Pro
Long document → Kimi 2.6
Image-related → MiniMax 2.7
Complex reasoning → Qwen3 235B
Everything else → DeepSeek (fast + cheap)
What broke during testing
Rate limits were inconsistent – Some models throttled me after 50 requests/minute, others after 200. I had to add per-model rate limiters.
Streaming latency hid real performance – One model sent the first token in 200ms but took 5 seconds to finish. Another took 1s to start but finished in 2s total. Measure end-to-end, not time-to-first-token.
Model responses vary in length – Even with the same prompt, Qwen3 wrote 30% longer responses than MiniMax. This affects cost and user experience.
Human rating is expensive – Two reviewers spent 6 hours rating 500 responses. Worth doing once, but not weekly.
If you want to run your own test
NovaStack (the gateway I used) offers new users credits at novapai.ai/en-US/. Enough to run a few hundred queries through all four models.
The script I used is simple:
python
models = ["deepseek-v4-pro", "kimi-2.6", "minimax-2.7", "qwen3-235b"]
results = []
for model in models:
start = time.time()
response = requests.post(
"https://api.novapai.ai/v1/chat/completions",
headers={"Authorization": f"Bearer {KEY}"},
json={"model": model, "messages": messages}
)
latency = time.time() - start
results.append({"model": model, "latency": latency, "response": response.text})
Questions for the community
What task types have you found surprising differences between models? I want to expand my benchmark.
How do you handle per-model rate limits in production? My simple retry-with-backoff feels inadequate.
Has anyone tried dynamic routing based on real-time cost/latency? Curious if that's worth the complexity.
I'll share the full benchmark dataset and rating rubric if there's interest. Just comment or DM.




















