






















The previous project in this series built a World Cup odds movement monitor with Claude Code and claude-fable-5.
That project answered one question:
Can Claude Code build a monitoring pipeline and use
claude-fable-5to summarize odds alerts as valid JSON?
The next question is more important for production:
What happens when the model fails?
So this third project turns the odds monitor into a multi-model alert router.
Instead of trusting one model, we send the same structured task through several routes on Crazyrouter:
claude-fable-5gpt-5.5qwen-plusgemini-2.5-flashThen we measure:
This is still an analytics engineering demo. It is not betting advice.
Most AI examples stop at a successful single model call.
That is not enough for real systems.
If your application depends on structured output, the real question is not:
Which model sounds smartest?
The real question is:
Which model returns a usable object for this exact workflow?
For an odds alert dashboard, the output must be machine-readable. A beautiful paragraph is not enough. The application needs valid JSON with expected keys.
So the router treats these as failures:
That is the difference between a demo and a production workflow.
The input comes from the previous odds movement monitor.
The Python script converted decimal odds into implied probability changes and flagged movements above a threshold.
Example alerts:
The router task is not to predict match results.
The task is to summarize the alerts as a safe engineering report.
Required JSON keys:
The test used the same OpenAI-compatible API base URL:
The request shape was intentionally compact:
The prompt explicitly required:
The router then attempted to parse each response and check required keys.
Here is the real test result:
| Model | HTTP | Latency | Total tokens | Valid JSON | Result |
|---|---|---|---|---|---|
claude-fable-5 | 400 | 1.09s | — | False | Invalid request |
gpt-5.5 | 200 | 8.07s | 950 | True | Valid fallback |
qwen-plus | 200 | 5.68s | 601 | True | Best primary |
gemini-2.5-flash | 200 | 4.70s | 1020 | False | Truncated JSON |
The router recommendation was:
This is exactly why model routing matters.
The fastest HTTP response was not the best production route. Gemini responded quickly, but produced invalid JSON. claude-fable-5 had worked in the previous article with a slightly different payload, but returned HTTP 400 here.
For this exact task, qwen-plus won because it returned valid JSON faster than gpt-5.5.
The qwen-plus response passed all required keys:
That is not a betting recommendation. It is a data-quality and monitoring summary.
gpt-5.5 was slower but also valid.
Its output included stronger caveats:
This makes gpt-5.5 a good fallback candidate.
If the primary route fails, it can provide a more conservative explanation.
This is the most interesting part.
In the previous project, claude-fable-5 successfully returned valid JSON when the request was compact and tuned for that model.
In this router benchmark, the request used the same payload shape across all models.
claude-fable-5 returned:
That does not mean the model is bad.
It means payload compatibility is part of production model quality.
A model can be useful in one request shape and fail in another. If your application routes dynamically, the router must understand those differences.
This is a very practical lesson:
gemini-2.5-flash returned HTTP 200, but failed JSON parsing.
The content started like valid JSON but was truncated:
That is a different failure mode from claude-fable-5.
One model failed at the request layer.
Another model failed at the output layer.
The router must treat both as failures.
This is why HTTP status alone is not enough.
The router rule for this demo is simple:
Pseudo-code:
This is boring code, but it is what makes AI workflows usable.
A pricing page tells you cost per token.
A production workflow cares about cost per valid output.
Those are not the same.
A cheap model that returns invalid JSON may trigger retries and fallback calls. A more expensive model may be cheaper for the actual workflow if it succeeds more often.
For this benchmark, the router would choose:
That does not mean Qwen is always better. It means Qwen was better for this exact payload and schema.
That is the point.
Claude Code’s role here is not to pick a favorite model.
It should build the router and the evidence trail:
This gives you:
That is much more valuable than a single polished answer.
Without an API gateway, this benchmark would require separate provider integrations.
With Crazyrouter, the test uses one interface:
That makes it practical to route by task, not by brand loyalty.
For example:
qwen-plus for fast structured alert summaries;gpt-5.5 when stricter explanation is needed;claude-fable-5 with a compatible payload for tasks where it performs well;This is how multi-model applications should be built.
The lesson from this project is simple:
In production AI, the best model is the one that returns an accepted output for the task.
Not the most hyped model.
Not the model with the fastest HTTP response.
Not the model you personally prefer.
For this odds alert router, the winner was qwen-plus, with gpt-5.5 as fallback. claude-fable-5 remains useful, but this payload needs tuning. gemini-2.5-flash was fast but invalid for the JSON workflow.
That is exactly why routers exist.
If you are building Claude Code projects that need structured output, model comparison, and fallback routing, try Crazyrouter:
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。