




























Codex-style coding agents are most useful when they do more than generate code once. For this experiment, I designed a Codex-style workflow that turns a World Cup 2026 prediction prototype into a reproducible engineering demo: deterministic match probabilities, fixture checks, JSON schema validation, charts, raw API audit files, and a real Crazyrouter multi-model test.
Important context: this is a developer workflow demo, not an official World Cup data product and not betting advice. The fixture and rating data used here is a small demo dataset created for reproducible testing. A production sports model would need official live fixtures, lineups, injuries, travel, odds, and continuous result updates.
The live API layer was tested through:

The weak version of this idea is simple: ask an AI model who will win a match and publish the answer.
The better version is more engineering-heavy:
That is where a Codex-style workflow becomes interesting. The value is not that an AI can guess sports outcomes. The value is that a coding agent can help turn a rough demo into a workflow with gates.
The earlier Claude Code-style version focused on building the first working predictor: fixture data, Elo/Poisson probabilities, charts, and Crazyrouter API calls.
For the Codex-style version, the angle is different:
In short: Claude Code is a good builder story. Codex is a good reviewer-builder story.
The predictor uses a deliberately transparent model:
The expected-goals function is intentionally simple:
This is not a production sports model. For this article, transparency is more important than pretending to have secret predictive power.
| Date | Match | Group | xG | Home / Draw / Away | Pick |
|---|---|---|---|---|---|
| 2026-06-11 | Mexico vs South Africa | A | 1.68-0.98 | 55.8% / 24.2% / 19.9% | Mexico |
| 2026-06-11 | South Korea vs Czechia | A | 1.35-1.21 | 40.1% / 26.6% / 33.3% | South Korea |
| 2026-06-12 | USA vs Paraguay | D | 1.53-1.14 | 48.2% / 25.5% / 26.3% | USA |
| 2026-06-13 | Brazil vs Morocco | C | 1.64-0.92 | 54.9% / 24.7% / 20.4% | Brazil |
| 2026-06-13 | Qatar vs Canada | B | 1.1-1.57 | 24.6% / 25.2% / 50.2% | Canada |
| 2026-06-14 | Germany vs Curaçao | E | 2.08-0.48 | 75.1% / 17.7% / 7.2% | Germany |
| 2026-06-14 | Netherlands vs Japan | F | 1.53-1.03 | 49.5% / 25.7% / 24.8% | Netherlands |

The USA vs Paraguay prediction is a good example. The model gives USA an edge, but not a dominant one: 48.2% home win, 25.5% draw, 26.3% away win. A good workflow should preserve that uncertainty instead of turning it into overconfident prose.
The demo includes these checks:
This is the main workflow lesson: generated content should pass gates before it becomes product output.
After generating probabilities, the workflow asked several model routes to produce a compact JSON match preview for USA vs Paraguay.
Task:
The model-list endpoint worked:
API results:
| Model | HTTP | Latency | Total tokens | Valid JSON | Schema valid |
|---|---|---|---|---|---|
gpt-4o-mini | 200 | 2487 ms | 514 | True | True |
gpt-5.5 | 200 | 4664 ms | 859 | True | True |
gemini-2.5-flash | 200 | 2631 ms | 837 | False | False |
qwen-plus | 200 | 5045 ms | 696 | True | True |
deepseek-chat | 200 | 4192 ms | 738 | True | True |

With a stricter prompt, 4 out of 5 model routes returned schema-valid JSON. That is exactly what we want from a validation experiment: most routes passed, and one route still exposed a failure case.
In this run:
gpt-4o-mini, gpt-5.5, qwen-plus, and deepseek-chat returned schema-valid JSON.gemini-2.5-flash returned truncated JSON in this specific test.This is not a reason to reject any model globally. It is a reason to build retries, stricter prompts, schema repair, and fallback routes.
A plain JSON parser asks:
Is this syntactically valid JSON?
A workflow validator asks:
Can the application safely use this object?
Those are different questions.
A coding-agent workflow should not be tied to one model route. The same task may need:
Crazyrouter makes that operationally simple because the client shape stays OpenAI-compatible:
The useful metric is not raw request price. It is cost per valid output.
If a cheap route often returns malformed or schema-invalid content, the workflow may spend more on retries than expected. If a premium route returns usable structured output more consistently, it may be cheaper per successful task.
Run commands:
That is the real lesson from a World Cup predictor demo: the prediction is the hook, but the workflow is the product.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。