
























This benchmark evaluates how well different LLMs perform on system design tasks. Each model receives the same cold system design prompt — no examples, no hints — and produces a complete design with architecture, capacity estimation, tradeoffs, and failure analysis. Independent LLM judges then score every transcript on 5 dimensions.
I evaluated 9 models on 9 problems with 3 judges — 81 transcripts scored in total. See the methodology.
Any feedback or request? Please submit an issue.
| Rank | Model | Mean Score | ±CI | Runs |
|---|---|---|---|---|
| 1 | kimi-k2.6 | 4.39 | ±0.13 | 9 |
| 2 | gpt-5.4 | 4.34 | ±0.16 | 9 |
| 3 | claude-sonnet-4.6 | 4.26 | ±0.09 | 9 |
| 4 | gpt-oss-120b | 4.02 | ±0.1 | 9 |
| 5 | deepseek-v4-pro | 4.00 | ±0.11 | 9 |
| 6 | gemini-3.1-pro | 3.87 | ±0.14 | 9 |
| 7 | gemma-4-31b-it | 3.44 | ±0.17 | 9 |
| 8 | gpt-oss-20b | 3.39 | ±0.14 | 9 |
| 9 | minimax-m2.7 | 3.28 | ±0.32 | 9 |
Buy me a coffee — or 10M tokens worth ☕
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。