国外的 Vibe Code 测评排行:opus 4.7、gpt-5.5、deepseek V4、Kimi K2.6……等
飛空
·
2026-04-24
·
via LINUX DO - 最新话题
Key Takeaways 要点总结 Claude Opus 4.7 now leads at 71.00% overall accuracy, ahead of GPT 5.4 (67.42%), GPT 5.3 Codex(61.77%), and Claude Opus 4.6 (Nonthinking)(57.57%). Claude Opus 4.7 现在以 71.00% 的总体准确率领先,高于 GPT 5.4 (67.42%)、GPT 5.3 Codex (61.77%) 和 Claude Opus 4.6 (Nonthinking) (57.57%)。 The top seven models are relatively tightly clustered (71.00% down to 51.48%), followed by a sharp drop to the middle tier (rank 9 at 37.91%). 前七个模型的准确率相对紧密聚集(从 71.00% 降至 51.48%),随后急剧下降到中档水平(排名第 9 的准确率为 37.91%)。 Distribution across apps remains highly uneven: even the top model still has non-trivial low-performing apps, while weaker models remain concentrated in the lowest pass-rate bin. 各应用中的分布仍然高度不均衡:即使是最高准确率的模型仍有表现不佳的应用,而较弱模型的准确率则集中在最低通过率区间。 Open-source/open-weight models underperform relative to top closed models on this benchmark, and cross-benchmark ordering does not transfer cleanly from SWE-Bench/Terminal Bench. 开源/开放权重模型在这个基准测试中的表现不如顶尖的封闭模型,并且跨基准的排序并不能从 SWE-Bench/Terminal Bench 顺利迁移。 Even at rank 1, roughly one-third of workflows fail, so reliable end-to-end app generation is still unsolved. 即使在排名第一的情况下,大约三分之一的流程会失败,因此可靠的端到端应用生成仍然未得到解决。 1 个帖子 - 1 位参与者 阅读完整话题
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。