























Abstract:We present a systematic evaluation of five large language models on automated code review, comparing Claude Sonnet 4.6, Claude Haiku 4.5, GPT-5.4 mini, Minimax M2.7, and GLM-5 Turbo across 150 code review samples - 100 synthetic mutation-injected bugs and 50 real bug-fix pull requests mined from eight major open-source repositories. Our principal finding is that Claude Haiku 4.5, a smaller and cheaper model, consistently outperforms the larger Claude Sonnet 4.6, achieving higher F1 (0.365 vs. 0.343), 18% higher recall, and superior qualitative review scores across all four evaluation dimensions, at 3.2x lower cost per review. This result holds across three independent experimental conditions (n=25, n=100, n=150) and is independently confirmed on the Martian Code Review Benchmark, a third-party evaluation with different repos, golden comments, and judge. We further report three secondary findings: (1) synthetic-only evaluation dramatically overestimates model capability - on real PRs alone, the best model achieves F1 = 0.066, compared to F1 = 0.847 on synthetic samples, a 92% degradation; (2) diff size is the dominant predictor of review quality, with F1 dropping from 0.657 on diffs under 10 lines to 0.043 on diffs over 150 lines; and (3) all models exhibit near-zero recall on performance-related bugs. We release our evaluation framework and dataset for reproducibility.
From: Shivam Pankaj Kumar [view email]
[v1]
Thu, 9 Apr 2026 06:56:13 UTC (14 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。