

















Abstract:Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation, but efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, which is expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher difficulty compared to prior works. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, challenging coding benchmarks for LLMs. Project Page: this https URL
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2601.00575 [cs.CL] |
| (or arXiv:2601.00575v2 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2601.00575 arXiv-issued DOI via DataCite |
From: Ishir Garg [view email]
[v1]
Fri, 2 Jan 2026 05:26:27 UTC (273 KB)
[v2]
Tue, 26 May 2026 07:56:19 UTC (268 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。