
























Abstract:Process Reward Model (PRM) is widely used in the post-training of Large Language Model (LLM) because it can perform fine-grained evaluation of the reasoning steps of generated content. However, most PRMs lack long-term reasoning and deep thinking capabilities. On the other hand, although a few works have tried to introduce Chain-of-Thought (CoT) capability into PRMs, the annotation cost of CoT-PRM data is too expensive to play a stable role in various tasks. To address the above challenges, we propose VRPRM, a process reward model via visual reasoning, and design an efficient two-stage training strategy. Experimental results show that using only 3.6K CoT-PRM Supervised Fine-Tuning(SFT) data and 50K non-CoT PRM Reinforcement Learning (RL) training data, VRPRM can surpass the non-thinking PRM with a total data volume of 400K and achieved a relative performance improvement of up to 118\% over the base model in the BoN experiment. This result confirms that the proposed combined training strategy can achieve higher quality reasoning capabilities at a lower data annotation cost, thus providing a new paradigm for PRM training with more efficient data utilization.
| Comments: | 20 pages, 11 figures |
| Subjects: | Machine Learning (cs.LG) |
| Cite as: | arXiv:2508.03556 [cs.LG] |
| (or arXiv:2508.03556v3 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2508.03556 arXiv-issued DOI via DataCite |
From: Xinquan Chen [view email]
[v1]
Tue, 5 Aug 2025 15:25:24 UTC (2,287 KB)
[v2]
Thu, 28 Aug 2025 06:17:35 UTC (2,297 KB)
[v3]
Thu, 21 May 2026 10:04:30 UTC (2,618 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。