





















Abstract:Reward modeling is not only a prediction problem: in KL-regularized policy optimization, the learned reward is exponentiated to define the deployed policy, so downstream value depends on errors in reward-tilted regions. We study this feedback in a Gaussian single-index model with $r^*(x) = \sigma^*(\langle \theta^*, x\rangle)$ and $x \sim N(0, I_d)$. We analyze a two-stage neural reward model that first learns the hidden direction $\theta^*$ from reward-weighted samples and then fits the readout layer by weighted ridge regression. Exponential reward weighting changes the Hermite signal available to the first layer; for any feature-learning temperature $\beta_1$ above a dimension-free $O(1)$ threshold, a constant fraction of neurons recover the hidden direction, with weak-recovery complexity governed by the generative exponent. After feature recovery, we derive tilted-policy value-gap bounds for an idealized label-weighted fit with weights $e^{y/\beta_2}$ and a more practical surrogate-weighted fit with weights $e^{r_{a_0}(x)/\beta_2}$. Keeping the $\beta_2$-dependence explicit yields an admissible set of deployment temperatures, balancing the gain from lowering $\beta_2$ against the learning cost amplified by exponential weighting; in the surrogate-weighted case, proxy-dependent factors shrink this admissible set.
| Comments: | 35 pages |
| Subjects: | Machine Learning (stat.ML); Machine Learning (cs.LG) |
| Cite as: | arXiv:2605.24749 [stat.ML] |
| (or arXiv:2605.24749v1 [stat.ML] for this version) | |
| https://doi.org/10.48550/arXiv.2605.24749 arXiv-issued DOI via DataCite (pending registration) |
From: Rei Higuchi [view email]
[v1]
Sat, 23 May 2026 22:00:38 UTC (59 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。