

















Abstract:Large language models~(LLMs) are trained on heterogeneous multilingual corpora, yet existing policy optimization methods often implicitly restrict each training question to a single response language or rely on a fixed dominant language for supervision. We propose language-routed policy optimization (LRPO), an online policy optimization framework that treats language as a selectable variable. LRPO elicits multilingual rollouts for each training question and integrates their relative quality into preference-based policy updates, increasing the diversity and informativeness of training signals under the fixed rollout budget. To adaptively determine which languages to explore during reinforcement learning, we introduce a trainable language router formulated as a multi-armed bandit, balancing exploration of underutilized languages with exploitation of more informative ones. Extensive experiments show that LRPO consistently improves multilingual performance, demonstrating that adaptive language routing enables effective cross-lingual knowledge exploitation for training. We release all the resources at this https URL.
| Comments: | Accepted at ICML 2026 |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.25360 [cs.CL] |
| (or arXiv:2605.25360v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.25360 arXiv-issued DOI via DataCite (pending registration) |
From: Geyang Guo [view email]
[v1]
Mon, 25 May 2026 02:28:41 UTC (5,963 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。