





















Abstract:Reinforcement learning algorithms are generally designed to maximize the expected return across a population. However, a policy that is optimal on average may be suboptimal for certain individuals, leading to potential safety concerns. To address this, we first formalize the notion of individual harm from a counterfactual perspective and define harm as the event in which a chosen action results in a strictly worse outcome than a baseline alternative. We then propose a general two-stage procedure for learning policies that maximize the expected return while accounting for individual harm. We further establish the finite-sample properties of the learned policy, derive an upper bound on its sub-optimality gap, and show that the harm rate remains well-controlled. Numerical experiments on both simulated and real-world datasets demonstrate the effectiveness of the proposed approach.
| Subjects: | Machine Learning (stat.ML); Machine Learning (cs.LG) |
| Cite as: | arXiv:2605.25114 [stat.ML] |
| (or arXiv:2605.25114v1 [stat.ML] for this version) | |
| https://doi.org/10.48550/arXiv.2605.25114 arXiv-issued DOI via DataCite (pending registration) |
From: Peng Wu [view email]
[v1]
Sun, 24 May 2026 14:54:06 UTC (1,258 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。