






















Abstract:We study risk-sensitive reinforcement learning in finite discounted MDPs with recursive entropic risk measures (ERM), where the risk parameter $\beta \neq 0$ controls the agent's risk attitude: $\beta>0$ for risk-averse and $\beta<0$ for risk-seeking behavior. A generative model of the MDP is assumed to be available. Our focus is on the sample complexities of learning the optimal state-action value function (value learning) and an optimal policy (policy learning) under recursive ERM. We introduce a model-based algorithm, called Model-Based ERM $Q$-Value Iteration (MB-RS-QVI), and derive PAC-type bounds on its sample complexity for both value and policy learning. Both PAC bounds scale exponentially with $|\beta|/(1-\gamma)$, where $\gamma$ is the discount factor. We also establish corresponding lower bounds for both value and policy learning, showing that exponential dependence on $|\beta|/(1-\gamma)$ is unavoidable in the worst case. The bounds are tight in the number of states and actions ($S$ and $A$), providing the first rigorous sample complexity guarantees for recursive ERM across both risk-averse and risk-seeking regimes.
| Subjects: | Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML) |
| Cite as: | arXiv:2506.00286 [cs.LG] |
| (or arXiv:2506.00286v3 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2506.00286 arXiv-issued DOI via DataCite |
From: Mohammad Sadegh Talebi [view email]
[v1]
Fri, 30 May 2025 22:27:57 UTC (42 KB)
[v2]
Wed, 1 Oct 2025 09:50:45 UTC (40 KB)
[v3]
Mon, 18 May 2026 21:58:29 UTC (488 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。