
























Abstract:We study the problem of learning the optimal policy in a discounted, infinite-horizon reinforcement learning (RL) setting in the presence of adversarially corrupted rewards. To address this problem, we develop a novel robust variant of the \(Q\)-learning algorithm and analyze it under the challenging asynchronous sampling model with time-correlated data. Despite corruption, we prove that the finite-time guarantees of our approach match existing bounds, up to an additive term that scales with the fraction of corrupted samples. We also establish an information-theoretic lower bound, revealing that our guarantees are near-optimal. Notably, our algorithm is agnostic to the underlying reward distribution and provides the first finite-time robustness guarantees for asynchronous \(Q\)-learning. A key element of our analysis is a refined Azuma-Hoeffding inequality for almost-martingales, which may have broader applicability in the study of RL algorithms.
| Comments: | To appear at the 43rd International Conference on Machine Learning (ICML) |
| Subjects: | Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC) |
| Cite as: | arXiv:2509.08933 [cs.LG] |
| (or arXiv:2509.08933v2 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2509.08933 arXiv-issued DOI via DataCite |
From: Aritra Mitra [view email]
[v1]
Wed, 10 Sep 2025 18:56:39 UTC (1,503 KB)
[v2]
Thu, 21 May 2026 17:37:36 UTC (3,283 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。