




























We present a unified framework based on primal-dual stochastic mirror descent for approximately solving infinite-horizon Markov decision processes (MDPs) given a generative model. When applied to an average-reward MDP with $A_{tot}$ total state-action pairs and mixing time bound $t_{mix}$ our method computes an $ε$-optimal policy with an expected $\widetilde{O}(t_{mix}^2 A_{tot} ε^{-2})$ samples from the state-transition matrix, removing the ergodicity dependence of prior art. When applied to a $γ$-discounted MDP with $A_{tot}$ total state-action pairs our method computes an $ε$-optimal policy with an expected $\widetilde{O}((1-γ)^{-4} A_{tot} ε^{-2})$ samples, matching the previous state-of-the-art up to a $(1-γ)^{-1}$ factor. Both methods are model-free, update state values and policies simultaneously, and run in time linear in the number of samples taken. We achieve these results through a more general stochastic mirror descent framework for solving bilinear saddle-point problems with simplex and box domains and we demonstrate the flexibility of this framework by providing further applications to constrained MDPs.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。