Optimal Sample Complexity for Average Reward MDPs: How Span-Based?

Original title: Span-Based Optimal Sample Complexity for Average Reward MDPs

Authors: Matthew Zurek, Yudong Chen

The article delves into learning optimal policies in average-reward Markov decision processes (MDPs) and their sample complexity under a generative model. It establishes a complexity bound, improving prior work by being minimax optimal in all parameters ($S$, $A$, $H$, and $\varepsilon$). This bound—$\widetilde{O}\left(SA\frac{H}{\varepsilon^2}\right)$—considers the span of the bias function for the optimal policy and the state-action space cardinality. To achieve this, the study reduces average-reward MDPs to discounted MDPs, showcasing its optimality. Additionally, it provides improved bounds for $\gamma$-discounted MDPs, offering insights into learning $\varepsilon$-optimal policies in weakly communicating MDPs. Notably, the analysis introduces tighter bounds on variance parameters related to the span parameter, diverging from traditional measures like mixing time or diameter. This broader perspective holds potential for wider applications in similar domains, enriching understanding beyond conventional metrics.

Original article: https://arxiv.org/abs/2311.13469