Provably Efficient Reinforcement Learning: From Single-Agent MDPs to Markov Games
Qiu, Shuang
2021
Abstract
Reinforcement Learning (RL) has achieved tremendous empirical successes in real-world decision-making problems. Along with its great empirical achievements, recently, the question of how to design efficient RL algorithms with provable theoretical guarantees has attracted increasing attention. To answer the above question, this thesis proposes and analyzes novel RL algorithms for single-agent Markov Decision Processes (MDPs) and Markov games, where the Markov game is the multi-agent extension of single-agent MDPs. This thesis covers two paradigms in RL: reward-based online RL and reward-free RL. The reward-based online RL is a standard online learning framework for RL. In this framework, the agents keep interacting with the unknown environment and receiving rewards. Meanwhile, the agents continuously update the policies with the current information obtained from the environment to achieve their goals of learning. The reward-free RL is a different framework where the agents first aim to thoroughly explore the unknown environment without accessing any pre-specified rewards and then, given an arbitrary extrinsic reward function, the agents compute the target policy via a planning algorithm with data collected in the exploration phase. Concretely, this thesis focuses on providing a theoretical analysis of three fundamental and challenging problems in RL: (1) online learning for constrained MDPs, (2) policy optimization for Markov games, (3) reward-free RL with nonlinear function approximation. The first two problems are studied in the scope of the reward-based online RL and the third one is an important problem under the reward-free RL setting. In these three directions, the main contributions of this thesis are summarized as follows. The first contribution is that this thesis proposes a provably efficient upper confidence primal-dual algorithm for the single-agent MDP online learning problem with time-varying constraints, where the transition model is unknown and the reward function is adversarial. This thesis further proves the upper bounds of the regret and the constraint violation for learning the constrained MDPs. As the second contribution, this thesis proposes new optimistic policy optimization algorithms for two-player zero-sum Markov games with structured but unknown transitions and theoretically analyzes both players' regret bounds, which generalizes the recent studies on policy optimization for single-agent MDPs in a stationary environment. The third contribution is that this thesis tackles the reward-free RL problem for both single-agent MDPs and two-player zero-sum Markov games under the context of function approximation, leveraging powerful nonlinear approximators: kernel and neural function approximators. Specifically, this thesis proposes to explore the unknown environment via an optimistic variant of the value-iteration algorithm incorporating kernel and neural function approximations and designs effective planning algorithms, which are theoretically justified to be able to generate the target policies when given an arbitrary extrinsic reward function.Deep Blue DOI
Subjects
Reinforcement Learning Markov Decision Process Markov Game Constrained Markov Decision Process Policy Optimization Reward-Free Reinforcement Learning
Types
Thesis
Metadata
Show full item recordCollections
Remediation of Harmful Language
The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.