Provably Efficient Algorithms for Safe Reinforcement Learning

Wei, Honghao

Provably Efficient Algorithms for Safe Reinforcement Learning

Wei, Honghao

2023

View/Open

honghaow_1.pdf

(3.9MB

PDF)

Abstract

Safe reinforcement learning (RL) is an area of research focused on developing algorithms and methods that ensure the safety of RL agents during learning and decision-making processes. The goal is to enable RL agents to interact with their environments and learn optimal decisions while avoiding actions that can lead to harmful or undesirable outcomes. This dissertation provides a comprehensive study of {em model-free}, {em simulator-free} reinforcement learning algorithms for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation, with the focus on three settings: $(1)$ episodic CMDPs; $(2)$ infinite-horizon average-reward CMDPs and $(3)$ non-stationary episodic CMDPs. The first part provides the first model-free, simulator-free safe-RL algorithm with sublinear regret and zero constraint violation. The algorithm is named Triple-Q because it includes three key components: a Q-function (also called action value function) for the cumulative reward, a Q-function for the cumulative utility of the constraint, and a virtual Queue that (over)-estimates the cumulative constraint violation. Under Triple-Q, at each step, an action is chosen based on the pseudo-Q-value that is a combination of the three Q values. The algorithm updates the reward and utility Q values with learning rates that depend on the visit counts to the corresponding (state, action) pairs and are periodically reset. In the episodic CMDP setting, Triple-Q achieves sublinear regret. Furthermore, Triple-Q guarantees zero constraint violation, both on expectation and with a high probability, when the number of episode is sufficiently large. Finally, the computational complexity of Triple-Q is similar to SARSA for unconstrained MDPs, and is computationally efficient. In Chapter III, the results are extended to infinite-horizon average-reward Constrained Markov Decision Processes (CMDPs). The proposed algorithm guarantees sublinear regret and zero constraint violation. Then in Chapter IV the dissertation studies safe-RL in a more challenging setting, non-stationary CMDPs, where the rewards/utilities and dynamics are time-varying and likely unknown a priori. In the nonstationary environment, reward, utility functions, and transition kernels can vary arbitrarily over time as long as the cumulative variations do not exceed certain variation budgets. We propose the first model-free, simulator-free RL algorithms with sublinear regret and zero constraint violation for non-stationary CMDPs in both tabular and linear function approximation settings with provable performance guarantees. Our results on regret bound and constraint violation for the tabular case match the corresponding best results for stationary CMDPs when the total budget is known. Additionally, we present a general framework for addressing the well-known challenges associated with analyzing non-stationary CMDPs, without requiring prior knowledge of the variation budget. We apply the approach to both tabular and linear approximation settings.

Deep Blue DOI

https://dx.doi.org/10.7302/8204

Subjects

Safe-Reinforcement Learning

Model-Free

CMDP

Zero Constraint Violation

Types

Thesis

Handle

https://hdl.handle.net/2027.42/177747

Metadata

Show full item record

Collections

Dissertations and Theses (Ph.D. and Master's)

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.