Sample-Efficient Algorithms for Hard-Exploration Problems in Reinforcement Learning

Guo, Yijie

Sample-Efficient Algorithms for Hard-Exploration Problems in Reinforcement Learning

Guo, Yijie

2022

View/Open

guoyijie_1.pdf

(27.1MB

PDF)

Abstract

Reinforcement learning (RL) aims to learn optimal behaviors for agents to maximize cumulative rewards through trial-and-error interactions with dynamic environments. In recent years, the powerful deep neural networks have driven the emergence of deep RL algorithms obtaining impressive achievements in a wide range of real-world applications, from playing video games, mastering board games, to learning control of robots. However, the issue of sample efficiency plagues deep RL algorithms in hard-exploration problems with sparse rewards. Without heavy reward engineering to provide dense and informative signals, it is expensive or even infeasible for the agent exploring the entire state and action space to discover the scarce reward due to the limited time budget. This challenge calls for algorithms that cost fewer trials to collect positive experiences and require fewer samples to learn policies with high rewards. This thesis contributes to addressing the exploration problem by deliberately collecting and exploiting the agents' experiences. Specifically, we present contributions in two scenarios. (i) exploiting the agent's past experiences to efficiently learn a policy solving one hard-exploration problem. (ii) exploiting the agent's knowledge across tasks to efficiently learn a shared policy simultaneously solving multiple hard-exploration problems. First, this thesis discusses exploiting past experiences collected on one task during policy training. We introduce new off-policy RL algorithms to indirectly drive deeper exploration on the same task via imitating high-rewarding transitions. In a variety of hard-exploration problems, reproducing past good experiences facilitates the discovery of following reward sources and thus boosts policy learning. While exploiting past good experiences improves the sample efficiency, in some complex cases, the agent might be misled by distractive or deceptive positive rewards and prematurely converge to a myopic solution. We overcome this limitation by replicating and augmenting the agent's diverse past trajectories regardless of the short-term rewards, instead of only imitating experiences with high rewards. In essence, the agent exploiting diverse past experiences is capable of exploring diverse directions of the large state space and hence has a better chance of finding the (near-)optimal solution to one hard-exploration problem. Second, this thesis steps further to pursue a policy shared for multiple sparse-reward tasks. We discuss how exploiting experiences and knowledge can benefit policy learning for any other related tasks. We investigate an action translator with the theoretical foundation to transfer the good source policy with relatively higher accumulated rewards to target tasks and roughly maintain the performance. We incorporate the action translator in the context-based meta-RL algorithm to handle multiple hard-exploration problems. The learned policy is equipped with the generalization ability to excel at related tasks with varying dynamics. In addition, we expand generalization ability across tasks with varying environment structures, especially in procedurally-generated environments. We formulate a novel view-based intrinsic reward to maximize the agent's knowledge coverage. The agent exploits the exploration knowledge extracted from training environments and generalizes to explore unseen test environments well by enlarging its view coverage. Overall, our work studies how to collect and exploit experiences from environments to tackle hard-exploration problems. Consequently, we develop new algorithms to significantly improve the sample efficiency and generalization performance via exploiting the agent's experiences and knowledge. This line of work allows us to understand the challenges of hard-exploration problems more deeply. Also, it serves as an inspiration to rethink the balance between exploration and exploitation in RL.

Deep Blue DOI

https://dx.doi.org/10.7302/6324

Subjects

Reinforcement Learning

Exploration and Exploitation

Imitation Learning

Policy Generalization

Types

Thesis

Handle

https://hdl.handle.net/2027.42/174593

Metadata

Show full item record

Collections

Dissertations and Theses (Ph.D. and Master's)

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.