Explorations of In-Context Reinforcement Learning

Brooks, Ethan

Explorations of In-Context Reinforcement Learning

dc.contributor.author	Brooks, Ethan
dc.date.accessioned	2024-05-22T17:29:02Z
dc.date.available	2024-05-22T17:29:02Z
dc.date.issued	2024
dc.date.submitted	2024
dc.identifier.uri	https://hdl.handle.net/2027.42/193452
dc.description.abstract	In-Context Learning describes a form of learning that occurs when a model accumulates information in its context or memory. For example, a Long Short-Term Memory unit (LSTM) can rapidly adapt to a novel task as input/target exemplars are fed into it. While in-context learning of this kind is not a new discovery, recent work has demonstrated the capacity of large "foundation" models to acquire this ability "for free," by training on large quantities of semi-supervised data, without the sophisticated (but often unstable) meta-objectives proposed by many earlier papers #cite(< finn2017model>)#cite(< stadie2018some>)#cite(< rakelly2019efficient>). In this work we explore several algorithms which specialize in-context learning based on semi-supervised methods to the reinforcement learning (RL) setting. In particular, we explore three approaches to in-context learning of value (expected cumulative discounted reward). The first of these methods demonstrates a method for implementing policy iteration, a classic RL algorithm, using a pre-trained large language model (LLM). We use the LLM to generate planning rollouts and extract monte-carlo estimates of value from them. We demonstrate the method on several small, text-based domains and present evidence that the LLM can generalize to unseen states, a key requirement of learning in non-tabular settings. The second method imports many of the ideas of the first, but trains a transformer model directly on offline RL data. We incorporate Algorithm Distillation (AD) #cite(< laskin2022context>), another method for in-context reinforcement learning that directly distills the improvement operator from data that includes behavior ranging from random to optimal. Our method combines the benefits of AD with the policy iteration method proposed in our previous work and demonstrates benefits in performance and generalization. Our third method proposes a new method for estimating value. Like the previous methods, this one implements a form of policy iteration, but eschews monte-carlo rollouts for a new approach to estimating value. We train a network to estimate Bellman updates and iteratively feed its outputs back into itself until the estimate converges. We find that this iterative approach improves the capability of the value estimates to generalize and mitigates some of the instability of other offline methods.
dc.language.iso	en_US
dc.subject	in-context learning
dc.subject	reinforcement learning
dc.subject	deep learning
dc.title	Explorations of In-Context Reinforcement Learning
dc.type	Thesis
dc.description.thesisdegreename	PhD
dc.description.thesisdegreediscipline	Computer Science & Engineering
dc.description.thesisdegreegrantor	University of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeemember	Baveja, Satinder Singh
dc.contributor.committeemember	Lewis, Richard L
dc.contributor.committeemember	Polk, Thad
dc.contributor.committeemember	Lee, Honglak
dc.contributor.committeemember	Mihalcea, Rada
dc.subject.hlbsecondlevel	Computer Science
dc.subject.hlbtoplevel	Engineering
dc.contributor.affiliationumcampus	Ann Arbor
dc.description.bitstreamurl	http://deepblue.lib.umich.edu/bitstream/2027.42/193452/1/ethanbro_1.pdf
dc.identifier.doi	https://dx.doi.org/10.7302/23097
dc.identifier.orcid	0000-0003-2557-4994
dc.identifier.name-orcid	Brooks, Ethan; 0000-0003-2557-4994	en_US
dc.working.doi	10.7302/23097	en
dc.owningcollname	Dissertations and Theses (Ph.D. and Master's)

Files in this item

Name:: ethanbro_1.pdf
Size:: 5.364MB
Format:: PDF

View/Open

Dissertations and Theses (Ph.D. and Master's)

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.