Show simple item record

Explorations of In-Context Reinforcement Learning

dc.contributor.authorBrooks, Ethan
dc.date.accessioned2024-05-22T17:29:02Z
dc.date.available2024-05-22T17:29:02Z
dc.date.issued2024
dc.date.submitted2024
dc.identifier.urihttps://hdl.handle.net/2027.42/193452
dc.description.abstractIn-Context Learning describes a form of learning that occurs when a model accumulates information in its context or memory. For example, a Long Short-Term Memory unit (LSTM) can rapidly adapt to a novel task as input/target exemplars are fed into it. While in-context learning of this kind is not a new discovery, recent work has demonstrated the capacity of large "foundation" models to acquire this ability "for free," by training on large quantities of semi-supervised data, without the sophisticated (but often unstable) meta-objectives proposed by many earlier papers #cite(< finn2017model>)#cite(< stadie2018some>)#cite(< rakelly2019efficient>). In this work we explore several algorithms which specialize in-context learning based on semi-supervised methods to the reinforcement learning (RL) setting. In particular, we explore three approaches to in-context learning of value (expected cumulative discounted reward). The first of these methods demonstrates a method for implementing policy iteration, a classic RL algorithm, using a pre-trained large language model (LLM). We use the LLM to generate planning rollouts and extract monte-carlo estimates of value from them. We demonstrate the method on several small, text-based domains and present evidence that the LLM can generalize to unseen states, a key requirement of learning in non-tabular settings. The second method imports many of the ideas of the first, but trains a transformer model directly on offline RL data. We incorporate Algorithm Distillation (AD) #cite(< laskin2022context>), another method for in-context reinforcement learning that directly distills the improvement operator from data that includes behavior ranging from random to optimal. Our method combines the benefits of AD with the policy iteration method proposed in our previous work and demonstrates benefits in performance and generalization. Our third method proposes a new method for estimating value. Like the previous methods, this one implements a form of policy iteration, but eschews monte-carlo rollouts for a new approach to estimating value. We train a network to estimate Bellman updates and iteratively feed its outputs back into itself until the estimate converges. We find that this iterative approach improves the capability of the value estimates to generalize and mitigates some of the instability of other offline methods.
dc.language.isoen_US
dc.subjectin-context learning
dc.subjectreinforcement learning
dc.subjectdeep learning
dc.titleExplorations of In-Context Reinforcement Learning
dc.typeThesis
dc.description.thesisdegreenamePhD
dc.description.thesisdegreedisciplineComputer Science & Engineering
dc.description.thesisdegreegrantorUniversity of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeememberBaveja, Satinder Singh
dc.contributor.committeememberLewis, Richard L
dc.contributor.committeememberPolk, Thad
dc.contributor.committeememberLee, Honglak
dc.contributor.committeememberMihalcea, Rada
dc.subject.hlbsecondlevelComputer Science
dc.subject.hlbtoplevelEngineering
dc.contributor.affiliationumcampusAnn Arbor
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/193452/1/ethanbro_1.pdf
dc.identifier.doihttps://dx.doi.org/10.7302/23097
dc.identifier.orcid0000-0003-2557-4994
dc.identifier.name-orcidBrooks, Ethan; 0000-0003-2557-4994en_US
dc.working.doi10.7302/23097en
dc.owningcollnameDissertations and Theses (Ph.D. and Master's)


Files in this item

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.