Explorations of In-Context Reinforcement Learning
dc.contributor.author | Brooks, Ethan | |
dc.date.accessioned | 2024-05-22T17:29:02Z | |
dc.date.available | 2024-05-22T17:29:02Z | |
dc.date.issued | 2024 | |
dc.date.submitted | 2024 | |
dc.identifier.uri | https://hdl.handle.net/2027.42/193452 | |
dc.description.abstract | In-Context Learning describes a form of learning that occurs when a model accumulates information in its context or memory. For example, a Long Short-Term Memory unit (LSTM) can rapidly adapt to a novel task as input/target exemplars are fed into it. While in-context learning of this kind is not a new discovery, recent work has demonstrated the capacity of large "foundation" models to acquire this ability "for free," by training on large quantities of semi-supervised data, without the sophisticated (but often unstable) meta-objectives proposed by many earlier papers #cite(< finn2017model>)#cite(< stadie2018some>)#cite(< rakelly2019efficient>). In this work we explore several algorithms which specialize in-context learning based on semi-supervised methods to the reinforcement learning (RL) setting. In particular, we explore three approaches to in-context learning of value (expected cumulative discounted reward). The first of these methods demonstrates a method for implementing policy iteration, a classic RL algorithm, using a pre-trained large language model (LLM). We use the LLM to generate planning rollouts and extract monte-carlo estimates of value from them. We demonstrate the method on several small, text-based domains and present evidence that the LLM can generalize to unseen states, a key requirement of learning in non-tabular settings. The second method imports many of the ideas of the first, but trains a transformer model directly on offline RL data. We incorporate Algorithm Distillation (AD) #cite(< laskin2022context>), another method for in-context reinforcement learning that directly distills the improvement operator from data that includes behavior ranging from random to optimal. Our method combines the benefits of AD with the policy iteration method proposed in our previous work and demonstrates benefits in performance and generalization. Our third method proposes a new method for estimating value. Like the previous methods, this one implements a form of policy iteration, but eschews monte-carlo rollouts for a new approach to estimating value. We train a network to estimate Bellman updates and iteratively feed its outputs back into itself until the estimate converges. We find that this iterative approach improves the capability of the value estimates to generalize and mitigates some of the instability of other offline methods. | |
dc.language.iso | en_US | |
dc.subject | in-context learning | |
dc.subject | reinforcement learning | |
dc.subject | deep learning | |
dc.title | Explorations of In-Context Reinforcement Learning | |
dc.type | Thesis | |
dc.description.thesisdegreename | PhD | |
dc.description.thesisdegreediscipline | Computer Science & Engineering | |
dc.description.thesisdegreegrantor | University of Michigan, Horace H. Rackham School of Graduate Studies | |
dc.contributor.committeemember | Baveja, Satinder Singh | |
dc.contributor.committeemember | Lewis, Richard L | |
dc.contributor.committeemember | Polk, Thad | |
dc.contributor.committeemember | Lee, Honglak | |
dc.contributor.committeemember | Mihalcea, Rada | |
dc.subject.hlbsecondlevel | Computer Science | |
dc.subject.hlbtoplevel | Engineering | |
dc.contributor.affiliationumcampus | Ann Arbor | |
dc.description.bitstreamurl | http://deepblue.lib.umich.edu/bitstream/2027.42/193452/1/ethanbro_1.pdf | |
dc.identifier.doi | https://dx.doi.org/10.7302/23097 | |
dc.identifier.orcid | 0000-0003-2557-4994 | |
dc.identifier.name-orcid | Brooks, Ethan; 0000-0003-2557-4994 | en_US |
dc.working.doi | 10.7302/23097 | en |
dc.owningcollname | Dissertations and Theses (Ph.D. and Master's) |
Files in this item
Remediation of Harmful Language
The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.