Understanding Word Embedding Stability Across Languages and Applications
dc.contributor.author | Burdick, Laura | |
dc.date.accessioned | 2020-10-04T23:22:36Z | |
dc.date.available | NO_RESTRICTION | |
dc.date.available | 2020-10-04T23:22:36Z | |
dc.date.issued | 2020 | |
dc.identifier.uri | https://hdl.handle.net/2027.42/162917 | |
dc.description.abstract | Despite the recent popularity of word embedding methods, there is only a small body of work exploring the limitations of these representations. In this thesis, we consider several aspects of embedding spaces, including their stability. First, we propose a definition of stability, and show that common English word embeddings are surprisingly unstable. We explore how properties of data, words, and algorithms relate to instability. We extend this work to approximately 100 world languages, considering how linguistic typology relates to stability. Additionally, we consider contextualized output embedding spaces. Using paraphrases, we explore properties and assumptions of BERT, a popular embedding algorithm. Second, we consider how stability and other word embedding properties affect tasks where embeddings are commonly used. We consider both word embeddings used as features in downstream applications and corpus-centered applications, where embeddings are used to study characteristics of language and individual writers. In addition to stability, we also consider other word embedding properties, specifically batching and curriculum learning, and how methodological choices made for these properties affect downstream tasks. Finally, we consider how knowledge of stability affects how we use word embeddings. Throughout this thesis, we discuss strategies to mitigate instability and provide analyses highlighting the strengths and weaknesses of word embeddings in different scenarios and languages. We show areas where more work is needed to improve embeddings, and we show where embeddings are already a strong tool. | |
dc.language.iso | en_US | |
dc.subject | natural language processing | |
dc.subject | word embeddings | |
dc.subject | machine learning | |
dc.subject | stability | |
dc.subject | multilingual | |
dc.subject | word semantics | |
dc.title | Understanding Word Embedding Stability Across Languages and Applications | |
dc.type | Thesis | |
dc.description.thesisdegreename | PhD | en_US |
dc.description.thesisdegreediscipline | Computer Science & Engineering | |
dc.description.thesisdegreegrantor | University of Michigan, Horace H. Rackham School of Graduate Studies | |
dc.contributor.committeemember | Mihalcea, Rada | |
dc.contributor.committeemember | Jurgens, David | |
dc.contributor.committeemember | Chai, Joyce | |
dc.contributor.committeemember | Kummerfeld, Jonathan K. | |
dc.contributor.committeemember | Mimno, David | |
dc.subject.hlbsecondlevel | Computer Science | |
dc.subject.hlbtoplevel | Engineering | |
dc.description.bitstreamurl | http://deepblue.lib.umich.edu/bitstream/2027.42/162917/1/lburdick_1.pdf | en_US |
dc.identifier.orcid | 0000-0002-9953-4592 | |
dc.identifier.name-orcid | (Wendlandt) Burdick, Laura; 0000-0002-9953-4592 | en_US |
dc.owningcollname | Dissertations and Theses (Ph.D. and Master's) |
Files in this item
Remediation of Harmful Language
The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.