Understanding Word Embedding Stability Across Languages and Applications

Burdick, Laura

Understanding Word Embedding Stability Across Languages and Applications

dc.contributor.author	Burdick, Laura
dc.date.accessioned	2020-10-04T23:22:36Z
dc.date.available	NO_RESTRICTION
dc.date.available	2020-10-04T23:22:36Z
dc.date.issued	2020
dc.identifier.uri	https://hdl.handle.net/2027.42/162917
dc.description.abstract	Despite the recent popularity of word embedding methods, there is only a small body of work exploring the limitations of these representations. In this thesis, we consider several aspects of embedding spaces, including their stability. First, we propose a definition of stability, and show that common English word embeddings are surprisingly unstable. We explore how properties of data, words, and algorithms relate to instability. We extend this work to approximately 100 world languages, considering how linguistic typology relates to stability. Additionally, we consider contextualized output embedding spaces. Using paraphrases, we explore properties and assumptions of BERT, a popular embedding algorithm. Second, we consider how stability and other word embedding properties affect tasks where embeddings are commonly used. We consider both word embeddings used as features in downstream applications and corpus-centered applications, where embeddings are used to study characteristics of language and individual writers. In addition to stability, we also consider other word embedding properties, specifically batching and curriculum learning, and how methodological choices made for these properties affect downstream tasks. Finally, we consider how knowledge of stability affects how we use word embeddings. Throughout this thesis, we discuss strategies to mitigate instability and provide analyses highlighting the strengths and weaknesses of word embeddings in different scenarios and languages. We show areas where more work is needed to improve embeddings, and we show where embeddings are already a strong tool.
dc.language.iso	en_US
dc.subject	natural language processing
dc.subject	word embeddings
dc.subject	machine learning
dc.subject	stability
dc.subject	multilingual
dc.subject	word semantics
dc.title	Understanding Word Embedding Stability Across Languages and Applications
dc.type	Thesis
dc.description.thesisdegreename	PhD	en_US
dc.description.thesisdegreediscipline	Computer Science & Engineering
dc.description.thesisdegreegrantor	University of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeemember	Mihalcea, Rada
dc.contributor.committeemember	Jurgens, David
dc.contributor.committeemember	Chai, Joyce
dc.contributor.committeemember	Kummerfeld, Jonathan K.
dc.contributor.committeemember	Mimno, David
dc.subject.hlbsecondlevel	Computer Science
dc.subject.hlbtoplevel	Engineering
dc.description.bitstreamurl	http://deepblue.lib.umich.edu/bitstream/2027.42/162917/1/lburdick_1.pdf	en_US
dc.identifier.orcid	0000-0002-9953-4592
dc.identifier.name-orcid	(Wendlandt) Burdick, Laura; 0000-0002-9953-4592	en_US
dc.owningcollname	Dissertations and Theses (Ph.D. and Master's)

Files in this item

Name:: lburdick_1.pdf
Size:: 1.974MB
Format:: PDF

View/Open

Dissertations and Theses (Ph.D. and Master's)

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.