Show simple item record

A Computational Account of Selected Patterns of Linguistic Variation and Change

dc.contributor.authorZhu, Jian
dc.date.accessioned2022-09-06T15:57:30Z
dc.date.available2022-09-06T15:57:30Z
dc.date.issued2022
dc.date.submitted2022
dc.identifier.urihttps://hdl.handle.net/2027.42/174169
dc.description.abstractLanguage variation and change are ubiquitous, and one aim of linguistic research is to understand synchronic variation and how it contributes to change over time. This dissertation takes a computationally intensive approach to the investigation of language variation and change, with the goals of 1) understanding the complex linguistic landscape in online communities as a result of variation and change; and 2) developing machine learning-based methods to facilitate the processing of large-scale language data in the form of both texts and speech. The current dissertation reports three case studies on selected patterns of variation and change, which span lexical, stylistic, and speech variation. Study 1 centers on the hypothesis that lexical change in online communities is partially shaped by the structure of the community’s underlying social network. To investigate the relationship between social networks and lexical change, I conducted a large-scale analysis of over 80k neologisms in 4420 online communities spanning more than a decade. Using Poisson regression and survival analysis, this study uncovers several associations between a community’s network structure and lexical change within the community. In addition to overall community size, network properties including dense connections, the lack of local clusters, and more external contacts are shown to promote lexical innovation and retention. Unlike offline communities, these topic-based communities do not experience strong lexical leveling despite increased contact but rather tend to accommodate more niche words. The analysis not only confirms the influence of social networks on lexical change but also uncovers findings specific to online communities. Study 2 takes a deep learning-based approach to studying individual stylistic variation in written texts. The proposed neural models achieve strong performance on authorship identification for short texts and are therefore used as a proxy to extract representations of idiolectal styles. Extensive analyses were conducted to assess how idiolectal styles were encoded by the data-driven neural model. Using an analogy-based probing task, the study shows that the learned latent spaces exhibit surprising regularities that encode qualitative and quantitative shifts of idiolectal styles. Through text perturbation, I quantify the relative contributions of different linguistic elements to idiolectal variation. Furthermore, I characterize idiolects through measuring inter- and intra-author variation, showing that variation in idiolects is often both distinctive and consistent. Study 3 moves beyond textual variation and addresses a methodological bottleneck in speech analysis, that is, aligning continuous and highly variable speech signals to discrete phones. Two Wav2Vec2-based models for both text-dependent and text-independent phone-to- audio alignment are proposed. The proposed Wav2Vec2-FS, a semi-supervised model, directly learns phone-to-audio alignment through contrastive learning and a forward sum loss and can be coupled with a pretrained phone recognizer to achieve text-independent alignment. The other model, Wav2Vec2-FC, is a frame classification model trained on forced aligned labels that can perform both forced alignment and text-independent segmentation. Evaluation results suggest that, even when transcriptions are not available, both proposed methods generate results that are very close to those of existing forced alignment tools. A phonetic aligner for Mandarin Chinese with the same method is also reported. This work presents a neural pipeline of fully automated phone-to-audio alignment to facilitate the processing of the highly variable speech data. This dissertation demonstrates that the abundance of publicly available language data and the advancement of machine learning methods can be effectively harnessed to inform linguistic theories of variation and change.
dc.language.isoen_US
dc.subjectcomputational linguistics
dc.titleA Computational Account of Selected Patterns of Linguistic Variation and Change
dc.typeThesis
dc.description.thesisdegreenamePhDen_US
dc.description.thesisdegreedisciplineLinguistics
dc.description.thesisdegreegrantorUniversity of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeememberBeddor, Patrice Speeter
dc.contributor.committeememberJurgens, David
dc.contributor.committeememberQueen, Robin M
dc.contributor.committeememberAbney, Steven P
dc.contributor.committeememberStyler, Will
dc.subject.hlbsecondlevelLinguistics
dc.subject.hlbtoplevelHumanities
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/174169/1/lingjzhu_1.pdf
dc.identifier.doihttps://dx.doi.org/10.7302/5900
dc.identifier.orcid0000-0002-7849-1060
dc.identifier.name-orcidZhu, Jian; 0000-0002-7849-1060en_US
dc.working.doi10.7302/5900en
dc.owningcollnameDissertations and Theses (Ph.D. and Master's)


Files in this item

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.