A Computational Account of Selected Patterns of Linguistic Variation and Change
Zhu, Jian
2022
Abstract
Language variation and change are ubiquitous, and one aim of linguistic research is to understand synchronic variation and how it contributes to change over time. This dissertation takes a computationally intensive approach to the investigation of language variation and change, with the goals of 1) understanding the complex linguistic landscape in online communities as a result of variation and change; and 2) developing machine learning-based methods to facilitate the processing of large-scale language data in the form of both texts and speech. The current dissertation reports three case studies on selected patterns of variation and change, which span lexical, stylistic, and speech variation. Study 1 centers on the hypothesis that lexical change in online communities is partially shaped by the structure of the community’s underlying social network. To investigate the relationship between social networks and lexical change, I conducted a large-scale analysis of over 80k neologisms in 4420 online communities spanning more than a decade. Using Poisson regression and survival analysis, this study uncovers several associations between a community’s network structure and lexical change within the community. In addition to overall community size, network properties including dense connections, the lack of local clusters, and more external contacts are shown to promote lexical innovation and retention. Unlike offline communities, these topic-based communities do not experience strong lexical leveling despite increased contact but rather tend to accommodate more niche words. The analysis not only confirms the influence of social networks on lexical change but also uncovers findings specific to online communities. Study 2 takes a deep learning-based approach to studying individual stylistic variation in written texts. The proposed neural models achieve strong performance on authorship identification for short texts and are therefore used as a proxy to extract representations of idiolectal styles. Extensive analyses were conducted to assess how idiolectal styles were encoded by the data-driven neural model. Using an analogy-based probing task, the study shows that the learned latent spaces exhibit surprising regularities that encode qualitative and quantitative shifts of idiolectal styles. Through text perturbation, I quantify the relative contributions of different linguistic elements to idiolectal variation. Furthermore, I characterize idiolects through measuring inter- and intra-author variation, showing that variation in idiolects is often both distinctive and consistent. Study 3 moves beyond textual variation and addresses a methodological bottleneck in speech analysis, that is, aligning continuous and highly variable speech signals to discrete phones. Two Wav2Vec2-based models for both text-dependent and text-independent phone-to- audio alignment are proposed. The proposed Wav2Vec2-FS, a semi-supervised model, directly learns phone-to-audio alignment through contrastive learning and a forward sum loss and can be coupled with a pretrained phone recognizer to achieve text-independent alignment. The other model, Wav2Vec2-FC, is a frame classification model trained on forced aligned labels that can perform both forced alignment and text-independent segmentation. Evaluation results suggest that, even when transcriptions are not available, both proposed methods generate results that are very close to those of existing forced alignment tools. A phonetic aligner for Mandarin Chinese with the same method is also reported. This work presents a neural pipeline of fully automated phone-to-audio alignment to facilitate the processing of the highly variable speech data. This dissertation demonstrates that the abundance of publicly available language data and the advancement of machine learning methods can be effectively harnessed to inform linguistic theories of variation and change.Deep Blue DOI
Subjects
computational linguistics
Types
Thesis
Metadata
Show full item recordCollections
Remediation of Harmful Language
The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.