A Computational Account of Selected Patterns of Linguistic Variation and Change

Zhu, Jian

A Computational Account of Selected Patterns of Linguistic Variation and Change

Zhu, Jian

2022

View/Open

lingjzhu_1.pdf

(3.9MB

PDF)

Abstract

Language variation and change are ubiquitous, and one aim of linguistic research is to understand synchronic variation and how it contributes to change over time. This dissertation takes a computationally intensive approach to the investigation of language variation and change, with the goals of 1) understanding the complex linguistic landscape in online communities as a result of variation and change; and 2) developing machine learning-based methods to facilitate the processing of large-scale language data in the form of both texts and speech. The current dissertation reports three case studies on selected patterns of variation and change, which span lexical, stylistic, and speech variation. Study 1 centers on the hypothesis that lexical change in online communities is partially shaped by the structure of the community’s underlying social network. To investigate the relationship between social networks and lexical change, I conducted a large-scale analysis of over 80k neologisms in 4420 online communities spanning more than a decade. Using Poisson regression and survival analysis, this study uncovers several associations between a community’s network structure and lexical change within the community. In addition to overall community size, network properties including dense connections, the lack of local clusters, and more external contacts are shown to promote lexical innovation and retention. Unlike offline communities, these topic-based communities do not experience strong lexical leveling despite increased contact but rather tend to accommodate more niche words. The analysis not only confirms the influence of social networks on lexical change but also uncovers findings specific to online communities. Study 2 takes a deep learning-based approach to studying individual stylistic variation in written texts. The proposed neural models achieve strong performance on authorship identification for short texts and are therefore used as a proxy to extract representations of idiolectal styles. Extensive analyses were conducted to assess how idiolectal styles were encoded by the data-driven neural model. Using an analogy-based probing task, the study shows that the learned latent spaces exhibit surprising regularities that encode qualitative and quantitative shifts of idiolectal styles. Through text perturbation, I quantify the relative contributions of different linguistic elements to idiolectal variation. Furthermore, I characterize idiolects through measuring inter- and intra-author variation, showing that variation in idiolects is often both distinctive and consistent. Study 3 moves beyond textual variation and addresses a methodological bottleneck in speech analysis, that is, aligning continuous and highly variable speech signals to discrete phones. Two Wav2Vec2-based models for both text-dependent and text-independent phone-to- audio alignment are proposed. The proposed Wav2Vec2-FS, a semi-supervised model, directly learns phone-to-audio alignment through contrastive learning and a forward sum loss and can be coupled with a pretrained phone recognizer to achieve text-independent alignment. The other model, Wav2Vec2-FC, is a frame classification model trained on forced aligned labels that can perform both forced alignment and text-independent segmentation. Evaluation results suggest that, even when transcriptions are not available, both proposed methods generate results that are very close to those of existing forced alignment tools. A phonetic aligner for Mandarin Chinese with the same method is also reported. This work presents a neural pipeline of fully automated phone-to-audio alignment to facilitate the processing of the highly variable speech data. This dissertation demonstrates that the abundance of publicly available language data and the advancement of machine learning methods can be effectively harnessed to inform linguistic theories of variation and change.

Deep Blue DOI

https://dx.doi.org/10.7302/5900

Subjects

computational linguistics

Types

Thesis

Handle

https://hdl.handle.net/2027.42/174169

Metadata

Show full item record

Collections

Dissertations and Theses (Ph.D. and Master's)

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.