Date: 22 March, 2021
 
Dataset Title: “The Lannang Corpus (LanCorp): A POS-tagged, sociolinguistic corpus containing recordings and transcriptions of Lannang speech collected from the metropolitan Manila Lannangs between 2016 and 2020”
 
Dataset Creator: Wilkinson Daniel Wong GONZALES
 
Dataset Contact: Wilkinson Daniel Wong GONZALES (wdwg@umich.edu; wdwgonzales@gmail.com)
 
Funding: The University of Michigan Department of Linguistics, Lieberthal-Rogel Center for Chinese Studies, International Institute
 
Overview:
The Lannang Corpus (LanCorp) is a sociolinguistic POS-tagged 375,000-word speech-and-text corpus of Lannang languages based on audio recordings collected in metropolitan Manila between 2016 and 2020. It hopes to furnish scholars interested in Sino-Philippine (socio)linguistics with a contemporary, multilingual corpus (i.e., Hokkien, Tagalog, English, Lánnang-uè, Mandarin) compiled using recorded oral data primarily collected from a Sino-Philippine community in metropolitan Manila by the community: the Manila Lannangs. The publicly available corpus contains manual transcriptions (time-aligned to the audio), source language and part-of-speech tags derived using a mix of manual and computational methods, and a wide range of social metadata; it is also organized and stored systematically for easy data retrieval and (socio)linguistic analysis. Although there are existing sociolinguistic corpora, they are small in scale and were not released publicly due to lack of informant consent – LanCorp readily fills the gap.
 
Methodology:
The data came from from sessions conducted with roughly 130 Lannang community members in metropolitan Manila. In the sessions, participants were instructed to tell a story using the wordless book Frog, Where Are You? By Mercer Mayer in Lánnang-uè but were also encouraged to use other languages that they feel comfortable using (e.g., Filipino). The participants were then interviewed using a set of questions that focus on questions about community, identity, language, and education. Although the language used by the interviewer is Lánnang-uè, participants were not restricted to speak in a particular language. Participants provided several sociolinguistic information through a survey.
 
Data in unstructured casual conversations was collected differently. A Lannang data collector reached out to five Lannang families and asked for verbal consent to record their conversations in their gatherings (e.g., restaurants, home). A microphone was then placed in an inconspicuous location in these gatherings for roughly thirty minutes. The data was collected in 2016 and submitted to the corpus compiler 2020; it was not linked to social metadata upon the request of the participants, who also requested that the raw audio file not be released. 
 
22 trained individuals – all fluent in Lannang languages – were trained to use ELAN and were familiarized with Lannang Orthography conventions (The Lannang Archives 2020). Upon passing the summative transcription assessment, they were assigned a portion of the audio files and were instructed to use ELAN to (1) segment the audio files into sentences, (2) identify the source language of the sentence, and (3) transcribe each sentence into text using Lannang Orthography conventions (The Lannang Archives 2020). I instructed four of these transcribers to go over the files and transcriptions. All participants were paid for their time.
 
For POS-tagging and source-language tagging, the ELAN files were first converted to spreadsheet files. From these, I extracted all Lánnang-uè sentences. Then, I tagged each word in all sentences for part-of-speech (POS) (e.g., conjunction, preposition) using a POS-tagger program I created in the Python environment (Van Rossum and Drake 2009).  The program utilizes Conditional Random Fields (CRF) (Lafferty et al. 2001) – a model that ‘learns’ the POS distributions from sequential data and can identify the optimal POS of a token, given the context. The CRF model used was trained using 1,085 manually annotated Lánnang-uè sentences. It has a cross-validated (k-folds = 5) accuracy score of 0.83 (SD = 0.005), precision score of 0.58 (SD = 0.017), recall score of 0.56 (0.018), and an f-1 score of 0.56 (SD = 0.015). 
 
After POS tagging, I tokenized the sentences – I broke down the tagged sentences into tagged words. I then tagged each word for source language by relying on a combination of rule-based and manual tagging approaches. I used publicly available English, Tagalog, and Mandarin wordlists to help me tag English-, Tagalog-, and Mandarin-origin Lánnang-uè words. Lánnang-uè words that are not found in any of the three wordlists are preliminarily tagged as Hokkien-sourced. I hired and trained three native speakers of Lánnang-uè to go over the list and revise incorrectly tagged tokens. I also asked them to tag words that do not have a clear origin as ‘unclear’. 
 
The resulting LanCorp is not only transcribed, time-aligned, and searchable, it is also tagged with source language information and part-of-speech.
 
Instrument and/or Software specifications: Zoom H6 recorder, ELAN (version 5.7)
 
Files contained here:
The LanCorp folder contains four subfolders: (1) Corpus in audio and ELAN format, (2) Corpus in text format, (3) Corpus in spreadsheet format, and (4) Sociolinguistic metadata. The first folder contains two more subfolders, each respectively containing WAV files and .eaf files. The second folder contains .txt files. The third and fourth folders are .csv files (the sociolinguistic metadata can be merged with the .csv corpus using R or other data processing software).
 
The audio and ELAN files are labelled as follows: uniqueuserid-contextYEAR”, where context refers to the stylistic condition in which the recording was done (i.e., CLIN = interviews, FRST = frog story narrative, PROT = unstructured conversations) and YEAR refers to the last two digits of the year of recording. 
 
A similar convention is followed for each utterance line in the spreadsheets: each utterance is tagged with a <style-year-uniqueuserid-uniqueutteranceid> metadata tag. For instance, the tag <CLIN-18-68:1> indicates that the utterance was recorded as part of an interview conducted in 2018. It indicates that the utterance was produced by an individual with the identification number 68 and that the utterance has a unique identification number of 1.
 
For instance, the label CLIN-18-68 indicates that the utterance was recorded as part of an interview conducted in 2018. It indicates that the utterance was produced by an individual with the identification number 68 and that the utterance has a unique identification number of 1.

POS = part of speech, dia = diacritic

Other tags/metadata are documented in file "Metadata Description.pdf".

 
Related publications:
[1] Gonzales, Wilkinson Daniel Wong. 2021. Interactions of Sinitic languages in the Philippines:  Sinicization, Filipinization, and Sino-Philippine language creation. The Palgrave handbook of Chinese language studies, ed. by Zhengdao Ye. London: Palgrave-MacMillan.
[2] Gonzales, Wilkinson Daniel Wong. 2021. Filipino, Chinese, neither, or both? The Lannang identity and its relationship with language. Language & Communication 77.
[3] Gonzales, Wilkinson Daniel Wong. 2022. “Truly a Language of Our Own” A Corpus-Based, Experimental, and Variationist Account of Lánnang-uè in Manila. Ann Arbor: University of Michigan Ph.D. dissertation.
[4] Gonzales, Wilkinson Daniel Wong. 2022. Hybridization. Philippine English: Development, Structure, and Sociology of English in the Philippines, ed. by Ariane Macalinga Borlongan. London: Routledge.
[5] Gonzales, Wilkinson Daniel Wong. in preparation. Advancing Sino-Philippine (socio)linguistics using the Lannang Corpus (LanCorp) – a multilingual, POS-tagged, and audio-textual databank.
 
Consent and ethics:
Participants have given consent to release the linguistic and sociolinguistic data on the condition that their names not be made public and that the data will be used only for academic/non-commercial purposes (e.g., sociolinguistic analyses, social analyses, historical analyses). The collection protocol has been vetted by the University of Michigan Institutional Review Board in 2019. 
 
Use and Access:
This data set is made available under the Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.

Funding:
University of Michigan Department of Linguistics, Lieberthal-Rogel Center for Chinese Studies, International Institute

To Cite Data:
Gonzales, Wilkinson Daniel Wong. 2022. The Lannang Corpus (LanCorp): A POS-tagged, sociolinguistic corpus containing recordings and transcriptions of Lannang speech collected from the metropolitan Manila Lannangs between 2016 and 2020. University of Michigan - Deep Blue Repository. doi: https://doi.org/10.7302/66g9-e028
 
References:
[1] ELAN (Version 5.7) [Computer software]. (2019). Nijmegen: Max Planck Institute for Psycholinguistics, The Language Archive. Retrieved from https://archive.mpi.nl/tla/elan"
[2] LAFFERTY, JOHN.; ANDREW MCCALLUM.; and FERNANDO C. N. PEREIRA. 2001. Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning.282–289.
[3] VAN ROSSUM, GUIDO.; and FRED L. DRAKE. 2009. Python 3 Reference Manual. Scotts Valley, CA: CreateSpace.