Morphological Inference from Bitext for Resource-Poor Languages.
dc.contributor.author | Szymanski, Terrence D. | en_US |
dc.date.accessioned | 2012-10-12T15:24:21Z | |
dc.date.available | NO_RESTRICTION | en_US |
dc.date.available | 2012-10-12T15:24:21Z | |
dc.date.issued | 2012 | en_US |
dc.date.submitted | 2012 | en_US |
dc.identifier.uri | https://hdl.handle.net/2027.42/93843 | |
dc.description.abstract | The development of rich, multi-lingual corpora is essential for enabling new types of large-scale inquiry into the nature of language (Abney and Bird, 2010; Lewis and Xia, 2010). However, significant digital resources currently exist for only a handful of the world's languages. The present dissertation addresses this issue by introducing new techniques for creating rich corpora by enriching existing resources via automated processing. As a way of leveraging existing resources, this dissertation describes an automated method for extracting bitext (text accompanied by a translation) from bilingual documents. Digitized copies of printed books are mined for foreign-language material, using statistical methods for language identification and word alignment to identify instances of English-foreign bitext. After parsing the English text and transferring this analysis via the word alignments, the foreign word tokens are tagged with English glosses and morphosyntactic features. Tagged tokens such as these constitute the input to a new algorithm, presented in this dissertation, for performing morphology induction. Drawing on previous work on unsupervised morphology induction which uses the principle of minimum description length to drive the analysis (Goldsmith, 2001), the present algorithm uses a greedy hill-climbing search to minimize the size of a paradigm-based morphological description of the language. The algorithm simultaneously segments wordforms into their component morphemes and organizes stems and axes into a paradigmatic structure. Because tagged tokens are used as input, the morphemes produced by this induction method are paired with meaningful morphosyntactic features, an improvement over algorithms for unsupervised morphology based on monolingual text, which treat morphemes purely as strings of letters. Combined, these methods for collecting and analyzing bitext data offer a pathway for the automatic creation of richly-annotated corpora for resource-poor languages, requiring minimal amounts of data and minimal manual analysis. | en_US |
dc.language.iso | en_US | en_US |
dc.subject | Computational Linguistics | en_US |
dc.subject | Morphological Inference | en_US |
dc.subject | Resource-poor Languages | en_US |
dc.title | Morphological Inference from Bitext for Resource-Poor Languages. | en_US |
dc.type | Thesis | en_US |
dc.description.thesisdegreename | PhD | en_US |
dc.description.thesisdegreediscipline | Linguistics | en_US |
dc.description.thesisdegreegrantor | University of Michigan, Horace H. Rackham School of Graduate Studies | en_US |
dc.contributor.committeemember | Abney, Steven P. | en_US |
dc.contributor.committeemember | Radev, Dragomir Radkov | en_US |
dc.contributor.committeemember | Thomason, Sarah G. | en_US |
dc.contributor.committeemember | Keshet, Ezra Russell | en_US |
dc.subject.hlbsecondlevel | Social Sciences (General) | en_US |
dc.subject.hlbtoplevel | Social Sciences | en_US |
dc.description.bitstreamurl | http://deepblue.lib.umich.edu/bitstream/2027.42/93843/1/tdszyman_1.pdf | |
dc.owningcollname | Dissertations and Theses (Ph.D. and Master's) |
Files in this item
Remediation of Harmful Language
The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.