Show simple item record

Morphological Inference from Bitext for Resource-Poor Languages.

dc.contributor.authorSzymanski, Terrence D.en_US
dc.date.accessioned2012-10-12T15:24:21Z
dc.date.availableNO_RESTRICTIONen_US
dc.date.available2012-10-12T15:24:21Z
dc.date.issued2012en_US
dc.date.submitted2012en_US
dc.identifier.urihttps://hdl.handle.net/2027.42/93843
dc.description.abstractThe development of rich, multi-lingual corpora is essential for enabling new types of large-scale inquiry into the nature of language (Abney and Bird, 2010; Lewis and Xia, 2010). However, significant digital resources currently exist for only a handful of the world's languages. The present dissertation addresses this issue by introducing new techniques for creating rich corpora by enriching existing resources via automated processing. As a way of leveraging existing resources, this dissertation describes an automated method for extracting bitext (text accompanied by a translation) from bilingual documents. Digitized copies of printed books are mined for foreign-language material, using statistical methods for language identification and word alignment to identify instances of English-foreign bitext. After parsing the English text and transferring this analysis via the word alignments, the foreign word tokens are tagged with English glosses and morphosyntactic features. Tagged tokens such as these constitute the input to a new algorithm, presented in this dissertation, for performing morphology induction. Drawing on previous work on unsupervised morphology induction which uses the principle of minimum description length to drive the analysis (Goldsmith, 2001), the present algorithm uses a greedy hill-climbing search to minimize the size of a paradigm-based morphological description of the language. The algorithm simultaneously segments wordforms into their component morphemes and organizes stems and axes into a paradigmatic structure. Because tagged tokens are used as input, the morphemes produced by this induction method are paired with meaningful morphosyntactic features, an improvement over algorithms for unsupervised morphology based on monolingual text, which treat morphemes purely as strings of letters. Combined, these methods for collecting and analyzing bitext data offer a pathway for the automatic creation of richly-annotated corpora for resource-poor languages, requiring minimal amounts of data and minimal manual analysis.en_US
dc.language.isoen_USen_US
dc.subjectComputational Linguisticsen_US
dc.subjectMorphological Inferenceen_US
dc.subjectResource-poor Languagesen_US
dc.titleMorphological Inference from Bitext for Resource-Poor Languages.en_US
dc.typeThesisen_US
dc.description.thesisdegreenamePhDen_US
dc.description.thesisdegreedisciplineLinguisticsen_US
dc.description.thesisdegreegrantorUniversity of Michigan, Horace H. Rackham School of Graduate Studiesen_US
dc.contributor.committeememberAbney, Steven P.en_US
dc.contributor.committeememberRadev, Dragomir Radkoven_US
dc.contributor.committeememberThomason, Sarah G.en_US
dc.contributor.committeememberKeshet, Ezra Russellen_US
dc.subject.hlbsecondlevelSocial Sciences (General)en_US
dc.subject.hlbtoplevelSocial Sciencesen_US
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/93843/1/tdszyman_1.pdf
dc.owningcollnameDissertations and Theses (Ph.D. and Master's)


Files in this item

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.