Morphological Inference from Bitext for Resource-Poor Languages.

Szymanski, Terrence D.

Morphological Inference from Bitext for Resource-Poor Languages.

dc.contributor.author	Szymanski, Terrence D.	en_US
dc.date.accessioned	2012-10-12T15:24:21Z
dc.date.available	NO_RESTRICTION	en_US
dc.date.available	2012-10-12T15:24:21Z
dc.date.issued	2012	en_US
dc.date.submitted	2012	en_US
dc.identifier.uri	https://hdl.handle.net/2027.42/93843
dc.description.abstract	The development of rich, multi-lingual corpora is essential for enabling new types of large-scale inquiry into the nature of language (Abney and Bird, 2010; Lewis and Xia, 2010). However, significant digital resources currently exist for only a handful of the world's languages. The present dissertation addresses this issue by introducing new techniques for creating rich corpora by enriching existing resources via automated processing. As a way of leveraging existing resources, this dissertation describes an automated method for extracting bitext (text accompanied by a translation) from bilingual documents. Digitized copies of printed books are mined for foreign-language material, using statistical methods for language identification and word alignment to identify instances of English-foreign bitext. After parsing the English text and transferring this analysis via the word alignments, the foreign word tokens are tagged with English glosses and morphosyntactic features. Tagged tokens such as these constitute the input to a new algorithm, presented in this dissertation, for performing morphology induction. Drawing on previous work on unsupervised morphology induction which uses the principle of minimum description length to drive the analysis (Goldsmith, 2001), the present algorithm uses a greedy hill-climbing search to minimize the size of a paradigm-based morphological description of the language. The algorithm simultaneously segments wordforms into their component morphemes and organizes stems and axes into a paradigmatic structure. Because tagged tokens are used as input, the morphemes produced by this induction method are paired with meaningful morphosyntactic features, an improvement over algorithms for unsupervised morphology based on monolingual text, which treat morphemes purely as strings of letters. Combined, these methods for collecting and analyzing bitext data offer a pathway for the automatic creation of richly-annotated corpora for resource-poor languages, requiring minimal amounts of data and minimal manual analysis.	en_US
dc.language.iso	en_US	en_US
dc.subject	Computational Linguistics	en_US
dc.subject	Morphological Inference	en_US
dc.subject	Resource-poor Languages	en_US
dc.title	Morphological Inference from Bitext for Resource-Poor Languages.	en_US
dc.type	Thesis	en_US
dc.description.thesisdegreename	PhD	en_US
dc.description.thesisdegreediscipline	Linguistics	en_US
dc.description.thesisdegreegrantor	University of Michigan, Horace H. Rackham School of Graduate Studies	en_US
dc.contributor.committeemember	Abney, Steven P.	en_US
dc.contributor.committeemember	Radev, Dragomir Radkov	en_US
dc.contributor.committeemember	Thomason, Sarah G.	en_US
dc.contributor.committeemember	Keshet, Ezra Russell	en_US
dc.subject.hlbsecondlevel	Social Sciences (General)	en_US
dc.subject.hlbtoplevel	Social Sciences	en_US
dc.description.bitstreamurl	http://deepblue.lib.umich.edu/bitstream/2027.42/93843/1/tdszyman_1.pdf
dc.owningcollname	Dissertations and Theses (Ph.D. and Master's)

Files in this item

Name:: tdszyman_1.pdf
Size:: 1.239MB
Format:: PDF

View/Open

Dissertations and Theses (Ph.D. and Master's)

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.