Practical Natural Language Processing for Low-Resource Languages.

King, Benjamin Philip

Practical Natural Language Processing for Low-Resource Languages.

dc.contributor.author	King, Benjamin Philip	en_US
dc.date.accessioned	2015-09-30T14:22:34Z
dc.date.available	NO_RESTRICTION	en_US
dc.date.available	2015-09-30T14:22:34Z
dc.date.issued	2015	en_US
dc.date.submitted	2015	en_US
dc.identifier.uri	https://hdl.handle.net/2027.42/113373
dc.description.abstract	As the Internet and World Wide Web have continued to gain widespread adoption, the linguistic diversity represented has also been growing. Simultaneously the field of Linguistics is facing a crisis of the opposite sort. Languages are becoming extinct faster than ever before and linguists now estimate that the world could lose more than half of its linguistic diversity by the year 2100. This is a special time for Computational Linguistics; this field has unprecedented access to a great number of low-resource languages, readily available to be studied, but needs to act quickly before political, social, and economic pressures cause these languages to disappear from the Web. Most work in Computational Linguistics and Natural Language Processing (NLP) focuses on English or other languages that have text corpora of hundreds of millions of words. In this work, we present methods for automatically building NLP tools for low-resource languages with minimal need for human annotation in these languages. We start first with language identification, specifically focusing on word-level language identification, an understudied variant that is necessary for processing Web text and develop highly accurate machine learning methods for this problem. From there we move onto the problems of part-of-speech tagging and dependency parsing. With both of these problems we extend the current state of the art in projected learning to make use of multiple high-resource source languages instead of just a single language. In both tasks, we are able to improve on the best current methods. All of these tools are practically realized in the "Minority Language Server," an online tool that brings these techniques together with low-resource language text on the Web. The Minority Language Server, starting with only a few words in a language can automatically collect text in a language, identify its language and tag its parts of speech. We hope that this system is able to provide a convincing proof of concept for the automatic collection and processing of low-resource language text from the Web, and one that can hopefully be realized before it is too late.	en_US
dc.language.iso	en_US	en_US
dc.subject	Natural Language Processing	en_US
dc.title	Practical Natural Language Processing for Low-Resource Languages.	en_US
dc.type	Thesis	en_US
dc.description.thesisdegreename	PhD	en_US
dc.description.thesisdegreediscipline	Computer Science and Engineering	en_US
dc.description.thesisdegreegrantor	University of Michigan, Horace H. Rackham School of Graduate Studies	en_US
dc.contributor.committeemember	Radev, Dragomir R.	en_US
dc.contributor.committeemember	Abney, Steven P.	en_US
dc.contributor.committeemember	Keshet, Ezra Russell	en_US
dc.contributor.committeemember	Cafarella, Michael John	en_US
dc.contributor.committeemember	Bird, Steven	en_US
dc.contributor.committeemember	Mihalcea, Rada	en_US
dc.subject.hlbsecondlevel	Computer Science	en_US
dc.subject.hlbtoplevel	Engineering	en_US
dc.description.bitstreamurl	http://deepblue.lib.umich.edu/bitstream/2027.42/113373/1/benking_1.pdf
dc.owningcollname	Dissertations and Theses (Ph.D. and Master's)

Files in this item

Name:: benking_1.pdf
Size:: 1.133MB
Format:: PDF

View/Open

Dissertations and Theses (Ph.D. and Master's)

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.