Abstract

Language is more than a tool of conveying information; it is utilized in all aspects of our lives. Yet only a small number of languages in the 7,000 languages worldwide are highly resourced by human language technologies (HLT). Despite African languages representing over 2,000 languages, only a few African languages are highly resourced, for which there exists a considerable amount of parallel digital data.

We present a novel approach to machine translation (MT) for under-resourced languages by improving the quality of the model using a paradigm called ``humans in the Loop.''

This thesis describes the work carried out to create a Bambara-French MT system including data discovery, data preparation, model hyper-parameter tuning, the development of a crowdsourcing platform for humans in the loop, vocabulary sizing, and segmentation. We present a novel approach to machine translation (MT) for under-resourced languages by improving the quality of the model using a paradigm called ``humans in the Loop.'' We achieved a BLEU (bilingual evaluation understudy) score of 17.5. The results confirm that MT for Bambara, despite our small data set, is viable. This work has the potential to contribute to the reduction of language barriers between the people of Sub-Saharan Africa and the rest of the world.

Library of Congress Subject Headings

Bambara language--Translation into French; Translators (Computer programs); Translating and interpreting--Data processing; Computational linguistics; Corpora (Linguistics); Human-computer interaction

Publication Date

8-2020

Document Type

Thesis

Student Type

Graduate

Degree Name

Computer Science (MS)

Department, Program, or Center

Computer Science (GCCIS)

Advisor

Christopher M. Homan

Advisor/Committee Member

Marcos Zampieri

Advisor/Committee Member

Sarah Luger

Campus

RIT – Main Campus

Plan Codes

COMPSCI-MS

Share

COinS