Abstract
Predominantly oral languages (POLs) face a significant "digital divide," as they are often excluded from the benefits of modern natural language processing (NLP) technologies, due to a lack of extensive, readily available machine learning (ML) datasets. We investigate methods to overcome this data scarcity for Bambara, a Manding language, spoken primarily in Mali, with a rich oral tradition but limited digital presence. The research leverages crowdsourcing and community engagement to build high-quality ML ready dataset resources. Key contributions include methods for automatic speech recognition (ASR) and machine translation (MT) dataset collection and curation and for educational resource creation. Our findings demonstrate that while POLs, specifically Bambara, presents unique challenges, including annotation complexity and technological barriers, crowdsourcing offers a scalable and inclusive path for their computational preservation. By integrating cultural knowledge with modern machine learning, this dissertation provides a framework for making NLP more equitable and accessible to Bambara speakers.
Publication Date
4-2026
Document Type
Dissertation
Student Type
Graduate
Degree Name
Computing and Information Sciences (Ph.D.)
Department, Program, or Center
Computing and Information Sciences Ph.D, Department of
College
Golisano College of Computing and Information Sciences
Advisor
Christopher Homan
Advisor/Committee Member
Emily Prud'hommeaux
Advisor/Committee Member
Shruti Rijhwani
Recommended Citation
Tapo, Allahsera Auguste, "NLP Crowdsourcing for Predominantly Oral Languages: The Case of Bambara" (2026). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/12636
Campus
RIT – Main Campus
