Abstract

Predominantly oral languages (POLs) face a significant "digital divide," as they are often excluded from the benefits of modern natural language processing (NLP) technologies, due to a lack of extensive, readily available machine learning (ML) datasets. We investigate methods to overcome this data scarcity for Bambara, a Manding language, spoken primarily in Mali, with a rich oral tradition but limited digital presence.     The research leverages crowdsourcing and community engagement to build high-quality ML ready dataset resources. Key contributions include methods for automatic speech recognition (ASR) and machine translation (MT) dataset collection and curation and for educational resource creation.      Our findings demonstrate that while POLs, specifically Bambara, presents unique challenges, including annotation complexity and technological barriers, crowdsourcing offers a scalable and inclusive path for their computational preservation. By integrating cultural knowledge with modern machine learning, this dissertation provides a framework for making NLP more equitable and accessible to Bambara speakers.

Publication Date

4-2026

Document Type

Dissertation

Student Type

Graduate

Degree Name

Computing and Information Sciences (Ph.D.)

Department, Program, or Center

Computing and Information Sciences Ph.D, Department of

College

Golisano College of Computing and Information Sciences

Advisor

Christopher Homan

Advisor/Committee Member

Emily Prud'hommeaux

Advisor/Committee Member

Shruti Rijhwani

Campus

RIT – Main Campus

Share

COinS