Abstract

Predominantly oral languages (POLs) face a significant "digital divide," as they are often excluded from the benefits of modern natural language processing (NLP) technologies, due to a lack of extensive, readily available machine learning (ML) datasets. We investigate methods to overcome this data scarcity for Bambara, a Manding language, spoken primarily in Mali, with a rich oral tradition but limited digital presence. The research leverages crowdsourcing and community engagement to build high-quality ML ready dataset resources. Key contributions include methods for automatic speech recognition (ASR) and machine translation (MT) dataset collection and curation and for educational resource creation. Our findings demonstrate that while POLs, specifically Bambara, presents unique challenges, including annotation complexity and technological barriers, crowdsourcing offers a scalable and inclusive path for their computational preservation. By integrating cultural knowledge with modern machine learning, this dissertation provides a framework for making NLP more equitable and accessible to Bambara speakers.

Library of Congress Subject Headings

Bambara language--Data processing; Crowdsourcing; Natural language processing (Computer science)

Publication Date

4-2026

Document Type

Dissertation

Student Type

Graduate

Degree Name

Computing and Information Sciences (Ph.D.)

Department, Program, or Center

Computing and Information Sciences Ph.D, Department of

College

Golisano College of Computing and Information Sciences

Advisor

Christopher Homan

Advisor/Committee Member

Emily Prud'hommeaux

Advisor/Committee Member

Shruti Rijhwani

Recommended Citation

Tapo, Allahsera Auguste, "NLP Crowdsourcing for Predominantly Oral Languages: The Case of Bambara" (2026). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/12636

Campus

RIT – Main Campus

Plan Codes

COMPIS-PHD

Download

COinS

Theses

NLP Crowdsourcing for Predominantly Oral Languages: The Case of Bambara

Abstract

Library of Congress Subject Headings

Publication Date

Document Type

Student Type

Degree Name

Department, Program, or Center

College

Advisor

Advisor/Committee Member

Advisor/Committee Member

Recommended Citation

Campus

Plan Codes

Search

Browse

Author Corner

RIT Links

Theses

NLP Crowdsourcing for Predominantly Oral Languages: The Case of Bambara

Author

Abstract

Library of Congress Subject Headings

Publication Date

Document Type

Student Type

Degree Name

Department, Program, or Center

College

Advisor

Advisor/Committee Member

Advisor/Committee Member

Recommended Citation

Campus

Plan Codes

Share

Search

Browse

Author Corner

RIT Links