Abstract

The advent of transformer-based models has revolutionized natural language processing, bringing remarkable improvements in tasks like automatic speech recognition (ASR). Inspired by these advancements, this thesis explores the optimization of a transformer-based ASR model to improve transcription accuracy in educational settings, particularly for lecture content. The goal of this research is to provide real-time, high-accuracy captions that enhance accessibility for all students, while offering a cost-effective solution for educators. To assess the potential of domain-specific fine-tuning, Whisper-small underwent two phases of fine-tuning. In the first phase, it was finetuned on care- fully selected, publicly available datasets: SpeechColab’s Gigaspeech-XS [39], AMI Meeting corpus [14]. In the second phase, fine-tuned model was optimized on a self-curated dataset [16] consisting of roughly 10 hours of live lecture recordings collected and assembled by me. Finally, a real-time captioning assistant application was developed to leverage the finetuned model and transcribe speech in real time with live editing capabilities. The optimized Whisper-small model was evaluated against Whisper’s retrained small, medium and large(version 2) counterparts. The evaluation was performed on a clean unseen data [15] prepared by me. The fine-tuned model achieved lower Word Error Rates (WER) of 4.53%, compared to 5.51% and 5.78% for Whisper-Medium and Whisper-Large-V2 respectively. These results demonstrate that fine-tuning a transformer-based ASR model on domain- specific data can significantly enhance its performance in a targeted context, such as live lecture transcription. The findings of this experiment highlight the promise of transformer-based models for improving educational accessibility. From thereon, building an application tailored to live lecture settings, this research contributes to the development of adaptable, low-cost technologies that support inclusive learning environments. The success of this experiment lays the groundwork for future breakthroughs in speech recognition, aiming to make education more accessible for everyone.

Library of Congress Subject Headings

Automatic speech recognition--Quality control; Domain-specific programming languages; Voice computing

Publication Date

1-2025

Document Type

Thesis

Student Type

Graduate

College

Golisano College of Computing and Information Sciences

Advisor

Thomas B. Kinsman

Advisor/Committee Member

Joe Geigel

Advisor/Committee Member

Jansen Orfan

Campus

RIT – Main Campus

Plan Codes

COMPSCI-MS

Share

COinS