Abstract
Technical logbooks are a challenging and under-explored text type in automated event identification. These texts are typically short and written in non-standard yet technical language, posing challenges to off-the-shelf NLP pipelines. These datasets typically represent a domain (a technical field such as automotive) and an application (e.g., maintenance). The granularity of issue types described in these datasets additionally leads to class imbalance, making it challenging for models to accurately predict which issue each logbook entry describes. In this research, we focus on the problem of technical issue pre-processing, clustering, and classification by considering logbook datasets from the automotive, aviation, and facility maintenance domains. We developed MaintNet, a collaborative open source library including logbook datasets from various domains and a pre-processing pipeline to clean unstructured datasets. Additionally, we adapted a feedback loop strategy from computer vision for handling extreme class imbalance, which resamples the training data based on its error in the prediction process. We further investigated the benefits of using transfer learning from sources within the same domain (but different applications), from within the same application (but different domains), and from all available data to improve the performance of the classification models. Finally, we evaluated several data augmentation approaches including synonym replacement, random swap, and random deletion to address the issue of data scarcity in technical logbooks.
Library of Congress Subject Headings
Natural language processing (Computer science); Technology--Terminology; Machine learning; Data mining
Publication Date
7-2022
Document Type
Dissertation
Student Type
Graduate
Degree Name
Computing and Information Sciences (Ph.D.)
Department, Program, or Center
Computer Science (GCCIS)
Advisor
Travis Desell
Advisor/Committee Member
Marcos Zampieri
Advisor/Committee Member
Christian Newman
Recommended Citation
Akhbardeh, Farhad, "NLP and ML Methods for Pre-processing, Clustering and Classification of Technical Logbook Datasets" (2022). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/11227
Campus
RIT – Main Campus
Plan Codes
COMPIS-PHD