This thesis details the process in which a part-of-speech tagger is developed in order to determine grammar patterns in source code identifiers. These grammar patterns are used to aid in the proper naming of identifiers in order to improve reader comprehension. This tagger is a continuation of an effort of a previous Ensemble Tagger [62], but with a focus on increasing the tagging rate while maintaining the accuracy, in order to make the tagger scalable. The Scalable Tagger will be trained on open source data sets, with a machine learning model and training features that are chosen to best suit the needs for accuracy and tagging rate. The results of the experiment will be contrasted with the results of the Ensemble Tagger to determine the Scalable Tagger’s efficacy.

Library of Congress Subject Headings

Natural language processing (Computer science); Software engineering; Machine learning

Publication Date


Document Type


Student Type


Degree Name

Computer Science (MS)

Department, Program, or Center

Computer Science (GCCIS)


J. Scott Hawker

Advisor/Committee Member

Mohamed Wiem Mkaouer

Advisor/Committee Member

Christian Newman


RIT – Main Campus

Plan Codes