Articles

An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags

Christian D. Newman,Follow
Michael J. Decker, Bowling Green State University
Reem S. Alsuhaibani, Kent State University
Anthony Peruma, Rochester Institute of Technology
Mohamed Wiem Mkaouer, Rochester Institute of TechnologyFollow
Satyajit Mohapatra, Rochester Institute of TechnologyFollow
Tejal Vishnoi, Rochester Institute of TechnologyFollow
Marcos Zampieri, Rochester Institute of TechnologyFollow
Timothy Sheldon, BNY MellonFollow
Emily Hill, Drew UniversityFollow

Abstract

This paper presents an ensemble part-of-speech tagging approach for source code identifiers. Ensemble tagging is a technique that uses machine-learning and the output from multiple part-of-speech taggers to annotate natural language text at a higher quality than the part-of-speech taggers are able to obtain independently. Our ensemble uses three state-of-the-art part-of-speech taggers: SWUM, POSSE, and Stanford. We study the quality of the ensemble's annotations on five different types of identifier names: function, class, attribute, parameter, and declaration statement at the level of both individual words and full identifier names. We also study and discuss the weaknesses of our tagger to promote the future amelioration of these problems through further research. Our results show that the ensemble achieves 75\% accuracy at the identifier level and 84-86\% accuracy at the word level. This is an increase of +17\% points at the identifier level from the closest independent part-of-speech tagger.

Publication Date

2021

Comments

© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Document Type

Article

Department, Program, or Center

Software Engineering (GCCIS)

Recommended Citation

C. Newman, et al.,"An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags" in IEEE Transactions on Software Engineering, vol. , no. 01, pp. 1-1, 5555. doi: 10.1109/TSE.2021.3098242

Campus

RIT – Main Campus

Download

Included in

Data Science Commons, Software Engineering Commons

COinS

Articles

An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags

Abstract

Publication Date

Comments

Document Type

Department, Program, or Center

Recommended Citation

Campus

Included in

Search

Browse

Author Corner

RIT Links

Articles

An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags

Authors

Abstract

Publication Date

Comments

Document Type

Department, Program, or Center

Recommended Citation

Campus

Included in

Share

Search

Browse

Author Corner

RIT Links