Software datasets and artifacts play a crucial role in advancing automated software traceability research. They can be used by researchers in different ways to develop or validate new automated approaches. Software artifacts, other than source code and issue tracking entities, can also provide a great deal of insight into a software system and facilitate knowledge sharing and information reuse. The diversity and quality of the datasets and artifacts within a research community have a significant impact on the accuracy, generalizability, and reproducibility of the results and consequently on the usefulness and practicality of the techniques under study. Collecting and assessing the quality of such datasets are not trivial tasks and have been reported as an obstacle by many researchers in the domain of software engineering. In this dissertation, we report our empirical work that aims to automatically generate and assess the quality of such datasets. Our goal is to introduce an intelligent system that can help researchers in the domain of software traceability in obtaining high-quality “training sets”, “testing sets” or appropriate “case studies” from open source repositories based on their needs.

In the first project, we present a first-of-its-kind study to review and assess the datasets that have been used in software traceability research over the last fifteen years. It presents and articulates the current status of these datasets, their characteristics, and their threats to validity. Second, this dissertation introduces a Traceability-Dataset Quality Assessment (T-DQA) framework to categorize software traceability datasets and assist researchers to select appropriate datasets for their research based on different characteristics of the datasets and the context in which those datasets will be used. Third, we present the results of an empirical study with limited scope to generate datasets using three baseline approaches for the creation of training data. These approaches are (i) Expert-Based, (ii) Automated Web-Mining, which generates training sets by automatically mining tactic's APIs from technical programming websites, and lastly, (iii) Automated Big-Data Analysis, which mines ultra-large-scale code repositories to generate training sets. We compare the trace-link creation accuracy achieved using each of these three baseline approaches and discuss the costs and benefits associated with them. Additionally, in a separate study, we investigate the impact of training set size on the accuracy of recovering trace links. Finally, we conduct a large-scale study to identify which types of software artifacts are produced by a wide variety of open-source projects at different levels of granularity. Then we propose an automated approach based on Machine Learning techniques to identify various types of software artifacts. Through a set of experiments, we report and compare the performance of these algorithms when applied to software artifacts. Finally, we conducted a study to understand how software traceability experts and practitioners evaluate the quality of their datasets. In addition, we aim at gathering experts’ opinions on all quality attributes and metrics proposed by T-DQA.

Library of Congress Subject Headings

Software engineering--Data processing; Computer software--Reusability; Database management

Publication Date


Document Type


Student Type


Degree Name

Computing and Information Sciences (Ph.D.)


Mehdi Mirakhorli

Advisor/Committee Member

Venera Arnaoudova

Advisor/Committee Member

Christian Newman


RIT – Main Campus

Plan Codes