Abstract
Supervised learning algorithms rely on availability of labeled data. Labeled data is either scarce or involves substantial human effort in the labeling process. These two factors, along with the abundance of unlabeled data, have spurred research initiatives that exploit unlabeled data to boost supervised learning. This genre of learning algorithms that utilize unlabeled data alongside a small set of labeled data are known as semi-supervised learning algorithms. Data characteristics, such as the presence of a generative model, provide the foundation for applying these learning algorithms. Co-training is one such al gorithm that leverages existence of two redundant "views" for a data instance. Based on these two views, the co-training algorithm trains two classifiers using the labeled data. The small set of labeled data results in a pair of weak classi fiers. With the help of the unlabeled data the two classifiers alternately boost each other to achieve a high-accuracy classifier. The conditions imposed by the co-training algorithm regarding the data characteristics restrict its application to data that possesses a natural split of the feature set. In this thesis we study the co-training setting and propose to overcome the above mentioned constraint by "manufacturing" feature splits. We pose and investigate the following questions: 1 . Can a feature split be constructed for a dataset such that the co-training algorithm can be applied to it? 2. If a feature split can be engineered, would splitting the features into more than two partitions give a better classifier? In essence, does moving from co-training (2 classifiers) to k-training (k-classifiers) help? 3. Is there an optimal number of "views" for a dataset such that k-training leads to an optimal classifier? The task of obtaining feature splits is approached by modeling the problem as a graph partitioning problem. Experiments are conducted on a breadth of text datasets. Results of k-training using constructed feature sets are compared with that of the expectation-maximization algorithm, which has been successful in a semi-supervised setting.
Library of Congress Subject Headings
Supervised learning (Machine learning); Text processing (Computer science); Data mining; Automatic classification; Natural language processing (Computer science)
Publication Date
2004
Document Type
Thesis
Student Type
Graduate
Degree Name
Computer Science (MS)
Department, Program, or Center
Computer Science (GCCIS)
Advisor
Ankur Teredesai
Advisor/Committee Member
Roger Gaborski
Advisor/Committee Member
Khalid Al-Kofahi
Recommended Citation
Chaoji, Vineet, "Feature Partitioning for the Co-Traning Setting" (2004). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/7704
Campus
RIT – Main Campus
Comments
Physical copy available from RIT's Wallace Library at Q325.75 .C42 2004