Abstract

Supervised learning algorithms rely on availability of labeled data. Labeled data is either scarce or involves substantial human effort in the labeling process. These two factors, along with the abundance of unlabeled data, have spurred research initiatives that exploit unlabeled data to boost supervised learning. This genre of learning algorithms that utilize unlabeled data alongside a small set of labeled data are known as semi-supervised learning algorithms. Data characteristics, such as the presence of a generative model, provide the foundation for applying these learning algorithms. Co-training is one such al gorithm that leverages existence of two redundant "views" for a data instance. Based on these two views, the co-training algorithm trains two classifiers using the labeled data. The small set of labeled data results in a pair of weak classi fiers. With the help of the unlabeled data the two classifiers alternately boost each other to achieve a high-accuracy classifier. The conditions imposed by the co-training algorithm regarding the data characteristics restrict its application to data that possesses a natural split of the feature set. In this thesis we study the co-training setting and propose to overcome the above mentioned constraint by "manufacturing" feature splits. We pose and investigate the following questions: 1 . Can a feature split be constructed for a dataset such that the co-training algorithm can be applied to it? 2. If a feature split can be engineered, would splitting the features into more than two partitions give a better classifier? In essence, does moving from co-training (2 classifiers) to k-training (k-classifiers) help? 3. Is there an optimal number of "views" for a dataset such that k-training leads to an optimal classifier? The task of obtaining feature splits is approached by modeling the problem as a graph partitioning problem. Experiments are conducted on a breadth of text datasets. Results of k-training using constructed feature sets are compared with that of the expectation-maximization algorithm, which has been successful in a semi-supervised setting.

Library of Congress Subject Headings

Supervised learning (Machine learning); Text processing (Computer science); Data mining; Automatic classification; Natural language processing (Computer science)

Publication Date

2004

Document Type

Thesis

Student Type

Graduate

Degree Name

Computer Science (MS)

Department, Program, or Center

Computer Science (GCCIS)

Advisor

Ankur Teredesai

Advisor/Committee Member

Roger Gaborski

Advisor/Committee Member

Khalid Al-Kofahi

Comments

Physical copy available from RIT's Wallace Library at Q325.75 .C42 2004

Recommended Citation

Chaoji, Vineet, "Feature Partitioning for the Co-Traning Setting" (2004). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/7704

Campus

RIT – Main Campus

Download

COinS

Theses

Feature Partitioning for the Co-Traning Setting

Abstract

Library of Congress Subject Headings

Publication Date

Document Type

Student Type

Degree Name

Department, Program, or Center

Advisor

Advisor/Committee Member

Advisor/Committee Member

Comments

Recommended Citation

Campus

Search

Browse

Author Corner

RIT Links

Theses

Feature Partitioning for the Co-Traning Setting

Author

Abstract

Library of Congress Subject Headings

Publication Date

Document Type

Student Type

Degree Name

Department, Program, or Center

Advisor

Advisor/Committee Member

Advisor/Committee Member

Comments

Recommended Citation

Campus

Share

Search

Browse

Author Corner

RIT Links