Abstract
Grouping objects that are described by attributes, or clustering is a central notion in data mining. On the other hand, similarity or relationships between attributes themselves is equally important but relatively unexplored. Such groups of attributes are also known as directories, concept hierarchies or topics depending on the underlying data domain. The similarities between the two problems of grouping objects and attributes might suggest that traditional clustering techniques are applicable. This thesis argues that traditional clustering techniques fail to adequately capture the solution we seek. It also explores domain-independent techniques for grouping attributes. The notion of similarity between attributes and therefore clustering in categorical datasets has not received adequate attention. This issue has seen renewed interest in the knowledge discovery community, spurred on by the requirements of personalization of information and online search technology. The problem is broken down into (a) quantification of this notion of similarity and (b) the subsequent formation of groups, retaining attributes similar enough in the same group based on metrics that we will attempt to derive. Both aspects of the problem are carefully studied. The thesis also analyzes existing domainindependent approaches to building distance measures, proposing and analyzing iii several such measures for quantifying similarity, thereby providing a foundation for future work in grouping relevant attributes. The theoretical results are supported by experiments carried out on a variety of datasets from the text-mining, web-mining, social networks and transaction analysis domains. The results indicate that traditional clustering solutions are inadequate within this problem framework. They also suggest a direction for the development of distance measures for the quantification of the concept of similarity between categorical attributes.
Library of Congress Subject Headings
Data mining; Cluster analysis; Information organization
Publication Date
2005
Document Type
Thesis
Department, Program, or Center
Computer Science (GCCIS)
Advisor
Teredesai, Ankur - Chair
Advisor/Committee Member
Hemaspaandra, Edith
Advisor/Committee Member
Gaborski, Roger
Recommended Citation
Dawara, Santosh, "Grouping related attributes" (2005). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/751
Campus
RIT – Main Campus
Comments
Note: imported from RIT’s Digital Media Library running on DSpace to RIT Scholar Works. Physical copy available through RIT's The Wallace Library at: QA76.9.D343 D39 2004