The relationship between genetics and phenotype is a complex one that remains poorly understood. Many factors contribute to the relationship between genetic variations and differences in phenotype. An improved understanding of the genetic underpinnings of various phenotypes can help us make important advances in testing for, preventing, treating, and curing a number of diseases and disorders.

The recent popularization of direct-to-consumer sequencing services, coupled with consumers releasing their genetic information for public use, has led to an unprecedented level of access to genetic information. Crowd-sourcing the problem of developing robust genome-wide association techniques for ever larger amounts of data is a promising trend.

This thesis explores likely methods to data mine one such public genetic data repository, openSNP, for correlated genotypes and phenotypes. Particular care is given to data clean-up and the steps required to preprocess public data for machine learning. The preprocessing methods are detailed in such a way that they may be applied to other genetic data repositories that already exist, for example the Personal Genome Project, as well as genetic data repositories that may become available in the future. Following data clean-up, a number of machine learning techniques are investigated, applied, and assessed for their utility in such a big-data problem. No single machine learning approach was found to be sufficient; the combination of imbalanced phenotype response classes and an underdetermined system led to a difficult machine learning challenge. Additional techniques must be explored or developed in order to make such genome-wide association studies possible and meaningful.

Library of Congress Subject Headings

Machine learning; Phenotype--Data processing; Human genetics--Variation--Data processing

Publication Date


Document Type


Student Type


Degree Name

Bioinformatics (MS)

Department, Program, or Center

Thomas H. Gosnell School of Life Sciences (COS)


Gary R. Skuse

Advisor/Committee Member

Ernest Fokoue

Advisor/Committee Member

Rajendra K. Raj


Physical copy available from RIT's Wallace Library at Q325.5 .H37 2016


RIT – Main Campus