Abstract
This research evaluates pattern recognition techniques on a subclass of big data where the dimensionality of the input space p is much larger than the number of observations n. Seven gene-expression microarray cancer datasets, where the ratio κ = n/p is less than one, were chosen for evaluation. The statistical and computational challenges inherent with this type of high-dimensional low sample size (HDLSS) data were explored. The capability and performance of a diverse set of machine learning algorithms is presented and compared. The sparsity and collinearity of the data being employed, in conjunction with the complexity of the algorithms studied, demanded rigorous and careful tuning of the hyperparameters and regularization parameters. This necessitated several extensions of cross-validation to be investigated, with the purpose of culminating in the best predictive performance.
For the techniques evaluated in this thesis, regularization or kernelization, and often both, produced lower classification error rates than randomized ensemble for all datasets used in this research. However, no one technique evaluated for classifying HDLSS microarray cancer data emerged as the universally best technique for predicting the generalization error.1
From the empirical analysis performed in this thesis, the following fundamentals emerged as being instrumental in consistently resulting in lower error rates when estimating the generalization error in this HDLSS microarray cancer data:
• Thoroughly investigate and understand the data
• Stratify during all sampling due to the uneven classes and extreme sparsity of this data.
• Perform 3 to 5 replicates of stratified cross-validation, implementing an adaptive K-fold, to determine the optimal tuning parameters.
• To estimate the generalization error in HDLSS data, replication is paramount. Replicate R=500 or R=1000 times with training and test sets of 2/3 and 1/3, respectively, to get the best generalization error estimate.
• Whenever possible, obtain an independent validation dataset.
• Seed the data for a fair and unbiased comparison among techniques.
• Define a methodology or standard set of process protocols to apply to machine learning research. This would prove very beneficial in ensuring reproducibility and would enable better comparisons among techniques.
_____
1A predominant portion of this research was published in the Serdica Journal of Computing (Volume 8, Number 2, 2014) as proceedings from the 2014 Flint International Statistical Conference at Kettering University, Michigan, USA.
Library of Congress Subject Headings
Cancer--Data processing; Machine learning; Pattern recognition systems
Publication Date
7-2015
Document Type
Thesis
Student Type
Graduate
Degree Name
Applied Statistics (MS)
Department, Program, or Center
The John D. Hromi Center for Quality and Applied Statistics (KGCOE)
Advisor
Ernest Fokoue
Advisor/Committee Member
Steven LaLonde
Advisor/Committee Member
Daniel Lawrence
Recommended Citation
Bill, Jo A., "An Empirical Analysis of Predictive Machine Learning Algorithms on High-Dimensional Microarray Cancer Data" (2015). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/8764
Campus
RIT – Main Campus
Plan Codes
APPSTAT-MS
Comments
Physical copy available from RIT's Wallace Library at RC267 .B45 2015