In the field of genetic toxicology the term aneugen is used to indicate chemical or physical agents that cause chromosomes to malsegregate during division, thereby resulting in altered DNA content in daughter cells. This form of chromosome damage can be detected in certain mammalian cell-based assays, however the molecular mechanism(s) responsible for aneugenic effects are not apparent from these conventional tests. However, the responsible molecular initiating event (MIE) is of interest to pharmaceutical, chemical, and agro-chemical industries, because this knowledge can assist their efforts to design out such liabilities and/or avoid similar chemical structures altogether. This study evaluated the ability of several experimental biomarkers to identify the MIE of aneugens from the functional curves that originate from human TK6 cells exposed to fluorescent Taxol (Taxol 488) for four hours and co-treated with known aneugens over a range of concentrations.

A large functional space of classifiers were evaluated using two stages of cross validation. First, a wide space was searched using a variety of depth, area under the curve (AUC) summarized and kernel methods to identify the top performing models. The top models were then evaluated in a second stage of cross-validation to establish a mean error rate and log loss that approached their theoretical distributions.

In searching the large space of non-parametric and functional classifiers, it was found that a K-Nearest Neighbors Model (KNN) using a single neighbor on an H-Modal Depth calculation of the functional curve could properly classify MIE of aneugens with cross-validated error rates close to zero and well below other methods such as AUC summary methods and other depth based methods. Similar to the KNN model, a Kernel Support Vector Machine with an ANOVAdot kernel could classify aneugens from the raw functional curve data not requiring a depth based calculation.

While those models are best, they have the benefit of having more data observations in the form of replicate data. If the data are summarized to remove replicates, the linear discriminant analysis model with AUC summarized data is the best model.

This study shows that it is possible to use the raw functional curves from an experiment of aneugens to identify their MIE to an accurate degree using machine learning methods.

Library of Congress Subject Headings

Multivariate analysis--Data processing; Classification--Mathematics; Genetic toxicology

Publication Date


Document Type


Student Type


Degree Name

Applied Statistics (MS)

Department, Program, or Center

School of Mathematical Sciences (COS)


Peter Bajorski

Advisor/Committee Member

Minh Pham

Advisor/Committee Member

Stephen Dertinger


RIT – Main Campus

Plan Codes