A primary factor for the success of machine learning is the quality of labeled training data. However, in many fields, labeled data can be costly, difficult, or even impossible to acquire. In comparison, computer simulation data can now be generated at a much higher abundance with a much lower cost.

These simulation data could potentially solve the problem of data deficiency in many machine learning tasks. Nevertheless, due to model assumptions, simplifications and possible errors, there is always a discrepancy between simulated and real data. This discrepancy needs to be addressed when transferring the knowledge from simulation to real data. Furthermore, simulation data is always tied to specific settings of models parameters, many of which have a considerable range of variations yet not necessarily relevant to the machine learning task of interest. The knowledge extracted from simulation data must thus be generalizable across these parameter variations before being transferred.

In this dissertation, we address the two outlined challenges in leveraging simulation data to overcome the shortage of labeled real data, . We do so in a clinical task of localizing the origin of ventricular activation from 12 lead electrocardiograms (ECGs), where the clinical ECG data with labeled sites of origin in the heart can only be invasively available.

By adopting the concept of domain adaptation, we address the discrepancy between simulated and clinical ECG data by learning the shift between the two domains using a large amount of simulation data and a small amount of clinical data.

By adopting the concept of domain generalization, we then address the reliance of simulated ECG data on patient-specific geometrical models by learning to generalize simulated ECG data across subjects, before transferring them to clinical data.

Evaluated on in-vivo premature ventricular contraction (PVC) patients, we demonstrate the feasibility of utilizing a large number of offline simulated ECG datasets to enable the prediction of the origin of arrhythmia with only a small number of clinical ECG data on a new patient.

Library of Congress Subject Headings

Machine learning; Data mining; Computer simulation; Electrocardiography--Data processing

Publication Date


Document Type


Student Type


Degree Name

Computing and Information Sciences (Ph.D.)

Department, Program, or Center

PhD Program in Computing and Information Sciences


Linwei Wang

Advisor/Committee Member

Dana Brooks

Advisor/Committee Member

Ifeoma Nwogu


RIT – Main Campus

Plan Codes