Eye movements help us identify when and where we are fixating. The location under fixation is a valuable source of information in decoding a person’s intent or as an input modality for human-computer interaction. However, it can be difficult to maintain fixation under motion unless our eyes compensate for body movement. Humans have evolved compensatory mechanisms using the vestibulo-ocular reflex pathway which ensures stable fixation under motion. The interaction between the vestibular and ocular system has primarily been studied in controlled environments, with comparatively few studies during natural tasks that involve coordinated head and eye movements under unrestrained body motion. Moreover, off-the-shelf tools for analyzing gaze events perform poorly when head movements are allowed. To address these issues we developed algorithms for gaze event classification and collected the Gaze-in-Wild (GW) dataset. However, reliable inference of human behavior during in-the-wild activities depends heavily on the quality of gaze data extracted from eyetrackers. State of the art gaze estimation algorithms can be easily affected by occluded eye features, askew eye camera orientation and reflective artifacts from the environments - factors commonly found in unrestrained experiment designs. To inculcate robustness to reflective artifacts, our efforts helped develop RITNet, a convolutional encoder-decoder neural network which successfully segments eye images into semantic parts such as pupil, iris and sclera. Well chosen data augmentation techniques and objective functions combat reflective artifacts and helped RITNet achieve first place in OpenEDS’19, an international competition organized by Facebook Reality Labs. To induce robustness to occlusions, our efforts resulted in a novel eye image segmentation protocol, EllSeg. EllSeg demonstrates state of the art pupil and iris detection despite the presence of reflective artifacts and occlusions. While our efforts have shown promising results in developing a reliable and robust gaze feature extractor, convolutional neural networks are prone to overfitting and do not generalize well beyond the distribution of data it was optimized on. To mitigate this limitation and explore the generalization capacity of EllSeg, we acquire a wide distribution of eye images sourced from multiple publicly available datasets to develop EllSeg-Gen, a domain generalization framework for segmenting eye imagery. EllSeg-Gen proposes four tests which allow us to quantify generalization. We find that jointly training with multiple datasets improves generalization for eye images acquired outdoors. In contrast, specialized dataset specific models are better suited for indoor domain generalization.

Library of Congress Subject Headings

Gaze--Data processing; Neural networks (Computer science); Eye tracking

Publication Date


Document Type


Student Type


Degree Name

Imaging Science (Ph.D.)

Department, Program, or Center

Chester F. Carlson Center for Imaging Science (COS)


Gabriel J. Diaz

Advisor/Committee Member

Reynold J. Bailey

Advisor/Committee Member

Christopher Kanan


RIT – Main Campus

Plan Codes