Abstract

Obtaining large amounts of labelled data in medical imaging is hindered by issues like privacy and cost, therefore limiting the gains that can be achieved from Deep Neural Networks (DNNs). A lot of work has shown data augmentation to be an effective strategy for expanding training data leading to more robust models. However, the lack of data is not always uniform across datasets e.g., some classes in a dataset may be underrepresented. This is particularly problematic because the overall good performance of a model on a dataset can easily mask the under-performance of underrepresented classes or instances. Naively employing data augmentation in such scenarios can lead to more performance disparities. In this research, we focus on two such cases in which the overall dataset may not exhibit problems of data scarcity but some classes or instances are underrepresented. 1) When some objects are underrepresented in image segmentation e.g., tumours in medical images. 2) When a particular event is underrepresented for a population subgroup hence leading to spurious correlation in data. For the first case, we propose a novel object-centric data augmentation model that can learn the shape variations for the objects of interest and augment the object in place without modifying the rest of the image. For the second case (spurious correlations), in this proposal, we take the first step in systematically understanding the shortcomings of existing optimization and representation based approaches to tackle spurious correlations in the context of medical images. Our findings show that current optimization methods for dealing with spurious correlations based on underperforming samples can be problematic when bias is not the only cause for poor performance, and naive invariant representation learning suffers from spurious correlations itself. We further show that using optimization in conjunction with invariant representation learning can lead to better representations that are void of irrelevant features to the task at hand. We also try to tackle the problem of spurious correlations in the data space by exploring the possibility of using counterfactual (CF) augmentation to factor out correlations from the training set. However, despite being widely utilized in the literature, most CF generation models struggle when exposed to highly correlated data, and oftentimes strategies to improve the accuracy of CFs come at the cost of their diversity. Finally, in future work, we propose to look beyond the loss-based methods for the identification of correlation-conflicting samples, as the loss-based methods often cannot distinguish between noisy and correlation-conflicting samples.

Library of Congress Subject Headings

Diagnostic imaging--Data processing; Generative adversarial networks (Computer networks); Neural networks (Computer science)

Publication Date

2025

Document Type

Dissertation

Student Type

Graduate

Degree Name

Computing and Information Sciences (Ph.D.)

Department, Program, or Center

Computing and Information Sciences Ph.D, Department of

College

Golisano College of Computing and Information Sciences

Advisor

Linwei Wang

Advisor/Committee Member

Pengcheng Shi

Advisor/Committee Member

Rui Li

Campus

RIT – Main Campus

Plan Codes

COMPIS-PHD

Share

COinS