Abstract
This dissertation explores the advanced integration of deep learning (DL) techniques in remote sensing (RS) and computer vision (CV), with a focus on optimizing convolutional neural networks (CNNs) for enhanced detection performance and efficient computational deployment. The first part of the research addresses the underperformance of conventional object detection methods when applied to RS data. Traditional techniques often struggle due to the small size of targets, limited training data, and diverse modalities involved. To overcome these challenges, we introduce YOLOrs, a novel CNN designed specifically for real-time object detection within multimodal RS imagery. YOLOrs is adept at detecting objects across multiple scales and can predict target orientations while incorporating a pioneering mid-level fusion architecture that effectively handles multimodal data. Building on the concept of multimodal data fusion, we further propose a two-phase multi-stream fusion approach that mitigates the difficulties associated with collecting paired multimodal data, which is often expensive and complex due to the disparate nature of sensing technologies. Our approach first involves training unimodal streams independently, followed by a joint training phase of a common multimodal decision layer. This method has shown to outperform traditional fusion techniques in empirical tests, demonstrating its effectiveness in practical scenarios. The second part of the dissertation shifts focus towards addressing the issue of over-parameterization in CNNs, which often leads to excessive computational demands and storage overheads, as well as overfitting. Here, we introduce YOLOrs-lite, an adaptation of YOLOrs using the Tensor-Train (TT) format for convolutional kernels, significantly reducing the network’s parameters while maintaining high detection performance. This approach not only enhances model efficiency but also facilitates real-time inference suitable for edge deployment. Additionally, we extend the TT compression technique to convolutional auto-encoders (CAEs), creating the CAE-TT, which adjusts the number of parameters without altering the network architecture, demonstrating effectiveness in both batch and online learning environments. Finally, we explore a novel CNN compression technique through dynamic parameter rank pruning. Utilizing low-rank matrix approximations and novel regularization strategies, this method dynamically adjusts the ranks during training, achieving substantial reductions in model size with improved or maintained performance on several benchmark datasets. This research collectively advances the field by developing innovative methods that refine DL applications in RS and CV, ensuring both high performance and efficiency in processing and deployment across diverse platforms.
Publication Date
6-2024
Document Type
Dissertation
Student Type
Graduate
Degree Name
Imaging Science (Ph.D.)
Department, Program, or Center
Chester F. Carlson Center for Imaging Science
College
College of Science
Advisor
Eli Saber
Advisor/Committee Member
John Kerekes
Advisor/Committee Member
Panos P. Markopoulos
Recommended Citation
Sharma, Manish, "Multimodal Data Fusion and Model Compression Methods for Computer Vision" (2024). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/11879
Campus
RIT – Main Campus