Abstract
The rapid development of artificial intelligence (AI) in computer vision has garnered considerable attention across diverse research fields (e.g., image classification, image segmentation, object detection, optical flow, depth estimation). However, this progress is met with an array of challenges that hinder them from real-world deployments. In this paper, three major challenges are discussed: I. Low Task Generalizability. Current visual models are often specifically designed for targeted tasks, constraining their generalizability with varying adaptation scenarios; II. Weak Model Interpretability. Within the paradigm of connectionism, most of these models are frequently regarded as ``black-box'' systems, making them challenging for humans to understand and control; III. High Computational Consumption. As the quest for superior performance persists, there is a prevailing trend towards scaling up visual models, which consequentially incurs significant computational costs. Human visual intelligence, on the other hand, provides natural solutions to these challenges. I thus seek to embody and mimic capabilities with human visual intelligence. Consequently, three primary contributions are covered in this paper in response to these challenges. Universal Visual Learner. While current computer vision techniques provide specialized solutions for different vision tasks (e.g., optical flow, depth estimation), human understands and explores the world by complex visual stimuli, unbound by task-specific constraints. To bridge this gap, I propose the Prototypical Transformer (ProtoFormer), a general and unified framework that addresses various motion tasks from a prototype-based perspective. ProtoFormer seamlessly integrates prototype learning with the Transformer architecture by thoughtfully incorporating motion dynamics through two innovative designs. First, Cross-Attention Prototyping identifies prototypes based on distinct motion patterns, enhancing transparency in the interpretation of motion scenes. Second, Latent Synchronization steers feature representation learning via prototypes, effectively reducing motion uncertainty. Interpretable Visual Intelligence. Due to the connectionism nature of the current deep neural networks, how to explain the network behaviors becomes a critical topic. In light of this view, I first introduce DNC (Deep Nearest Centroids)—a rejuvenation of the classic Nearest Centroids classifier, envisioned for large-scale visual recognition. In contrast to conventional deep models, which often overlook latent data structures, DNC employs a non-parametric, case-based reasoning approach. By utilizing sub-centroids of training samples to represent class distributions, DNC classifies by measuring the proximity of test data to these sub-centroids within the feature space. This distance-based approach provides unparalleled flexibility, allowing complete knowledge transfer across diverse recognition tasks. Moreover, DNC's inherent simplicity, combined with its intuitive decision-making process, ensures explainability when sub-centroids are actual training images. Another promising direction for interpretable visual intelligence is to provide explicit symbols at each programming stage, enabling users to intuitively interpret and modify results. Recognizing that current approaches in Image-to-Image translation are generally unexplainable, I propose a novel Diffusion Visual Programmer (DVP), a neuro-symbolic image translation framework. The proposed DVP seamlessly integrates a condition-flexible diffusion model within the GPT architecture, orchestrating a coherent sequence of visual programs (i.e., computer vision models) for various pro-symbolic tasks, such as RoI identification, style transfer, and position manipulation. This integration facilitates transparent and controllable image translation processes. Carbon-Efficient Visual Intelligence System. During my study, the heavy training burden appears with high frequency. When considering human visual intelligence system, they can efficiently and effectively realize vision tasks with low energy cost, this phenomenon inspires me to make investigation on parameter-efficient training to networks. Transformer-based models are nowadays trend in visual related tasks, however, their size continue to grow, and fine-tuning these large-scale pretrained vision models for new tasks has become increasingly parameter-intensive. While parameter-efficient learning emerges as a solution, it often lags behind full fine-tuning in performance. To address this challenge, I study and introduce an Effective and Efficient Visual Prompt Tuning (E2VPT) mechanism. E2VPT incorporates learnable key-value prompts, enhancing the model's fine-tuning efficiency. Furthermore, a strategic prompt pruning approach maintains performance while significantly reducing parameters. Another promising avenue toward carbon-efficient visual intelligence system is knowledge distillation. I focus on studying the transformer-based architectures since they are the de-facto standard models for diverse vision tasks owing to their superior performance. As the size of the models, especially transformer-based models, continue to scale up, model distillation becomes extremely important in various real-world applications, particularly on devices limited by computational resources (i.e., edge devices). However, prevailing knowledge distillation methods exhibit diminished efficacy when confronted with a large capacity gap between the teacher and the student, e.g., 10x compression rate. I thus present Automatic Multi-step Distillation (AMD) for large-scale vision model compression. In particular, the distillation process unfolds across multiple steps. Initially, the teacher undergoes distillation to form an intermediate teacher-assistant model, which is subsequently distilled further to the student. An efficient and effective optimization framework is introduced to automatically identify the optimal teacher-assistant that leads to the maximal student performance. To sum up, by drawing inspiration from the innate capabilities of human visual intelligence, this research underscores the necessity of fostering models that are not just proficient but also versatile, interpretable, and parameter-efficient. My endeavors, ranging from the development of universal visual learners to carving paths for carbon-efficient AI systems, manifest my commitment to driving AI research that resonates with real-world intricacies. It is my fervent hope that the foundations laid in this paper serve as a prompting avenue to the AI community on continually striving for models that seamlessly bridge the gap between machine efficiency and human intuitiveness.
Library of Congress Subject Headings
Computer vision; Artificial intelligence; Neural networks (Computer science); Deep learning (Machine learning)
Publication Date
7-2-2024
Document Type
Dissertation
Student Type
Graduate
Degree Name
Imaging Science (Ph.D.)
Department, Program, or Center
Chester F. Carlson Center for Imaging Science
College
College of Science
Advisor
Dongfang Liu
Advisor/Committee Member
Ying Nian Wu
Advisor/Committee Member
Qi Yu
Recommended Citation
Han, Cheng, "Towards Human-Embodied Visual Intelligence" (2024). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/11819
Campus
RIT – Main Campus
Plan Codes
IMGS-PHD