Abstract
Using deep learning, computer vision now rivals people at object recognition and detection, opening doors to tackle new challenges in image understanding. Among these challenges, understanding and reasoning about language grounded visual content is of fundamental importance to advancing artificial intelligence. Recently, multiple datasets and algorithms have been created as proxy tasks towards this goal, with visual question answering (VQA) being the most widely studied. In VQA, an algorithm needs to produce an answer to a natural language question about an image. However, our survey of datasets and algorithms for VQA uncovered several sources of dataset bias and sub-optimal evaluation metrics that allowed algorithms to perform well by merely exploiting superficial statistical patterns. In this dissertation, we describe new algorithms and datasets that address these issues. We developed two new datasets and evaluation metrics that enable a more accurate measurement of abilities of a VQA model, and also expand VQA to include new abilities, such as reading text, handling out-of-vocabulary words, and understanding data-visualization. We also created new algorithms for VQA that have helped advance the state-of-the-art for VQA, including an algorithm that surpasses humans on two different chart question answering datasets about bar-charts, line-graphs and pie charts. Finally, we provide a holistic overview of several yet-unsolved challenges in not only VQA but vision and language research at large. Despite enormous progress, we find that a robust understanding and integration of vision and language is still an elusive goal, and much of the progress may be misleading due to dataset bias, superficial correlations and flaws in standard evaluation metrics. We carefully study and categorize these issues for several vision and language tasks and outline several possible paths towards development of safe, robust and trustworthy AI for language-grounded visual understanding.
Library of Congress Subject Headings
Computer vision; Machine learning; Optical pattern recognition; Computer algorithms--Evaluation; Natural language processing (Computer science); Semantic computing
Publication Date
2-24-2020
Document Type
Dissertation
Student Type
Graduate
Degree Name
Imaging Science (Ph.D.)
Department, Program, or Center
Chester F. Carlson Center for Imaging Science (COS)
Advisor
Christopher Kanan
Advisor/Committee Member
Matt Huenerfauth
Advisor/Committee Member
Nathan D. Cahill
Recommended Citation
Kafle, Kushal, "Advancing Multi-Modal Deep Learning: Towards Language-Grounded Visual Understanding" (2020). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/10357
Campus
RIT – Main Campus
Plan Codes
IMGS-PHD