Deep learning has enabled great advances in the field of natural language processing, computer vision and pattern recognition in general. Deep learning frameworks have been very successful in performing classification, object detection, segmentation and translation. Before objects can be processed, a vector representation of that object needs to be created. For example, sentences and images can be encoded with a sent2vec and image2vec function respectively in preparation for input to a machine learning framework. Neural networks are able to learn efficient vector representation of images, text, audio, videos and 3D point clouds. However, the transfer of knowledge from one modality to the other is a challenging task. In this work, we develop vector spaces that can handle data that belongs to multiple modalities at the same time. In these spaces, similar objects are tightly clustered and dissimilar objects are far away irrespective of their modality. Such a vector space can be used in retrieval of objects, searching and generation tasks. For example, given a picture of a person surfing, one can retrieve sentences or audio bites of a person surfing. We build a Multi-stage Common Vector Space (M-CVS) and Reference Vector Space (RVS) that can handle images, text, audios, videos and 3D point cloud data. Both, the M-CVS and RVS can handle the addition of a new modality without having to change the existing transforms or architecture. Our model is evaluated by performing cross modal retrieval on multiple benchmark datasets.

Library of Congress Subject Headings

Vector spaces; Machine learning; Neural networks (Computer science)

Publication Date


Document Type


Student Type


Degree Name

Computer Engineering (MS)

Department, Program, or Center

Computer Engineering (KGCOE)


Raymond Ptucha

Advisor/Committee Member

Alexander Loui

Advisor/Committee Member

Qi Yu


RIT – Main Campus

Plan Codes