Current research in computer vision and machine learning has demonstrated some great abilities at detecting and recognizing objects in natural images. Current state-of-the-art results in object detection, classification and localization in ImageNet Challenges have the validation accuracy for top 5 predictions for classification to be at 3.08% while similar classification experiments run by trained humans report an accuracy of 5.1%. While some people might argue that human accuracy is a function of training time it can be said with great confidence that automated classification models are at least as good as trained humans in classification problems. The ability of these models to analyze and describe complex images, however, is still an active area of research.

Image description is a good starting point for imparting artificial intelligence to machines by allowing them to analyze and describe complex visual scenes. This thesis work introduces a generic end-to-end trainable Fusion-based Recurrent Multi-Modal (FRMM) architecture to address multi-modal applications. FRMM allows each input modality to be independent in terms of architecture, parameters and length of input sequences. FRMM image description models seamlessly blend convolutional neural network feature descriptors with sequential language data in a recurrent framework. In addition to introducing FRMMs, this work also analyzes the impact of varying activation functions and vocabulary size. For training and testing Flickr8k, Flickr30K and MSCOCO datasets have been used, demonstrating state-of-the-art description results.

Library of Congress Subject Headings

Deep learning (Machine learning); Neural networks (Computer science); Computer vision; Image processing--Digital techniques

Publication Date


Document Type


Student Type


Degree Name

Computer Engineering (MS)

Department, Program, or Center

Computer Engineering (KGCOE)


Raymond Ptucha

Advisor/Committee Member

Andreas Savakis

Advisor/Committee Member

Christopher Kanan


Physical copy available from RIT's Wallace Library at Q325.5 .O78 2016


RIT – Main Campus