Teaching computers how to recognize people and objects from visual cues in images and videos is an interesting challenge. The computer vision and pattern recognition communities have already demonstrated the ability of intelligent algorithms to detect and classify objects in difficult conditions such as pose, occlusions and image fidelity. Recent deep learning approaches in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) are built using very large and deep convolution neural network architectures. In 2015, such architectures outperformed human performance (94.9% human vs 95.06% machine) for top-5 validation accuracies on the ImageNet dataset, and earlier this year deep learning approaches demonstrated a remarkable 96.43% accuracy. These successes have been made possible by deep architectures such as VGG, GoogLeNet, and most recently by deep residual models with as many as 152 weight layers. Training of these deep models is a difficult task due to compute intensive learning of millions of parameters. Due to the inevitability of these parameters, very small filters of size 3x3 are used in convolutional layers to reduce the parameters in very deep networks. On the other hand, deep networks generalize well on other datasets and outperform complex datasets with less features or Images.

This thesis proposes a robust approach for large scale visual recognition by introducing a framework that automatically analyses the similarity between different classes among the dataset and configures a family of smaller networks that replace a single larger network. Classes that are similar are grouped together and are learnt by a smaller network. This allows one to divide and conquer the large classification problem by identifying the class category from its coarse label to its fine label, deploying two or more stages of networks. In this way the proposed framework learns the natural hierarchy and effectively uses it for the classification problem. A comprehensive analysis of the proposed methods show that hierarchical models outperform traditional models in terms of accuracy, reduced computations and attribute to expanding the ability to learn large scale visual information effectively.

Library of Congress Subject Headings

Computer vision; Optical pattern recognition; Machine learning; Neural networks (Computer science)

Publication Date


Document Type


Student Type


Degree Name

Computer Engineering (MS)

Department, Program, or Center

Computer Engineering (KGCOE)


Raymond Ptucha

Advisor/Committee Member

Christopher Kanan

Advisor/Committee Member

Dhireesha Kudithipudi


Physical copy available from RIT's Wallace Library at TA1634 .C43 2016


RIT – Main Campus

Plan Codes