Pose estimation is an important and challenging task in computer vision. Hand pose estimation has drawn increasing attention during the past decade and has been utilized in a wide range of applications including augmented reality, virtual reality, human-computer interaction, and action recognition. Hand pose is more challenging than general human body pose estimation due to the large number of degrees of freedom and the frequent occlusions of joints. To address these challenges, we propose HandyPose, a single-pass, end-to-end trainable architecture for hand pose estimation. Adopting an encoder-decoder framework with multi-level features, our method achieves high accuracy in hand pose while maintaining manageable size complexity and modularity of the network. HandyPose takes a multi-scale approach to representing context by incorporating spatial information at various levels of the network to mitigate the loss of resolution due to pooling. Our advanced multi-level waterfall architecture leverages the efficiency of progressive cascade filtering while maintaining larger fields-of-view through the concatenation of multi-level features from different levels of the network in the waterfall module. The decoder incorporates both the waterfall and multi-scale features for the generation of accurate joint heatmaps in a single stage. Recent developments in computer vision and deep learning have achieved significant progress in human pose estimation, but little of this work has been applied to vehicle pose. We also propose VehiPose, an efficient architecture for vehicle pose estimation, based on a multi-scale deep learning approach that achieves high accuracy vehicle pose estimation while maintaining manageable network complexity and modularity. The VehiPose architecture combines an encoder-decoder architecture with a waterfall atrous convolution module for multi-scale feature representation. It incorporates contextual information across scales and performs the localization of vehicle keypoints in an end-to-end trainable network. Our HandyPose architecture has a baseline of vehipose with an improvement in performance by incorporating multi-level features from different levels of the backbone and introducing novel multi-level modules. HandyPose and VehiPose more thoroughly leverage the image contextual information and deal with the issue of spatial loss of resolution due to successive pooling while maintaining the size complexity, modularity of the network, and preserve the spatial information at various levels of the network. Our results demonstrate state-of-the-art performance on popular datasets and show that HandyPose and VehiPose are robust and efficient architectures for hand and vehicle pose estimation.

Library of Congress Subject Headings

Computer vision; Gesture recognition (Computer science); Pattern recognition systems; Neural networks (Computer science)

Publication Date


Document Type


Student Type


Degree Name

Computer Engineering (MS)

Department, Program, or Center

Computer Engineering (KGCOE)


Andreas Savakis

Advisor/Committee Member

Alexander Loui

Advisor/Committee Member

Matthew Dye


RIT – Main Campus

Plan Codes