Video understanding has become increasingly important as surveillance, social, and informational videos weave themselves into our everyday lives. Video captioning offers a simple way to summarize, index, and search the data. Most video captioning models utilize a video encoder and captioning decoder framework. Hierarchical encoders can abstractly capture clip level temporal features to represent a video, but the clips are at fixed time steps. This thesis research introduces two models: a hierarchical model with steered captioning, and a Multi-stream Hierarchical Boundary model. The steered captioning model is the first attention model to smartly guide an attention model to appropriate locations in a video by using visual attributes. The Multi-stream Hierarchical Boundary model combines a fixed hierarchy recurrent architecture with a soft hierarchy layer by using intrinsic feature boundary cuts within a video to define clips. This thesis also introduces a novel parametric Gaussian attention which removes the restriction of soft attention techniques which require fixed length video streams. By carefully incorporating Gaussian attention in designated layers, the proposed models demonstrate state-of-the-art video captioning results on recent datasets.

Library of Congress Subject Headings

Neural networks (Computer science); Video recordings for the hearing impaired--Data processing; Video recordings--Data processing

Publication Date


Document Type


Student Type


Degree Name

Computer Engineering (MS)

Department, Program, or Center

Computer Engineering (KGCOE)


Raymond Ptucha

Advisor/Committee Member

Nathan Cahill

Advisor/Committee Member

Dhireesha Kudithipudi


Physical copy available from RIT's Wallace Library at QA76.87 .N48 2017


RIT – Main Campus

Plan Codes