Abstract
Video understanding has become increasingly important as surveillance, social, and informational videos weave themselves into our everyday lives. Video captioning offers a simple way to summarize, index, and search the data. Most video captioning models utilize a video encoder and captioning decoder framework. Hierarchical encoders can abstractly capture clip level temporal features to represent a video, but the clips are at fixed time steps. This thesis research introduces two models: a hierarchical model with steered captioning, and a Multi-stream Hierarchical Boundary model. The steered captioning model is the first attention model to smartly guide an attention model to appropriate locations in a video by using visual attributes. The Multi-stream Hierarchical Boundary model combines a fixed hierarchy recurrent architecture with a soft hierarchy layer by using intrinsic feature boundary cuts within a video to define clips. This thesis also introduces a novel parametric Gaussian attention which removes the restriction of soft attention techniques which require fixed length video streams. By carefully incorporating Gaussian attention in designated layers, the proposed models demonstrate state-of-the-art video captioning results on recent datasets.
Library of Congress Subject Headings
Neural networks (Computer science); Video recordings for the hearing impaired--Data processing; Video recordings--Data processing
Publication Date
4-2017
Document Type
Thesis
Student Type
Graduate
Degree Name
Computer Engineering (MS)
Department, Program, or Center
Computer Engineering (KGCOE)
Advisor
Raymond Ptucha
Advisor/Committee Member
Nathan Cahill
Advisor/Committee Member
Dhireesha Kudithipudi
Recommended Citation
Nguyen, Thang Huy, "Automatic Video Captioning using Deep Neural Network" (2017). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/9516
Campus
RIT – Main Campus
Plan Codes
CMPE-MS
Comments
Physical copy available from RIT's Wallace Library at QA76.87 .N48 2017