The optical flow task is an active research domain in the machine learning community with numerous downstream applications such as autonomous driving and action recognition. Convolutional neural networks have often been the basis of design for these models. However, given the recent popularity of self-attention architectures, many optical flow models now utilize a vision transformer backbone. Despite the differing backbone architectures used, consistencies in model design, especially regarding auxiliary operations, have been observed. Perhaps most apparent is the calculation and use of the cost volume. While prior works have well documented the effects of the cost volume on models with a convolutional neural network backbone, similar research does not exist for optical flow models based on other architectures, such as the vision transformer. Naturally, a research question arises: what are the effects of utilizing a cost volume in vision transformer-based optical flow models? In this thesis, a series of experiments examine the impact of the cost volume on training time, model accuracy, average inference time, and model size. The observed results show that cost volume use increases the model size and training time while improving model accuracy and reducing average inference time. These results differ from those regarding the effects of the cost volume on convolutional neural network-based models. With the results presented, researchers can now adequately consider the potential benefits and drawbacks of cost volume use in vision transformer-based optical flow models.

Library of Congress Subject Headings

Computer vision; Motion perception (Vision); Machine learning

Publication Date


Document Type


Student Type


Degree Name

Computer Engineering (MS)

Department, Program, or Center

Computer Engineering (KGCOE)


Dongfang Liu

Advisor/Committee Member

Andreas Savakis

Advisor/Committee Member

Cory Merkel


RIT – Main Campus

Plan Codes