The demand to deploy deep learning models on edge devices has recently increased due to their pervasiveness, in applications ranging from healthcare to precision agriculture. However, a major challenge with current deep learning models, is their computational complexity. One approach to address this limitation is to compress the deep learning models by employing low-precision numerical formats. Such low-precision models often suffer from degraded inference or training accuracy. This lends itself to the question, which low-precision numerical format can meet the objective of high training accuracy with minimal resources? This research introduces tapered-precision numerical formats for deep learning inference and training. These formats have inherent capability to match the distribution of deep learning parameters by expressing values in unequal-magnitude spacing such that the density of values is maximum near zero and is tapered towards the maximum representable number. We develop low-precision arithmetic frameworks, that utilize tapered precision numerical formats to enhance the performance of deep learning inference and training. Further, we develop a software/hardware co-design framework to identify the right format for inference based on user-defined constraints through integer linear programming optimization. Third, novel adaptive low-precision algorithms are proposed that match the tapered-precision numerical format configuration to best represent the layerwise dynamic range and distribution of parameters within a deep learning model. Finally, a numerical analysis approach and signal-to-quantization-noise ratio equation for tapered-precision numerical formats are proposed that uses a metric to select the appropriate numerical format configuration. The efficacy of the proposed approaches is demonstrated on various benchmarks. Results assert that the accuracy and hardware cost trade-off of low-precision deep neural networks using tapered precision numerical formats outperform other well-known numerical formats, including floating point and fixed-point.

Publication Date


Document Type


Student Type


Degree Name

Electrical and Computer Engineering (Ph.D)

Department, Program, or Center

Electrical and Computer Engineering Technology


Kate Gleason College of Engineering


Dhireesha Kudithipudi

Advisor/Committee Member

Andres Kwasinski

Advisor/Committee Member

Majid Rabbani


RIT – Main Campus