Modern processing speeds in conventional Von Neumann architectures are severely limited by memory access speeds. Read and write speeds of main memory have not scaled at the same rate as logic circuits. In addition, the large physical distance spanned by the interconnect between the processor and the memory incurs a large RC delay and power penalty, often a hundred times more than on chip interconnects. As a result, accessing data from memory becomes a bottleneck in the overall performance of the processor. Operations such as matrix multiplication, which are used extensively in many modern applications such as solving systems of equations, Convolutional Neural Networks, and image recognition, require large volumes of data to be processed. These operations are impacted the most by this bottleneck and their performance is limited as a result.

Processing-in-Memory (PIM) is designed to overcome this bottleneck by performing repeated data intensive operations on the same die as the memory. In doing so, the large delay and power penalties caused by data transfers between the processor and the memory can be avoided. PIM architectures are often designed as small, simple, and efficient processing blocks such that they can be integrated into each block of the memory. This allows for extreme parallelism to be achieved, which makes it ideal for big data processes. An issue with this design paradigm, however, is the lack of flexibility in operations that can be performed. Most PIM architectures are designed to perform application specific functions, limiting their widespread use.

A novel PIM architecture is proposed which allows for arbitrary functions to be implemented with a high degree of parallelism. The architecture is based on PIM cores which are capable of performing any arbitrary function on two 4-bit inputs. Nine PIM cores are connected together to allow more advanced functions such as an 8-bit Multiply-Accumulate function to be implemented. Wireless interconnects are utilized in the design to aid in communication between clusters. The architecture will be applied to perform matrix multiplication on dense and sparse matrices of 8-bit values, which are prevalent in image and video formats. An analytical model is proposed to evaluate the area, power, and timing of the PIM architecture for both dense and sparse matrices. A real-world performance evaluation will also be conducted by applying the models to image/video data in a standard resolution to examine the timing and power consumption of the system. The results are compared against CPU and GPU results to evaluate the architecture against traditional implementations. The proposed architecture was found to have an execution time similar to a GPU implementation while requiring significantly less power.

Library of Congress Subject Headings

Computer architecture--Design; Computer storage devices; High performance processors

Publication Date


Document Type


Student Type


Degree Name

Computer Engineering (MS)

Department, Program, or Center

Computer Engineering (KGCOE)


Amlan Ganguly

Advisor/Committee Member

Cory Merkel

Advisor/Committee Member

Mark Indovina


RIT – Main Campus

Plan Codes