Systems and methods for reducing the bandwidth needed to read the inputs
to a matrix multiply operation may improve system performance. Rather
than reading a row of a first input matrix and a column of a second input
matrix to produce a column of a product matrix, a column of the first
input matrix and a single element of the second input matrix are read to
produce a column of partial dot products of the product matrix.
Therefore, the number of input matrix elements read to produce each
product matrix element is reduced from 2N to N+1, where N is the number
of elements in a column of the product matrix.