Exploiting Sparsity, Compression and In-Memory Compute in Designing Data-Intensive Accelerators

Chen, Thomas

Exploiting Sparsity, Compression and In-Memory Compute in Designing Data-Intensive Accelerators

Chen, Thomas

2019

View/Open

tcchen_1.pdf

(7.2MB

PDF)

Abstract

Enabled by technology scaling, processing parallelism has been continuously increased to meet the demand of large-scale and data-intensive computations. However, the effort to increase processing parallelism is largely hindered by the von Neumann bottleneck. To achieve a higher performance, domain-specific computing has become the most promising direction. Domain-specific computing employs highly optimized datapaths, simplified control and efficient dataflow to enable the dense integration of processing elements with optimized memory access. Many domain-specific designs have demonstrated significantly better figure of merit than a general-purpose CPU or GPU, but the von Neumann bottleneck still limits the maximum achievable performance. To reduce the data transfer cost, a closer integration between memory and computation is needed, which ultimately leads to the so-called in-memory computing approach. In-memory computing re-purposes memory cell array for multiply-accumulate operations and apply both bit-line and word-line parallelism to realize large matrix computation before memory readout, eliminating the von Neumann bottleneck entirely. However, in-memory computing is inherently analog compute, where limited precision and high sensitivity to noise pose major challenges. This thesis work presents two approaches to address the von Neumann bottleneck: 1) reducing the amount of data that needs to be moved by sparsity and data compression; and 2) robust multi-bit in-memory compute design to extend the applicability of in-memory compute to a wider range of applications. With video input and 3D features, a video processor requires many times larger memory size and computation than a 2D image processor. In this work, I chose a video sequence inference processor to demonstrate sparsity-oriented optimizations using a quantized all-spiking network, where the sparsity can reach a high 90% level. By kernel compression and activation compression, memory size can be reduced further by 43% and 64%, respectively. High data sparsity and memory compression lead to two orders of magnitude of improvement in performance and energy. The design was demonstrated in a 2.53mm² 40nm CMOS chip for video sequence inference that achieved 1.70TOPS with a power dissipation of 135mW at 0.9V and 250MHz. The results show the effective use of sparsity and data compression to loosen the von Neumann bottleneck. It is well known that in-memory computing is limited in operand and output precision, which restricts its applications to binary or low-precision applications. Through an algorithmic transformation using a residual approach, I demonstrate that it is possible to map a high-precision partial differential equation (PDE) solver to a low-precision 5-bit in-memory computing. To support multi-bit computation, I adopt both width and level modulation of word-line pulses. To reduce the cost and improve the speed of analog-to-digital conversion, I employ a compact array of dual-ramp single-slope (DRSS) ADCs for bit-line readout. These ideas were demonstrated in a 1.87mm² 180nm test chip made of four 320×64 multiply-accumulate (MAC) SRAMs, each supporting 128× parallel 5b×5b MACs with 32 5-bit output ADCs and consuming 16.6mW at 200MHz. The prototype was able to solve a 127×127 PDE grid at 56.9 GOPS. This SRAM based in-memory compute provides over 40× compute density than an equivalent ASIC, demonstrating that the von Neumann bottleneck can be removed for applications that require higher precisions. This work shows the importance of algorithm-architecture-circuit co-design for uncovering opportunities to mitigate and remove the von Neumann bottleneck. The design techniques and approaches can be applicable to a wide array of applications for improving performance and efficiency.

Subjects

data-intensive accelerators

sparse coding

process in memory

precision optimization

recurrent neural network

partial differential equation

Types

Thesis

Handle

https://hdl.handle.net/2027.42/153499

Metadata

Show full item record

Collections

Dissertations and Theses (Ph.D. and Master's)

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.