Domain-Specific Acceleration: From Efficient Vision Processing Hardware to High-Performance Quantum Computing Software

Zhang, Qirui

Domain-Specific Acceleration: From Efficient Vision Processing Hardware to High-Performance Quantum Computing Software

Zhang, Qirui

2024

View/Open

qiruizh_1.pdf

(4MB

PDF)

Abstract

With the end of Dennard scaling and the decline of Moore’s law, there are no longer ‘free’ performance and efficiency gains from semiconductor technology advancements. Domain-Specific Acceleration (DSA) is a promising remaining path for further significant improvements. This approach involves designing optimized software and hardware tailored to specific application domains. Successful DSA requires careful consideration of methodologies such as specialization, parallelism exploitation, algorithm-hardware co-design, and balancing efficiency with programmability. To extend the boundaries of DSA, especially for less extensively studied domains, this dissertation studies DSA designs for three application areas: Image compression, robotic vision, and Quantum Circuit Simulation (QCS). Though these domains differ, the three designs employ a common methodology of algorithm-hardware co-optimizations to reduce data movements from memory to the processing units. Firstly, this dissertation presents an Ultra-Low-Power (ULP) H.264 or Advanced Video Coding (AVC) intra-frame image compression accelerator for event-driven Internet of Things (IoT) imaging systems. The H.264/AVC intra-frame codec is customized to compress arbitrary non-rectangular change-detected regions. Novel algorithm-hardware co-designs optimize energy and latency from image memory accesses, reducing overhead for neighbor macroblock accesses by 2.6× with negligible quality loss. Split control for major processing phases exploits data dependency and pipelining, while data path micro-architecture reconfiguration reduces area and leakage. Fabricated in 40nm, the accelerator occupies 0.32mm2 with 4kB SRAM, consuming only 1.21μW at 0.6V and 153kHz, achieving 30.9pJ/pixel compression energy efficiency. Combined with change detection, this design brings a 133× reduction in overall energy for egressing images of change-detected regions in an event-driven IoT imaging system. Secondly, this dissertation introduces RoboVisio, an efficient and flexible domain-specific System-on-Chip (SoC) for vision tasks in autonomous micro-robot navigation. A novel hybrid Processing Element (PE) is proposed, combining a 2D-mapping architecture for classic vision tasks with an output-channel-parallel systolic architecture for Convolutional Neural Network (CNN). This integration future-proofs the architecture, facilitating next-generation CNN-heavy vision algorithms, saving 40% in area and leakage without power or throughput loss compared to separate implementations. Other key features include 2MB magnetoresistive random-access memory for non-volatile fully-on-chip weight storage, a unified image-activation memory with block-swapping-based buffering that reduces buffer footprint by 50% and eliminates data copy for multi-frame buffering, and a combination of weight buffering and CNN loop ordering reducing weight memory system power by 75%. Fabricated in 22nm, RoboVisio achieves 0.22nJ/pix for Harris corner detection and 3.5TOPS/W (16-bit OP) for CNN, a 40% to 170% efficiency improvement over state-of-the-art edge machine learning SoCs using non-volatile memory. Lastly, this dissertation examines the acceleration of QCS, a crucial computational problem for quantum computing development. Predominant approaches center on Tensor Network (TN), valued for better concurrency and reduced computation compared to full quantum vectors and matrices. However, even with the advantages, array-based tensors can have significant redundancy. To optimize QCS algorithms for future hardware accelerators, this dissertation presents Fast Tensor Decision Diagram (FTDD), a novel open-source software framework. FTDD leverages Tensor Decision Diagram (TDD) to eliminate overheads and achieve significant speedups. On average, FTDD delivers a 37× speedup over Google’s TensorNetwork library on redundancy-rich circuits and 25× and 144× speedups over quantum multi-valued decision diagram and prior TDD implementation, respectively, on Google random quantum circuits. FTDD introduces a linear-complexity rank simplification algorithm, Tetris, and edge-centric data structures for recursive TDD operations. Additionally, FTDD explores TN contraction ordering and optimizations from binary decision diagram.

Deep Blue DOI

https://dx.doi.org/10.7302/24023

Subjects

Domain-Specific Architecture

H.264/AVC

Autonomous Navigation

Neural Network

Quantum Circuit Simulation

Decision Diagram

Types

Thesis

Handle

https://hdl.handle.net/2027.42/194675

Metadata

Show full item record

Collections

Dissertations and Theses (Ph.D. and Master's)

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.