Acceleration Techniques of Sparse Linear Algebra on Emerging Architectures

Feng, Siying

Acceleration Techniques of Sparse Linear Algebra on Emerging Architectures

Feng, Siying

2022

View/Open

fengsy_1.pdf

(7.3MB

PDF)

Abstract

Recent years have witnessed a tremendous surge of interest in accelerating sparse linear algebra applications. Sparse linear algebra is a fundamental building block and usually the performance bottleneck of a wide range of applications, such as machine learning, graph processing, and scientific computing. Optimizing sparse linear algebra kernels is thus critical for the efficient computation of these workloads. The key challenge of sparse linear algebra lies in the irregular access pattern induced by the sparseness nature, which renders the deep cache hierarchy in General-Purpose Processors (GPPs) useless and makes sparse linear algebra applications notoriously memory intensive. This dissertation proposes multiple approaches to optimize the performance and efficiency of sparse linear algebra kernels by taking advantage of emerging architecture techniques, including hardware specialization, architecture reconfiguration, and Near-Memory Processing (NMP). Aiming for a balance among efficiency, flexibility, and programmability for architecture designs, this dissertation first proposes Transmuter, a reconfigurable architecture that features massively-parallel Processing Elements (PEs) and a flexible on-chip memory hierarchy that reconfigures the memory type, resource sharing, and dataflow at runtime to adapt to different applications. Transmuter demonstrates significant efficiency gains over the CPU and GPU across a diverse set of commonly-used kernels while offering GPU-like programmability. More importantly, Transmuter retains high performance for sparse linear algebra kernels, achieving an energy efficiency within 4.1x compared to state-of-the-art functionally-equivalent Application Specific Integrated Circuits (ASICs). As the algorithm mapping and hardware configuration play a crucial role in the performance of Transmuter, the next piece of this dissertation proposes the CoSPARSE framework, which guides the runtime reconfiguration of Transmuter to achieve the best performance for graph analytics workloads. During execution, CoSPARSE intelligently reconfigures to the best-performing software algorithm and hardware configuration for Transmuter based on the input characteristics. The synergistic software and hardware reconfiguration amass a net speedup of up to 2.0x, over a naive baseline implementation with no software or hardware reconfiguration. Compared to a recent graph processing framework on a server-class CPU, CoSPARSE achieves an average speedup and energy efficiency improvement of 1.5x and 404.4x, respectively, across a suite of widely-used graph algorithms. The dynamic algorithm reconfiguration of CoSPARSE and many other graph frameworks requires the input graph to be stored in multiple data formats to avoid runtime transposition, trading off storage for performance. As data sizes keep growing, to prevent designs like CoSPARSE from expensive disk accesses when memory storage is limited, the final part of this dissertation presents MeNDA, a scalable near-memory accelerator for sparse matrix transposition and sparse merging dataflows. The wide application of multi-way merge sorting allows MeNDA to be easily adapted to other sparse primitives such as Sparse Matrix-Vector Multiplication (SpMV). Compared to two state-of-the-art implementations of sparse matrix transposition on a CPU and a sparse library on a GPU, MeNDA achieves a speedup of 19.1x, 12.0x, and 7.7x, respectively. Because MeNDA greatly reduces the runtime transposition overhead, integrating MeNDA can save reconfigurable graph frameworks such as CoSPARSE from storing two or more copies of the graph in the main memory with a minor power overhead of 78.6 mW per rank for commodity DRAM devices.

Deep Blue DOI

https://dx.doi.org/10.7302/6904

Subjects

Computer Architecture for Sparse Linear Algebra Applications

Coarse Granularity Reconfigurable Architecture

Intelligent Runtime Architecture Reconfiguration

Near-memory Processing

Graph Analytics

Sparse Matrix Transposition and Sparse Multi-way Merging Sorting

Types

Thesis

Handle

https://hdl.handle.net/2027.42/175690

Metadata

Show full item record

Collections

Dissertations and Theses (Ph.D. and Master's)

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.