Efficient sparse-matrix multi-vector product on GPUs

Published on Jun 11, 2018
· DOI :10.1145/3208040.3208062
Changwan Hong9
Estimated H-index: 9
(OSU: Ohio State University),
Aravind Sukumaran-Rajam11
Estimated H-index: 11
(OSU: Ohio State University)
+ 7 AuthorsP. Sadayappan71
Estimated H-index: 71
(OSU: Ohio State University)
Sources
Abstract
Sparse Matrix-Vector (SpMV) and Sparse Matrix-Multivector (SpMM) products are key kernels for computational science and data science. While GPUs offer significantly higher peak performance and memory bandwidth than multicore CPUs, achieving high performance on sparse computations on GPUs is very challenging. A tremendous amount of recent research has focused on various GPU implementations of the SpMV kernel. But the multi-vector SpMM kernel has received much less attention. In this paper, we present an in-depth analysis to contrast SpMV and SpMM, and develop a new sparse-matrix representation and computation approach suited to achieving high data-movement efficiency and effective GPU parallelization of SpMM. Experimental evaluation using the entire SuiteSparse matrix suite demonstrates significant performance improvement over existing SpMM implementations from vendor libraries.
📖 Papers frequently viewed together
2017
3 Authors (P. Zardoshti, ..., H. Sarbazi-Azad)
2014HPCC: High Performance Computing and Communications
References35
Newest
Every year, novel NVIDIA GPU designs are introduced. This rapid architectural and technological progression, coupled with a reluctance by manufacturers to disclose low-level details, makes it difficult for even the most proficient GPU software designers to remain up-to-date with the technological advances at a microarchitectural level. To address this dearth of public, microarchitectural-level information on the novel NVIDIA GPUs, independent researchers have resorted to microbenchmarks-based di...
Jun 14, 2017 in ICS (International Conference on Supercomputing)
#1Markus Steinberger (MPG: Max Planck Society)H-Index: 17
#2Rhaleb Zayer (MPG: Max Planck Society)H-Index: 14
Last. Hans-Peter Seidel (MPG: Max Planck Society)H-Index: 119
view all 3 authors...
The rising popularity of the graphics processing unit (GPU) across various numerical computing applications triggered a breakneck race to optimize key numerical kernels and in particular, the sparse matrix-vector product (SpMV). Despite great strides, most existing GPU-SpMV approaches trade off one aspect of performance against another. They either require preprocessing, exhibit inconsistent behavior, lead to execution divergence, suffer load imbalance or induce detrimental memory access pattern...
Source
#1Jongsoo Park (Intel)H-Index: 24
#2Sheng Li (Intel)H-Index: 17
Last. Pradeep Dubey (Intel)H-Index: 49
view all 7 authors...
Phenomenally successful in practical inference problems, convolutional neural networks (CNN) are widely deployed in mobile devices, data centers, and even supercomputers. The number of parameters needed in CNNs, however, are often large and undesirable. Consequently, various methods have been developed to prune a CNN once it is trained. Nevertheless, the resulting CNNs offer limited benefits. While pruning the fully connected layers reduces a CNN's size considerably, it does not improve inferenc...
Aug 1, 2016 in ICPP (International Conference on Parallel Processing)
#1Akrem Benatia (BIT: Beijing Institute of Technology)H-Index: 3
#2Weixing Ji (BIT: Beijing Institute of Technology)H-Index: 9
Last. Feng Shi (BIT: Beijing Institute of Technology)H-Index: 10
view all 4 authors...
Sparse Matrix-Vector Multiplication (SpMV) kernel dominates the computing cost in numerous scientific applications. Many implementations based on different sparse formats were proposed recently for this kernel on the GPU side. Since the performance of these sparse formats varies significantly according to the sparsity characteristics of the input matrix and the hardware specifications, no one of them can be considered as the best one to use for every sparse matrix. In this paper, we address the ...
Source
Apr 4, 2016 in PDP (Parallel, Distributed and Network-Based Processing)
#1Elias Konstantinidis (UoA: National and Kapodistrian University of Athens)H-Index: 6
#2Yiannis Cotronis (UoA: National and Kapodistrian University of Athens)H-Index: 9
Modern Graphics Processing Units (GPUs) have evolved to high performance general purpose processors, forming an alternative to CPUs. However, programming them effectively has proven to be a challenge, not only due to the mandatory requirement of extracting massive fine grained parallelism but also due to its susceptible performance on memory traffic. Apart from regular memory caches, GPUs feature other types of fast memories as well, for instance scratchpads, texture caches, etc. In order to gai...
Source
Feb 27, 2016 in PPoPP (ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming)
#1Duane Merrill (Nvidia)H-Index: 10
#2Michael Garland (Nvidia)H-Index: 55
We present a perfectly balanced, "merge-based" parallel method for computing sparse matrix-vector products (SpMV). Our algorithm operates directly upon the Compressed Sparse Row (CSR) sparse matrix format, a predominant in-memory representation for general-purpose sparse linear algebra computations. Our CsrMV performs an equitable multi-partitioning of the input dataset, ensuring that no single thread can be overwhelmed by assignment to (a) arbitrarily-long rows or (b) an arbitrarily-large numbe...
Source
Feb 15, 2016 in ICLR (International Conference on Learning Representations)
#1Song Han (Stanford University)H-Index: 45
#2Huizi Mao (THU: Tsinghua University)H-Index: 19
Last. William J. Dally (Stanford University)H-Index: 99
view all 3 authors...
Abstract: Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, we introduce "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connection...
Dec 16, 2015 in HiPC (IEEE International Conference on High Performance Computing, Data, and Analytics)
#1Mayank Daga (AMD: Advanced Micro Devices)H-Index: 11
#2Joseph L. Greathouse (AMD: Advanced Micro Devices)H-Index: 15
Sparse matrix vector multiplication (SpMV) is an important linear algebra primitive. Recent research has focused on improving the performance of SpMV on GPUs when using compressed sparse row (CSR), the most frequently used matrix storage format on CPUs. Efficient CSR-based SpMV obviates the need for other GPU-specific storage formats, thereby saving runtime and storage overheads. However, existing CSR-based SpMV algorithms on GPUs perform poorly on irregular sparse matrices, limiting their usefu...
Source
#1Weifeng Liu (UCPH: University of Copenhagen)H-Index: 15
#2Brian Vinter (UCPH: University of Copenhagen)H-Index: 15
General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method (AMG), breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient parallel SpGEMM implementation has to handle extra irregularity from three aspects: (1) the number of nonzero entries in the resulting sparse matrix is unknown in advance, (2) very expensive parallel insert operations at random positions in the r...
Source
Apr 12, 2015 in HiPC (IEEE International Conference on High Performance Computing, Data, and Analytics)
#2Stanimire Tomov (UT: University of Tennessee)H-Index: 35
This paper presents a heterogeneous CPU-GPU implementation for a sparse iterative eigensolver -- the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG). For the key routine generating the Krylov search spaces via the product of a sparse matrix and a block of vectors, we propose a GPU kernel based on a modified sliced ELLPACK format. Blocking a set of vectors and processing them simultaneously accelerates the computation of a set of consecutive SpMVs significantly. Comparing the per...
Source
Cited By26
Newest
#1Francesco RizziH-Index: 9
#2Eric J. ParishH-Index: 9
Last. John TencerH-Index: 6
view all 4 authors...
This work aims to advance computational methods for projection-based reduced order models (ROMs) of linear time-invariant (LTI) dynamical systems. For such systems, current practice relies on ROM formulations expressing the state as a rank-1 tensor (i.e., a vector), leading to computational kernels that are memory bandwidth bound and, therefore, ill-suited for scalable performance on modern many-core and hybrid computing nodes. This weakness can be particularly limiting when tackling many-query ...
Source
#1Guyue Huang (UCSB: University of California, Santa Barbara)H-Index: 3
Sparse matrix-vector and matrix-matrix multiplication (SpMV and SpMM) are fundamental in both conventional (graph analytics, scientific computing) and emerging (sparse DNN, GNN) domains. Workload-balancing and parallel-reduction are widely-used design principles for efficient SpMV. However, prior work fails to resolve how to implement and adaptively use the two principles for SpMV/MM. To overcome this obstacle, we first complete the implementation space with optimizations by filling three missin...
#1Haodong Bian (Qinghai University)H-Index: 2
#2Jianqiang Huang (Qinghai University)H-Index: 14
Last. Xiaoying Wang (Qinghai University)H-Index: 14
view all 5 authors...
Abstract SpMV (Sparse matrix–vector multiplication) is widely used in many fields. Improving the performance of SpMV has been the pursuit of many researchers. Parallel SpMV using multi-core processors has been a standard parallel method used by researchers. In reality, the number of non-zero elements in many sparse matrices is not evenly distributed, so parallelism without preprocessing will cause a large amount of performance loss due to uneven load. In this paper, we propose ALBUS (Absolute Lo...
Source
Deep learning on graphs has attracted tremendous attention from the graph learning community in recent years. It has been widely used in several real-world applications such as social network analysis and recommender systems. In this paper, we introduce CogDL, an extensive toolkit for deep learning on graphs that allows researchers and developers to easily conduct experiments and build applications. It provides standard training and evaluation for the most important tasks in the graph domain, in...
Nov 2, 2020 in ICCAD (International Conference on Computer Aided Design)
#1Ying Zhang (Stevens Institute of Technology)H-Index: 2
#2Zhiqiang Zhao (MTU: Michigan Technological University)H-Index: 5
Last. Zhuo Feng (Stevens Institute of Technology)H-Index: 18
view all 3 authors...
Recent spectral graph sparsification techniques have shown promising performance in accelerating many numerical and graph algorithms, such as iterative methods for solving large sparse matrices, spectral partitioning of undirected graphs, vectorless verification of power/thermal grids, representation learning of large graphs, etc. However, prior spectral graph sparsification methods rely on fast Laplacian matrix solvers that are usually challenging to implement in practice. This work, for the fi...
Source
#1Sureyya Emre Kurt (UofU: University of Utah)
#2Aravind Sukumaran-Rajam (WSU: Washington State University)H-Index: 1
Last. P. Sadayyapan (UofU: University of Utah)H-Index: 1
view all 4 authors...
Tiling is a key technique to reduce data movement in matrix computations. While tiling is well understood and widely used for dense matrix/tensor computations, effective tiling of sparse matrix computations remains a challenging problem. This paper proposes a novel method to efficiently summarize the impact of the sparsity structure of a matrix on achievable data reuse as a one-dimensional signature, which is then used to build an analytical cost model for tile size optimization for sparse matri...
Source
#1Srinivas Eswar (Georgia Institute of Technology)H-Index: 3
#2Koby Hayashi (Georgia Institute of Technology)H-Index: 5
Last. Haesun Park (Georgia Institute of Technology)H-Index: 57
view all 6 authors...
We develop the first distributed-memory parallel implementation of Symmetric Nonnegative Matrix Factorization (SymNMF), a key data analytics kernel for clustering and dimensionality reduction. Our implementation includes two different algorithms for SymNMF, which give comparable results in terms of time and accuracy. The first algorithm is a parallelization of an existing sequential approach that uses solvers for non symmetric NMF. The second algorithm is a novel approach based on the Gauss-Newt...
Source
Sep 30, 2020 in PACT (International Conference on Parallel Architectures and Compilation Techniques)
#1Ziheng Wang (MIT: Massachusetts Institute of Technology)H-Index: 4
In recent years, there has been a flurry of research in deep neural network pruning and compression. Early approaches prune weights individually. However, it is difficult to take advantage of the resulting unstructured sparsity patterns on modern hardware like GPUs. As a result, pruning strategies which impose sparsity structures in the weights have become more popular. However,these structured pruning approaches typically lead to higher losses in accuracy than unstructured pruning. In this pape...
Source
#1Chen-Ting Chao (NTHU: National Tsing Hua University)H-Index: 2
#2Wei-Hsu Chu (NTHU: National Tsing Hua University)H-Index: 1
Last. Hsiang-Wei Sung (MediaTek)H-Index: 2
view all 6 authors...
In natural language processing(NLP), the general way to understand the meaning of a word is via word embedding. The word embedding training model can convert words into multidimensional vectors and make the words that do not know “meaning” into vectors with “meaning”. Famous word embedding training models, include models such as FastText, Word2Vec, and GloVe. They can train words into vectors and then they are used for further semantic classifications. In this paper, we work on the efficient sup...
Source
Graph Neural Networks (GNNs) have achieved significant improvements in various domains. Sparse Matrix-Matrix multiplication (SpMM) is a fundamental operator in GNNs, which performs a multiplication between a sparse matrix and a dense matrix. Accelerating SpMM on parallel hardware like GPUs can face the following challenges: From the GNN application perspective, the compatibility needs to be considered. General GNN algorithms require SpMM-like operations (e.g., pooling) between matrices, which ar...
This website uses cookies.
We use cookies to improve your online experience. By continuing to use our website we assume you agree to the placement of these cookies.
To learn more, you can find in our Privacy Policy.