Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor

Published on Mar 4, 2020in ACM Transactions on Architecture and Code Optimization0.919
· DOI :10.1145/3378176
Lijuan Jiang3
Estimated H-index: 3
(CAS: Chinese Academy of Sciences),
Chao Yang16
Estimated H-index: 16
(PKU: Peking University),
Wen-Jing Ma2
Estimated H-index: 2
(CAS: Chinese Academy of Sciences)
We present a systematic methodology for optimizing batched matrix multiplications on SW26010 many-core processor of the Sunway TaihuLight supercomputer. Five surrogate algorithms and a machine lear...
Feb 16, 2019 in PPoPP (ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming)
#1Xiuhong Li (PKU: Peking University)H-Index: 7
#2Yun Liang (PKU: Peking University)H-Index: 32
Last. Yinghan Li (SenseTime)H-Index: 1
view all 5 authors...
General matrix multiplication (GEMM) plays a paramount role in a broad range of domains such as deep learning, scientific computing, and image processing. The primary optimization method is to partition the matrix into many tiles and exploit the parallelism within and between tiles. The tiling hierarchy closely mirrors the thread hierarchy on GPUs. In practice, GPUs can fully unleash its computing power only when the matrix size is large and there are sufficient number of tiles and workload for ...
Jan 1, 2019 in PARCO (Parallel Computing)
#4Stanimire Tomov (UT: University of Tennessee)H-Index: 35
Abstract Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, In...
Aug 1, 2017 in ICPP (International Conference on Parallel Processing)
#1James Lin (SJTU: Shanghai Jiao Tong University)H-Index: 19
#2Zhigeng Xu (SJTU: Shanghai Jiao Tong University)H-Index: 3
Last. Satoshi Matsuoka (TITech: Tokyo Institute of Technology)H-Index: 59
view all 5 authors...
The home-grown SW26010 many-core processor enabled the production of China’s first independently developed number-one ranked supercomputer – the Sunway TaihuLight. The design of the limited off-chip memory bandwidth, however, renders the SW26010 a highly memory-bound processor. To compensate for this limitation, the processor was designed with a unique hardware feature, "Register Level Communication" (RLC), to share register data among its 8 × 8 computing processing elements (CPEs) via a 2D onch...
Aug 1, 2017 in ICPP (International Conference on Parallel Processing)
#1Lijuan Jiang (CAS: Chinese Academy of Sciences)H-Index: 3
#2Chao Yang (CAS: Chinese Academy of Sciences)H-Index: 16
Last. Peng Zhang (CAS: Chinese Academy of Sciences)H-Index: 3
view all 9 authors...
The matrix-matrix multiplication is an essential building block that can be found in various scientific and engineering applications. High-performance implementations of the matrix-matrix multiplication on state-of-the-art processors may be of great importance for both the vendors and the users. In this paper, we present a detailed methodology of implementing and optimizing the double-precision general format matrix-matrix multiplication (DGEMM) kernel on the emerging SW26010 processor, which is...
Jun 14, 2017 in ICS (International Conference on Supercomputing)
#1Ahmad Abdelfattah (UT: University of Tennessee)H-Index: 13
#2Azzam Haidar (UT: University of Tennessee)H-Index: 22
Last. Jack Dongarra (UT: University of Tennessee)H-Index: 130
view all 4 authors...
This paper presents a software framework for solving large numbers of relatively small matrix problems using GPUs. Our approach combines novel and existing HPC techniques to methodically apply performance analysis, kernel design, low-level optimizations, and autotuning to exceed in performance proprietary vendor libraries. As a case study, we discuss the fundamental matrix operations defined by the Basic Linear Algebra Subprograms (BLAS) standard. This case study is significantly important for w...
#1Jack Dongarra (University of Manchester)H-Index: 130
#2Sven Hammarling (University of Manchester)H-Index: 28
Last. Mawussi Zounon (University of Manchester)H-Index: 9
view all 6 authors...
A current trend in high-performance computing is to decompose a large linear algebra problem into batches containing thousands of smaller problems, that can be solved independently, before collating the results. To standardize the interface to these routines, the community is developing an extension to the BLAS standard (the batched BLAS), enabling users to perform thousands of small BLAS operations in parallel whilst making efficient use of their hardware. We discuss the benefits and drawbacks ...
We expose a systematic approach for developing distributed-memory parallel matrix-matrix multiplication algorithms. The journey starts with a description of how matrices are distributed to meshes of nodes (e.g., MPI processes), relates these distributions to scalable parallel implementation of matrix-vector multiplication and rank-1 update, continues on to reveal a family of matrix-matrix multiplication algorithms that view the nodes as a two-dimensional (2D) mesh, and finishes with extending th...
Nov 13, 2016 in HiPC (IEEE International Conference on High Performance Computing, Data, and Analytics)
#1Jian Zhang (CAS: Chinese Academy of Sciences)H-Index: 43
#2Chunbao Zhou (CAS: Chinese Academy of Sciences)H-Index: 1
Last. Zhao LiuH-Index: 2
view all 10 authors...
Many important properties of materials such as strength, ductility, hardness and conductivity are determined by the microstructures of the material. During the formation of these microstructures, grain coarsening plays an important role. The Cahn-Hilliard equation has been applied extensively to simulate the coarsening kinetics of a two-phase microstructure. It is well accepted that the limited capabilities in conducting large scale, long time simulations constitute bottlenecks in predicting mic...
Nov 13, 2016 in HiPC (IEEE International Conference on High Performance Computing, Data, and Analytics)
#1Chao Yang (CAS: Chinese Academy of Sciences)H-Index: 16
#2Wei Xue (THU: Tsinghua University)H-Index: 32
Last. Weimin Zheng (THU: Tsinghua University)H-Index: 32
view all 12 authors...
An ultra-scalable fully-implicit solver is developed for stiff time-dependent problems arising from the hyperbolic conservation laws in nonhydrostatic atmospheric dynamics. In the solver, we propose a highly efficient hybrid domain-decomposed multigrid preconditioner that can greatly accelerate the convergence rate at the extreme scale. For solving the overlapped subdomain problems, a geometry-based pipelined incomplete LU factorization method is designed to further exploit the on-chip fine-grai...
#1Jack Dongarra (UT: University of Tennessee)H-Index: 130
Cited By3
The depthwise separable convolution is commonly seen in convolutional neural networks (CNNs), and is widely used to reduce the computation overhead of a standard multi-channel 2D convolution. Existing implementations of depthwise separable convolutions target accelerating model training with large batch sizes with a large number of samples to be processed at once. Such approaches are inadequate for small-batch-sized model training and the typical scenario of model inference where the model takes...
#2Xin LiuH-Index: 5
#3Yong LiuH-Index: 4
Last. Wang ZhenH-Index: 2
view all 10 authors...
Classical simulation of quantum computation plays a critical role in numerical studies of quantum algorithms and the validation of quantum devices. Here, we introduce SW_Qsim, a tensor-network-based quantum simulator, which is designed with a two-level parallel structure for efficient implementation on the many-core New Sunway Supercomputer. We propose a minimize-memory contraction path algorithm for rectangular quantum grids to reduce the memory overhead, and provide the memory-limited simulati...
#1Francesco RizziH-Index: 9
#2Eric J. ParishH-Index: 9
Last. John TencerH-Index: 6
view all 4 authors...
This work aims to advance computational methods for projection-based reduced order models (ROMs) of linear time-invariant (LTI) dynamical systems. For such systems, current practice relies on ROM formulations expressing the state as a rank-1 tensor (i.e., a vector), leading to computational kernels that are memory bandwidth bound and, therefore, ill-suited for scalable performance on modern many-core and hybrid computing nodes. This weakness can be particularly limiting when tackling many-query ...
#1Suyash Bakshi (UH: University of Houston)H-Index: 1
#2Lennart Johnsson (UH: University of Houston)H-Index: 9
Reducing energy consumption and achieving high energy efficiency in computation has become the top priority in High Performance Computing. High energy efficiency generally requires high resource utilization since energy demand for any applications and architectures is dependent on active time. We show that by using DMA the 28nm CMOS node Myriad-2 Vision Processing Unit can achieve 25 GFLOPs/W for FP32 matrixmultiplication. Our main contributions are: (i) An analysis of data transfer needs for in...
This website uses cookies.
We use cookies to improve your online experience. By continuing to use our website we assume you agree to the placement of these cookies.
To learn more, you can find in our Privacy Policy.