# Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor

Published on Mar 4, 2020in ACM Transactions on Architecture and Code Optimization0.919

· DOI :10.1145/3378176

Published on Mar 4, 2020in ACM Transactions on Architecture and Code Optimization0.919

· DOI :10.1145/3378176

References33

Newest

Feb 16, 2019 in PPoPP (ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming)

General matrix multiplication (GEMM) plays a paramount role in a broad range of domains such as deep learning, scientific computing, and image processing. The primary optimization method is to partition the matrix into many tiles and exploit the parallelism within and between tiles. The tiling hierarchy closely mirrors the thread hierarchy on GPUs. In practice, GPUs can fully unleash its computing power only when the matrix size is large and there are sufficient number of tiles and workload for ...

Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices

Jan 1, 2019 in PARCO (Parallel Computing)

Abstract Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, In...

Aug 1, 2017 in ICPP (International Conference on Parallel Processing)

The home-grown SW26010 many-core processor enabled the production of China’s first independently developed number-one ranked supercomputer – the Sunway TaihuLight. The design of the limited off-chip memory bandwidth, however, renders the SW26010 a highly memory-bound processor. To compensate for this limitation, the processor was designed with a unique hardware feature, "Register Level Communication" (RLC), to share register data among its 8 × 8 computing processing elements (CPEs) via a 2D onch...

Aug 1, 2017 in ICPP (International Conference on Parallel Processing)

The matrix-matrix multiplication is an essential building block that can be found in various scientific and engineering applications. High-performance implementations of the matrix-matrix multiplication on state-of-the-art processors may be of great importance for both the vendors and the users. In this paper, we present a detailed methodology of implementing and optimizing the double-precision general format matrix-matrix multiplication (DGEMM) kernel on the emerging SW26010 processor, which is...

Jun 14, 2017 in ICS (International Conference on Supercomputing)

This paper presents a software framework for solving large numbers of relatively small matrix problems using GPUs. Our approach combines novel and existing HPC techniques to methodically apply performance analysis, kernel design, low-level optimizations, and autotuning to exceed in performance proprietary vendor libraries. As a case study, we discuss the fundamental matrix operations defined by the Basic Linear Algebra Subprograms (BLAS) standard. This case study is significantly important for w...

A current trend in high-performance computing is to decompose a large linear algebra problem into batches containing thousands of smaller problems, that can be solved independently, before collating the results. To standardize the interface to these routines, the community is developing an extension to the BLAS standard (the batched BLAS), enabling users to perform thousands of small BLAS operations in parallel whilst making efficient use of their hardware. We discuss the benefits and drawbacks ...

We expose a systematic approach for developing distributed-memory parallel matrix-matrix multiplication algorithms. The journey starts with a description of how matrices are distributed to meshes of nodes (e.g., MPI processes), relates these distributions to scalable parallel implementation of matrix-vector multiplication and rank-1 update, continues on to reveal a family of matrix-matrix multiplication algorithms that view the nodes as a two-dimensional (2D) mesh, and finishes with extending th...

Nov 13, 2016 in HiPC (IEEE International Conference on High Performance Computing, Data, and Analytics)

Many important properties of materials such as strength, ductility, hardness and conductivity are determined by the microstructures of the material. During the formation of these microstructures, grain coarsening plays an important role. The Cahn-Hilliard equation has been applied extensively to simulate the coarsening kinetics of a two-phase microstructure. It is well accepted that the limited capabilities in conducting large scale, long time simulations constitute bottlenecks in predicting mic...

Nov 13, 2016 in HiPC (IEEE International Conference on High Performance Computing, Data, and Analytics)

An ultra-scalable fully-implicit solver is developed for stiff time-dependent problems arising from the hyperbolic conservation laws in nonhydrostatic atmospheric dynamics. In the solver, we propose a highly efficient hybrid domain-decomposed multigrid preconditioner that can greatly accelerate the convergence rate at the extreme scale. For solving the overlapped subdomain problems, a geometry-based pipelined incomplete LU factorization method is designed to further exploit the on-chip fine-grai...

Cited By3

Newest

The depthwise separable convolution is commonly seen in convolutional neural networks (CNNs), and is widely used to reduce the computation overhead of a standard multi-channel 2D convolution. Existing implementations of depthwise separable convolutions target accelerating model training with large batch sizes with a large number of samples to be processed at once. Such approaches are inadequate for small-batch-sized model training and the typical scenario of model inference where the model takes...

Classical simulation of quantum computation plays a critical role in numerical studies of quantum algorithms and the validation of quantum devices. Here, we introduce SW_Qsim, a tensor-network-based quantum simulator, which is designed with a two-level parallel structure for efficient implementation on the many-core New Sunway Supercomputer. We propose a minimize-memory contraction path algorithm for rectangular quantum grids to reduce the memory overhead, and provide the memory-limited simulati...

This work aims to advance computational methods for projection-based reduced order models (ROMs) of linear time-invariant (LTI) dynamical systems. For such systems, current practice relies on ROM formulations expressing the state as a rank-1 tensor (i.e., a vector), leading to computational kernels that are memory bandwidth bound and, therefore, ill-suited for scalable performance on modern many-core and hybrid computing nodes. This weakness can be particularly limiting when tackling many-query ...

Reducing energy consumption and achieving high energy efficiency in computation has become the top priority in High Performance Computing. High energy efficiency generally requires high resource utilization since energy demand for any applications and architectures is dependent on active time. We show that by using DMA the 28nm CMOS node Myriad-2 Vision Processing Unit can achieve 25 GFLOPs/W for FP32 matrixmultiplication. Our main contributions are: (i) An analysis of data transfer needs for in...