A coordinated tiling and batching framework for efficient GEMM on GPUs

Published on Feb 16, 2019 in PPoPP (ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming)
路 DOI :10.1145/3293883.3295734
Xiuhong Li7
Estimated H-index: 7
(PKU: Peking University),
Yun Liang32
Estimated H-index: 32
(PKU: Peking University)
+ 2 AuthorsYinghan Li1
Estimated H-index: 1
(SenseTime)
Sources
Abstract
General matrix multiplication (GEMM) plays a paramount role in a broad range of domains such as deep learning, scientific computing, and image processing. The primary optimization method is to partition the matrix into many tiles and exploit the parallelism within and between tiles. The tiling hierarchy closely mirrors the thread hierarchy on GPUs. In practice, GPUs can fully unleash its computing power only when the matrix size is large and there are sufficient number of tiles and workload for each tile. However, in many real-world applications especially deep learning domain, the matrix size is small. To this end, prior work proposes batched GEMM to process a group of small independent GEMMs together by designing a single CUDA kernel for all of these GEMMs. However, the current support for batched GEMM is still rudimentary. Tiling and batching are tightly correlated. A large tile size can increase the data reuse, but it will decrease the thread-level parallelism, which further decrease the optimization space for the batching. A small tile size can increase the thread-level parallelism and then provide larger optimization space for the batching, but at the cost of sacrificing data reuse. In this paper, we propose a coordinated tiling and batching framework for accelerating GEMMs on GPUs. It is a two-phase framework, which consists of a tiling engine and a batching engine to perform efficient batched GEMM on GPUs. Tiling engine partitions the GEMMs into independent tiles and batching engine assigns the tiles to thread blocks. Moreover, we propose a general programming interface for the coordinated tiling and batching solution. Finally, experiment evaluation results on synthetic batched GEMM cases show that our framework can achieve about 1.40X performance speedup on average over the state-of-the-art technique. We also use GoogleNet as a real-world case study and our framework can achieve 1.23X speedup.
馃摉 Papers frequently viewed together
2015CC: Compiler Construction
2005
4 Authors (Zhelong Pan, ..., Rudolf Eigenmann)
References28
Newest
#1Xiuhong Li (PKU: Peking University)H-Index: 7
#2Yun Liang (PKU: Peking University)H-Index: 32
Last. Ming Jiang (PKU: Peking University)H-Index: 23
view all 7 authors...
Low-dose X-ray computed tomography (XCT) is a popular imaging technique to visualize the inside structure of object non-destructively. Model-based Iterative Reconstruction (MBIR) method can reconstruct high-quality image but at the cost of large computational demands. Therefore, MBIR of ten resorts to the platforms with hardware accelerators such as GPUs to speed up the reconstruction process. For MBIR, the reconstruction process is to minimize an objective function by updating image iteratively...
Source
#1Xiaolong Xie (PKU: Peking University)H-Index: 7
#2Yun Liang (PKU: Peking University)H-Index: 32
Last. Dongrui Fan (CAS: Chinese Academy of Sciences)H-Index: 17
view all 7 authors...
The key to the high performance on GPUs lies in the massive threading to enable thread switching and hide long latencies. GPUs are equipped with a large register file to enable fast context switch. However, thread throttling techniques that are designed to mitigate cache contention, lead to under-utilization of registers. Register allocation is a significant factor for performance as it not just determines the single-thread performance, but indirectly affects the TLP. In this paper, we propose C...
Source
Feb 10, 2018 in PPoPP (ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming)
#1Prashant Singh Rawat (OSU: Ohio State University)H-Index: 7
#2Fabrice RastelloH-Index: 16
Last. P. Sadayappan (OSU: Ohio State University)H-Index: 71
view all 6 authors...
The recent advent of compute-intensive GPU architecture has allowed application developers to explore high-order 3D stencils for better computational accuracy. A common optimization strategy for such stencils is to expose sufficient data reuse by means such as loop unrolling, with the expectation of register-level reuse. However, the resulting code is often highly constrained by register pressure. While current state-of-the-art register allocators are satisfactory for most applications, they are...
Source
Nov 13, 2017 in ICCAD (International Conference on Computer Aided Design)
#1Yun Liang (PKU: Peking University)H-Index: 32
#2Xiuhong Li (PKU: Peking University)H-Index: 7
Last. Xiaolong Xie (PKU: Peking University)H-Index: 7
view all 3 authors...
Graphics Processing Units (GPUs) computing has become ubiquitous for embedded system, evidenced by its wide adoption for various general purpose applications. As more and more applications are accelerated by GPUs, multi-tasking scenario starts to emerge. Multi-tasking allows multiple applications to simultaneously execute on the same GPU and share the resource. This brings new challenges due to the contention among the different applications for the shared resources such as caches. However, the ...
Source
Oct 14, 2017 in MICRO (International Symposium on Microarchitecture)
#1Zhen Zheng (THU: Tsinghua University)H-Index: 5
#2Chanyoung Oh (SNU: Seoul National University)H-Index: 4
Last. Wenguang Chen (THU: Tsinghua University)H-Index: 27
view all 6 authors...
Pipeline is an important programming pattern, while GPU, designed mostly for data-level parallel executions, lacks an efficient mechanism to support pipeline programming and executions. This paper provides a systematic examination of various existing pipeline execution models on GPU, and analyzes their strengths and weaknesses. To address their shortcomings, this paper then proposes three new execution models equipped with much improved controllability, including a hybrid model that is capable o...
Source
Jun 14, 2017 in ICS (International Conference on Supercomputing)
#1Ahmad Abdelfattah (UT: University of Tennessee)H-Index: 13
#2Azzam Haidar (UT: University of Tennessee)H-Index: 22
Last. Jack Dongarra (UT: University of Tennessee)H-Index: 130
view all 4 authors...
This paper presents a software framework for solving large numbers of relatively small matrix problems using GPUs. Our approach combines novel and existing HPC techniques to methodically apply performance analysis, kernel design, low-level optimizations, and autotuning to exceed in performance proprietary vendor libraries. As a case study, we discuss the fundamental matrix operations defined by the Basic Linear Algebra Subprograms (BLAS) standard. This case study is significantly important for w...
Source
Jun 14, 2017 in ICS (International Conference on Supercomputing)
#1Keren ZhouH-Index: 4
#2Guangming TanH-Index: 16
Last. Ninghui SunH-Index: 26
view all 5 authors...
GPUs are widely used in accelerating deep neural networks (DNNs) for their high bandwidth and parallelism. But tuning the performance of DNN computations is challenging, as it requires a thorough understanding of both underlying architectures and algorithm implementations. Traditional research, which focused on analyzing performance by CUDA C language or PTX instructions, has not combined hardware features tightly with source code. In this paper, we present a performance analysis framework at th...
Source
#1Yun Liang (PKU: Peking University)H-Index: 32
#2Xiuhong Li (PKU: Peking University)H-Index: 7
Graphics Processing Units (GPUs) have been widely adopted as accelerators for compute-intensive applications due to its tremendous computational power and high memory bandwidth. As the complexity of applications continues to grow, each new generation of GPUs has been equipped with advanced architectural features and more resources to sustain its performance acceleration capability. Recent GPUs have been featured with concurrent kernel execution, which is designed to improve the resource utilizat...
Source
Apr 4, 2017 in ASPLOS (Architectural Support for Programming Languages and Operating Systems)
#1Ang Li (PNNL: Pacific Northwest National Laboratory)H-Index: 17
#2Shuaiwen Leon Song (PNNL: Pacific Northwest National Laboratory)H-Index: 21
Last. Henk Corporaal (TU/e: Eindhoven University of Technology)H-Index: 41
view all 6 authors...
Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-shared L2 with long access latency; while the in-core locality, which is crucial for performance delivery, is handled explicitly by user-controlled scratchpad memory. In this work, we disclose another type of data locality that has been long ignored bu...
Source
Jan 26, 2017 in PPoPP (ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming)
#1Xiuxia Zhang (CAS: Chinese Academy of Sciences)H-Index: 3
#2Guangming Tan (CAS: Chinese Academy of Sciences)H-Index: 16
Last. Mingyu Chen (CAS: Chinese Academy of Sciences)H-Index: 19
view all 6 authors...
In this paper, we present a methodology to understand GPU microarchitectural features and improve performance for compute-intensive kernels. The methodology relies on a reverse engineering approach to crack the GPU ISA encodings in order to build a GPU assembler. An assembly microbenchmark suite correlates microarchitectural features with their performance factors to uncover instruction-level and memory hierarchy preferences. We use SGEMM as a running example to show the ways to achieve bare-met...
Source
Cited By24
Newest
There is a growing interest in custom spatial accelerators for machine learning applications. These accelerators employ a spatial array of processing elements (PEs) interacting via custom buffer hierarchies and networks-on-chip. The efficiency of these accelerators comes from employing optimized dataflow (i.e., spatial/temporal partitioning of data across the PEs and fine-grained scheduling) strategies to optimize data reuse. The focus of this work is to evaluate these accelerator architectures ...
Source
The depthwise separable convolution is commonly seen in convolutional neural networks (CNNs), and is widely used to reduce the computation overhead of a standard multi-channel 2D convolution. Existing implementations of depthwise separable convolutions target accelerating model training with large batch sizes with a large number of samples to be processed at once. Such approaches are inadequate for small-batch-sized model training and the typical scenario of model inference where the model takes...
Source
#1Weiling Yang (National University of Defense Technology)H-Index: 1
#2Jianbin Fang (National University of Defense Technology)H-Index: 13
Last. Zheng Wang (University of Leeds)H-Index: 33
view all 5 authors...
General Matrix Multiplication (GEMM) is a key subroutine in highperformance computing. While the mainstream linear algebra libraries can deliver high performance on large and regular-shaped GEMM, they are inadequate for optimizing small and irregular-shaped GEMMs, which are commonly seen in new HPC applications. Some of the recent works in this direction have made promising progress on x86 architectures and GPUs but still leave much room for improvement on emerging HPC hardware built upon the AR...
Source
#1Pratik Fegade (CMU: Carnegie Mellon University)H-Index: 3
#2Tianqi ChenH-Index: 29
Last. Todd C. MowryH-Index: 54
view all 4 authors...
There is often variation in the shape and size of input data used for deep learning. In many cases, such data can be represented using tensors with non-uniform shapes, or ragged tensors. Due to limited and non-portable support for efficient execution on ragged tensors, current deep learning frameworks generally use techniques such as padding and masking to make the data shapes uniform and then offload the computations to optimized kernels for dense tensor algebra. Such techniques can, however, l...
#1Francesco RizziH-Index: 9
#2Eric J. ParishH-Index: 9
Last. John TencerH-Index: 6
view all 4 authors...
This work aims to advance computational methods for projection-based reduced order models (ROMs) of linear time-invariant (LTI) dynamical systems. For such systems, current practice relies on ROM formulations expressing the state as a rank-1 tensor (i.e., a vector), leading to computational kernels that are memory bandwidth bound and, therefore, ill-suited for scalable performance on modern many-core and hybrid computing nodes. This weakness can be particularly limiting when tackling many-query ...
Source
#1Ruimin Wang (SCUT: South China University of Technology)
#2Zhiwei Yang (SCUT: South China University of Technology)
Last. Lu Lu (SCUT: South China University of Technology)
view all 4 authors...
In the past few decades, general matrix multiplication (GEMM), as the basic component of the Basic Linear Algebra Subprograms (BLAS) library, has played a vital role in various fields such as machine learning, image processing, and fluid dynamics. Because these fields tend to deconstruct the problem into multiple smaller sub-problems, today鈥檚 BLAS libraries have implemented batched GEMM routines to achieve high performance in this scenario. MAGMA proposes a vbatch routine to calculate batched GE...
Source
Source
#1Cheolsun Lim (Hansung University)
#2Myungsun Kim (Hansung University)H-Index: 1
On-device DNN processing has been common interests in the field of autonomous driving research. For better accuracy, both the number of DNN models and the model-complexity have been increased. To properly respond to this, hardware platforms structured with multicore-based CPUs and DNN accelerators have been released, and the GPU is generally used as an accelerator. When multiple DNN workloads are sporadically requested, the GPU can be easily oversubscribed, thereby leading to an unexpected perfo...
Source
Loop tiling is a key high-level transformation which is known to maximize locality in loop intensive programs. It has been successfully applied to a number of applications including tensor contractions, iterative stencils and machine learning. This technique has also been extended to a wide variety of computational domains and architectures. The performance achieved with this critical transformation largely depends on a set of inputs given, the tile sizes, due to the complex trade-off between lo...
Source
#1Hadia Ahmed (LBNL: Lawrence Berkeley National Laboratory)H-Index: 6
#2David B. Williams-Young (LBNL: Lawrence Berkeley National Laboratory)H-Index: 13
Last. Chao Yang (LBNL: Lawrence Berkeley National Laboratory)H-Index: 27
view all 4 authors...
Tuning scientific code for heterogeneous computing architecture is a growing challenge. Not only do we need to tune the code to multiple architectures, but also we need to select or schedule computations to the most efficient compute variant. In this paper, we explore the tuning and performance modeling question of one of the most time computing kernels in density functional theory calculations on systems with a multicore host CPU accelerated with GPUs. We show the problem configuration dictates...
Source
This website uses cookies.
We use cookies to improve your online experience. By continuing to use our website we assume you agree to the placement of these cookies.
To learn more, you can find in our Privacy Policy.