NVIDIA Tensor Core Programmability, Performance & Precision

Published on May 21, 2018
· DOI :10.1109/IPDPSW.2018.00091
Stefano Markidis34
Estimated H-index: 34
(KTH: Royal Institute of Technology),
Steven W. D. Chien7
Estimated H-index: 7
+ 2 AuthorsJeffrey S. Vetter55
Estimated H-index: 55
(ORNL: Oak Ridge National Laboratory)
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called Tensor Core that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision. Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores.
#1Azzam Haidar (UT: University of Tennessee)H-Index: 22
#2Panruo Wu (UT: University of Tennessee)H-Index: 15
Last. Jack Dongarra (University of Manchester)H-Index: 130
view all 4 authors...
The use of low-precision arithmetic in mixed-precision computing methods has been a powerful tool to accelerate numerous scientific computing applications. Artificial intelligence (AI) in particular has pushed this to current extremes, making use of half-precision floating-point arithmetic (FP16) in approaches based on neural networks. The appeal of FP16 is in the high performance that can be achieved using it on today's powerful manycore GPU accelerators, e.g., like the NVIDIA V100, that can pr...
#1Cris Cecka (Nvidia)H-Index: 8
Communication-avoiding algorithms have been a subject of growing interest in the last decade due to the growth of distributed memory systems and the disproportionate increase of computational throughput to communication bandwidth. For distributed 1D FFTs, communication costs quickly dominate execution time as all industry-standard implementations perform three all-to-all transpositions of the data. In this work, we reformulate an existing algorithm that employs the Fast Multipole Method to reduc...
#1Paulius Micikevicius (Nvidia)H-Index: 12
#2Sharan Narang (Baidu)H-Index: 19
Last. Hao WuH-Index: 3
view all 11 authors...
Deep neural networks have enabled progress in a wide variety of applications. Growing the size of the neural network typically results in improved accuracy. As model sizes grow, the memory and compute requirements for training these models also increases. We introduce a technique to train deep neural networks using half precision floating point numbers. In our technique, weights, activations and gradients are stored in IEEE half-precision format. Half-precision floating numbers have limited nume...
#1Piotr Luszczek (UT: University of Tennessee)H-Index: 31
#2Jakub Kurzak (UT: University of Tennessee)H-Index: 28
Last. Jack Dongarra (ORNL: Oak Ridge National Laboratory)H-Index: 130
view all 4 authors...
With NVIDA Tegra Jetson X1 and Pascal P100 GPUs, NVIDIA introduced hardware-based computation on FP16 numbers also called half-precision arithmetic. In this talk, we will introduce the steps required to build a viable benchmark for this new arithmetic format. This will include the connections to established IEEE floating point standards and existing HPC benchmarks. The discussion will focus on performance and numerical stability issues that are important for this kind of benchmarking and how the...
Jun 24, 2017 in ISCA (International Symposium on Computer Architecture)
#1Norman P. Jouppi (Google)H-Index: 65
#2Cliff Young (Google)H-Index: 26
Last. Doe Hyun Yoon (Google)H-Index: 16
view all 76 authors...
Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic exec...
#1Jack Dongarra (University of Manchester)H-Index: 130
#2Sven Hammarling (University of Manchester)H-Index: 28
Last. Mawussi Zounon (University of Manchester)H-Index: 9
view all 6 authors...
A current trend in high-performance computing is to decompose a large linear algebra problem into batches containing thousands of smaller problems, that can be solved independently, before collating the results. To standardize the interface to these routines, the community is developing an extension to the BLAS standard (the batched BLAS), enabling users to perform thousands of small BLAS operations in parallel whilst making efficient use of their hardware. We discuss the benefits and drawbacks ...
Jan 1, 2017 in NeurIPS (Neural Information Processing Systems)
#1Urs Köster (University of California, Berkeley)H-Index: 10
#2Tristan J. Webb (Warw.: University of Warwick)H-Index: 7
Last. N S RaoH-Index: 11
view all 14 authors...
Deep neural networks are commonly developed and trained in 32-bit floating point format. Significant gains in performance and energy efficiency could be realized by training and inference in numerical formats optimized for deep learning. Despite advances in limited precision inference in recent years, training of neural networks in low bit-width remains a challenging problem. Here we present the Flexpoint data format, aiming at a complete replacement of 32-bit floating point format training and ...
Nov 13, 2016 in HiPC (IEEE International Conference on High Performance Computing, Data, and Analytics)
#1Alexander Heinecke (Los Angeles Mission College)H-Index: 22
#2Greg Henry (Intel)H-Index: 18
Last. Hans Pabst (Intel)H-Index: 9
view all 4 authors...
Many modern highly scalable scientific simulations packages rely on small matrix multiplications as their main computational engine. Math libraries or compilers are unlikely to provide the best possible kernel performance. To address this issue, we present a library which provides high performance small matrix multiplications targeting all recent x86 vector instruction set extensions up to Intel AVX-512. Our evaluation proves that speed-ups of more than 10 x are possible depending on the CPU and...
#1Nicolas Offermans (KTH: Royal Institute of Technology)H-Index: 5
#2Oana Marin (Argonne National Laboratory)H-Index: 9
Last. Elia Merzari (Argonne National Laboratory)H-Index: 18
view all 10 authors...
The present work is targeted at performing a strong scaling study of the high-order spectral element fluid dynamics solver Nek5000. Prior studies such as [5] indicated a recommendable metric for strong scalability from a theoretical viewpoint, which we test here extensively on three parallel machines with different performance characteristics and interconnect networks, namely Mira (IBM Blue Gene/Q), Beskow (Cray XC40) and Titan (Cray XK7). The test cases considered for the simulations correspond...
Aug 1, 2015 in HiPC (IEEE International Conference on High Performance Computing, Data, and Analytics)
#5Alistair Hart (Cray)H-Index: 17
We present a case study of porting NekBone, a skeleton version of the Nek5000 code, to a parallel GPU-accelerated system. Nek5000 is a computational fluid dynamics code based on the spectral element method used for the simulation of incompressible flow. The original NekBone Fortran source code has been used as the base and enhanced by OpenACC directives. The profiling of NekBone provided an assessment of the suitability of the code for GPU systems, and indicated possible kernel optimizations. To...
Cited By126
view all 4 authors...
Heterogeneous computing systems provide high performance and energy efficiency. However, to optimally utilize such systems, solutions that distribute the work across host CPUs and accelerating devices are needed. In this paper, we present a performance and energy aware approach that combines AI planning heuristics for parameter space exploration with a machine learning model for performance and energy evaluation to determine a near-optimal system configuration. For data-parallel applications our...
#1Nikoli Dryden (ETH Zurich)H-Index: 9
#2Roman Böhringer (ETH Zurich)H-Index: 1
Last. Torsten Hoefler (ETH Zurich)H-Index: 55
view all 4 authors...
I/O is emerging as a major bottleneck for machine learning training, especially in distributed environments. Indeed, at large scale, I/O takes as much as 85% of training time. Addressing this I/O bottleneck necessitates careful optimization, as optimal data ingestion pipelines differ between systems, and require a delicate balance between access to local storage, external filesystems, and remote nodes. We introduce NoPFS, a machine learning I/O middleware, which provides a scalable, flexible, an...
#1Jong Hoon ShinH-Index: 10
#2Ali ShafieeH-Index: 12
Last. Joseph H. Hassoun (Samsung)H-Index: 3
view all 6 authors...
This paper examines the design space trade-offs of DNNs accelerators aiming to achieve competitive performance and efficiency metrics for all four combinations of dense or sparse activation/weight tensors. To do so, we systematically examine the overheads of supporting sparsity on top of an optimized dense core. These overheads are modeled based on parameters that indicate how a multiplier can borrow a nonzero operation from the neighboring multipliers or future cycles. As a result of this explo...
Modern graphics computing units (GPUs) are designed and optimized to perform highly parallel numerical calculations. This parallelism has enabled (and promises) significant advantages, both in terms of energy performance and calculation. In this document, we take stock of the different applications of mixed precision. We recall the standards currently used in the overwhelming majority of systems in terms of numerical computation. We show that the mixed precision which decreases the precision at ...
#1Thomas Grützmacher (KIT: Karlsruhe Institute of Technology)H-Index: 5
#2Hartwig Anzt (KIT: Karlsruhe Institute of Technology)H-Index: 16
Last. Enrique S. Quintana-Ortí (Polytechnic University of Valencia)H-Index: 36
view all 3 authors...
The roofline model not only provides a powerful tool to relate an application's performance with the specific constraints imposed by the target hardware but also offers a graphic representation of the balance between memory access cost and compute throughput. In this work, we present a strategy to break up the tight coupling between the precision format used for arithmetic operations and the storage format employed for memory operations. (At a high level, this idea is equivalent to compressing/d...
#1Ismail Emir Yuksel (TOBB University of Economics and Technology)H-Index: 2
#2Behzad Salami (Barcelona Supercomputing Center)H-Index: 10
Last. Adrián Cristal Kestelman (Barcelona Supercomputing Center)H-Index: 6
view all 5 authors...
On-chip memory (usually based on Static RAMs-SRAMs) are crucial components for various computing devices including heterogeneous devices, e.g., GPUs, FPGAs, ASICs to achieve high performance. Modern workloads such as Deep Neural Networks (DNNs) running on these heterogeneous fabrics are highly dependent on the on-chip memory architecture for efficient acceleration. Hence, improving the energy-efficiency of such memories directly leads to an efficient system. One of the common methods to save ene...
#2Perry GibsonH-Index: 1
Last. David KaeliH-Index: 41
view all 5 authors...
Edge computing devices inherently face tight resource constraints, which is especially apparent when deploying Deep Neural Networks (DNN) with high memory and compute demands. FPGAs are commonly available in edge devices. Since these reconfigurable circuits can achieve higher throughput and lower power consumption than general purpose processors, they are especially well-suited for DNN acceleration. However, existing solutions for designing FPGA-based DNN accelerators for edge devices come with ...
#1Francesco RizziH-Index: 9
#2Eric J. ParishH-Index: 9
Last. John TencerH-Index: 6
view all 4 authors...
This work aims to advance computational methods for projection-based reduced order models (ROMs) of linear time-invariant (LTI) dynamical systems. For such systems, current practice relies on ROM formulations expressing the state as a rank-1 tensor (i.e., a vector), leading to computational kernels that are memory bandwidth bound and, therefore, ill-suited for scalable performance on modern many-core and hybrid computing nodes. This weakness can be particularly limiting when tackling many-query ...
#2Osman Hassan (University of the Sciences)H-Index: 1
Last. Shahid Khan (RWTH Aachen University)H-Index: 3
view all 0 authors...
This website uses cookies.
We use cookies to improve your online experience. By continuing to use our website we assume you agree to the placement of these cookies.
To learn more, you can find in our Privacy Policy.