Original paper
NVIDIA Tensor Core Programmability, Performance & Precision
Pages: 522 - 531
Published: May 1, 2018
Abstract
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called Tensor Core that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and...
Paper Details
Title
NVIDIA Tensor Core Programmability, Performance & Precision
Published Date
May 1, 2018
Pages
522 - 531