NVIDIA Tensor Core Programmability, Performance & Precision

Pages: 522 - 531
Published: May 21, 2018
Abstract
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called Tensor Core that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and...
Paper Details
Title
NVIDIA Tensor Core Programmability, Performance & Precision
Published Date
May 21, 2018
Journal
Pages
522 - 531
Citation AnalysisPro
  • Scinapse’s Top 10 Citation Journals & Affiliations graph reveals the quality and authenticity of citations received by a paper.
  • Discover whether citations have been inflated due to self-citations, or if citations include institutional bias.