NVIDIA Tensor Core Programmability, Performance &amp; Precision

Stefano Markidis; Steven W. D. Chien; Erwin Laure; Ivy Bo Peng; Jeffrey S. Vetter

doi:https://doi.org/10.1109/ipdpsw.2018.00091

doi.org/10.1109/ipdpsw.2018.00091

Original paper

NVIDIA Tensor Core Programmability, Performance & Precision

,

,

..., Jeffrey S. Vetter

44

Pages: 522 - 531

Published: May 1, 2018

Abstract

The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called Tensor Core that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and...

Paper Fields

Paper Details

Title

NVIDIA Tensor Core Programmability, Performance & Precision

DOI

doi.org/10.1109/ipdpsw.2018.00091

Published Date

May 1, 2018

Pages

522 - 531

Notes

To use the Note feature, you need to be logged in. Please

History

NVIDIA Tensor Core Programmability, Performance &amp; Precision

NVIDIA Tensor Core Programmability, Performance & Precision