Original paper
Fast implementation of DGEMM on Fermi GPU
Pages: 1 - 11
Published: Nov 8, 2011
Abstract
In this paper we present a thorough experience on tuning double-precision matrix-matrix multiplication (DGEM-M) on the Fermi GPU architecture. We choose an optimal algorithm with blocking in both shared memory and registers to satisfy the constraints of the Fermi memory hierarchy. Our optimization strategy is further guided by a performance modeling based on micro-architecture benchmarks. Our optimizations include software pipelining, use of...
Paper Details
Title
Fast implementation of DGEMM on Fermi GPU
Published Date
Nov 8, 2011
Pages
1 - 11