Matrix multiplication

Quantum mechanics

Computational science

Exploit

CUDA

Computer security

Theoretical computer science

Operating system

Parallel computing

Computer science

Distributed computing

Physics

Kernel (algebra)

Mathematics

Supercomputer

Partition (number theory)

Quantum

Combinatorics

Programming language

Thread (computing)

General matrix multiplication (GEMM) plays a paramount role in a broad range of domains such as deep learning, scientific computing, and image processing. The primary optimization method is to partition the matrix into many tiles and exploit the parallelism within and between tiles. The tiling hierarchy closely mirrors the thread hierarchy on GPUs. In practice, GPUs can fully unleash its computing power only when the matrix size is large and there are sufficient number of tiles and workload for each tile. However, in many real-world applications especially deep learning domain, the matrix size is small. To this end, prior work proposes batched GEMM to process a group of small independent GEMMs together by designing a single CUDA kernel for all of these GEMMs.

A coordinated tiling and batching framework for efficient GEMM on GPUs