Performance Modeling and Prediction for Dense Linear Algebra

Published on Jun 1, 2017in arXiv: Performance
Elmar Peise8
Estimated H-index: 8
(RWTH Aachen University)
Sources
Abstract
This dissertation introduces measurement-based performance modeling and prediction techniques for dense linear algebra algorithms. As a core principle, these techniques avoid executions of such algorithms entirely, and instead predict their performance through runtime estimates for the underlying compute kernels. For a variety of operations, these predictions allow to quickly select the fastest algorithm configurations from available alternatives. We consider two scenarios that cover a wide range of computations: To predict the performance of blocked algorithms, we design algorithm-independent performance models for kernel operations that are generated automatically once per platform. For various matrix operations, instantaneous predictions based on such models both accurately identify the fastest algorithm, and select a near-optimal block size. For performance predictions of BLAS-based tensor contractions, we propose cache-aware micro-benchmarks that take advantage of the highly regular structure inherent to contraction algorithms. At merely a fraction of a contraction's runtime, predictions based on such micro-benchmarks identify the fastest combination of tensor traversal and compute kernel.
馃摉 Papers frequently viewed together
2019
3 Authors (Shuo Ji, ..., Yuxiang Li)
2019
References79
Newest
#1Edoardo Di Napoli (Forschungszentrum J眉lich)H-Index: 10
#2Elmar Peise (RWTH Aachen University)H-Index: 8
Last. Paolo Bientinesi (RWTH Aachen University)H-Index: 20
view all 4 authors...
Abstract One of the greatest efforts of computational scientists is to translate the mathematical model describing a class of physical phenomena into large and complex codes. Many of these codes face the difficulty of implementing the mathematical operations in the model in terms of low level optimized kernels offering both performance and portability. Legacy codes suffer from the additional curse of rigid design choices based on outdated performance metrics (e.g.聽minimization of memory footprin...
Source
Nov 13, 2016 in HiPC (IEEE International Conference on High Performance Computing, Data, and Analytics)
#1Jianyu Huang (University of Texas at Austin)H-Index: 7
#2Tyler M. Smith (University of Texas at Austin)H-Index: 6
Last. Robert A. van de Geijn (University of Texas at Austin)H-Index: 45
view all 4 authors...
We dispel with "street wisdom" regarding the practical implementation of Strassen's algorithm for matrix-matrix multiplication (DGEMM). Conventional wisdom: it is only practical for very large matrices. Our implementation is practical for small matrices. Conventional wisdom: the matrices being multiplied should be relatively square. Our implementation is practical for rank-k updates, where k is relatively small (a shape of importance for libraries like LAPACK). Conventional wisdom: it inherently...
Source
#1Rodrigo CanalesH-Index: 1
#2Elmar PeiseH-Index: 8
Last. Paolo BientinesiH-Index: 20
view all 3 authors...
Even though in recent years the scale of statistical analysis problems has increased tremendously, many statistical software tools are still limited to single-node computations. However, statistical analyses are largely based on dense linear algebra operations, which have been deeply studied, optimized and parallelized in the high-performance-computing community. To make high-performance distributed computations available for statistical analysis, and thus enable large scale statistical computat...
Sep 1, 2016 in CLUSTER (International Conference on Cluster Computing)
#1Alexandru Calotoiu (Technische Universit盲t Darmstadt)H-Index: 9
#2David BeckinsaleH-Index: 1
Last. Felix Wolf (Technische Universit盲t Darmstadt)H-Index: 29
view all 7 authors...
Tuning large applications requires a clever exploration of the design and configuration space. Especially on supercomputers, this space is so large that its exhaustive traversal via performance experiments becomes too expensive, if not impossible. Manually creating analytical performance models provides insights into optimization opportunities but is extremely laborious if done for applications of realistic size. If we must consider multiple performance-relevant parameters and their possible int...
Source
#1Tze Meng Low (University of Texas at Austin)H-Index: 12
#2Francisco D. Igual (Complutense University of Madrid)H-Index: 19
view all 4 authors...
We show how the BLAS-like Library Instantiation Software (BLIS) framework, which provides a more detailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation, allows one to analytically determine tuning parameters for high-end instantiations of the matrix-matrix multiplication. This is of both practical and scientific importance, as it greatly reduces the development effort required for the implementation of the level-3 BLAS while also advancing our understanding of how hierarch...
Source
#1Field G. Van Zee (University of Texas at Austin)H-Index: 12
#2Tyler M. Smith (University of Texas at Austin)H-Index: 6
Last. Lee Killough (Cray)H-Index: 1
view all 12 authors...
BLIS is a new software framework for instantiating high-performance BLAS-like dense linear algebra libraries. We demonstrate how BLIS acts as a productivity multiplier by using it to implement the level-3 BLAS on a variety of current architectures. The systems for which we demonstrate the framework include state-of-the-art general-purpose, low-power, and many-core architectures. We show, with very little effort, how the BLIS framework yields sequential and parallel implementations that are compe...
Source
#1Elmar Peise (RWTH Aachen University)H-Index: 8
#2Paolo Bientinesi (RWTH Aachen University)H-Index: 20
To exploit both memory locality and the full performance potential of highly tuned kernels, dense linear algebra libraries such as LAPACK commonly implement operations as blocked algorithms. However, to achieve next-to-optimal performance with such algorithms, significant tuning is required. On the other hand, recursive algorithms are virtually tuning free, and yet attain similar performance. In this paper, we first analyze and compare blocked and recursive algorithms in terms of performance, an...
#1Field G. Van Zee (University of Texas at Austin)H-Index: 12
#2Robert A. van de Geijn (University of Texas at Austin)H-Index: 45
The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level-2 (matrix-vector) and level-3 (matrix-matrix) BLAS operations can be expressed and optimized in terms of very simple kernels. While others have had similar insights, BLIS reduces the necessary kernels to what we believe is the simplest set that still supports...
Source
#1Pedro Alonso (Polytechnic University of Valencia)H-Index: 14
#2Sandra Catal谩nH-Index: 7
view all 6 authors...
Abstract We present accurate piece-wise models for the time and energy costs of high performance implementations of both the matrix multiplication ( gemm ) and the triangular system solve with multiple right-hand sides ( trsm ) on x86 architectures. Our methodology decouples the costs due to the floating-point arithmetic/data movement occurring in the higher levels of the cache hierarchy from those of packing/data transfers between the main memory and the L2/L3 cache. A careful analytical study ...
Source
#1Elmar Peise (RWTH Aachen University)H-Index: 8
#2Paolo Bientinesi (RWTH Aachen University)H-Index: 20
Optimal use of computing resources requires extensive coding, tuning and benchmarking. To boost developer productivity in these time consuming tasks, we introduce the Experimental Linear Algebra Performance Studies framework (ELAPS), a multi-platform open source environment for fast yet powerful performance experimentation with dense linear algebra kernels, algorithms, and libraries. ELAPS allows users to construct experiments to investigate how performance and efficiency vary depending on facto...
Cited By2
Newest
#1Francesco RizziH-Index: 9
#2Eric J. ParishH-Index: 9
Last. John TencerH-Index: 6
view all 4 authors...
This work aims to advance computational methods for projection-based reduced order models (ROMs) of linear time-invariant (LTI) dynamical systems. For such systems, current practice relies on ROM formulations expressing the state as a rank-1 tensor (i.e., a vector), leading to computational kernels that are memory bandwidth bound and, therefore, ill-suited for scalable performance on modern many-core and hybrid computing nodes. This weakness can be particularly limiting when tackling many-query ...
Source
#1Aravind Sankaran (RWTH Aachen University)H-Index: 2
#2Paolo Bientinesi (Ume氓 University)H-Index: 20
In scientific computing, it is common that a mathematical expression can be computed by many different algorithms (sometimes over hundreds), each identifying a specific sequence of library calls. Although mathematically equivalent, those algorithms might exhibit significant differences in terms of performance. However in practice, due to fluctuations, there is not one algorithm that consistently performs noticeably better than the rest. For this reason, with this work we aim to identify not the ...
This website uses cookies.
We use cookies to improve your online experience. By continuing to use our website we assume you agree to the placement of these cookies.
To learn more, you can find in our Privacy Policy.