A compute-bound formulation of Galerkin model reduction for linear time-invariant dynamical systems

Published on Oct 1, 2021in arXiv: Computational Physics
· DOI :10.1016/J.CMA.2021.113973
Francesco Rizzi9
Estimated H-index: 9
Eric J. Parish9
Estimated H-index: 9
+ 1 AuthorsJohn Tencer6
Estimated H-index: 6
This work aims to advance computational methods for projection-based reduced order models (ROMs) of linear time-invariant (LTI) dynamical systems. For such systems, current practice relies on ROM formulations expressing the state as a rank-1 tensor (i.e., a vector), leading to computational kernels that are memory bandwidth bound and, therefore, ill-suited for scalable performance on modern many-core and hybrid computing nodes. This weakness can be particularly limiting when tackling many-query studies, where one needs to run a large number of simulations. This work introduces a reformulation, called rank-2 Galerkin, of the Galerkin ROM for LTI dynamical systems which converts the nature of the ROM problem from memory bandwidth to compute bound. We present the details of the formulation and its implementation, and demonstrate its utility through numerical experiments using, as a test case, the simulation of elastic seismic shear waves in an axisymmetric domain. We quantify and analyze performance and scaling results for varying numbers of threads and problem sizes. Finally, we present an end-to-end demonstration of using the rank-2 Galerkin ROM for a Monte Carlo sampling study. We show that the rank-2 Galerkin ROM is one order of magnitude more efficient than the rank-1 Galerkin ROM (the current practice) and about 970X more efficient than the full order model, while maintaining excellent accuracy in both the mean and statistics of the field.
#1Renee Swischuk (MIT: Massachusetts Institute of Technology)H-Index: 2
#2Boris Kramer (MIT: Massachusetts Institute of Technology)H-Index: 12
Last. Karen Willcox (University of Texas at Austin)H-Index: 48
view all 4 authors...
This paper presents a physics-based data-driven method to learn predictive reduced-order models (ROMs) from high-fidelity simulations and illustrates it in the challenging context of a single-injec...
31 CitationsSource
#1Lijuan Jiang (CAS: Chinese Academy of Sciences)H-Index: 2
#2Chao Yang (PKU: Peking University)H-Index: 16
Last. Wen-Jing Ma (CAS: Chinese Academy of Sciences)H-Index: 3
view all 3 authors...
We present a systematic methodology for optimizing batched matrix multiplications on SW26010 many-core processor of the Sunway TaihuLight supercomputer. Five surrogate algorithms and a machine lear...
3 CitationsSource
#1Max D. Gunzburger (FSU: Florida State University)H-Index: 75
#2Nan Jiang (Missouri University of Science and Technology)H-Index: 10
Last. Zhu Wang (USC: University of South Carolina)H-Index: 17
view all 3 authors...
Many applications of computational fluid dynamics require multiple simulations of a flow under different input conditions. In this paper, a numerical algorithm is developed to efficiently determine a set of such simulations in which the individually independent members of the set are subject to different viscosity coefficients, initial conditions, and/or body forces. The proposed scheme applied to the flow ensemble leads to need to solve a single linear system with multiple right-hand sides, and...
15 CitationsSource
Feb 16, 2019 in PPoPP (ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming)
#1Xiuhong Li (PKU: Peking University)H-Index: 7
#2Yun Liang (PKU: Peking University)H-Index: 29
Last. Yinghan Li (SenseTime)H-Index: 1
view all 5 authors...
General matrix multiplication (GEMM) plays a paramount role in a broad range of domains such as deep learning, scientific computing, and image processing. The primary optimization method is to partition the matrix into many tiles and exploit the parallelism within and between tiles. The tiling hierarchy closely mirrors the thread hierarchy on GPUs. In practice, GPUs can fully unleash its computing power only when the matrix size is large and there are sufficient number of tiles and workload for ...
24 CitationsSource
#1Changwan Hong (OSU: Ohio State University)H-Index: 9
#2Aravind Sukumaran-Rajam (OSU: Ohio State University)H-Index: 9
Last. P. Sadayappan (OSU: Ohio State University)H-Index: 71
view all 10 authors...
Sparse Matrix-Vector (SpMV) and Sparse Matrix-Multivector (SpMM) products are key kernels for computational science and data science. While GPUs offer significantly higher peak performance and memory bandwidth than multicore CPUs, achieving high performance on sparse computations on GPUs is very challenging. A tremendous amount of recent research has focused on various GPU implementations of the SpMV kernel. But the multi-vector SpMM kernel has received much less attention. In this paper, we pre...
26 CitationsSource
#1Stefano Markidis (KTH: Royal Institute of Technology)H-Index: 33
#2Steven W. D. ChienH-Index: 8
Last. Jeffrey S. Vetter (ORNL: Oak Ridge National Laboratory)H-Index: 55
view all 5 authors...
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called Tensor Core that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precis...
126 CitationsSource
#1E.V. RabinovichH-Index: 2
#2N Y FilipenkoH-Index: 1
Last. Grigory S. ShefelH-Index: 2
view all 3 authors...
2 CitationsSource
#1John Tencer (SNL: Sandia National Laboratories)H-Index: 6
#2Kevin Carlberg (SNL: Sandia National Laboratories)H-Index: 17
Last. Roy E. Hogan (SNL: Sandia National Laboratories)H-Index: 9
view all 4 authors...
4 CitationsSource
Jun 24, 2017 in ISCA (International Symposium on Computer Architecture)
#1Norman P. Jouppi (Google)H-Index: 62
#2Cliff Young (Google)H-Index: 27
Last. Doe Hyun Yoon (Google)H-Index: 16
view all 76 authors...
Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic exec...
2,053 CitationsSource
#1Elmar Peise (RWTH Aachen University)H-Index: 6
This dissertation introduces measurement-based performance modeling and prediction techniques for dense linear algebra algorithms. As a core principle, these techniques avoid executions of such algorithms entirely, and instead predict their performance through runtime estimates for the underlying compute kernels. For a variety of operations, these predictions allow to quickly select the fastest algorithm configurations from available alternatives. We consider two scenarios that cover a wide rang...
2 Citations
Cited By0