-
Testing the Unknown: A Framework for OpenMP Testing via Random Program Generation
Authors:
Ignacio Laguna,
Patrick Chapman,
Konstantinos Parasyris,
Giorgis Georgakoudis,
Cindy Rubio-González
Abstract:
We present a randomized differential testing approach to test OpenMP implementations. In contrast to previous work that manually creates dozens of verification and validation tests, our approach is able to randomly generate thousands of tests, exposing OpenMP implementations to a wide range of program behaviors. We represent the space of possible random OpenMP tests using a grammar and implement o…
▽ More
We present a randomized differential testing approach to test OpenMP implementations. In contrast to previous work that manually creates dozens of verification and validation tests, our approach is able to randomly generate thousands of tests, exposing OpenMP implementations to a wide range of program behaviors. We represent the space of possible random OpenMP tests using a grammar and implement our method as an extension of the Varity program generator. By generating 1,800 OpenMP tests, we find various performance anomalies and correctness issues when we apply it to three OpenMP implementations: GCC, Clang, and Intel. We also present several case studies that analyze the anomalies and give more details about the classes of tests that our approach creates.
△ Less
Submitted 11 October, 2024;
originally announced October 2024.
-
Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs
Authors:
Anwar Hossain Zahid,
Ignacio Laguna,
Wei Le
Abstract:
As scientific codes are ported between GPU platforms, continuous testing is required to ensure numerical robustness and identify numerical differences. Compiler-induced numerical differences occur when a program is compiled and run on different GPUs, and the numerical outcomes are different for the same input. We present a study of compiler-induced numerical differences between NVIDIA and AMD GPUs…
▽ More
As scientific codes are ported between GPU platforms, continuous testing is required to ensure numerical robustness and identify numerical differences. Compiler-induced numerical differences occur when a program is compiled and run on different GPUs, and the numerical outcomes are different for the same input. We present a study of compiler-induced numerical differences between NVIDIA and AMD GPUs. Our approach uses Varity to generate thousands of short numerical tests in CUDA and HIP, and their inputs; then, we use differential testing to check if the program produced a numerical inconsistency when run on these GPUs. We also use the HIPIFY tool to convert CUDA tests into HIP and check if there are numerical inconsistencies induced by HIPIFY. We generated more than 600,000 tests and found subtle numerical differences that come from (1) math library calls, (2) differences in floating-point precision (FP64 versus FP32), and (3) converting code with HIPIFY.
△ Less
Submitted 11 October, 2024;
originally announced October 2024.
-
FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators
Authors:
Xinyi Li,
Ang Li,
Bo Fang,
Katarzyna Swirydowicz,
Ignacio Laguna,
Ganesh Gopalakrishnan
Abstract:
NVIDIA Tensor Cores and AMD Matrix Cores (together called Matrix Accelerators) are of growing interest in high-performance computing and machine learning owing to their high performance. Unfortunately, their numerical behaviors are not publicly documented, including the number of extra precision bits maintained, the accumulation order of addition, and predictable subnormal number handling during c…
▽ More
NVIDIA Tensor Cores and AMD Matrix Cores (together called Matrix Accelerators) are of growing interest in high-performance computing and machine learning owing to their high performance. Unfortunately, their numerical behaviors are not publicly documented, including the number of extra precision bits maintained, the accumulation order of addition, and predictable subnormal number handling during computations. This makes it impossible to reliably port codes across these differing accelerators. This paper contributes a collection of {\em Feature Targeted Tests for Numerical Properties} that that help determine these features across five floating-point formats, four rounding modes and additional that highlight the rounding behaviors and preservation of extra precision bits. To show the practical relevance of FTTN, we design a simple matrix-multiplication test designed with insights gathered from our feature-tests. We executed this very simple test on five platforms, producing different answers: V100, A100, and MI250X produced 0, MI100 produced 255.875, and Hopper H100 produced 191.875. Our matrix multiplication tests employ patterns found in iterative refinement-based algorithms, highlighting the need to check for significant result variability when porting code across GPUs.
△ Less
Submitted 29 February, 2024;
originally announced March 2024.
-
MPGemmFI: A Fault Injection Technique for Mixed Precision GEMM in ML Applications
Authors:
Bo Fang,
Xinyi Li,
Harvey Dam,
Cheng Tan,
Siva Kumar Sastry Hari,
Timothy Tsai,
Ignacio Laguna,
Dingwen Tao,
Ganesh Gopalakrishnan,
Prashant Nair,
Kevin Barker,
Ang Li
Abstract:
Emerging deep learning workloads urgently need fast general matrix multiplication (GEMM). To meet such demand, one of the critical features of machine-learning-specific accelerators such as NVIDIA Tensor Cores, AMD Matrix Cores, and Google TPUs is the support of mixed-precision enabled GEMM. For DNN models, lower-precision FP data formats and computation offer acceptable correctness but significan…
▽ More
Emerging deep learning workloads urgently need fast general matrix multiplication (GEMM). To meet such demand, one of the critical features of machine-learning-specific accelerators such as NVIDIA Tensor Cores, AMD Matrix Cores, and Google TPUs is the support of mixed-precision enabled GEMM. For DNN models, lower-precision FP data formats and computation offer acceptable correctness but significant performance, area, and memory footprint improvement. While promising, the mixed-precision computation on error resilience remains unexplored. To this end, we develop a fault injection framework that systematically injects fault into the mixed-precision computation results. We investigate how the faults affect the accuracy of machine learning applications. Based on the error resilience characteristics, we offer lightweight error detection and correction solutions that significantly improve the overall model accuracy if the models experience hardware faults. The solutions can be efficiently integrated into the accelerator's pipelines.
△ Less
Submitted 9 November, 2023;
originally announced November 2023.
-
Giving RSEs a Larger Stage through the Better Scientific Software Fellowship
Authors:
William F. Godoy,
Ritu Arora,
Keith Beattie,
David E. Bernholdt,
Sarah E. Bratt,
Daniel S. Katz,
Ignacio Laguna,
Amiya K. Maji,
Addi Malviya Thakur,
Rafael M. Mudafort,
Nitin Sukhija,
Damian Rouson,
Cindy Rubio-González,
Karan Vahi
Abstract:
The Better Scientific Software Fellowship (BSSwF) was launched in 2018 to foster and promote practices, processes, and tools to improve developer productivity and software sustainability of scientific codes. BSSwF's vision is to grow the community with practitioners, leaders, mentors, and consultants to increase the visibility of scientific software production and sustainability. Over the last fiv…
▽ More
The Better Scientific Software Fellowship (BSSwF) was launched in 2018 to foster and promote practices, processes, and tools to improve developer productivity and software sustainability of scientific codes. BSSwF's vision is to grow the community with practitioners, leaders, mentors, and consultants to increase the visibility of scientific software production and sustainability. Over the last five years, many fellowship recipients and honorable mentions have identified as research software engineers (RSEs). This paper provides case studies from several of the program's participants to illustrate some of the diverse ways BSSwF has benefited both the RSE and scientific communities. In an environment where the contributions of RSEs are too often undervalued, we believe that programs such as BSSwF can be a valuable means to recognize and encourage community members to step outside of their regular commitments and expand on their work, collaborations and ideas for a larger audience.
△ Less
Submitted 14 November, 2022; v1 submitted 14 November, 2022;
originally announced November 2022.
-
Reinit++: Evaluating the Performance of Global-Restart Recovery Methods For MPI Fault Tolerance
Authors:
Giorgis Georgakoudis,
Luanzheng Guo,
Ignacio Laguna
Abstract:
Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest check-point. However, re-deploying an application incurs overhead by tearing down and re-instating execution, and possibly limiti…
▽ More
Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest check-point. However, re-deploying an application incurs overhead by tearing down and re-instating execution, and possibly limiting checkpointing retrieval from slow permanent storage. In this paper we present Reinit++, a new design and implementation of the Reinit approach for global-restart recovery, which avoids application re-deployment. We extensively evaluate Reinit++ contrasted with the leading MPI fault-tolerance approach of ULFM, implementing global-restart recovery, and the typical practice of restarting an application to derive new insight on performance. Experimentation with three different HPC proxy applications made resilient to withstand process and node failures shows that Reinit++ recovers much faster than restarting, up to 6x, or ULFM, up to 3x, and that it scales excellently as the number of MPI processes grows.
△ Less
Submitted 13 February, 2021;
originally announced February 2021.
-
MATCH: An MPI Fault Tolerance Benchmark Suite
Authors:
Luanzheng Guo,
Giorgis Georgakoudis,
Konstantinos Parasyris,
Ignacio Laguna,
Dong Li
Abstract:
MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effective MPI fault tolerance techniques have been propos…
▽ More
MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effective MPI fault tolerance techniques have been proposed to enable MPI application execution to efficiently resume from system failures. However, there is no structured way to study and compare different MPI fault tolerance designs, so to guide the selection and development of efficient MPI fault tolerance techniques for distinct scenarios. To solve this problem, we design, develop, and evaluate a benchmark suite called MATCH to characterize, research, and comprehensively compare different combinations and configurations of MPI fault tolerance designs. Our investigation derives useful findings: (1) Reinit recovery in general performs better than ULFM recovery; (2) Reinit recovery is independent of the scaling size and the input problem size, whereas ULFM recovery is not; (3) Using Reinit recovery with FTI checkpointing is a highly efficient fault tolerance design. MATCH code is available at https://github.com/kakulo/MPI- FT- Bench.
△ Less
Submitted 13 February, 2021;
originally announced February 2021.
-
Report of the Workshop on Program Synthesis for Scientific Computing
Authors:
Hal Finkel,
Ignacio Laguna
Abstract:
Program synthesis is an active research field in academia, national labs, and industry. Yet, work directly applicable to scientific computing, while having some impressive successes, has been limited. This report reviews the relevant areas of program synthesis work for scientific computing, discusses successes to date, and outlines opportunities for future work. This report is the result of the Wo…
▽ More
Program synthesis is an active research field in academia, national labs, and industry. Yet, work directly applicable to scientific computing, while having some impressive successes, has been limited. This report reviews the relevant areas of program synthesis work for scientific computing, discusses successes to date, and outlines opportunities for future work. This report is the result of the Workshop on Program Synthesis for Scientific Computing was held virtually on August 4-5 2020 (https://prog-synth-science.github.io/2020/).
△ Less
Submitted 2 February, 2021;
originally announced February 2021.
-
PARIS: Predicting Application Resilience Using Machine Learning
Authors:
Luanzheng Guo,
Dong Li,
Ignacio Laguna
Abstract:
Extreme-scale scientific applications can be more vulnerable to soft errors (transient faults) as high-performance computing systems increase in scale. The common practice to evaluate the resilience to faults of an application is random fault injection, a method that can be highly time consuming. While resilience prediction modeling has been recently proposed to predict application resilience in a…
▽ More
Extreme-scale scientific applications can be more vulnerable to soft errors (transient faults) as high-performance computing systems increase in scale. The common practice to evaluate the resilience to faults of an application is random fault injection, a method that can be highly time consuming. While resilience prediction modeling has been recently proposed to predict application resilience in a faster way than fault injection, it can only predict a single class of fault manifestation (SDC) and there is no evidence demonstrating that it can work on previously unseen programs, which greatly limits its re-usability. We present PARIS, a resilience prediction method that addresses the problems of existing prediction methods using machine learning. Using carefully-selected features and a machine learning model, our method is able to make resilience predictions of three classes of fault manifestations (success, SDC, and interruption) as opposed to one class like in current resilience prediction modeling. The generality of our approach allows us to make prediction on new applications, i.e., previously unseen applications, providing large applicability to our model. Our evaluation on 125 programs shows that PARIS provides high prediction accuracy, 82% and 77% on average for predicting the rate of success and interruption, respectively, while the state-of-the-art resilience prediction model cannot predict them. When predicting the rate of SDC, PARIS provides much better accuracy than the state-of-the-art (38% vs. -273%). PARIS is much faster (up to 450x speedup) than the traditional method (random fault injection).
△ Less
Submitted 7 December, 2018;
originally announced December 2018.
-
Multi-level analysis of compiler induced variability and performance tradeoffs
Authors:
Michael Bentley,
Ian Briggs,
Ganesh Gopalakrishnan,
Dong H. Ahn,
Ignacio Laguna,
Gregory L. Lee,
Holger E. Jones
Abstract:
Successful HPC software applications are long-lived. When ported across machines and their compilers, these applications often produce different numerical results, many of which are unacceptable. Such variability is also a concern while optimizing the code more aggressively to gain performance. Efficient tools that help locate the program units (files and functions) within which most of the variab…
▽ More
Successful HPC software applications are long-lived. When ported across machines and their compilers, these applications often produce different numerical results, many of which are unacceptable. Such variability is also a concern while optimizing the code more aggressively to gain performance. Efficient tools that help locate the program units (files and functions) within which most of the variability occurs are badly needed, both to plan for code ports and to root-cause errors due to variability when they happen in the field. In this work, we offer an enhanced version of the open-source testing framework FLiT to serve these roles. Key new features of FLiT include a suite of bisection algorithms that help locate the root causes of variability. Another added feature allows an analysis of the tradeoffs between performance and the degree of variability. Our new contributions also include a collection of case studies. Results on the MFEM finite-element library include variability/performance tradeoffs, and the identification of a (hitherto unknown) abnormal level of result-variability even under mild compiler optimizations. Results from studying the Laghos proxy application include identifying a significantly divergent floating-point result-variability and successful root-causing down to the problematic function over as little as 14 program executions. Finally, in an evaluation of 4,376 controlled injections of floating-point perturbations on the LULESH proxy application, we showed that the FLiT framework has 100 precision and recall in discovering the file and function locations of the injections all within an average of only 15 program executions.
△ Less
Submitted 24 June, 2019; v1 submitted 13 November, 2018;
originally announced November 2018.
-
FlipTracker: Understanding Natural Error Resilience in HPC Applications
Authors:
Luanzheng Guo,
Dong Li,
Ignacio Laguna,
Martin Schulz
Abstract:
As high-performance computing systems scale in size and computational power, the danger of silent errors, i.e., errors that can bypass hardware detection mechanisms and impact application state, grows dramatically. Consequently, applications running on HPC systems need to exhibit resilience to such errors. Previous work has found that, for certain codes, this resilience can come for free, i.e., so…
▽ More
As high-performance computing systems scale in size and computational power, the danger of silent errors, i.e., errors that can bypass hardware detection mechanisms and impact application state, grows dramatically. Consequently, applications running on HPC systems need to exhibit resilience to such errors. Previous work has found that, for certain codes, this resilience can come for free, i.e., some applications are naturally resilient, but few studies have shown the code patterns---combinations or sequences of computations---that make an application naturally resilient. In this paper, we present FlipTracker, a framework designed to extract these patterns using fine-grained tracking of error propagation and resilience properties, and we use it to present a set of computation patterns that are responsible for making representative HPC applications naturally resilient to errors. This not only enables a deeper understanding of resilience properties of these codes, but also can guide future application designs towards patterns with natural resilience.
△ Less
Submitted 5 September, 2018;
originally announced September 2018.
-
Report of the HPC Correctness Summit, Jan 25--26, 2017, Washington, DC
Authors:
Ganesh Gopalakrishnan,
Paul D. Hovland,
Costin Iancu,
Sriram Krishnamoorthy,
Ignacio Laguna,
Richard A. Lethin,
Koushik Sen,
Stephen F. Siegel,
Armando Solar-Lezama
Abstract:
Maintaining leadership in HPC requires the ability to support simulations at large scales and fidelity. In this study, we detail one of the most significant productivity challenges in achieving this goal, namely the increasing proclivity to bugs, especially in the face of growing hardware and software heterogeneity and sheer system scale. We identify key areas where timely new research must be pro…
▽ More
Maintaining leadership in HPC requires the ability to support simulations at large scales and fidelity. In this study, we detail one of the most significant productivity challenges in achieving this goal, namely the increasing proclivity to bugs, especially in the face of growing hardware and software heterogeneity and sheer system scale. We identify key areas where timely new research must be proactively begun to address these challenges, and create new correctness tools that must ideally play a significant role even while ramping up toward exacale. We close with the proposal for a two-day workshop in which the problems identified in this report can be more broadly discussed, and specific plans to launch these new research thrusts identified.
△ Less
Submitted 21 May, 2017;
originally announced May 2017.
-
The DESI Experiment Part II: Instrument Design
Authors:
DESI Collaboration,
Amir Aghamousa,
Jessica Aguilar,
Steve Ahlen,
Shadab Alam,
Lori E. Allen,
Carlos Allende Prieto,
James Annis,
Stephen Bailey,
Christophe Balland,
Otger Ballester,
Charles Baltay,
Lucas Beaufore,
Chris Bebek,
Timothy C. Beers,
Eric F. Bell,
José Luis Bernal,
Robert Besuner,
Florian Beutler,
Chris Blake,
Hannes Bleuler,
Michael Blomqvist,
Robert Blum,
Adam S. Bolton,
Cesar Briceno
, et al. (268 additional authors not shown)
Abstract:
DESI (Dark Energy Spectropic Instrument) is a Stage IV ground-based dark energy experiment that will study baryon acoustic oscillations and the growth of structure through redshift-space distortions with a wide-area galaxy and quasar redshift survey. The DESI instrument is a robotically-actuated, fiber-fed spectrograph capable of taking up to 5,000 simultaneous spectra over a wavelength range from…
▽ More
DESI (Dark Energy Spectropic Instrument) is a Stage IV ground-based dark energy experiment that will study baryon acoustic oscillations and the growth of structure through redshift-space distortions with a wide-area galaxy and quasar redshift survey. The DESI instrument is a robotically-actuated, fiber-fed spectrograph capable of taking up to 5,000 simultaneous spectra over a wavelength range from 360 nm to 980 nm. The fibers feed ten three-arm spectrographs with resolution $R= λ/Δλ$ between 2000 and 5500, depending on wavelength. The DESI instrument will be used to conduct a five-year survey designed to cover 14,000 deg$^2$. This powerful instrument will be installed at prime focus on the 4-m Mayall telescope in Kitt Peak, Arizona, along with a new optical corrector, which will provide a three-degree diameter field of view. The DESI collaboration will also deliver a spectroscopic pipeline and data management system to reduce and archive all data for eventual public use.
△ Less
Submitted 13 December, 2016; v1 submitted 31 October, 2016;
originally announced November 2016.
-
The DESI Experiment Part I: Science,Targeting, and Survey Design
Authors:
DESI Collaboration,
Amir Aghamousa,
Jessica Aguilar,
Steve Ahlen,
Shadab Alam,
Lori E. Allen,
Carlos Allende Prieto,
James Annis,
Stephen Bailey,
Christophe Balland,
Otger Ballester,
Charles Baltay,
Lucas Beaufore,
Chris Bebek,
Timothy C. Beers,
Eric F. Bell,
José Luis Bernal,
Robert Besuner,
Florian Beutler,
Chris Blake,
Hannes Bleuler,
Michael Blomqvist,
Robert Blum,
Adam S. Bolton,
Cesar Briceno
, et al. (268 additional authors not shown)
Abstract:
DESI (Dark Energy Spectroscopic Instrument) is a Stage IV ground-based dark energy experiment that will study baryon acoustic oscillations (BAO) and the growth of structure through redshift-space distortions with a wide-area galaxy and quasar redshift survey. To trace the underlying dark matter distribution, spectroscopic targets will be selected in four classes from imaging data. We will measure…
▽ More
DESI (Dark Energy Spectroscopic Instrument) is a Stage IV ground-based dark energy experiment that will study baryon acoustic oscillations (BAO) and the growth of structure through redshift-space distortions with a wide-area galaxy and quasar redshift survey. To trace the underlying dark matter distribution, spectroscopic targets will be selected in four classes from imaging data. We will measure luminous red galaxies up to $z=1.0$. To probe the Universe out to even higher redshift, DESI will target bright [O II] emission line galaxies up to $z=1.7$. Quasars will be targeted both as direct tracers of the underlying dark matter distribution and, at higher redshifts ($ 2.1 < z < 3.5$), for the Ly-$α$ forest absorption features in their spectra, which will be used to trace the distribution of neutral hydrogen. When moonlight prevents efficient observations of the faint targets of the baseline survey, DESI will conduct a magnitude-limited Bright Galaxy Survey comprising approximately 10 million galaxies with a median $z\approx 0.2$. In total, more than 30 million galaxy and quasar redshifts will be obtained to measure the BAO feature and determine the matter power spectrum, including redshift space distortions.
△ Less
Submitted 13 December, 2016; v1 submitted 31 October, 2016;
originally announced November 2016.
-
Development of the photomultiplier tube readout system for the first Large-Sized Telescope of the Cherenkov Telescope Array
Authors:
Shu Masuda,
Yusuke Konno,
Juan Abel Barrio,
Oscar Blanch Bigas,
Carlos Delgado,
Lluís Freixas Coromina,
Shuichi Gunji,
Daniela Hadasch,
Kenichiro Hatanaka,
Masahiro Ikeno,
Jose Maria Illa Laguna,
Yusuke Inome,
Kazuma Ishio,
Hideaki Katagiri,
Hidetoshi Kubo,
Gustavo Martínez,
Daniel Mazin,
Daisuke Nakajima,
Takeshi Nakamori,
Hideyuki Ohoka,
Riccardo Paoletti,
Stefan Ritt,
Andrea Rugliancich,
Takayuki Saito,
Karl-Heinz Sulanke
, et al. (9 additional authors not shown)
Abstract:
The Cherenkov Telescope Array (CTA) is the next generation ground-based very high energy gamma-ray observatory. The Large-Sized Telescope (LST) of CTA targets 20 GeV -- 1 TeV gamma rays and has 1855 photomultiplier tubes (PMTs) installed in the focal plane camera. With the 23 m mirror dish, the night sky background (NSB) rate amounts to several hundreds MHz per pixel. In order to record clean imag…
▽ More
The Cherenkov Telescope Array (CTA) is the next generation ground-based very high energy gamma-ray observatory. The Large-Sized Telescope (LST) of CTA targets 20 GeV -- 1 TeV gamma rays and has 1855 photomultiplier tubes (PMTs) installed in the focal plane camera. With the 23 m mirror dish, the night sky background (NSB) rate amounts to several hundreds MHz per pixel. In order to record clean images of gamma-ray showers with minimal NSB contamination, a fast sampling of the signal waveform is required so that the signal integration time can be as short as the Cherenkov light flash duration (a few ns). We have developed a readout board which samples waveforms of seven PMTs per board at a GHz rate. Since a GHz FADC has a high power consumption, leading to large heat dissipation, we adopted the analog memory ASIC "DRS4". The sampler has 1024 capacitors per channel and can sample the waveform at a GHz rate. Four channels of a chip are cascaded to obtain deeper sampling depth with 4096 capacitors. After a trigger is generated in a mezzanine on the board, the waveform stored in the capacitor array is subsequently digitized with a low speed (33 MHz) ADC and transferred via the FPGA-based Gigabit Ethernet to a data acquisition system. Both a low power consumption (2.64 W per channel) and high speed sampling with a bandwidth of $>$300 MHz have been achieved. In addition, in order to increase the dynamic range of the readout we adopted a two gain system achieving from 0.2 up to 2000 photoelectrons in total. We finalized the board design for the first LST and proceeded to mass production. Performance of produced boards are being checked with a series of quality control (QC) tests. We report the readout board specifications and QC results.
△ Less
Submitted 1 September, 2015;
originally announced September 2015.