-
URSA: The Universal Research and Scientific Agent
Authors:
Michael Grosskopf,
Russell Bent,
Rahul Somasundaram,
Isaac Michaud,
Arthur Lui,
Nathan Debardeleben,
Earl Lawrence
Abstract:
Large language models (LLMs) have moved far beyond their initial form as simple chatbots, now carrying out complex reasoning, planning, writing, coding, and research tasks. These skills overlap significantly with those that human scientists use day-to-day to solve complex problems that drive the cutting edge of research. Using LLMs in "agentic" AI has the potential to revolutionize modern science…
▽ More
Large language models (LLMs) have moved far beyond their initial form as simple chatbots, now carrying out complex reasoning, planning, writing, coding, and research tasks. These skills overlap significantly with those that human scientists use day-to-day to solve complex problems that drive the cutting edge of research. Using LLMs in "agentic" AI has the potential to revolutionize modern science and remove bottlenecks to progress. In this work, we present URSA, a scientific agent ecosystem for accelerating research tasks. URSA consists of a set of modular agents and tools, including coupling to advanced physics simulation codes, that can be combined to address scientific problems of varied complexity and impact. This work highlights the architecture of URSA, as well as examples that highlight the potential of the system.
△ Less
Submitted 27 June, 2025;
originally announced June 2025.
-
Resiliency in Numerical Algorithm Design for Extreme Scale Simulations
Authors:
Emmanuel Agullo,
Mirco Altenbernd,
Hartwig Anzt,
Leonardo Bautista-Gomez,
Tommaso Benacchio,
Luca Bonaventura,
Hans-Joachim Bungartz,
Sanjay Chatterjee,
Florina M. Ciorba,
Nathan DeBardeleben,
Daniel Drzisga,
Sebastian Eibl,
Christian Engelmann,
Wilfried N. Gansterer,
Luc Giraud,
Dominik Goeddeke,
Marco Heisig,
Fabienne Jezequel,
Nils Kohl,
Xiaoye Sherry Li,
Romain Lion,
Miriam Mehl,
Paul Mycek,
Michael Obersteiner,
Enrique S. Quintana-Orti
, et al. (11 additional authors not shown)
Abstract:
This work is based on the seminar titled ``Resiliency in Numerical Algorithm Design for Extreme Scale Simulations'' held March 1-6, 2020 at Schloss Dagstuhl, that was attended by all the authors.
Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to backgr…
▽ More
This work is based on the seminar titled ``Resiliency in Numerical Algorithm Design for Extreme Scale Simulations'' held March 1-6, 2020 at Schloss Dagstuhl, that was attended by all the authors.
Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated.
More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors.
The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge.
△ Less
Submitted 26 October, 2020;
originally announced October 2020.
-
TensorFI: A Flexible Fault Injection Framework for TensorFlow Applications
Authors:
Zitao Chen,
Niranjhana Narayanan,
Bo Fang,
Guanpeng Li,
Karthik Pattabiraman,
Nathan DeBardeleben
Abstract:
As machine learning (ML) has seen increasing adoption in safety-critical domains (e.g., autonomous vehicles), the reliability of ML systems has also grown in importance. While prior studies have proposed techniques to enable efficient error-resilience techniques (e.g., selective instruction duplication), a fundamental requirement for realizing these techniques is a detailed understanding of the ap…
▽ More
As machine learning (ML) has seen increasing adoption in safety-critical domains (e.g., autonomous vehicles), the reliability of ML systems has also grown in importance. While prior studies have proposed techniques to enable efficient error-resilience techniques (e.g., selective instruction duplication), a fundamental requirement for realizing these techniques is a detailed understanding of the application's resilience.
In this work, we present TensorFI, a high-level fault injection (FI) framework for TensorFlow-based applications. TensorFI is able to inject both hardware and software faults in general TensorFlow programs. TensorFI is a configurable FI tool that is flexible, easy to use, and portable. It can be integrated into existing TensorFlow programs to assess their resilience for different fault types (e.g., faults in particular operators). We use TensorFI to evaluate the resilience of 12 ML programs, including DNNs used in the autonomous vehicle domain. Our tool is publicly available at https://github.com/DependableSystemsLab/TensorFI.
△ Less
Submitted 3 April, 2020;
originally announced April 2020.
-
Failure Analysis and Quantification for Contemporary and Future Supercomputers
Authors:
Li Tan,
Nathan DeBardeleben
Abstract:
Large-scale computing systems today are assembled by numerous computing units for massive computational capability needed to solve problems at scale, which enables failures common events in supercomputing scenarios. Considering the demanding resilience requirements of supercomputers today, we present a quantitative study on fine-grained failure modeling for contemporary and future large-scale comp…
▽ More
Large-scale computing systems today are assembled by numerous computing units for massive computational capability needed to solve problems at scale, which enables failures common events in supercomputing scenarios. Considering the demanding resilience requirements of supercomputers today, we present a quantitative study on fine-grained failure modeling for contemporary and future large-scale computing systems. We integrate various types of failures from different system hierarchical levels and system components, and summarize the overall system failure rates formally. Given that nowadays system-wise failure rate needs to be capped under a threshold value for reliability and cost-efficiency purposes, we quantitatively discuss different scenarios of system resilience, and analyze the impacts of resilience to different error types on the variation of system failure rates, and the correlation of hierarchical failure rates. Moreover, we formalize and showcase the resilience efficiency of failure-bounded supercomputers today.
△ Less
Submitted 5 November, 2019;
originally announced November 2019.
-
Soft Error Resilience and Failure Recovery for Continuum Dynamics Applications
Authors:
Li Tan,
Marc Charest,
Nathan DeBardeleben,
Qiang Guan,
Ben Bergen
Abstract:
The persistently growing resilience concerns of large-scale computing systems today require not only generic fault tolerance approaches, but also application-level resilience, due to demanding efficiency and various domain-specific requirements. Scientific applications within a particular domain generally comply with domain conservation laws, which can be leveraged as an error detection criterion…
▽ More
The persistently growing resilience concerns of large-scale computing systems today require not only generic fault tolerance approaches, but also application-level resilience, due to demanding efficiency and various domain-specific requirements. Scientific applications within a particular domain generally comply with domain conservation laws, which can be leveraged as an error detection criterion to study the resilience of this domain of applications sharing similar program characteristics. However, it is challenging to achieve application resilience: (a) how to identify the invariants of a given domain of applications, knowing the conservation laws, and (b) how to utilize the invariants to efficiently detect and recover from failures in application runs.
In this work, we target several continuum dynamics software packages, FleCSALE [1] and CODY [2] (with intrinsic invariants during computation), study their resilience to soft errors online (injected using an open-source fault injector), and investigate the opportunities for non-intrusive and lightweight failure recovery (checksum-based invariant checking). We propose a checksum-retry approach to achieve our goals, and experimental results on a virtualized platform with extensive fault injection campaigns demonstrate the effectiveness and efficiency of the proposed approach.
△ Less
Submitted 5 November, 2019;
originally announced November 2019.
-
Characterization and Comparison of Application Resilience for Serial and Parallel Executions
Authors:
Kai Wu,
Qiang Guan,
Nathan DeBardeleben,
Dong Li
Abstract:
Soft error of exascale application is a challenge problem in modern HPC. In order to quantify an application's resilience and vulnerability, the application-level fault injection method is widely adopted by HPC users. However, it is not easy since users need to inject a large number of faults to ensure statistical significance, especially for parallel version program. Normally, parallel execution…
▽ More
Soft error of exascale application is a challenge problem in modern HPC. In order to quantify an application's resilience and vulnerability, the application-level fault injection method is widely adopted by HPC users. However, it is not easy since users need to inject a large number of faults to ensure statistical significance, especially for parallel version program. Normally, parallel execution is more complex and requires more hardware resources than its serial execution. Therefore, it is essential that we can predict error rate of parallel application based on its corresponding serial version. In this poster, we characterize fault pattern in serial and parallel executions. We find first there are same fault sources in serial and parallel execution. Second, parallel execution also has some unique fault sources compared with serial executions. Those unique fault sources are important for us to understand the difference of fault pattern between serial and parallel executions.
△ Less
Submitted 3 August, 2018;
originally announced August 2018.