Skip to main content

Showing 1–6 of 6 results for author: DeBardeleben, N

.
  1. arXiv:2506.22653  [pdf, ps, other

    cs.AI

    URSA: The Universal Research and Scientific Agent

    Authors: Michael Grosskopf, Russell Bent, Rahul Somasundaram, Isaac Michaud, Arthur Lui, Nathan Debardeleben, Earl Lawrence

    Abstract: Large language models (LLMs) have moved far beyond their initial form as simple chatbots, now carrying out complex reasoning, planning, writing, coding, and research tasks. These skills overlap significantly with those that human scientists use day-to-day to solve complex problems that drive the cutting edge of research. Using LLMs in "agentic" AI has the potential to revolutionize modern science… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

    Comments: 31 pages, 9 figures

  2. arXiv:2010.13342  [pdf, other

    cs.DC

    Resiliency in Numerical Algorithm Design for Extreme Scale Simulations

    Authors: Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz, Sanjay Chatterjee, Florina M. Ciorba, Nathan DeBardeleben, Daniel Drzisga, Sebastian Eibl, Christian Engelmann, Wilfried N. Gansterer, Luc Giraud, Dominik Goeddeke, Marco Heisig, Fabienne Jezequel, Nils Kohl, Xiaoye Sherry Li, Romain Lion, Miriam Mehl, Paul Mycek, Michael Obersteiner, Enrique S. Quintana-Orti , et al. (11 additional authors not shown)

    Abstract: This work is based on the seminar titled ``Resiliency in Numerical Algorithm Design for Extreme Scale Simulations'' held March 1-6, 2020 at Schloss Dagstuhl, that was attended by all the authors. Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to backgr… ▽ More

    Submitted 26 October, 2020; originally announced October 2020.

    Comments: 45 pages, 3 figures, submitted to The International Journal of High Performance Computing Applications

    ACM Class: D.4.5; G.4; G.1; D.4.4

  3. arXiv:2004.01743  [pdf, other

    cs.DC cs.CV cs.LG stat.ML

    TensorFI: A Flexible Fault Injection Framework for TensorFlow Applications

    Authors: Zitao Chen, Niranjhana Narayanan, Bo Fang, Guanpeng Li, Karthik Pattabiraman, Nathan DeBardeleben

    Abstract: As machine learning (ML) has seen increasing adoption in safety-critical domains (e.g., autonomous vehicles), the reliability of ML systems has also grown in importance. While prior studies have proposed techniques to enable efficient error-resilience techniques (e.g., selective instruction duplication), a fundamental requirement for realizing these techniques is a detailed understanding of the ap… ▽ More

    Submitted 3 April, 2020; originally announced April 2020.

    Comments: A preliminary version of this work was published in a workshop

  4. arXiv:1911.02118  [pdf, ps, other

    cs.DC

    Failure Analysis and Quantification for Contemporary and Future Supercomputers

    Authors: Li Tan, Nathan DeBardeleben

    Abstract: Large-scale computing systems today are assembled by numerous computing units for massive computational capability needed to solve problems at scale, which enables failures common events in supercomputing scenarios. Considering the demanding resilience requirements of supercomputers today, we present a quantitative study on fine-grained failure modeling for contemporary and future large-scale comp… ▽ More

    Submitted 5 November, 2019; originally announced November 2019.

    Comments: 20 pages

    MSC Class: 68M15; 68M20; 68N20

  5. arXiv:1911.02114  [pdf, ps, other

    cs.DC

    Soft Error Resilience and Failure Recovery for Continuum Dynamics Applications

    Authors: Li Tan, Marc Charest, Nathan DeBardeleben, Qiang Guan, Ben Bergen

    Abstract: The persistently growing resilience concerns of large-scale computing systems today require not only generic fault tolerance approaches, but also application-level resilience, due to demanding efficiency and various domain-specific requirements. Scientific applications within a particular domain generally comply with domain conservation laws, which can be leveraged as an error detection criterion… ▽ More

    Submitted 5 November, 2019; originally announced November 2019.

    Comments: 18 pages

    MSC Class: 68M15; 68M20; 68N20

  6. arXiv:1808.01093  [pdf, other

    cs.DC

    Characterization and Comparison of Application Resilience for Serial and Parallel Executions

    Authors: Kai Wu, Qiang Guan, Nathan DeBardeleben, Dong Li

    Abstract: Soft error of exascale application is a challenge problem in modern HPC. In order to quantify an application's resilience and vulnerability, the application-level fault injection method is widely adopted by HPC users. However, it is not easy since users need to inject a large number of faults to ensure statistical significance, especially for parallel version program. Normally, parallel execution… ▽ More

    Submitted 3 August, 2018; originally announced August 2018.

    Comments: 2 pages

    Report number: LA-UR-17-26470