Skip to main content

Showing 1–12 of 12 results for author: Hukerikar, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2410.18124  [pdf, other

    cs.DC

    Optimal Checkpoint Interval with Availability as an Objective Function

    Authors: Nirmal Raj Saxena, Saurabh Hukerikar, Mikolaj Blaz, Swapna Raj

    Abstract: We present a simplified derivation of the optimal checkpoint interval in Young_1974 [1]. The optimal checkpoint interval derivation in [1] is based on minimizing the total lost time as an objective-function. Lost time is a function of checkpoint interval, checkpoint save time, and average failure time. This simplified derivation yields lost-time-optimal that is identical to the one derived in [1].… ▽ More

    Submitted 15 October, 2024; originally announced October 2024.

    Comments: 10 pages, 5 figures

  2. Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing

    Authors: Rizwan A. Ashraf, Saurabh Hukerikar, Christian Engelmann

    Abstract: Resiliency is the ability of large-scale high-performance computing (HPC) applications to gracefully handle errors, and recover from failures. In this paper, we propose a pattern-based approach to constructing resilience solutions that handle multiple error modes. Using resilience patterns, we evaluate the performance and reliability characteristics of detection, containment and mitigation techniq… ▽ More

    Submitted 22 February, 2018; originally announced February 2018.

    Comments: 2018 ACM/SPEC International Conference on Performance Engineering (ICPE '18) Berlin, Germany

  3. arXiv:1801.04523  [pdf, other

    cs.DC

    Shrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery

    Authors: Rizwan A. Ashraf, Saurabh Hukerikar, Christian Engelmann

    Abstract: Efficient utilization of today's high-performance computing (HPC) systems with complex hardware and software components requires that the HPC applications are designed to tolerate process failures at runtime. With low mean time to failure (MTTF) of current and future HPC systems, long running simulations on these systems require capabilities for gracefully handling process failures by the applicat… ▽ More

    Submitted 14 January, 2018; originally announced January 2018.

    Comments: 26th Euromicro International Conference on Parallel, Distributed and network-based Processing (PDP 2018)

  4. A Pattern Language for High-Performance Computing Resilience

    Authors: Saurabh Hukerikar, Christian Engelmann

    Abstract: High-performance computing systems (HPC) provide powerful capabilities for modeling, simulation, and data analytics for a broad class of computational problems. They enable extreme performance of the order of quadrillion floating-point arithmetic calculations per second by aggregating the power of millions of compute, memory, networking and storage components. With the rapidly growing scale and co… ▽ More

    Submitted 30 October, 2017; v1 submitted 25 October, 2017; originally announced October 2017.

    Comments: Proceedings of the 22nd European Conference on Pattern Languages of Programs

  5. arXiv:1710.02627  [pdf, ps, other

    cs.DC

    Pattern-based Modeling of High-Performance Computing Resilience

    Authors: Saurabh Hukerikar, Christian Engelmann

    Abstract: With the growing scale and complexity of high-performance computing (HPC) systems, resilience solutions that ensure continuity of service despite frequent errors and component failures must be methodically designed to balance the reliability requirements with the overheads to performance and power. Design patterns enable a structured approach to the development of resilience solutions, providing h… ▽ More

    Submitted 6 October, 2017; originally announced October 2017.

    Comments: International European Conference on Parallel and Distributed Computing (Euro-Par) 2017 Workshops

  6. Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

    Authors: Saurabh Hukerikar, Christian Engelmann

    Abstract: Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. While the HPC community has developed various resilience solutions, the solution space remains fragmented. There are no formal methods and metrics to integrate the various HPC resilience techniques into composite solutions, nor are there methods to holistically evaluate the adequacy and efficacy of… ▽ More

    Submitted 23 August, 2017; originally announced August 2017.

    Comments: Supercomputing Frontiers and Innovations. arXiv admin note: text overlap with arXiv:1611.02717

  7. arXiv:1708.06884  [pdf, other

    cs.DC cs.DB

    Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale

    Authors: Byung H. Park, Saurabh Hukerikar, Ryan Adamson, Christian Engelmann

    Abstract: Today's high-performance computing (HPC) systems are heavily instrumented, generating logs containing information about abnormal events, such as critical conditions, faults, errors and failures, system resource utilization, and about the resource usage of user applications. These logs, once fully analyzed and correlated, can produce detailed information about the system health, root causes of fail… ▽ More

    Submitted 23 August, 2017; originally announced August 2017.

    Comments: IEEE Cluster 2017 at Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications

  8. arXiv:1611.02823  [pdf, other

    cs.DC cs.PL cs.SE

    Language Support for Reliable Memory Regions

    Authors: Saurabh Hukerikar, Christian Engelmann

    Abstract: The path to exascale computational capabilities in high-performance computing (HPC) systems is challenged by the inadequacy of present software technologies to adapt to the rapid evolution of architectures of supercomputing systems. The constraints of power have driven system designs to include increasingly heterogeneous architectures and diverse memory technologies and interfaces. Future systems… ▽ More

    Submitted 23 November, 2016; v1 submitted 9 November, 2016; originally announced November 2016.

    Comments: The 29th International Workshop on Languages and Compilers for Parallel Computing

  9. arXiv:1611.02717  [pdf, other

    cs.DC cs.SE

    Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.0)

    Authors: Saurabh Hukerikar, Christian Engelmann

    Abstract: In this document, we develop a structured approach to the management of HPC resilience based on the concept of resilience-based design patterns. A design pattern is a general repeatable solution to a commonly occurring problem. We identify the commonly occurring problems and solutions used to deal with faults, errors and failures in HPC systems. The catalog of resilience design patterns provides d… ▽ More

    Submitted 28 December, 2016; v1 submitted 8 November, 2016; originally announced November 2016.

    Comments: Oak Ridge National Laboratory Technical Report version 1.0

    Report number: ORNL/TM-2016/687

  10. arXiv:1610.08494  [pdf, other

    cs.DC

    Havens: Explicit Reliable Memory Regions for HPC Applications

    Authors: Saurabh Hukerikar, Christian Engelmann

    Abstract: Supporting error resilience in future exascale-class supercomputing systems is a critical challenge. Due to transistor scaling trends and increasing memory density, scientific simulations are expected to experience more interruptions caused by transient errors in the system memory. Existing hardware-based detection and recovery techniques will be inadequate to manage the presence of high memory fa… ▽ More

    Submitted 26 October, 2016; originally announced October 2016.

    Comments: 2016 IEEE High Performance Extreme Computing Conference (HPEC '16), September 2016, Waltham, MA, USA

  11. arXiv:1610.01728  [pdf, other

    cs.DC

    RedThreads: An Interface for Application-level Fault Detection/Correction through Adaptive Redundant Multithreading

    Authors: Saurabh Hukerikar, Keita Teranishi, Pedro C. Diniz, Robert F. Lucas

    Abstract: In the presence of accelerated fault rates, which are projected to be the norm on future exascale systems, it will become increasingly difficult for high-performance computing (HPC) applications to accomplish useful computation. Due to the fault-oblivious nature of current HPC programming paradigms and execution environments, HPC applications are insufficiently equipped to deal with errors. We bel… ▽ More

    Submitted 17 January, 2017; v1 submitted 6 October, 2016; originally announced October 2016.

    Comments: Submitted to Journal

  12. arXiv:1605.01994  [pdf, other

    cs.DC

    Rolex: Resilience-Oriented Language Extensions for Extreme-Scale Systems

    Authors: Saurabh Hukerikar, Robert F. Lucas

    Abstract: Future exascale high-performance computing (HPC) systems will be constructed from VLSI devices that will be less reliable than those used today, and faults will become the norm, not the exception. This will pose significant problems for system designers and programmers, who for half-a-century have enjoyed an execution model that assumed correct behavior by the underlying computing system. The mean… ▽ More

    Submitted 23 May, 2016; v1 submitted 6 May, 2016; originally announced May 2016.