Skip to main content

Showing 1–11 of 11 results for author: Engelmann, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2207.09708  [pdf, other

    cs.MA cs.AI cs.SE

    RV4JaCa -- Runtime Verification for Multi-Agent Systems

    Authors: Debora C. Engelmann, Angelo Ferrando, Alison R. Panisson, Davide Ancona, Rafael H. Bordini, Viviana Mascardi

    Abstract: This paper presents a Runtime Verification (RV) approach for Multi-Agent Systems (MAS) using the JaCaMo framework. Our objective is to bring a layer of security to the MAS. This layer is capable of controlling events during the execution of the system without needing a specific implementation in the behaviour of each agent to recognise the events. MAS have been used in the context of hybrid intell… ▽ More

    Submitted 20 July, 2022; originally announced July 2022.

    Comments: In Proceedings AREA 2022, arXiv:2207.09058

    Journal ref: EPTCS 362, 2022, pp. 23-36

  2. arXiv:2010.13342  [pdf, other

    cs.DC

    Resiliency in Numerical Algorithm Design for Extreme Scale Simulations

    Authors: Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz, Sanjay Chatterjee, Florina M. Ciorba, Nathan DeBardeleben, Daniel Drzisga, Sebastian Eibl, Christian Engelmann, Wilfried N. Gansterer, Luc Giraud, Dominik Goeddeke, Marco Heisig, Fabienne Jezequel, Nils Kohl, Xiaoye Sherry Li, Romain Lion, Miriam Mehl, Paul Mycek, Michael Obersteiner, Enrique S. Quintana-Orti , et al. (11 additional authors not shown)

    Abstract: This work is based on the seminar titled ``Resiliency in Numerical Algorithm Design for Extreme Scale Simulations'' held March 1-6, 2020 at Schloss Dagstuhl, that was attended by all the authors. Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to backgr… ▽ More

    Submitted 26 October, 2020; originally announced October 2020.

    Comments: 45 pages, 3 figures, submitted to The International Journal of High Performance Computing Applications

    ACM Class: D.4.5; G.4; G.1; D.4.4

  3. Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing

    Authors: Rizwan A. Ashraf, Saurabh Hukerikar, Christian Engelmann

    Abstract: Resiliency is the ability of large-scale high-performance computing (HPC) applications to gracefully handle errors, and recover from failures. In this paper, we propose a pattern-based approach to constructing resilience solutions that handle multiple error modes. Using resilience patterns, we evaluate the performance and reliability characteristics of detection, containment and mitigation techniq… ▽ More

    Submitted 22 February, 2018; originally announced February 2018.

    Comments: 2018 ACM/SPEC International Conference on Performance Engineering (ICPE '18) Berlin, Germany

  4. arXiv:1801.04523  [pdf, other

    cs.DC

    Shrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery

    Authors: Rizwan A. Ashraf, Saurabh Hukerikar, Christian Engelmann

    Abstract: Efficient utilization of today's high-performance computing (HPC) systems with complex hardware and software components requires that the HPC applications are designed to tolerate process failures at runtime. With low mean time to failure (MTTF) of current and future HPC systems, long running simulations on these systems require capabilities for gracefully handling process failures by the applicat… ▽ More

    Submitted 14 January, 2018; originally announced January 2018.

    Comments: 26th Euromicro International Conference on Parallel, Distributed and network-based Processing (PDP 2018)

  5. A Pattern Language for High-Performance Computing Resilience

    Authors: Saurabh Hukerikar, Christian Engelmann

    Abstract: High-performance computing systems (HPC) provide powerful capabilities for modeling, simulation, and data analytics for a broad class of computational problems. They enable extreme performance of the order of quadrillion floating-point arithmetic calculations per second by aggregating the power of millions of compute, memory, networking and storage components. With the rapidly growing scale and co… ▽ More

    Submitted 30 October, 2017; v1 submitted 25 October, 2017; originally announced October 2017.

    Comments: Proceedings of the 22nd European Conference on Pattern Languages of Programs

  6. arXiv:1710.02627  [pdf, ps, other

    cs.DC

    Pattern-based Modeling of High-Performance Computing Resilience

    Authors: Saurabh Hukerikar, Christian Engelmann

    Abstract: With the growing scale and complexity of high-performance computing (HPC) systems, resilience solutions that ensure continuity of service despite frequent errors and component failures must be methodically designed to balance the reliability requirements with the overheads to performance and power. Design patterns enable a structured approach to the development of resilience solutions, providing h… ▽ More

    Submitted 6 October, 2017; originally announced October 2017.

    Comments: International European Conference on Parallel and Distributed Computing (Euro-Par) 2017 Workshops

  7. Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

    Authors: Saurabh Hukerikar, Christian Engelmann

    Abstract: Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. While the HPC community has developed various resilience solutions, the solution space remains fragmented. There are no formal methods and metrics to integrate the various HPC resilience techniques into composite solutions, nor are there methods to holistically evaluate the adequacy and efficacy of… ▽ More

    Submitted 23 August, 2017; originally announced August 2017.

    Comments: Supercomputing Frontiers and Innovations. arXiv admin note: text overlap with arXiv:1611.02717

  8. arXiv:1708.06884  [pdf, other

    cs.DC cs.DB

    Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale

    Authors: Byung H. Park, Saurabh Hukerikar, Ryan Adamson, Christian Engelmann

    Abstract: Today's high-performance computing (HPC) systems are heavily instrumented, generating logs containing information about abnormal events, such as critical conditions, faults, errors and failures, system resource utilization, and about the resource usage of user applications. These logs, once fully analyzed and correlated, can produce detailed information about the system health, root causes of fail… ▽ More

    Submitted 23 August, 2017; originally announced August 2017.

    Comments: IEEE Cluster 2017 at Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications

  9. arXiv:1611.02823  [pdf, other

    cs.DC cs.PL cs.SE

    Language Support for Reliable Memory Regions

    Authors: Saurabh Hukerikar, Christian Engelmann

    Abstract: The path to exascale computational capabilities in high-performance computing (HPC) systems is challenged by the inadequacy of present software technologies to adapt to the rapid evolution of architectures of supercomputing systems. The constraints of power have driven system designs to include increasingly heterogeneous architectures and diverse memory technologies and interfaces. Future systems… ▽ More

    Submitted 23 November, 2016; v1 submitted 9 November, 2016; originally announced November 2016.

    Comments: The 29th International Workshop on Languages and Compilers for Parallel Computing

  10. arXiv:1611.02717  [pdf, other

    cs.DC cs.SE

    Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.0)

    Authors: Saurabh Hukerikar, Christian Engelmann

    Abstract: In this document, we develop a structured approach to the management of HPC resilience based on the concept of resilience-based design patterns. A design pattern is a general repeatable solution to a commonly occurring problem. We identify the commonly occurring problems and solutions used to deal with faults, errors and failures in HPC systems. The catalog of resilience design patterns provides d… ▽ More

    Submitted 28 December, 2016; v1 submitted 8 November, 2016; originally announced November 2016.

    Comments: Oak Ridge National Laboratory Technical Report version 1.0

    Report number: ORNL/TM-2016/687

  11. arXiv:1610.08494  [pdf, other

    cs.DC

    Havens: Explicit Reliable Memory Regions for HPC Applications

    Authors: Saurabh Hukerikar, Christian Engelmann

    Abstract: Supporting error resilience in future exascale-class supercomputing systems is a critical challenge. Due to transistor scaling trends and increasing memory density, scientific simulations are expected to experience more interruptions caused by transient errors in the system memory. Existing hardware-based detection and recovery techniques will be inadequate to manage the presence of high memory fa… ▽ More

    Submitted 26 October, 2016; originally announced October 2016.

    Comments: 2016 IEEE High Performance Extreme Computing Conference (HPEC '16), September 2016, Waltham, MA, USA