Skip to main content

Showing 1–4 of 4 results for author: Rexachs, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2409.02214  [pdf, ps, other

    cs.DC

    Checkpoint and Restart: An Energy Consumption Characterization in Clusters

    Authors: Marina Moran, Javier Balladini, Dolores Rexachs, Emilio Luque

    Abstract: The fault tolerance method currently used in High Performance Computing (HPC) is the rollback-recovery method by using checkpoints. This, like any other fault tolerance method, adds an additional energy consumption to that of the execution of the application. The objective of this work is to determine the factors that affect the energy consumption of the computing nodes on homogeneous cluster, whe… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

    Comments: 15 pages, 20 figures

  2. Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems

    Authors: Marina Moran, Javier Balladini, Dolores Rexachs, Enzo Rucci

    Abstract: Nowadays, improving the energy efficiency of high-performance computing (HPC) systems is one of the main drivers in scientific and technological research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be explored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing wh… ▽ More

    Submitted 14 November, 2023; v1 submitted 10 November, 2023; originally announced November 2023.

    Comments: This is the accepted version of the manuscript that was sent to review to Journal of Parallel and Distributed Computing (ISSN 1096-0848). arXiv admin note: text overlap with arXiv:2012.11396

  3. Towards Management of Energy Consumption in HPC Systems with Fault Tolerance

    Authors: Marina MorĂ¡n, Javier Balladini, Dolores Rexachs, Enzo Rucci

    Abstract: High-performance computing continues to increase its computing power and energy efficiency. However, energy consumption continues to rise and finding ways to limit and/or decrease it is a crucial point in current research. For high-performance MPI applications, there are rollback recovery based fault tolerance methods, such as uncoordinated checkpoints. These methods allow only some processes to g… ▽ More

    Submitted 17 August, 2021; v1 submitted 21 December, 2020; originally announced December 2020.

    Comments: This is the author version of the manuscript that was accepted for publication in 2020 IEEE Biennial Congress of Argentina (ARGENCON) (ISBN 978-1-7281-5957-7/20)

  4. Soft Errors Detection and Automatic Recovery based on Replication combined with different Levels of Checkpointing

    Authors: Diego Montezanti, Enzo Rucci, Armando De Giusti, Marcelo Naiouf, Dolores Rexachs, Emilio Luque

    Abstract: Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent undetected errors will occur several times a day, increasing the occurrence of corrupted results. In this article, we propose SEDAR, which is a methodology that improves system reliability against transient faults when running parallel message-passing applications. Our approach, based on process re… ▽ More

    Submitted 27 July, 2020; v1 submitted 16 July, 2020; originally announced July 2020.

    Comments: 26 pages, 3 figures (1 and 2 with subfigures), 4 tables, sent to review to Future Generation Computer Systems

    Journal ref: FGCS Volume 113, December 2020, Pages 240-254, ISSN 0167-739X