-
On the Combination of Silent Error Detection and Checkpointing
Authors:
Guillaume Aupy,
Anne Benoit,
Thomas Hérault,
Yves Robert,
Frédéric Vivien,
Dounia Zaidouni
Abstract:
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential dis…
▽ More
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution); (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, we compute the minimum period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, owing to the verification mechanism, but the corresponding overhead is included in the waste. Finally, both models are instantiated using realistic scenarios and application/architecture parameters.
△ Less
Submitted 31 October, 2013;
originally announced October 2013.
-
Checkpointing strategies with prediction windows
Authors:
Guillaume Aupy,
Yves Robert,
Frédéric Vivien,
Dounia Zaidouni
Abstract:
This paper deals with the impact of fault prediction techniques on checkpointing strategies. We suppose that the fault-prediction system provides prediction windows instead of exact predictions, which dramatically complicates the analysis of the checkpointing strategies. We propose a new approach based upon two periodic modes, a regular mode outside prediction windows, and a proactive mode inside…
▽ More
This paper deals with the impact of fault prediction techniques on checkpointing strategies. We suppose that the fault-prediction system provides prediction windows instead of exact predictions, which dramatically complicates the analysis of the checkpointing strategies. We propose a new approach based upon two periodic modes, a regular mode outside prediction windows, and a proactive mode inside prediction windows, whenever the size of these windows is large enough. We are able to compute the best period for any size of the prediction windows, thereby deriving the scheduling strategy that minimizes platform waste. In addition, the results of this analytical evaluation are nicely corroborated by a comprehensive set of simulations, which demonstrate the validity of the model and the accuracy of the approach.
△ Less
Submitted 19 February, 2013;
originally announced February 2013.
-
Checkpointing algorithms and fault prediction
Authors:
Guillaume Aupy,
Yves Robert,
Frédéric Vivien,
Dounia Zaidouni
Abstract:
This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical first-order analysis of Young and Daly in the presence of a fault prediction system, characterized by its recall and its precision. In this framework, we provide an optimal algorithm to decide when to take predictions into account, and we derive the optimal value of the checkpointin…
▽ More
This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical first-order analysis of Young and Daly in the presence of a fault prediction system, characterized by its recall and its precision. In this framework, we provide an optimal algorithm to decide when to take predictions into account, and we derive the optimal value of the checkpointing period. These results allow to analytically assess the key parameters that impact the performance of fault predictors at very large scale.
△ Less
Submitted 3 December, 2013; v1 submitted 15 February, 2013;
originally announced February 2013.
-
Impact of fault prediction on checkpointing strategies
Authors:
Guillaume Aupy,
Yves Robert,
Frédéric Vivien,
Dounia Zaidouni
Abstract:
This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical analysis of Young and Daly in the presence of a fault prediction system, which is characterized by its recall and its precision, and which provides either exact or window-based time predictions. We succeed in deriving the optimal value of the checkpointing period (thereby minimizing…
▽ More
This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical analysis of Young and Daly in the presence of a fault prediction system, which is characterized by its recall and its precision, and which provides either exact or window-based time predictions. We succeed in deriving the optimal value of the checkpointing period (thereby minimizing the waste of resource usage due to checkpoint overhead) in all scenarios. These results allow to analytically assess the key parameters that impact the performance of fault predictors at very large scale. In addition, the results of this analytical evaluation are nicely corroborated by a comprehensive set of simulations, thereby demonstrating the validity of the model and the accuracy of the results.
△ Less
Submitted 9 October, 2012; v1 submitted 30 July, 2012;
originally announced July 2012.