Search | arXiv e-print repository

Evaluating Different Fault Injection Abstractions on the Assessment of DNN SW Hardening Strategies

Authors: Giuseppe Esposito, Juan David Guerrero-Balaguera, Josie Esteban Rodriguez Condia, Matteo Sonza Reorda

Abstract: The reliability of Neural Networks has gained significant attention, prompting efforts to develop SW-based hardening techniques for safety-critical scenarios. However, evaluating hardening techniques using application-level fault injection (FI) strategies, which are commonly hardware-agnostic, may yield misleading results. This study for the first time compares two FI approaches (at the applicatio… ▽ More The reliability of Neural Networks has gained significant attention, prompting efforts to develop SW-based hardening techniques for safety-critical scenarios. However, evaluating hardening techniques using application-level fault injection (FI) strategies, which are commonly hardware-agnostic, may yield misleading results. This study for the first time compares two FI approaches (at the application level (APP) and instruction level (ISA)) to evaluate deep neural network SW hardening strategies. Results show that injecting permanent faults at ISA (a more detailed abstraction level than APP) changes completely the ranking of SW hardening techniques, in terms of both reliability and accuracy. These results highlight the relevance of using an adequate analysis abstraction for evaluating such techniques. △ Less

Submitted 11 December, 2024; originally announced December 2024.

Comments: 6 pages, 2 figures, 1 table, Asian Test Symposium

Journal ref: Asian Test Symposium 2024

arXiv:2306.10856 [pdf, other]

Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control Units

Authors: Juan-David Guerrero-Balaguera, Josie E. Rodriguez Condia, Fernando F. dos Santos, Matteo Sonza, Paolo Rech

Abstract: Graphics Processing Units (GPUs) are over-stressed to accelerate High-Performance Computing applications and are used to accelerate Deep Neural Networks in several domains where they have a life expectancy of many years. These conditions expose the GPUs hardware to (premature) aging, causing permanent faults to arise after the usual end-of-manufacturing test. Techniques to assess the impact of per… ▽ More Graphics Processing Units (GPUs) are over-stressed to accelerate High-Performance Computing applications and are used to accelerate Deep Neural Networks in several domains where they have a life expectancy of many years. These conditions expose the GPUs hardware to (premature) aging, causing permanent faults to arise after the usual end-of-manufacturing test. Techniques to assess the impact of permanent faults in GPUs are then strongly required, thus allowing to estimate the reliability risk and to possibly mitigate it. In this paper, we present a method to evaluate the effects of permanent faults affecting the GPU scheduler and control units, which are the most peculiar and stressed resources, along with the first figures that allow quantifying these effects. We characterize over 5.83x10^5 permanent fault effects in the scheduler and controllers of a gate-level GPU model. Then, we map the observed error categories in software by instrumenting the code of 13 applications and two convolutional neural networks, injecting more than 1.65x10^5 permanent errors. Our two-level fault injection strategy reduces the evaluation time from hundreds of years of gate-level evaluation to hundreds of hours.We found that faults in the GPU parallelism management units can modify the opcode, the addresses, and the status of thread(s) and warp(s). The large majority (up to 99%) of these hardware permanent errors impacts the running software execution. Errors affecting the instruction operation or resource management hang the code, while 45% of errors in the parallelism management or control-flow induce silent data corruptions. △ Less

Submitted 2 October, 2023; v1 submitted 19 June, 2023; originally announced June 2023.

Journal ref: ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis, ACM; IEEE, Nov 2023, Denver, United States

arXiv:2211.13094 [pdf, other]

Characterizing a Neutron-Induced Fault Model for Deep Neural Networks

Authors: Fernando Fernandes dos Santos, Angeliki Kritikakou, Josie Esteban Rodriguez Condia, Juan David Guerrero Balaguera, Matteo Sonza Reorda, Olivier Sentieys, Paolo Rech

Abstract: The reliability evaluation of Deep Neural Networks (DNNs) executed on Graphic Processing Units (GPUs) is a challenging problem since the hardware architecture is highly complex and the software frameworks are composed of many layers of abstraction. While software-level fault injection is a common and fast way to evaluate the reliability of complex applications, it may produce unrealistic results s… ▽ More The reliability evaluation of Deep Neural Networks (DNNs) executed on Graphic Processing Units (GPUs) is a challenging problem since the hardware architecture is highly complex and the software frameworks are composed of many layers of abstraction. While software-level fault injection is a common and fast way to evaluate the reliability of complex applications, it may produce unrealistic results since it has limited access to the hardware resources and the adopted fault models may be too naive (i.e., single and double bit flip). Contrarily, physical fault injection with neutron beam provides realistic error rates but lacks fault propagation visibility. This paper proposes a characterization of the DNN fault model combining both neutron beam experiments and fault injection at software level. We exposed GPUs running General Matrix Multiplication (GEMM) and DNNs to beam neutrons to measure their error rate. On DNNs, we observe that the percentage of critical errors can be up to 61%, and show that ECC is ineffective in reducing critical errors. We then performed a complementary software-level fault injection, using fault models derived from RTL simulations. Our results show that by injecting complex fault models, the YOLOv3 misdetection rate is validated to be very close to the rate measured with beam experiments, which is 8.66x higher than the one measured with fault injection using only single-bit flips. △ Less

Submitted 23 November, 2022; originally announced November 2022.

arXiv:2109.00958 [pdf]

A Novel Compaction Approach for SBST Test Programs

Authors: Juan-David Guerrero-Balaguera, Josie E. Rodriguez Condia, Matteo Sonza Reorda

Abstract: In-field test of processor-based devices is a must when considering safety-critical systems (e.g., in robotics, aerospace, and automotive applications). During in-field testing, different solutions can be adopted, depending on the specific constraints of each scenario. In the last years, Self-Test Libraries (STLs) developed by IP or semiconductor companies became widely adopted. Given the strict c… ▽ More In-field test of processor-based devices is a must when considering safety-critical systems (e.g., in robotics, aerospace, and automotive applications). During in-field testing, different solutions can be adopted, depending on the specific constraints of each scenario. In the last years, Self-Test Libraries (STLs) developed by IP or semiconductor companies became widely adopted. Given the strict constraints of in-field test, the size and time duration of a STL is a crucial parameter. This work introduces a novel approach to compress functional test programs belonging to an STL. The proposed approach is based on analyzing (via logic simulation) the interaction between the micro-architectural operation performed by each instruction and its capacity to propagate fault effects on any observable output, reducing the required fault simulations to only one. The proposed compaction strategy was validated by resorting to a RISC-V processor and several test programs stemming from diverse generation strategies. Results showed that the proposed compaction approach can reduce the length of test programs by up to 93.9% and their duration by up to 95%, with minimal effect on fault coverage. △ Less

Submitted 8 September, 2021; v1 submitted 2 September, 2021; originally announced September 2021.

Comments: Paper accepted to be presented in The 30th IEEE Asian Test Symposium (ATS 2021) November 22 - 25, 2021, Japan. to be published in the IEEE xplorer after the presentation in the event

Showing 1–4 of 4 results for author: Condia, J E R