Projects per year
Abstract
Soft errors threaten to disrupt supercomputing scaling. Fault injection is a key technique to understand the impact of faults on scientific applications. However, injecting faults in parallel applications has been prohibitively slow, inaccurate and hard to implement. In this paper, we present SAFIRE, the first fast and accurate fault injection framework for parallel, multi-threaded applications. SAFIRE uses novel compiler instrumentation and code generation techniques to achieve high accuracy and high speed. Using SAFIRE, we show that fault manifestations can be significantly different depending on whether they happen in the application itself or in the parallel runtime system. In our experimental evaluation on 15 HPC parallel programs, we show that SAFIRE is multiple factors faster and equally accurate in comparison with state-of-the-art dynamic binary instrumentation tools for fault injection.
Original language | English |
---|---|
Title of host publication | Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 890-899 |
Number of pages | 10 |
ISBN (Electronic) | 9781728112466 |
DOIs | |
Publication status | Published - 01 May 2019 |
Event | 33rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019 - Rio de Janeiro, Brazil Duration: 20 May 2019 → 24 May 2019 |
Conference
Conference | 33rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019 |
---|---|
Country | Brazil |
City | Rio de Janeiro |
Period | 20/05/2019 → 24/05/2019 |
ASJC Scopus subject areas
- Computer Networks and Communications
- Hardware and Architecture
- Information Systems and Management
Fingerprint Dive into the research topics of 'SaFirE: Scalable and accurate fault injection for parallel multithreaded applications'. Together they form a unique fingerprint.
Projects
-
R6551CSC: Open TransPREcision COMPuting
Woods, R., Karakonstantis, G. & Vandierendonck, H.
03/11/2016 → …
Project: Research
-
R1485CSC: SERT: Scale-free, Energy-Aware and Resilient Adaptation of CSE Applications to Mega-Core Systems
Nikolopoulos, D., Scott, S., Vandierendonck, H. & de Supinski, B.
13/11/2014 → 30/09/2018
Project: Research