SaFirE: Scalable and accurate fault injection for parallel multithreaded applications

Giorgis Georgakoudis, Ignacio Laguna, Hans Vandierendonck, Dimitrios S. Nikolopoulos, Martin Schulz

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Soft errors threaten to disrupt supercomputing scaling. Fault injection is a key technique to understand the impact of faults on scientific applications. However, injecting faults in parallel applications has been prohibitively slow, inaccurate and hard to implement. In this paper, we present SAFIRE, the first fast and accurate fault injection framework for parallel, multi-threaded applications. SAFIRE uses novel compiler instrumentation and code generation techniques to achieve high accuracy and high speed. Using SAFIRE, we show that fault manifestations can be significantly different depending on whether they happen in the application itself or in the parallel runtime system. In our experimental evaluation on 15 HPC parallel programs, we show that SAFIRE is multiple factors faster and equally accurate in comparison with state-of-the-art dynamic binary instrumentation tools for fault injection.

Original languageEnglish
Title of host publicationProceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages890-899
Number of pages10
ISBN (Electronic)9781728112466
DOIs
Publication statusPublished - 01 May 2019
Event33rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019 - Rio de Janeiro, Brazil
Duration: 20 May 201924 May 2019

Conference

Conference33rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019
CountryBrazil
CityRio de Janeiro
Period20/05/201924/05/2019

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems and Management

Fingerprint Dive into the research topics of 'SaFirE: Scalable and accurate fault injection for parallel multithreaded applications'. Together they form a unique fingerprint.

  • Projects

    Cite this

    Georgakoudis, G., Laguna, I., Vandierendonck, H., Nikolopoulos, D. S., & Schulz, M. (2019). SaFirE: Scalable and accurate fault injection for parallel multithreaded applications. In Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019 (pp. 890-899). [8820954] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/IPDPS.2019.00097