Energy-efficient localised rollback via data flow analysis and frequency scaling

Kiril Dichev, Kirk Cameron, Dimitrios S. Nikolopoulos

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

Exascale systems will suffer failures hourly. HPC programmers rely mostly on application-level checkpoint and a global rollback to recover. In recent years, techniques reducing the number of rolling back processes have been implemented via message logging. However, the log-based approaches have weaknesses, such as being dependent on complex modifications within an MPI implementation, and the fact that a full restart may be required in the general case. To address the limitations of all log-based mechanisms, we return to checkpoint-only mechanisms, but advocate data flow rollback (DFR), a fundamentally different approach relying on analysis of the data flow of iterative codes, and the well-known concept of data flow graphs. We demonstrate the benefits of DFR for an MPI stencil code by localising rollback, and then reduce energy consumption by 10-12% on idling nodes via frequency scaling. We also provide large-scale estimates for the energy savings of DFR compared to global rollback, which for stencil codes increase as n square for a process count n.
Original languageEnglish
Title of host publicationEuroMPI 2018 - Proceedings of the 25th European MPI Users' Group Meeting
Number of pages11
ISBN (Electronic)9781450364928
DOIs
Publication statusPublished - 23 Sept 2018
EventProceedings of the 25th European MPI Users' Group Meeting - Barcelona, Spain
Duration: 23 Sept 201826 Sept 2018
https://eurompi2018.bsc.es

Conference

ConferenceProceedings of the 25th European MPI Users' Group Meeting
Country/TerritorySpain
CityBarcelona
Period23/09/201826/09/2018
Internet address

Keywords

  • Checkpoint/Restart
  • Data Flow
  • Discrete-Event Simulator
  • Energy Efficiency
  • Fault Tolerance
  • Frequency Scaling
  • MPI
  • Stencil Applications

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition
  • Software

Fingerprint

Dive into the research topics of 'Energy-efficient localised rollback via data flow analysis and frequency scaling'. Together they form a unique fingerprint.

Cite this