Power Log'n'Roll: Power-Efficient Localized Rollback for MPI Applications Using Message Logging Protocols

Kiril Dichev*, De Sensi Daniele, Dimitrios S. Nikolopoulos, Kirk Cameron, Ivor Spence

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

2 Citations (Scopus)
193 Downloads (Pure)

Abstract

In fault tolerance for parallel and distributed systems, message logging protocols have played a prominent role in the lastthree decades. Such protocols enable local rollback to provide recovery from fail-stop errors. Global rollback techniques can bestraightforward to implement but at times lead to slower recovery than local rollback. Local rollback is more complicated but can offerfaster recovery times. In this work, we study the power and energy efficiency implications of global and local rollback. We propose apower-efficient version of local rollback to reduce power consumption for non-critical, blocked processes, using Dynamic Voltage andFrequency Scaling (DVFS) and clock modulation (CM). Our results for 3 different MPI codes on 2 parallel systems show thatpower-efficient local rollback reduces CPU energy waste up to 50% during the recovery phase, compared to existing global and localrollback techniques, without introducing significant overheads. Furthermore, we show that savings manifest for all blocked processes,which grow linearly with the process count. We estimate that for settings with high recovery overheads the total energy waste ofparallel codes is reduced with the proposed local rollback.
Original languageEnglish
Pages (from-to)1276 - 1288
JournalIEEE Transactions on Parallel and Distributed Systems
Volume33
Issue number6
Early online date27 Aug 2021
DOIs
Publication statusPublished - 01 Jun 2022

ASJC Scopus subject areas

  • General Computer Science

Fingerprint

Dive into the research topics of 'Power Log'n'Roll: Power-Efficient Localized Rollback for MPI Applications Using Message Logging Protocols'. Together they form a unique fingerprint.

Cite this