TY - GEN
T1 - Energy-efficient localised rollback via data flow analysis and frequency scaling
AU - Dichev, Kiril
AU - Cameron, Kirk
AU - Nikolopoulos, Dimitrios S.
PY - 2018/9/23
Y1 - 2018/9/23
N2 - Exascale systems will suffer failures hourly. HPC programmers rely mostly on application-level checkpoint and a global rollback to recover. In recent years, techniques reducing the number of rolling back processes have been implemented via message logging. However, the log-based approaches have weaknesses, such as being dependent on complex modifications within an MPI implementation, and the fact that a full restart may be required in the general case. To address the limitations of all log-based mechanisms, we return to checkpoint-only mechanisms, but advocate data flow rollback (DFR), a fundamentally different approach relying on analysis of the data flow of iterative codes, and the well-known concept of data flow graphs. We demonstrate the benefits of DFR for an MPI stencil code by localising rollback, and then reduce energy consumption by 10-12% on idling nodes via frequency scaling. We also provide large-scale estimates for the energy savings of DFR compared to global rollback, which for stencil codes increase as n square for a process count n.
AB - Exascale systems will suffer failures hourly. HPC programmers rely mostly on application-level checkpoint and a global rollback to recover. In recent years, techniques reducing the number of rolling back processes have been implemented via message logging. However, the log-based approaches have weaknesses, such as being dependent on complex modifications within an MPI implementation, and the fact that a full restart may be required in the general case. To address the limitations of all log-based mechanisms, we return to checkpoint-only mechanisms, but advocate data flow rollback (DFR), a fundamentally different approach relying on analysis of the data flow of iterative codes, and the well-known concept of data flow graphs. We demonstrate the benefits of DFR for an MPI stencil code by localising rollback, and then reduce energy consumption by 10-12% on idling nodes via frequency scaling. We also provide large-scale estimates for the energy savings of DFR compared to global rollback, which for stencil codes increase as n square for a process count n.
KW - Checkpoint/Restart
KW - Data Flow
KW - Discrete-Event Simulator
KW - Energy Efficiency
KW - Fault Tolerance
KW - Frequency Scaling
KW - MPI
KW - Stencil Applications
UR - http://www.scopus.com/inward/record.url?scp=85055422719&partnerID=8YFLogxK
U2 - 10.1145/3236367.3236379
DO - 10.1145/3236367.3236379
M3 - Conference contribution
SN - 9781450364928
BT - EuroMPI 2018 - Proceedings of the 25th European MPI Users' Group Meeting
T2 - Proceedings of the 25th European MPI Users' Group Meeting
Y2 - 23 September 2018 through 26 September 2018
ER -