Dependency-Aware Rollback and Checkpoint-Restart for Distributed Task-Based Runtimes

Kiril Dichev, Herbert Jordan, Konstantinos Tovletoglou, Thomas Heller, Dimitrios Nikolopoulos, Georgios Karakonstantis, Charles Gillan

Research output: Contribution to conferencePaper

Abstract

With the increase in compute nodes in large compute platforms, a proportional increase in node failures will follow. Many application-based checkpoint/restart (C/R) techniques have been proposed for MPI applications to target the reduced mean time between failures. However, rollback as part of the recovery remains a dominant cost even in highly optimised MPI applications employing C/R techniques. Continuing execution past a checkpoint (that is, reducing rollback) is possible in message-passing runtimes, but extremely complex to design and implement. Our work focuses on task-based runtimes, where task dependencies are explicit and message passing is implicit. We see an opportunity for reducing rollback for such runtimes: we explore task dependencies in the rollback, which we call dependency-aware rollback. We also design a new C/R technique, which is influenced by recursive decomposition of tasks, and combine it with dependency-aware rollback. We expect the dependency-aware rollback to cancel and recompute less tasks in the presence of node failures. We describe, implement and validate the proposed protocol in a simulator, which confirms these expectations. In addition, we consistently observe faster overall execution time for dependency-aware rollback in the presence of faults, despite the fact that reduced task cancellation does not guarantee reduced overall execution time.
Original languageEnglish
Number of pages10
Publication statusPublished - 29 May 2017

Fingerprint

Dive into the research topics of 'Dependency-Aware Rollback and Checkpoint-Restart for Distributed Task-Based Runtimes'. Together they form a unique fingerprint.
  • Energy-efficient localised rollback via data flow analysis and frequency scaling

    Dichev, K., Cameron, K. & Nikolopoulos, D. S., 23 Sept 2018, EuroMPI 2018 - Proceedings of the 25th European MPI Users' Group Meeting. 11 p. a11

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    5 Citations (Scopus)

Cite this