Diagnosis of Performance Faults in LargeScale MPI Applications via Probabilistic Progress-Dependence Inference

Ignacio Laguna, Dong H. Ahn, Bronis R. De Supinski, Saurabh Bagchi, Todd Gamblin

Research output: Contribution to journalArticlepeer-review

7 Citations (Scopus)

Abstract

Debugging large-scale parallel applications is challenging. Most existing techniques provide little information about failure root causes. Further, most debuggers significantly slow down program execution, and run sluggishly with massively parallel applications. This paper presents a novel technique that scalably infers the tasks in a parallel program on which a failure occurred, as well as the code in which it originated. Our technique combines scalable runtime analysis with static analysis to determine the least-progressed task(s) and to identify the code lines at which the failure arose. We present a novel algorithm that infers probabilistically progress dependence among MPI tasks using a globally constructed Markov model that represents tasks' control-flow behavior. In comparison to previous work, our algorithm infers more precisely the least-progressed task. We combine this technique with static backward slicing analysis, further isolating the code responsible for the current state. A blind study demonstrates that our technique isolates the root cause of a concurrency bug in a molecular dynamics simulation, which only manifests itself at 7,996 tasks or more. We extensively evaluate fault coverage of our technique via fault injections in 10 HPC benchmarks and show that our analysis takes less than a few seconds on thousands of parallel tasks.

Original languageEnglish
Article number6803050
Pages (from-to)1280-1289
Number of pages10
JournalIEEE Transactions on Parallel and Distributed Systems
Volume26
Issue number5
Early online date21 Apr 2014
DOIs
Publication statusPublished - 01 May 2015

Keywords

  • Distributed debugging
  • MPI
  • parallel applications
  • progress dependence

ASJC Scopus subject areas

  • Signal Processing
  • Hardware and Architecture
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Diagnosis of Performance Faults in LargeScale MPI Applications via Probabilistic Progress-Dependence Inference'. Together they form a unique fingerprint.

Cite this