News Archives

  • UNM
  • >Home
  • >News
  • >2012
  • >August
  • >[Colloquium] On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-based Fault Tolerance

[Colloquium] On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-based Fault Tolerance

August 24, 2012

Watch Colloquium: 

M4V file (330 MB)

  • Date: Friday, August 24, 2012 
  • Time: 12:00 pm — 12:50 pm 
  • Place: Centennial Engineering Center 1041

Dewan Ibtesham
Department of Computer Science University of New Mexico 

The increasing size and complexity of high performance computing (HPC) systems have led to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/restart mechanisms will not scale sufficiently for future generation systems. Therefore, optimizations that reduce checkpoint overheads are necessary to keep checkpoint/restart mechanisms effective. In this work, we demonstrate that checkpoint data compression is a feasible mechanism for reducing checkpoint commit latencies and storage overheads. Leveraging a simple model for checkpoint compression viability, we show: (1) checkpoint data compression is feasible for many types of scientific applications expected to run on extreme scale systems; (2) checkpoint compression viability scales with checkpoint size; (3) user-level versus system-level checkpoints bears little impact on checkpoint compression viability; and (4) checkpoint compression viability scales with application process count. Lastly, we describe the impact that checkpoint compression might have on future generation extreme scale systems.

 

Bio: Dewan Ibtesham is a third year PhD student advised by Professor Dorian Arnold within the UNM Department of Computer Science. He received his bachelors degree in Computer Science and Engineering from BUET (Bangladesh University of Engineering Technology). After working two and a half years in the software industry, he moved to the U.S. and started graduate school beginning fall 2009. His research interests are generally in high performance computing and large scale distributed systems; in particular, making sure that the HPC systems are fault tolerant and reliable for users so that the full potential of the systems are properly utilized.