Recent News
Inaugural School of Engineering Teaching Innovation Fellows selected
February 2, 2024
UNM computer scientist wins NSF CAREER Award to optimize supercomputer performance
February 1, 2024
Hand and Machine Lab’s Experimental Clay Exhibition closing celebration Nov. 17
November 15, 2023
Moses selected as special assistant to the dean for educational initiatives
October 3, 2023
News Archives
[Colloquium] On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-based Fault Tolerance
August 24, 2012
Watch Colloquium:
M4V file (330 MB)
- Date: Friday, August 24, 2012
- Time: 12:00 pm — 12:50 pm
- Place: Centennial Engineering Center 1041
Dewan Ibtesham
Department of Computer Science University of New Mexico
The increasing size and complexity of high performance computing (HPC) systems have led to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/restart mechanisms will not scale sufficiently for future generation systems. Therefore, optimizations that reduce checkpoint overheads are necessary to keep checkpoint/restart mechanisms effective. In this work, we demonstrate that checkpoint data compression is a feasible mechanism for reducing checkpoint commit latencies and storage overheads. Leveraging a simple model for checkpoint compression viability, we show: (1) checkpoint data compression is feasible for many types of scientific applications expected to run on extreme scale systems; (2) checkpoint compression viability scales with checkpoint size; (3) user-level versus system-level checkpoints bears little impact on checkpoint compression viability; and (4) checkpoint compression viability scales with application process count. Lastly, we describe the impact that checkpoint compression might have on future generation extreme scale systems.
Bio: Dewan Ibtesham is a third year PhD student advised by Professor Dorian Arnold within the UNM Department of Computer Science. He received his bachelors degree in Computer Science and Engineering from BUET (Bangladesh University of Engineering Technology). After working two and a half years in the software industry, he moved to the U.S. and started graduate school beginning fall 2009. His research interests are generally in high performance computing and large scale distributed systems; in particular, making sure that the HPC systems are fault tolerant and reliable for users so that the full potential of the systems are properly utilized.