Recent News
Hand and Machine Lab’s Experimental Clay Exhibition closing celebration Nov. 17
November 15, 2023
Moses selected as special assistant to the dean for educational initiatives
October 3, 2023
Computer science student navigates crime’s depths with AI at Department of Homeland Security internship
August 25, 2023
UNM researchers take a deep dive into our changing planet with SIMReef project
August 1, 2023
News Archives
[Colloquium] Fault-Tolerance for Extreme Scale Systems-A Systems Level Perspective
May 2, 2013
Watch Colloquium:
AVI file (910 MB)
- Date: Thursday, May 2, 2013
- Time: 11:00 am — 12:30 pm
- Place: Mechanical Engineering 218
Kurt Ferreira
Sandia National Laboratories
Achieving the next three orders of magnitude performance increase to move from petascale to exascale computing will require significant advancements in several fundamental areas. Recent reports from the U.S. Department of Energy place resilience as as one of these challenges. This resilience challenge is cross cutting and will likely require advancements in multiple layers in the systems software stack of these extreme-scale systems, from the OS to the application. In this, I will summarize current work at Sandia National Laboratories to address this important challenge. I will characterize this challenge in the context of extreme-scale capability computing, outline current approaches and their benefits, and point out unexplored areas where more work is needed.
Bio: Kurt Ferreira A senior member of Sandia’s technical staff, Kurt Ferreira is an expert on system software and resilience/fault-tolerance methods for large-scale, massively parallel, distributed-memory, scientific computing systems. Kurt has designed and developed many innovative, high-performance, and resilient implementations of low-level system software for a number of HPC platforms at Sandia National Laboratories. His research interests include the design and construction of operating systems for massively parallel processing machines and innovative application- and system-level fault-tolerance mechanisms for HPC.