Recent News

UNM Engineering names Prabhakar inaugural Cleve Moler and MathWorks Endowed Chair
October 3, 2025

Computer scientist wins Athlete of the Year Award for adaptive skiing technique
May 29, 2025

Hand and Machine Lab wins 2 awards at CHI conference
May 15, 2025

Dissertation defense, May 12: Jannatul Ferdous
May 9, 2025

News Archives

UNM
>Home
>News
>2013
>May
>[Colloquium] Fault-Tolerance for Extreme Scale Systems-A Systems Level Perspective

[Colloquium] Fault-Tolerance for Extreme Scale Systems-A Systems Level Perspective

May 2, 2013

Watch Colloquium:

AVI file (910 MB)

Date: Thursday, May 2, 2013
Time: 11:00 am — 12:30 pm
Place: Mechanical Engineering 218

Kurt Ferreira
Sandia National Laboratories

Achieving the next three orders of magnitude performance increase to move from petascale to exascale computing will require significant advancements in several fundamental areas. Recent reports from the U.S. Department of Energy place resilience as as one of these challenges. This resilience challenge is cross cutting and will likely require advancements in multiple layers in the systems software stack of these extreme-scale systems, from the OS to the application. In this, I will summarize current work at Sandia National Laboratories to address this important challenge. I will characterize this challenge in the context of extreme-scale capability computing, outline current approaches and their benefits, and point out unexplored areas where more work is needed.

Bio: Kurt Ferreira A senior member of Sandia’s technical staff, Kurt Ferreira is an expert on system software and resilience/fault-tolerance methods for large-scale, massively parallel, distributed-memory, scientific computing systems. Kurt has designed and developed many innovative, high-performance, and resilient implementations of low-level system software for a number of HPC platforms at Sandia National Laboratories. His research interests include the design and construction of operating systems for massively parallel processing machines and innovative application- and system-level fault-tolerance mechanisms for HPC.

Recent News

News Archives

[Colloquium] Fault-Tolerance for Extreme Scale Systems-A Systems Level Perspective

Contact Info:

Location:

SOE Links

Useful Links