Modeling coordinated checkpointing for large-scale supercomputers

TitleModeling coordinated checkpointing for large-scale supercomputers
Publication TypeConference Paper
Year of Publication2005
AuthorsWang, L., K. Pattabiraman, Z. Kalbarczyk, R. K. Iyer, L. Votta, C. Vick, and A. Wood
Conference NameDependable Systems and Networks, 2005. DSN 2005. Proceedings. International Conference on
Pagination812 - 821
Date Publishedjun.
Keywordscheckpointing, coordinated checkpointing protocol, failure analysis, large-scale supercomputers, parallel machines, performance evaluation, reliability, scalability, stochastic activity network, system performance

Current supercomputing systems consisting of thousands of nodes cannot meet the demands of emerging high-performance scientific applications. As a result, a new generation of supercomputing systems consisting of hundreds of thousands of nodes is being proposed. However, these systems are likely to experience far more frequent failures than today's systems, and such failures must be tackled effectively. Coordinated checkpointing is a common technique to deal with failures in supercomputers. This paper presents a model of a coordinated checkpointing protocol for large-scale supercomputers, and studies its scalability by considering both the coordination overhead and the effect of failures. Unlike most of the existing checkpointing models, the proposed model takes into account failures during checkpointing and recovery, as well as correlated failures. Stochastic activity networks (SANs) are used to model the system, and the model is simulated to study the scalability, reliability, and performance of the system.


a place of mind, The University of British Columbia

Electrical and Computer Engineering
2332 Main Mall
Vancouver, BC Canada V6T 1Z4
Tel +1.604.822.2872
Fax +1.604.822.5949

Emergency Procedures | Accessibility | Contact UBC | © Copyright 2020 The University of British Columbia