NSCI: SI2-SSE: An Extensible Model to Support Scalable Checkpoint-Restart for DMTCP across Multiple Disciplines

2018-04-24T01:28:24Z (GMT) by Gene Cooperman
DMTCP (Distributed MultiThreaded CheckPointing) is a widely used package for transparent checkpoint-restart. Checkpoint-restart saves to disk the state of a running process, and then to restart (possibly on a new computer) the process where it left off. DMTCP has grown from a monolithic package to a highly adaptable package supporting HPC (e.g., MPI), GPUs, high-performance networks; and applications such as cyber-security, EDA, science, and engineering.