si2-cooperman.pdf (249.55 kB)
NSCI: SI2-SSE: An Extensible Model to Support Scalable Checkpoint-Restart for DMTCP across Multiple Disciplines
DMTCP (Distributed MultiThreaded CheckPointing) is a widely used package for transparent checkpoint-restart. Checkpoint-restart saves to disk the state of a running process, and then to restart (possibly on a new computer) the process where it left off. DMTCP has grown from a monolithic package to a highly adaptable package supporting HPC (e.g., MPI), GPUs, high-performance networks; and applications such as cyber-security, EDA, science, and engineering.