A checkpoint of a parallel program is taken in order to provide a
consistent state of the program in the event the program is to be
restarted. Each process of the parallel program is responsible for taking
its own checkpoint, however, the timing of when the checkpoint is to be
taken by each process is the responsibility of a coordinating process.
During the checkpointing, various data is written to a checkpoint file.
This data includes, for instance, in-transit message data, a data section,
file offsets, signal state, executable information, stack contents and
register contents. The checkpoint file can be stored either in local or
global storage. When it is stored in global storage, migration of the
program is facilitated. When a parallel program is to be restarted, each
process of the program initiates its own restart. The restart logic
restores the process to the state at which the checkpoint was taken.