A distributed system for creating a checkpoint for a plurality of
processes running on the distributed system. The distributed system
includes a plurality of compute nodes with an operating system executing
on each compute node. A checkpoint library resides at the user level on
each of the compute nodes, and the checkpoint library is transparent to
the operating system residing on the same compute node and to the other
compute nodes. Each checkpoint library uses a windowed messaging logging
protocol for checkpointing of the distributed system. Processes
participating in a distributed computation on the distributed system may
be migrated from one compute node to another compute node in the
distributed system by re-mapping of hardware addresses using the
checkpoint library.