A method, apparatus, and program product checkpoint an application in a
parallel computing system of the type that includes a plurality of hybrid
nodes. Each hybrid node includes a host element and a plurality of
accelerator elements. Each host element may include at least one
multithreaded processor, and each accelerator element may include at
least one multi-element processor. In a first hybrid node from among the
plurality of hybrid nodes, checkpointing the application includes
executing at least a portion of the application in the host element,
configuring and executing at least one computation kernel in at least one
accelerator element, and, in response to receiving a command to
checkpoint the application, checkpointing the host element separately
from the at least one accelerator element upon which the at least one
computation kernel is executing.