In one embodiment, a method for fault tolerance and recovery in a
high-performance computing (HPC) system includes monitoring a currently
running node in an HPC system including multiple nodes. A fabric coupling
the multiple nodes to each other and coupling the multiple nodes to
storage accessible to each of the multiple nodes and capable of storing
multiple hosts that are each executable at any of the multiple nodes. The
method includes, if a fault occurs at the currently running node,
discontinuing operation of the currently running node and booting the
host at a free node in the HPC system from the storage.