Fault tolerance and recovery in a high-performance computing (HPC) system

In one embodiment, a method for fault tolerance and recovery in a high-performance computing (HPC) system includes monitoring a currently running node in an HPC system including multiple nodes. A fabric coupling the multiple nodes to each other and coupling the multiple nodes to storage accessible to each of the multiple nodes and capable of storing multiple hosts that are each executable at any of the multiple nodes. The method includes, if a fault occurs at the currently running node, discontinuing operation of the currently running node and booting the host at a free node in the HPC system from the storage.

Web www.patentalert.com

< Transistor for non volatile memory devices having a carbon nanotube channel and electrically floating quantum dots in its gate dielectric

> Method and apparatus for power management by user needs

~ 00489