Program products for performing checkpoint/restart of a parallel program

A checkpoint of a parallel program is taken in order to provide a consistent state of the program in the event the program is to be restarted. Each process of the parallel program is responsible for taking its own checkpoint, however, the timing of when the checkpoint is to be taken by each process is the responsibility of a coordinating process. During the checkpointing, various data is written to a checkpoint file. This data includes, for instance, in-transit message data, a data section, file offsets, signal state, executable information, stack contents and register contents. The checkpoint file can be stored either in local or global storage. When it is stored in global storage, migration of the program is facilitated. When a parallel program is to be restarted, each process of the program initiates its own restart. The restart logic restores the process to the state at which the checkpoint was taken.
Een controlepost van een parallel programma wordt genomen om een verenigbare staat van het programma in de gebeurtenis te verstrekken het programma moet zijn opnieuw begonnen. Elk proces van het parallelle programma is de oorzaak van het nemen van zijn eigen controlepost, echter, de timing van wanneer de controlepost door elk proces moet worden genomen is de verantwoordelijkheid van een coördinerend proces. Tijdens checkpointing, wordt divers gegeven geschreven aan een controlepostdossier. Deze gegevens omvatten, bijvoorbeeld, in-transit berichtgegevens, een gegeven sectie, dossiercompensatie, signaalstaat, uitvoerbare informatie, stapelinhoud en registerinhoud. Het controlepostdossier kan of in lokale of globale opslag worden opgeslagen. Wanneer het in globale opslag wordt opgeslagen, wordt de migratie van het programma vergemakkelijkt. Wanneer een parallel programma moet zijn opnieuw begonnen, stelt elk proces van het programma zijn eigen nieuw begin in werking. De nieuw beginlogica herstelt het proces aan de staat bij wie de controlepost werd genomen.

Web www.patentalert.com

< (none)

< Input/output recovery which is based an error rate and a current state of the computer environment

> Input/output recovery system which is based upon an error rate and a current state of the computer environment

> (none)

~ 00024