An apparatus and method passively determine when a job in a clustered computing
environment is dead. Each node in the cluster has a cluster engine for communicating
between each job on the node and jobs on other nodes. A protocol is defined that
includes one or more acknowledge (ACK) rounds, and that only performs local processing
between ACK rounds. The protocol is executed by jobs that are members of a defined
group. Each job in the group has one or more work threads that execute the protocol.
In addition, each job has a main thread that communicates between the job and jobs
on other nodes (through the cluster engine), routes appropriate messages from the
cluster engine to a work thread, and signals to the cluster engine when a fault
occurs when the work thread executes the protocol. By assuring that a dead job
is reported to other members of the group, liveness information for group members
can be monitored without the overhead associated with active liveness checking.