Methods and apparatus perform fault isolation in multiple node computing
systems using commutative error detection values for--example,
checksums--to identify and to isolate faulty nodes. When information
associated with a reproducible portion of a computer program is injected
into a network by a node, a commutative error detection value is
calculated. At intervals, node fault detection apparatus associated with
the multiple node computer system retrieve commutative error detection
values associated with the node and stores them in memory. When the
computer program is executed again by the multiple node computer system,
new commutative error detection values are created and stored in memory.
The node fault detection apparatus identifies faulty nodes by comparing
commutative error detection values associated with reproducible portions
of the application program generated by a particular node from different
runs of the application program. Differences in values indicate a
possible faulty node.