A cluster or group of cooperating systems may implement failure chain
detection and recovery. The group may include multiple nodes and each
node may include a group management services (GMS) module that in turn
may include a group communications mechanism to detect cluster membership
events. Each GMS module may maintain an identically ordered view of the
current group membership. When a member of the group fails, the member
that joined the group immediately after the failed member, according to
respective join times, may be selected to perform recovery operations for
the failed member. If a group member fails while performing recovery
operations for another failed member, the next member in the group
(according to respective join times) may be selected to perform recovery
for that node and may also perform recovery operations for the original
failed node as well.