Improved methods for providing fault tolerance in a distributed computer
system utilize an explicit, delayed acknowledgement message protocol to
send an acknowledgement message to a workflow-requesting entity, such as
a load manager and/or a requesting client, only upon completion of a
workflow. The system includes workflow engines operating as a distributed
queue group to load-balance processing requests from clients. The system
also has a certified messaging capability that guarantees delivery of any
message sent by a certified message sender by maintaining a persistent
record of the message until an acknowledgement message is received back
from the certified message receiver. In the event a hardware or software
failure occurs during a workflow execution, the workflow is reassigned to
a different workflow engine. Improved fault-tolerant computers and
computer networks are also described.