The method of the present invention is useful in a computer system including
at least two server nodes, each of which can execute clustered server software.
The program executes a method for monitoring failure situations to reduce downtime.
The method includes the step of detecting an event causing one of the failure situations,
and then the method determines if the event affects one of the server nodes. If
it is determined the event does affect one of the server nodes, the method then
determines if the event exceeds a threshold value. If it is determined the event
exceeds a threshold value, the method executes a proactive failover. If the event
is not specific to a cluster node, but indicates an impending or actual failure
of the cluster software, the method identifies and initiates an appropriate action
to fix the condition or provide a workaround (if available) that will preempt an
impending failure of the cluster system or would enable a restarting of a failed
cluster software.