Errors occurring in computing clusters and other computing systems can
impact system performance. Each error has an error type and each error
type has a base cost estimating importance of correcting the error. Each
error type also has a confidence indicating the level of agreement
between those who fix the errors and those who assigned the base cost. An
error type's actual cost is produced using the base cost and confidence.
An error cascade map contains estimates that one error will cause
another. An error type that causes other error types has a cascade cost.
Upon detecting an error type, a repair order can be generated, depending
on the cost involved. Repairs are then performed. Feedback mechanisms and
correlations can be used to update the confidences and the error cascade
map.