Systems and methods establish groups among numerous indications of failure
in order to infer a cause of failure common to each group. In one
implementation, a system computes the groups such that each group has the
maximum likelihood of resulting from a common failure. Indications of
failure are grouped by probability, even when a group's inferred cause of
failure is not directly observable in the system. In one implementation,
related matrices provide a system for receiving numerous health
indications from each of numerous autonomous systems connected with the
Internet. A correlational matrix links input (failure symptoms) and
output (known or unknown root causes) through probability-based
hypothetical groupings of the failure indications. The matrices are
iteratively refined according to self-consistency and parsimony metrics
to provide most likely groupings of indicators and most likely causes of
failure.