A two-phase method to perform root-cause analysis over an
enterprise-specific fault model is described. In the first phase, an
up-stream analysis is performed (beginning at a node generating an alarm
event) to identify one or more nodes that may be in failure. In the
second phase, a down-stream analysis is performed to identify those nodes
in the enterprise whose operational condition are impacted by the prior
determined failed nodes. Nodes identified as failed as a result of the
up-stream analysis may be reported to a user as failed. Nodes identifies
as impacted as a result of the down-stream analysis may be reported to a
user as impacted and, beneficially, any failure alarms associated with
those impacted nodes may be masked. Up-stream (phase 1) analysis is
driven by inference policies associated with various nodes in the
enterprise's fault model. An inference policy is a rule, or set of rules,
for inferring the status or condition of a fault model node based on the
status or condition of the node's immediately down-stream neighboring
nodes. Similarly, down-stream (phase 2) analysis is driven by impact
policies associated with various nodes in the enterprise's fault model.
An impact policy is a rule, or set of rules, for assessing the impact on
a fault model node based on the status or condition of the node's
immediately up-stream neighboring nodes.