Data center activity traces form a corpus used for machine learning. The
data in the corpus are putatively normal but may be tainted with latent
anomalies. There is a statistical likelihood that the corpus represents
predominately legitimate activity, and this likelihood is exploited to
allow for a targeted examination of only the data representing possible
anomalous activity. The corpus is separated into clusters having members
with like features. The clusters having the fewest members are
identified, as these clusters represent potential anomalous activities.
These clusters are evaluated to determine whether they represent actual
anomalous activities. The data from the clusters representing actual
anomalous activities are excluded from the corpus. As a result, the
machine learning is more effective and the trained system provides better
performance, since latent anomalies are not mistaken for normal activity.