Techniques for correcting miscategorized features excerpted from web pages
are provided. For each of several categories and several pages on a
particular web site, a separate feature may be excerpted from that page
and associated with that page in relation to that category. Often, many
of the "high confidence" features that have been associated with the same
category are found to be associated with similar characteristics
regardless of the pages from which those features were excerpted. Thus, a
set of category characteristics, which are often found associated with
the "high confidence" features in a particular category, may be
determined. For each page, a candidate feature that is associated with
the set of category characteristics may be identified in that page. If,
in relation to the particular category, a feature other than the
candidate feature is associated with that page, then that other feature
may be replaced by the candidate feature.