Techniques are disclosed for detecting web pages with duplicate content.
In one embodiment, a set of shingles is computed for each page of a group
of pages. An aggregate set of shingles is determined based on the sets of
shingles computed for the group of pages. A first subset from the
aggregate set of shingles is determined by selecting, from the aggregate
set, shingles whose frequencies in the aggregate set exceed a specified
threshold. A modified set of shingles is generated for each page of the
group of pages by removing, from the set of shingles for that page, any
shingle included in the first subset. One or more duplicate pages in the
group of pages are determined based at least in part on the modified sets
of shingles generated for the group of pages.