Web page clustering techniques described herein are URL Clustering and
Page Clustering, whereby clustering algorithms cluster together pages
that are structurally similar. Regarding URL clustering, because
similarly structured pages have similar patterns in their URLs, grouping
similar URL patterns will group structurally similar pages. Embodiments
of URL clustering may involve: (a) URL normalization and (b) URL
variation computation. Regarding page clustering, page feature-based
techniques further cluster any given set of homogenous clusters, reducing
the number of clusters based on the underlying page code. Embodiments of
page clustering may reduce the number of clusters based on the tag
probabilities and the tag sequence, utilizing an Approximate Nearest
Neighborhood (ANN) graph along with evaluation of intra-cluster and
inter-cluster compactness.