Different URLs that actually reference the same web page or other web
resource are detected and that information is used to only download one
instance of a web page or web resource from a web site. All web pages or
web resources downloaded from a web server are compared to identify which
are substantially identical. Once identical web pages or web resources
with different URLs are found, the different URLs are then analyzed to
identify what portions of the URL are essential for identifying a
particular web page or web resource, and what portions are irrelevant.
Once this has been done for each set of substantially identical web pages
or web resources (also referred to as an "equivalence class" herein),
these per-equivalence-class rules are generalized to
trans-equivalence-class rules. There are two rule-learning steps: step
(1), where it is learned for each equivalence class what portions of the
URLs in that class are relevant for selecting the page and what portions
are not; and step (2), where the per-equivalence-class rules constructed
during step (1) are generalized to rules that cover many equivalence
classes. Once a rule is determined, it is applied to the class of web
pages or web resources to identify errors. If there are no errors, the
rule is activated and is then used by the web crawler for future crawling
to avoid the download of duplicative web pages or web resources.