Duplicate document detection in a web crawler system

Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.

Web www.patentalert.com

< Interactive techniques for organizing and retrieving thumbnails and notes on large displays

> Software component importance evaluation system

> Database storage and maintenance using row index ordering

~ 00563