Method for duplicate detection on web-scale data in supercomputing environments

A method for duplicate detection on web-scale data in a supercomputing environment includes computing a hash of at least one document in a computer system to generate data packets from the at least one document and to generate a fixed size tuple of information from the at least one document, distributing the data packets to each node of the plurality of nodes, applying localized detection techniques to data packets on each node of the plurality of nodes to remove data packet duplicates, redistributing the data packets to each node of the plurality of nodes based on the document fingerprint, reapplying the localized detection techniques on each node to the redistributed packets to remove exact data packet duplicates, and performing a global merge of results of the localized detection techniques in a distributed fashion.

Web www.patentalert.com

< Multi-column multi-data type internationalized sort extension method for web applications

> Dynamic physical database design

~ 00428