A method for duplicate detection on web-scale data in a supercomputing
environment includes computing a hash of at least one document in a
computer system to generate data packets from the at least one document
and to generate a fixed size tuple of information from the at least one
document, distributing the data packets to each node of the plurality of
nodes, applying localized detection techniques to data packets on each
node of the plurality of nodes to remove data packet duplicates,
redistributing the data packets to each node of the plurality of nodes
based on the document fingerprint, reapplying the localized detection
techniques on each node to the redistributed packets to remove exact data
packet duplicates, and performing a global merge of results of the
localized detection techniques in a distributed fashion.