A scale-out supercomputing environment includes a plurality of
interconnected nodes arranged in a three-dimensional cubic grid and
configured to perform a method of duplicate detection. The method
includes at least computing a fingerprint of at least one document in the
supercomputing environment to generate data packets from the at least one
document and to generate a fixed size tuple of information from the at
least one document, distributing the data packets to each node of the
plurality of nodes to ensure all elements of the fixed size tuple fit
into memory of the plurality of nodes, applying localized detection
techniques to data packets on each node of the plurality of nodes to
remove data packet duplicates, redistributing the data packets to each
node of the plurality of nodes based on the document fingerprint, and
performing a global merge of results of the localized detection
techniques.