Method and apparatus for efficient identification of duplicate and near-duplicate documents and text spans using high-discriminability text fragments

Disclosed is a computer-assisted method for finding duplicate or near-duplicate documents or text spans within a document collection by using high-discriminability text fragments. Distinctive features of the documents or text spans are identified. For each pair of documents or text spans with at least one distinctive feature in common, the distinctive features of each document or text span are compared to determine whether the pair is duplicates or near-duplicates. An apparatus for performing this computer-assisted method is also disclosed.

Web www.patentalert.com

< System enabling user access to secondary content associated with a primary content stream

< Management of a concurrent use license in a logically-partitioned computer

> Systems, methods, and computer program products to optimize serialization when porting code to IBM S/390 UNIX system services from a UNIX system

> Pickup device having a heat-radiation path

~ 00228