A process for finding a similar data records from a set of data records. A
database table or tables provide a number of data records from which one
or more canonical data records are identified. Tokens are identified
within the data records and classified according to attribute field. A
similarity score is assigned to data records in relation to other data
records based on a similarity between tokens of the data records. Data
records whose similarity score with respect to each other is greater than
a threshold form one or more groups of data records. The records or
tuples form nodes of a graph wherein edges between nodes represent a
similarity score between records of a group. Within each group a
canonical record is identified based on the similarity of data records to
each other within the group.