A method, computer program and system for optimizing similarity string
filtering are disclosed. A first data string comprising one or more data
characters and selecting a second data string comprising one or more data
characters are selected. At least one of a defined set of shapes is
applied to the first data string to generate one or more patterns
associated with the first data string. At least one of the defined set of
shapes is applied to the second data string to generate one or more
patterns associated with the second data string. The one or more patterns
associated with the first data string are compared with the one or more
patterns associated with the second data string to determine if one or
more matching patterns exist. The first data string and the second data
string are linked if one or more matching patterns exist.