Text classification has become an important aspect of information
technology. Present text classification techniques range from simple text
matching to more complex clustering methods. Clustering describes a
process of discovering structure in a collection of characters. The
invention automatically analyzes a text string and either updates an
existing cluster or creates a new cluster. To that end, the invention may
use a character n-gram matching process in addition to other
heuristic-based clustering techniques. In the character n-gram matching
process, each text string is first normalized using several heuristics.
It is then divided into a set of overlapping character n-grams, where n
is the number of adjacent characters. If the commonality between the text
string and the existing cluster members satisfies a pre-defined
threshold, the text string is added to the cluster. If, on the other
hand, the commonality does not satisfy the pre-defined threshold, a new
cluster may be created. Each cluster may have a selected topic name. The
topic name allows whole clusters to be compared in a similar way to the
individual clusters, and merged when a predetermined level of commonality
exists between the subject clusters. The topic name also may be used as a
suggested alternative to the text string. In this instance, the topic
name of the cluster to which the text string was added may be outputted
as an alternative to the text string.