The method and apparatus for categorizing an item based on Record Linkage Theory
is disclosed. A related method and apparatus for assigning a confidence level to
the categorization process is disclosed. In one aspect, the item to be categorized
is parsed into at least one token. At least one category that contains the token
in the training set is identified. A weight is calculated for each token with respect
to a first category. Weights are combined to determine the total weight of the
first category. The weighting process is repeated for each relevant category. Where
one of a plurality of threshold values is met or exceeded, the item may be automatically
assigned to the category with the highest total weight. The combination of threshold
values may be selected based on the confidence level associated with that combination
of threshold values. Weights for each relevant category, possibly ordered, may
be presented.