A system and method determine numerical representations for categorical
data fields by taking advantage of the redundancy of the data records to
allow automatic discovery of an order of the categories. A categorical
data field is recoded by creating separate tables for each numerical data
field occurring in the data records. The separate tables are sorted
according to the numerical values of the respective data fields. The
recoding of the categories is performed based on the average sort order
of occurrences of the category in a specific sorted table. The standard
deviation of the numerical codes provided by the categories is calculated
for each of the separate recoding tables. The recoding table with the
maximum standard deviation is selected as the recoding table to perform
the recoding of the categories contained in the respective categorical
data field of the data records. A plausibility check is performed for the
selected recoding table by excluding the numerical data field that has
formed the basis for the sorting of the respective table and recreating
the recoding table from the data records. The resulting recoding table
and the original recoding table are compared. Resulting recoding tables
that are similar indicate a high level of confidence that the originally
selected recoding table is optimal.