A method for identifying data that is meaningless and generating a natural
language statistical model which can reject meaningless input. The method
can include identifying unigrams that are individually meaningless from a
set of training data. At least a portion of the unigrams identified as
being meaningless can be assigned to a first n-gram class. The method
also can include identifying bigrams that are entirely composed of
meaningless unigrams and determining whether the identified bigrams are
individually meaningless. At least a portion of the bigrams identified as
being individually meaningless can be assigned to the first n-gram class.