Disclosed are methods and computer program products for automatically
identifying and compensating for stop words in a text processing system.
This automatic stop word compensation allows such operations as
performing queries on an abstract mathematical space built using all
words from all texts, with the ability to compensate for the skew that
the inclusion of the stop words may have introduced into the space.
Documents are represented by document vectors in the abstract
mathematical space. To compensate for stop words, a weight function is
applied to a predetermined component of the document vectors associated
with frequently occurring word(s) contained in the documents. The weight
function may be applied dynamically during query processing.
Alternatively, the weight function may be applied statically to all
document vectors.