A multi-lingual indexing and search system is presented that performs
tokenization and stemming in a manner which is independent of whether
index entries and search terms appear as words in a dictionary. The
system includes a tokenizer that separates a string of text into
individual word tokens, and eliminates predetermined types of tokens from
further processing. The system also includes a stemmer that reduces words
to grammatical stems by removing known word-endings associated with the
various languages to be supported. The stemmer removes known word endings
from the word tokens without any effort to guarantee that the remaining
stem is contained in a dictionary. In an embodiment, the stemmer only
removes those word endings which are associated with nouns. The system
further includes an indexer that stores the stems in an index.