The present invention provides a computer-readable medium and system for
selecting a set of n-grams for indexing string data in a DBMS system.
Aspects of the invention include providing a set of candidate n-grams,
each n-gram comprising a sequence of characters; identifying sample
queries having character strings containing the candidate n-grams; and
based on the set of candidate n-grams, the sample queries, database
records, and an n-gram space constraint, automatically selecting, given
the space constraint, a minimal set of an n-grams from the set of
candidate n-grams that minimizes the number of false hits for the set of
sample queries had the sample queries been executed against the database
records.