A computer-implemented system and method is disclosed for retrieving
documents using context-dependant probabilistic modeling of words and
documents. The present invention uses multiple overlapping vectors to
represent each document. Each vector is centered on each of the words in
the document and includes the local environment. The vectors are used to
build probability models that are used for predictions of related
documents and related keywords. The results of the statistical analysis
are used for retrieving an indexed document, for extracting features from
a document, or for finding a word within a document. The statistical
evaluation is also used to evaluate the probability of relation between
the key words appearing in the document and building a vocabulary of key
words that are generally found together. The results of the analysis are
stored in a repository. Searches of the data repository produce a list of
related documents and a list of related terms. The user may select from
the list of documents and/or from the list of related terms to refine the
search and retrieve those documents which meet the search goal of the
user with a minimum of extraneous data.