Forming and/or improving a language model based on data from a large
collection of documents, such as web data. The collection of documents is
queried using queries that are formed from the language model. The
language model is subsequently improved using the information thus
obtained. The improvement is used to improve the query. As data is
received from the collection of documents, it is compared to a rejection
model, that models what rejected documents typically look like. Any
document that meets the test is then rejected. The documents that remain
are characterized to determine whether they add information to the
language model, whether they are relevant, and whether they should be
independently rejected. Rejected documents are used to update the
rejection model; accepted documents are used to update the language
model. Each iteration improves the language model, and the documents may
be analyzed again using the improved language model.