A document (or multiple documents) is analyzed to identify entities of
interest within that document. This is accomplished by constructing
n-gram or bi-gram models that correspond to different kinds of text
entities, such as chemistry-related words and generic English words. The
models can be constructed from training text selected to reflect a
particular kind of text entity. The document is tokenized, and the tokens
are run against the models to determine, for each token, which kind of
text entity is most likely to be associated with that token. The entities
of interest in the document can then be annotated accordingly.