Coding free text documents, especially in medicine, has become an urgent
priority as electronic medical records (EMR) mature, and the need to
exchange data between EMRs becomes more acute. However, only a few
automated coding systems exist, and they can only code a small portion of
the free text against a limited number of codes. The precision of these
systems is low and code quality is not measured. The present invention
discloses a process and system which implements semantic coding against
standard lexicon(s) with high precision. The standard lexicon can come
from a number of different sources, but is usually developed by a
standard's body. The system is semi-automated to enable medical coders or
others to process free text documents at a rapid rate and with high
precision. The system performs the steps of segmenting a document,
flagging the need for corrections, validating the document against a data
type definition, and looking up both the semantics and standard codes
which correspond to the document's sentences. The coder has the option to
intervene at any step in the process to fix mistakes made by the system.
A knowledge base, consisting of propositions, represents the semantic
knowledge in the domain. When sentences with unknown semantics are
discovered they can be easily added to the knowledge base. The
propositions in the knowledge base are associated with codes in the
standard lexicon. The quality of each match is rated by a professional
who understands the knowledge domain. The system uses this information to
perform high precision coding and measure the quality of the match.