Entity disambiguation resolves which names, words, or phrases in text
correspond to distinct persons, organizations, locations, or other
entities in the context of an entire corpus. The invention is based
largely on language-independent algorithms. Thus, it is applicable not
only to unstructured text from arbitrary human languages, but also to
semi-structured data, such as citation databases and the disambiguation
of named entities mentioned in wire transfer transaction records for the
purpose of detecting money-laundering activity. The system uses multiple
types of context as evidence for determining whether two mentions
correspond to the same entity and it automatically learns the weight of
evidence of each context item via corpus statistics. The invention uses
multiple search keys to efficiently find pairs of mentions that
correspond to the same entity, while skipping billions of unnecessary
comparisons, yielding a system with very high throughput that can be
applied to truly massive data.