Text based documents are compared by lexically normalising each word of
the text of a first document (104) to form a first normalised
representation. A vector representation of the first document is built
(206) from the first normalised representation. Each word of the text of
a second document (110) is lexically normalised to form a second
normalised representation. A vector representation of the second document
is built (204) from the second normalised representation. The alignment
of the vector representations is compared (210) to produce a score (218)
of the similarity of the second document to the first document.