Various technologies and techniques are disclosed that improve the
identification of related content. An article for which to identify
matching content is received or selected. The raw text of the article is
analyzed to reduce the raw text to a core set of words, and the results
are stored in a document feature vector array. The formatted text of the
article is analyzed and vector array scores are updated based on the
formatting. Anchor text words for documents that link to the article are
added to the vector array. Articles linking to and from the particular
article are identified and added to the vector array as appropriate.
Transformations are performed, such as to adjust the vector scores based
on how common or generic the words are. Vector arrays are created for
other potentially related documents. The vectors are compared to
determine how related they are to each other.