Systems and methods for determining the topic structure of a document
including text utilize a Probabilistic Latent Semantic Analysis (PLSA)
model and select segmentation points based on similarity values between
pairs of adjacent text blocks. PLSA forms a framework for both text
segmentation and topic identification. The use of PLSA provides an
improved representation for the sparse information in a text block, such
as a sentence or a sequence of sentences. Topic characterization of each
text segment is derived from PLSA parameters that relate words to
"topics", latent variables in the PLSA model, and "topics" to text
segments. A system executing the method exhibits significant performance
improvement. Once determined, the topic structure of a document may be
employed for document retrieval and/or document summarization.