The invention provides a text segmentation apparatus comprising means for
analyzing an electronic text to determine likelihood of segmentation
point for each of sentence ends in the text based on a coherent unit and
means for segmenting the text into text segments based on the likelihood
of segmentation point. The apparatus is programmed to segment the text
segment at the position having the best likelihood of segmentation point
within the text segment when the size of any of the segmented text
segments exceeds a threshold value to be determined based on the
specified text segmentation size. Particularly, the apparatus determines
the similarity between the text parts contained in a pair of windows to
be set up on the left and right sides of each sentence end position in
the text so as to obtain similarity curves. Then, the apparatus
determines the likelihood of segmentation point for each sentence end
point based on the obtained similarity curves. The apparatus segments the
text at the point having the best likelihood of segmentation point and
further segments it at the point of the second best likelihood of
segmentation point, and so on, until the size of all of the text segments
becomes approximately equal to the specified segment size.