A method organizes semi-structured data into a taxonomy, based on
Tag-Separated (TS) clustering. The method comprises retrieving documents
including the semi-structured data. The semi-structured data comprises
structured data including structured data fields and tags, and
unstructured data. The method selects a structured attribute type
including any of a categorical attribute, a numerical attribute, and a
tag associated with annotated text, and an unstructured attribute type
including a text attribute. The method clusters the semi-structured data
from the retrieved documents into a plurality of clusters based on the
selected structured attribute type and the selected unstructured
attribute type. For a categorical attribute, each category corresponds to
a single cluster. For a numerical attribute, a clustering algorithm
clusters numerical data projected onto a range of the numerical
attribute. For an annotated text attribute, a monothetic clustering
algorithm clusters annotated text data according to tags associated with
a vocabulary for the annotated text data.