A method and apparatus is provided for converting a document in a first
format essentially comprising a flat layout structure into a structured
document in a hierarchical form in accordance with predetermined
attributes identified from the input format. The process comprises
fragmenting the input document into a plurality of document content
elements in accordance with a predetermined set of document attributes
identifiable from the input document format. The content elements are
clustered into selective sets having similar document attributes. The
clustered sets are validated with reference to common textual properties
organizational content common in documents in the collection. The
clustered sets are then categorized into predetermined categories
comprising structured elements of the structured document format and the
document content elements are organized by hierarchical dependency from
the predetermined categories wherein the organized document elements
comprise the desired structured document format.