Techniques for clustering user sessions using multi-modal information
including proximal cue information are provided. The topology, content and usage
of a document collection or web site are determined. User paths are then identified
using longest repeating subsequence techniques. An information need feature vector
is determined for each significant user path. Further, other feature vectors and
proximal cue vectors for each document or web page in the significant path are
determined. The other feature vectors include a content feature vector, a uniform
resource locator feature vector, an inlink feature vector and an outlink feature
vector, among others. The feature vectors and the proximal cue vectors are combined
into a multi-modal vector that represents a user profile for each significant user
path. The multi-modal vectors are clustered using a type of multi-modal clustering
such as K-Means or Wavefront clustering.