Input set indexing for set-similarity lookups. The architecture provides
input to an indexing process that enables more efficient lookups for
large data sets (e.g., disk-based) without requiring a full scan of the
input. A new index structure is provided, the output of which is exact,
rather than approximate. The similarity of two sets is specified using a
similarity function that maps two sets to a numeric value that represents
similarity of the two sets. Threshold-based lookups are addressed where
two sets are considered similar if the numeric similarity score is above
a threshold. The structure efficiently identifies all input sets within a
distance k (e.g., a hamming distance) of the query set. Additional
information in the form of frequency of elements (the number of input
sets in which an element occurs) is used to improve index performance.