The invention relates to methods for clustering gene and protein
sequences. In particular, it involves generation of networks of sequences
where the interconnections are based upon a measure of similarity. The
invention also provides methods of optimizing and improving the networks
by re-wiring of the network based upon overlap of the nearest neighbors
of given pairs of nodes. The invention further provides methods of
identifying clusters of sequences within the networks and the optimized
networks based upon the topology of the network. The clusters identified
represent groups of sequences that are related by function and/or
evolution. The invention has particular applicability in annotation of
sequences in databases and identification of functional homologs which
can be very useful for novel therapeutic and diagnostic targets based
upon such targets belonging to a cluster or family that contains a known
sequence such as a diagnostic sequence, antigen or other therapeutic
target.