Described is a storage reports duplicate file detector that operates by
receiving file records during a first scan of file system metadata. The
detector computes a hash based on attributes in the record, and maintains
the hash value in association with information that indicates whether a
hash value corresponds to more than one file. In one implementation, the
information corresponds to the amount of space wasted by duplication. The
information is used to determine which hash values correspond to groups
of potentially duplicate files, and eliminate non-duplicates. A second
scan locates file information for each of the potentially duplicate
files, and the file information is then used to determine which groups of
potentially duplicate files are actually duplicate files.