Comment Re:CRC (Score 0) 440
Other commenters have observed your big-O inefficiency here. I had the same initial approach in my script http://gnosis.cx/bin/find-duplicate-contents (that I've mentioned in another post), but changed it when I realized it took hundreds of times as long to run as the correct hash-and-sort approach.
The problem is that if you have many files of the same size, you need to do a pairwise comparison of every pair from among them. This multiplies the operations enormously in what I found to be typical cases on my filesystem. For example, if you have 50 files that all are *exactly* 1GB in size, you need to do "50 choose 2" comparisons, i.e. 1225 of them. If the files wind up differing in the first few bytes (few meaning even first few megabytes), that's still possibly cheaper than hashing the whole files. However, if genuine duplicates exist in many of those cases--which is more typical--you wind up having to read through the entire identical content many times.
In contrast, if when you find same-sized files you commit to making exactly one read of each of them (but indeed, the entirety of them), you can store an MD5 or other hash of the whole thing, and then just compare those short MD5 sums. In actual testing and refinement, I have found this to be a hell of a lot cheaper (as in tens or hundreds of times cheaper in my actual test cases). Of course, the actual answer depends on what files of what sizes you actually have, and it is easy to construct artificial cases in which my approach (I didn't invent it, of course) loses. But not in my real-life experience.