Well yes, this is a linux tool, but still I was quite pleased with it's results for 800k files. It took some time but it had an end.
It's basically a shellscript doing what others have suggested: sort by size, same size files are checksummed. /usr/share/fslint/fslint/findup
find dUPlicate files.
Usage: findup [[[-t [-m|-d]] | [--summary]] [-r] [-f] paths(s) ...]
If no path(s) specified then the currrent directory is assumed.
When -m is specified any found duplicates will be merged (using hardlinks).
When -d is specified any found duplicates will be deleted (leaving just 1).
When -t is specfied, only report what -m or -d would do.
When --summary is specified change output format to include file sizes.
You can also pipe this summary format to /usr/share/fslint/fslint/fstool/dupwaste
to get a total of the wastage due to duplicates.
As it's a single command line with dozens of pipes, it should use all cores if needed.
some text from the source:
will show duplicate files in the specified directories
(and their subdirectories), in the format:
or if the --summary option is specified:
2 * 2048 file1 file2
3 * 1024 file3 file4 file5
Where the number is the disk usage in bytes of each of the
duplicate files on that line, and all duplicate files are
shown on the same line.
Output it ordered by largest disk usage first and
then by the number of duplicate files.
I compared this to any equivalent utils I could find (as of Nov 2000)
and it's (by far) the fastest, has the most functionality (thanks to
find) and has no (known) bugs. In my opinion fdupes is the next best but
is slower (even though written in C), and has a bug where hard links
in different directories are reported as duplicates sometimes.
This script requires uniq > V2.0.21 (part of GNU textutils|coreutils)
dir/file names containing \n are ignored
undefined operation for dir/file names containing \1
sparse files are not treated differently.
Don't specify params to find that affect output etc. (e.g -printf etc.)
zero length files are ignored.
symbolic links are ignored.
path1 & path2 can be files &/or directories
and the code has optimizations like this one
sort -k2,2n -k3,3n | #NB sort inodes so md5sum does less seeking all over disk