I can tell you how I have done similar stuff on Mac OS X, using only built-in tools and features and very simple bash scripts. Of course you are using Windows, so you will have to change some of the steps to use the matching Windows tools (like using .bat files instead of bash, etc) and may even need to install some stuff. Even if you don't use it, it may be of interest for other Mac users.
Here it goes:
First, save this very crude bash script into a file (sorry, I'm not a bash programmer):
#!/bin/bash
function navigate_directory {
cd "$1"
for anItem in *
do
if [ -d "$anItem" ]
then
echo $level$anItem
export level=$level"."
navigate_directory "$anItem"
export level=${level:1:`expr ${#level} - 1`}
elif [ `mdls -name md5cs -raw "$anItem"` = "(null)" ]
then
#echo \ \ $anItem
md5cs=`md5 -q "$anItem"`
#echo \ \ \ \ $md5cs
xattr -w com.apple.metadata:md5cs $md5cs "$anItem"
fi
done
cd ..
}
crawlDirs=$@;
export level="."
for anItem in "$*"
do
echo $anItem
navigate_directory "$anItem"
done
All that script does is crawl through all the directories in the input, and for each file it calculates the MD5 checksum (hint: md5cs=`md5 -q "$anItem"` ). Then it uses xattr to save the MD5 checksum as an extended attribute that can be searched using Spotlight (you would need to use the equivalent search feature in Windows 7).
Because you want it to be searchable through Spotlight the "legal" way to do this is by creating your own little application that "registers" the attribute in the system. But that is waaaaaay too much work for something that you don't plan to use a week from now, so just cheat and register it as an Apple metadata attribute: xattr -w com.apple.metadata:md5cs $md5cs "$anItem"
(if this makes you uncomfortable you can later delete the attributes using a similar function)
To index everything, run the script from the base directory of your filesystem (not sure how to do that in Windows, you may have to run it on every drive), or just run on the directories that have your files (it's pointless to index the system files). The time it will take depends on the number and size of the files you have. Given your 4.2 million files in 4.9 TB it should take a day or so given your fast hardware.
At this point if you do a Spotlight search for the MD5 checksum of a file you will almost immediately get a list of all its dupes. (If you don't, you may need to rebuild the Spotlight indexes by running mdutil -i on and then off on every drive. I don't think it's necessary but YMMV).
Now copy this other bash script. Note how it is very similar to the above one.
#!/bin/bash
function get_md5_for_file {
# Return the MD5 checksum of a file, stored in the md5cs
# extended attribute.
md5cs=` mdls -raw -name md5cs "$1" `
if [ $md5cs = "(null)" ]
then
# If the attribute hasn't been created, do it now for future queries
md5cs=` md5 -q "$1" `
xattr -w com.apple.metadata:md5cs $md5cs "$1"
fi
echo $md5cs
}
function navigate_directory {
cd "$1"
for anItem in *
do
if [ -d "$anItem" ]
then
navigate_directory "$anItem"
else
# If $anItem is a file, get the md5 checksum, search for it
# using mdfind, and count the number of hits.
md5cs=`get_md5_for_file "$anItem"`
numDupes=`mdfind -count "md5cs == '$md5cs'"`
if [ $numDupes -gt 1 ]
then
# The file has at least one dupe.
# List them all in the results file.
echo $md5cs\ $numDupes\ $anItem >> ~/search_results.txt
echo \ `mdfind "md5cs == '$md5cs'"` >> ~/search_results.txt
else
# This is optional: all non-dupes are listed in the
# uniques file.
echo $md5cs\ $anItem >> ~/search_uniques.txt
fi
fi
done
cd ..
}
if [ $# -eq 1 ] && [ -f "$1" ]
then
# If the input is a normal file, get its MD5 checksum and launch a
# Spotlight search for it in a Finder window.
md5cs=`get_md5_for_file "$1"`
echo $md5cs | pbcopy
osascript /path/to/script/start_find.scpt
pbpaste
echo "$@"
else
for anItem in "$@"
do
#echo $anItem
navigate_directory "$anItem"
done
fi
Sorry for the sparse comments and the crudeness of the code. The second script basically takes the input and (if the input is several files or one or more directories) crawls through it identifying which files have one or more dupes anywhere in the system (not only on the selected dirs), and saves those results in a text file. To use it, you identify a directory (or directories) on which you wish to search for dupes.
It has a bonus: if the input is a single file, it will perform a Spotlight search for its dupes in a Finder window, but for that you need to save this AppleScript (its the start_find.scpt called by osascript above):
tell application "Finder" to activate
tell application "System Events"
click menu item "New Finder Window" of ((process "Finder")'s (menu bar 1)'s
(menu bar item "File")'s (menu "File"))
end tell
tell application "System Events"
click menu item "Find" of ((process "Finder")'s (menu bar 1)'s
(menu bar item "File")'s (menu "File"))
end tell
tell application "System Events"
click menu item "Paste" of ((process "Finder")'s (menu bar 1)'s
(menu bar item "Edit")'s (menu "Edit"))
end tell
Another bonus: you can save the second script as a service made with Automator: remove the #!/bin/bash line, and paste it in a "Run Shell Script" Action (/bin/bash shell), Service receives selected files or folders in Finder.app. Then, you can select the target files/dirs in the Finder and run the service from the contextual menu (right click or control-click).
Caveats:
I have tested this only in Snow Leopard.
MD5 does not guarantee that all hits are really duplicates, so to prevent the very unlikely event of a collision (two files that happen to have the same MD5 in spite of being different) you may want to do additional checks for size, etc, or just verify the supposed dupes manually before nuking them.
And finally, again, this technique uses several Mac OS X exclusive features, you will need to adapt it to your system's matching features.