Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?
Check out the new SourceForge HTML5 internet speed test! No Flash necessary and runs on all devices. ×
Data Storage

Submission + - How do I De-Dup a system with 4.2 million files? 2

jamiedolan writes: I've managed to consolidate most of my old data from the last decade onto drives attached to my main Windows 7 PC. Lots of files of all types from digital photos & scans to HD video files (also web site backup's mixed in which are the cause of such a high number of files). In more recent times I've organized files in a reasonable folder system and have an active / automated backup system. The problem is that I know that I have many old files that have been duplicated multiple times across my drives (many from doing quick backups of important data to an external drive that later got consolidate onto a single larger drive), chewing up space. I tried running a free de-dup program, but it ran for a week straight and was still "processing" when I finally gave up on it. I have a fast system, i7 2.8Ghz with 16GB of ram, but currently have 4.9TB of data with a total of 4.2 million files. Manual sorting is out of the question due to the number of files and my old sloppy filing (folder) system. I do need to keep the data, nuking it is not a viable option. Thanks. Jamie Dolan
This discussion was created for logged-in users only, but now has been archived. No new comments can be posted.

How do I De-Dup a system with 4.2 million files?

Comments Filter:
  • In my opinion, patience is your tool of choice. The issue with the program you tried, is that you won't know if it works well before you use it - so maybe you might want to search info on how it runs, so you can make sure it will sort your files as you wish. The way I would (will actually, cause I'm heading towards the same problem) is to make a simple script that calculates checksums for all files - and then find dupes based those checksums. I would choose this approach for two reasons: 1. The computer c
    • Found an easy solution at http://elonen.iki.fi/code/misc-notes/remove-duplicate-files/ [elonen.iki.fi]

      OUTF=rem-duplicates.sh; echo "#! /bin/sh" > $OUTF; find "$@" -type f -exec md5sum {} \; | sort --key=1,32 | uniq -w 32 -d --all-repeated=separate | sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm \1/' >> $OUTF; chmod a+x $OUTF; ls -l $OUTF

      This solution is based on the md5 checksum, it is probably the fastest and least reliable of checksums; sha256 or sha512 would be better but will requir

"There is nothing new under the sun, but there are lots of old things we don't know yet." -Ambrose Bierce