Catch up on stories from the past week (and beyond) at the Slashdot story archive


Forgot your password?
Data Storage

Submission + - How do I De-Dup a system with 4.2 million files? 2

jamiedolan writes: I've managed to consolidate most of my old data from the last decade onto drives attached to my main Windows 7 PC. Lots of files of all types from digital photos & scans to HD video files (also web site backup's mixed in which are the cause of such a high number of files). In more recent times I've organized files in a reasonable folder system and have an active / automated backup system. The problem is that I know that I have many old files that have been duplicated multiple times across my drives (many from doing quick backups of important data to an external drive that later got consolidate onto a single larger drive), chewing up space. I tried running a free de-dup program, but it ran for a week straight and was still "processing" when I finally gave up on it. I have a fast system, i7 2.8Ghz with 16GB of ram, but currently have 4.9TB of data with a total of 4.2 million files. Manual sorting is out of the question due to the number of files and my old sloppy filing (folder) system. I do need to keep the data, nuking it is not a viable option. Thanks. Jamie Dolan
This discussion was created for logged-in users only, but now has been archived. No new comments can be posted.

How do I De-Dup a system with 4.2 million files?

Comments Filter:
  • In my opinion, patience is your tool of choice. The issue with the program you tried, is that you won't know if it works well before you use it - so maybe you might want to search info on how it runs, so you can make sure it will sort your files as you wish. The way I would (will actually, cause I'm heading towards the same problem) is to make a simple script that calculates checksums for all files - and then find dupes based those checksums. I would choose this approach for two reasons: 1. The computer c
    • Found an easy solution at []; echo "#! /bin/sh" > $OUTF; find "$@" -type f -exec md5sum {} \; | sort --key=1,32 | uniq -w 32 -d --all-repeated=separate | sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm \1/' >> $OUTF; chmod a+x $OUTF; ls -l $OUTF

      This solution is based on the md5 checksum, it is probably the fastest and least reliable of checksums; sha256 or sha512 would be better but will requir

"The way of the world is to praise dead saints and prosecute live ones." -- Nathaniel Howe