Terje Mathisen - Slashdot User

Comment Haven't had a desk phone for 10+ years! (Score 4, Interesting) 445

by Terje Mathisen on Thursday December 06, 2012 @09:14AM (#42202727) Attached to: Ask Slashdot: Do You Still Need a Phone At Your Desk?

Comment Re:This is amazing: Why didn't they do it 10+ year (Score 1) 226

by Terje Mathisen on Thursday November 15, 2012 @04:07PM (#41995117) Attached to: NASA To Encrypt All of Its Laptops

Comment This is amazing: Why didn't they do it 10+ years a (Score 4, Interesting) 226

by Terje Mathisen on Thursday November 15, 2012 @12:13PM (#41992237) Attached to: NASA To Encrypt All of Its Laptops

Comment Re:The Brain is Plastic (Score 3, Informative) 317

by Terje Mathisen on Sunday November 04, 2012 @10:47AM (#41872201) Attached to: Why Coding At Fifty May Be Nifty

Comment Re:40: I'm 55... (Score 5, Interesting) 317

by Terje Mathisen on Sunday November 04, 2012 @10:37AM (#41872159) Attached to: Why Coding At Fifty May Be Nifty

Comment Re:Seriously? Yes! (Score 1) 302

by Terje Mathisen on Wednesday October 24, 2012 @08:20AM (#41750655) Attached to: Ask Slashdot: Seamonkey vs. Firefox — Any Takers?

Comment File size then interleaved secure hash (Score 2) 440

by Terje Mathisen on Sunday September 02, 2012 @02:01PM (#41206887) Attached to: Ask Slashdot: How Do I De-Dupe a System With 4.2 Million Files?

This is a very fun programming task!

Since it will be totally limited by disk IO, the language you choose doesn't really matter, as long as you make sure that you never read each file more than once:

1) Recursive scan of all disks/directories, saving just file name and size plus a pointer to the directory you found it in.
If you have multiple physical disks you can run this in parallel, one task/thread for each disk.

2) Sort the list by file size.

3) For each file size with multiple entries do:

3a) How many matches are there and how large are they?

3a1) Just two files: Read them both in parallel, using a block size of 1MB or more in order to avoid extra disk seeks, and compare directly. Exit on first difference of course!
3a2) 3 or more files: Read them all interleaved, still using a 1MB+ block size. For each block calculate a CRC32 or secure hash, compare these at the end of each block iteration. When a single file differs from the rest, it is unique.
When two or more are equal but still different from the majority of the group, recurse into a new copy of the scanning function that checks the smallest group, then upon return go on with the rest.

It should be obvious that your scanning function needs to accept an array of open file handles/descriptor plus an offset to start the scanning process at, thus making it easy to call it recursively to check the tails of a sub-array!

(A possible problem can occur if you have _very_ many files of the same size, in that the operating system could run out of file handles for simultaneously open files! In that case I'd fall back on passing in file paths instead of open handles and take the hit of re-opening each file for each block to be read. I would also increase the block size significantly, into the 10-100 MB range, so that everything except big ISOs and similar would be read in a single access. The same process is probably optimal for file sizes less than the minimum block size.)

This algorithm should be able to do what you need in significantly less time than you'd need to just read everything once. I'd estimate about 50 MB/s effective reading speed, so if everything is on a single disk (4.9 TB? Not very likely!) and every single file size has multiple entries that only differ in the last byte, you would need 100 K seconds, or a little more than a day. My guess is you should easily finish overnight!

Terje

Comment Re:SCORPION STARE (Score 2) 283

by Terje Mathisen on Tuesday August 28, 2012 @04:39AM (#41146391) Attached to: UK License Plate Cameras Have "Gaps In Coverage"

Comment Re:A job or a calling? (Score 1) 1086

by Terje Mathisen on Thursday August 09, 2012 @03:29PM (#40936457) Attached to: Ask Slashdot: How Many of You Actually Use Math?

Comment A job or a calling? (Score 4, Interesting) 1086

by Terje Mathisen on Thursday August 09, 2012 @02:17PM (#40935095) Attached to: Ask Slashdot: How Many of You Actually Use Math?

Comment This already exists in civilized countries... (Score 4, Interesting) 299

by Terje Mathisen on Sunday February 19, 2012 @05:00AM (#39091173) Attached to: Avoiding Red Lights By Booking Ahead

Comment BT, DT: (Was Re:Two mostly similar choices) (Score 2) 467

by Terje Mathisen on Monday February 13, 2012 @07:14AM (#39017541) Attached to: Dealing With an Overly-Restrictive Intellectual Property Policy?

Comment IEEE do some of the same (Score 5, Interesting) 206

by Terje Mathisen on Friday January 27, 2012 @10:47AM (#38839207) Attached to: Scientists Organize Elsevier Boycott

Comment Re:How many GIS functions do you need? (Score 1) 316

by Terje Mathisen on Tuesday January 17, 2012 @12:45PM (#38726082) Attached to: Ask Slashdot: Open Source vs Proprietary GIS Solution?

Comment Re:How many GIS functions do you need? (Score 1) 316

by Terje Mathisen on Tuesday January 17, 2012 @11:22AM (#38724872) Attached to: Ask Slashdot: Open Source vs Proprietary GIS Solution?

Slashdot Top Deals