Comment: Re:Dedup is just a marketing word.... (Score 3, Informative) 306
For our production systems it depends 100% on the actual amount of duplicated data, since bulk data reads are needed to verify the duplication. The number of passes is almost irrelevant because they primarily scan meta-data N times, not bulk data (duplicated bulk data only has to be verified once).
The meta-data can be scanned much more quickly than the verification of duplicated bulk data because the meta-data is laid out on the physical disk fairly optimally for the B-Tree scan the de-dup code issues. So meta-data can be read from the hard disk at 40 MBytes/sec even without the use of a SSD to cache it. Of course, with DFly's swapcache and the meta-data cached on the SSD that scan runs at 200-300 MBytes/sec.
But in contrast, the bulk reads used to validate the duplicate data just aren't going to be laid out linearly on the disk. There's a lot of skipping around... so the more actual duplicate data we have the larger the percentage of the disk's surface we have to read to verify it.
This is an area which I could further optimize in HAMMER's dedup code. Currently I do not sort the bulk data block numbers when running the data verification pass. Not only that but I am scanning a sorted CRC list, so the bulk data offsets are going to be seriously unsorted. Doing so would definitely improve performance, probably quite a bit, but still not be anywhere near the 40 MBytes/sec the meta-data scan can achieve off the platter. It would not be a whole lot of programming, probably a day to do that. Currently isn't at the top of my list though.
What this means, in summary (and even with semi-sorting of the bulk data blocks), is that one can use a bounded amount of ram without really effecting the efficiency of the off-line de-duplication.
-Matt