First of all.... one of the most commonly duplicated blocks is the NUL block, that is a block of data where all bits are 0, corresponding with unused space, or space that was used and then zero'd.
If you have a virtual machine on a fresh 30GB disk with 10GB actually in use, you have at least 25GB that could be freed up by dedup.
But that is a "bug" in the storage management of virtualization environments. If blocks are not used they should not be allocated. Allocating them and then saying "but we can reduce the space they use" sounds like a hack at best. It's more of a workaround rather than a solution.
Second, if you have multiple VMs on a dedup store, many of the OS files will be duplicates.
Even on a single system, many system binaries and libraries, will contain duplicate blocks.
Of course multiple binaries statically linked against the same libraries will have dups.
But also, there is a common structure to certain files in the OS, similarities between files so great, that they will contain duplicate blocks.
If you optimize your OS install you will not gain much by deduplicating all those OS files. If your infrastructure is so small that you can afford to be wasteful to get away with doing default installs then why bother with the complexity and costs of deduplication at all?
Then if the system actually contains user data, there is probably duplication within the data.
For example, mail stores... will commonly have many duplicates.
But to how many people out there does this actually apply? There may be quite a few corporate environments where this is the case but many if not most people will probably have fairly unique data in their VMs. Deduplication doesn't help much in that case.
One user sent an e-mail message to 300 people in your organization -- guess what, that message is going to be in 300 mailboxes.
If users store files on the system, they will commonly make multiple copies of their own files..
Ex... mydocument-draft1.doc, mydocument-draft2.doc, mydocument-draft3.doc
Can MS Word files be large enough to matter? Yes.. if you get enough of them.
Besides they have common structure that is the same for almost all MS Word files. Even documents' whose text is not at all similar are likely to have some duplicate blocks, which you have just accepted in the past -- it's supposed to be a very small amount of space per file, but in reality: a small amount of waste multiplied by thousands of files, adds up.
Again this only applies to very specific environments. The 40GB MySQL db of customer A and the 60GB db of customer B don't really share any data at all and the data-to-OS-files ratio is probably in the ballpark of 100:1 so I see very little gain.
Just because data seems to be all different doesn't mean dedup won't help with storage usage.
I don't doubt there are infrastructures that will benefit hugely from this but I think those infrastructures are the minority. I just see all this undifferentiated hype about how this will reduce peoples storage troubles when it really only applies to a (relatively) small group of people out there. If you do the deduplication on a big chunk level you'll get little overhead but won't find many duplicates. If you do the deduplication more fine-grained then you'll find more duplicates but incur more overhead for the deduplication prozess.
and I have no idea when NFSv6 support will land
Latest changelog entry from the Fedora nfs-utils package:
* Thu Jan 21 2010 Steve Dickson 1.2.1-13
- mount.nfs: Configuration file parser ignoring options
- mount.nfs: Set the default family for lookups based on defaultproto= setting
- Enabled ipv6
Damn! Party pooper!Now, now... Before we break out the pitchforks and torches...
"People should have access to the data which you have about them. There should be a process for them to challenge any inaccuracies." -- Arthur Miller