GPLJonas - Slashdot User

Comment Re:I've wanted deduplication for a long time! (Score -1, Flamebait) 306

by GPLJonas on Wednesday January 04, 2012 @05:00PM (#38588822) Attached to: Ask Slashdot: Free/Open Deduplication Software?

So you're saying that open source isn't viable solution in every case?

Comment I've wanted deduplication for a long time! (Score 4, Interesting) 306

by GPLJonas on Wednesday January 04, 2012 @04:54PM (#38588736) Attached to: Ask Slashdot: Free/Open Deduplication Software?

And now, even the next version of Windows Server will contain integrated data deduplication technology! So Linux devs better get working on similar features. I still cannot figure out how NTFS can support compressing files and folders but Linux cannot.

That deduplication for NTFS is really interesting, actually. It's not licensed technology but straight from Microsoft Research and it has some clever aspects to it.

Some technical details about the deduplication process:

Microsoft Research spent 2 years experimenting with algorithms to find the “cheapest” in terms of overhead. They select a chunk size for each data set. This is typically between 32 KB and 128 KB, but smaller chunks can be created as well. Microsoft claims that most real-world use cases are about 80 KB. The system processes all the data looking for “fingerprints” of split points and selects the “best” on the fly for each file.

After data is de-duplicated, Microsoft compresses the chunks and stores them in a special “chunk store” within NTFS. This is actually part of the System Volume store in the root of the volume, so dedupe is volume-level. The entire setup is self describing, so a deduplication NTFS volume can be read by another server without any external data.

There is some redundancy in the system as well. Any chunk that is referenced more than x times (100 by default) will be kept in a second location. All data in the filesystem is checksummed and will be proactively repaired. The same is done for the metadata. The deduplication service includes a scrubbing job as well as a file system optimization task to keep everything running smoothly.

Windows 8 deduplication cooperates with other elements of the operating system. The Windows caching layer is dedupe-aware, and this will greatly accelerate overall performance. Windows 8 also includes a new “express” library that makes compression “20 times faster”. Compressed files are not re-compressed based on filetype, so zip files, Office 2007+ files, etc will be skipped and just deduped.

New writes are not deduped – this is a post-process technology. The data deduplication service can be scheduled or can run in “background mode” and wait for idle time. Therefore, I/O impact is between “none and 2x” depending on type. Opening a file is less than 3% greater I/O and can be faster if it’s cached. Copying a large file can make some difference (e.g. 10 GB VHD) since it adds additional disk seeks, but multiple concurrent copies that share data can actually improve performance.

The most interesting thing is that Microsoft Research says it doesn't affect performance almost at all. So when are we going to see Linux equivalents? Because Linux is getting behind on the new technologies.

Slashdot Top Deals