A lot of times these days I use rsync to do hard linked backups, which works mostly well but has some shortcomings. For example, backups across multiple machines don't have their duplicate files hardlinked, and files that are mostly similar can't be hard linked, such as files that grow like log files. More specifically we have some database files that grow with yearly detail information and everything before the newly added records is identical, resulting in gigs of used up space every day during backups when maybe a few megs has changed.
Initially I liked the way BackupPC handled the situation by pooling and compressing all the files, and duplicate files from different backups were automatically linked together. So I wrote a little script that primarily duplicated the the functionality of hardlinking duplicate files together regardless of file stat, running on top of fusecompress to get the compression too. The problem mostly is time though to crawl thousands and thousands of files and relink them. On top of that, rsync will not use those duplicate files for hardlinks in the next backup if the file stat info doesn't match, like mtime/owner/etc which means the next backup contains fresh new copies of files that have to be re-hardlinked by crawling the files again. Plus you don't get elimination of partial file redundancy.
So I looked around some more for a system that would allow you to compress out redundant blocks, and the closest thing I could find is squashfs, but it's read-only. Which sucks because we need to purge daily local backups occasionally to make more room for newer backups. We keep the last 6 month of daily backups available on a server, and do daily offsite backups from that. So once a month we delete the oldest months backups from the local backup server, and using squashfs you'd have to recreate the whole squash archive, which would suck for a terabyte archive with millions of files in it.
At this point I knew what features I wanted but couldn't find anything that did it yet, so I went ahead and wrote a fuse daemon in python that handles block-level deduplication and compression at the same time. I'm still playing around with it and testing different storage ideas, it's available in git if anyone wants to take a look, you can get it by doing:
git clone http://git.hoopajoo.net/projects/fusearchive.git fusearchive
(note the above command might be mangled because of the auto-linking in slashdot, there should be no [hoopajoo.net] in the actual clone command)
Currently it uses a storage directory with 2 sub directories, store/ and tree/. Inside tree/ are files that contain a hash that identifies the block list for the file contents. This way 2 identical files will only consume the size of a hash on disk + inodes. The hash points the the block that contains the file data block list, which is also a list of hashes of the data. This way any files that have identical blocks (on a block boundry) will have the redundant blocks only take up the size of the hash. Blocks are currently 5M, which can be tuned, and the blocks are compressed using zlib. So a bunch of small files get the benefit of compression and entire-file deduplication while large growing files will at most use up an extra block or data + the hash info for the rest of the file. So far this seems to be working pretty well, the biggest issues I have is tracking block references so we can free the block when it's no longer referenced by any files. It works fine currently but since each block contains it's own reference counter a crash could make the ref counts incorrect, and unfortunately I can't think of a better, more atomic way to handle that. The other big drawback is speed, it's about 1/3 the speed of native file copying, and from profiling the code 80-90% of the time seems to be spent passing fuse messages in the main fuse-python library, with a little time being taken up by zlib and actual file writes.
If I could get something like that from a native filesystem that also supported journaling so you didn't have the refcount mess that would be pretty sweet. Plus I wouldn't have to waste time developing and supporting it :p