Comment Re: FreeNAS (Score 1) 212
I'm not familiar with zfs either, but I suppose this would be a basic function of zfs interfaces, wouldn't it?
Only if it is implemented as such. It probably isn't.
Suppose you have a file which was modified once (ie there are two versions). The overall filesystem has files which have had 3 billion modifications. You have 3 billion trees and the leaf node for that file in each tree points to one of two versions.
The smart way to search the tree is to compare all the roots and eliminate any which aren't different versions (there won't be any at this level, since new roots are only created for new snapshots). Then you descend one level in each and eliminate all which aren't different versions (which will eliminate 99% of the records at this level, since if you snapshot on a single file change most nodes will be shared across two trees except the one containing the one file that changed). Then you keep going down eliminating duplicates in this way. In the end you'll have a bunch of linked lists from root to leaf showing the one file that changed in each new root (a tree with only one node per level is a linked list). Then you search the leaf nodes for the file of interest and generate a linear history for that one file. If you are targetting a particular file then you can stop your search early on most of those trees anytime you discard the parent of the desired leaf (if you're looking for changes in
That is still a lot of work, because there was no linkage between file versions in the data structures. A filesystem optimized for file versioning would store a link from each file to the previous version so that the history for a single file could be traversed with only one seek per version.
However, doing this from userspace is going to be even less efficient if the filesystem doesn't expose its internals. In userspace you have no visibility into which directories under a snapshot are a shared record in the filesystem. To trace the history of a single file requires descending to that file in every single tree, and doing a full comparison, which is far more operations than in the optimal algorithm. To trace the history of the entire tree requires comparing every file in every snapshot, which is an enormous number of operations even in RAM, let alone on disk.
If you've ever tried to do a git log on an individual file in a very large git repository you might have noticed this problem, and I believe git does things in the optimal way. (Git repositories are very similar to COW filesystems.) If you stuck your entire hard drive into a git repository and did a commit on every change, you'd quickly run into this problem.
To do this well really requires designing the feature into the filesystem. Doing it in a COW filesystem is especially challenging since if you want to remove a snapshot you need to potentially clean up broken links at the leaf level if you're actually linking across snapshots so that you can easily traverse file histories.