Time to PAUSE, and DFS Progress (of sorts)

Journal TTK Ciar's Journal: Time to PAUSE, and DFS Progress (of sorts)

Journal by TTK Ciar on Saturday September 26, 2009 @03:10AM

I got a PAUSE account!

As of today I am "TTKCIAR" at pause.perl.org, a full-fledged member of the open source perl community, and capable of contributing software to CPAN, the central repository of perl modules, for others to download and use! Exciting day!

I have a bunch of code to contribute too .. but there's one problem: most of it totally falls short of PAUSE's standards for contributed code.

Look at the KS module, for instance. KS is a repository of handy functions which I have accumulated over the last nine years. They're useful and mature functions, but scarcely documented and need to be broken out into category-specific submodules. It's not "the way" to have functions for converting Brinell Hardness units to Vickers Hardness units rubbing elbows with file-locking functions and networking functions and string manipulation functions, all in the same module. The unit conversion functions need to go into modules in the "Physics" namespace, and the network functions need to go into modules in the "Net" namespace, etc. The documentation needs to be brought up to PAUSE standards as well.

It's work I knew I needed to do, but it was easy to put it off as long as I didn't have a PAUSE account. But now that I do, there's no more putting it off! I just need to find time to do it.

The first module I publish might be a relatively young one, a concurrency library called Dopkit. It's something I've been wanting to write for years, but I just finished writing and debugging it yesterday. There are many concurrency modules in CPAN already, but most of them require considerable programming overhead and require that the programmer wrap their head around the way concurrency works. These are reasonable things to do, but I've often thought it would be nice if it could be made trivially easy for the programmer to make loop iterations perform in parallel, without changing from the familiar loop construct. Dopkit ("Do Parallel Kit") provides functions that look and act like the familiar perl loop syntax -- do, for, foreach, and while -- and chops up the loop into parts which execute concurrently on however many cores the machine has. The idea is to put very few demands on the programmer, who needs only to load the module, create a dopkit object, and then use dop(), forp(), foreachp, and whilefilep() where they'd normally use do, for, foreach, and while(defined(<SOMEFILE>)). There are some limitations to the implementation, so the programmer can't use Dopkit *everywhere* they'd normally use a loop, but within its limitations Dopkit is an easy and powerful way to get code running on multiple cores fast.

Dopkit suffers from the same documentation deficit as KS, but at least it's already "categorized" -- as soon as I can get the documentation written, it should be published as Parallel::Dopkit. KS will take significant refactoring.

Most of the perl in my codecloset is embarrassingly primitive (I wrote most of it in my early days of perl, before I was very proficient with it), but there are a few other modules on my priority list to get into shape and publish. My "dy" utility has proven a tremendously useful tool over the years, but is in desperate need of rewriting. It started out life as a tiny throwaway script, and grew features organically without any design or regard for maintainability. I've been rewriting its functionality in two modules, FileID and DY (which should probably get renamed to something more verbose). When they're done, the "dy" utility itself should be trivially implementable as a light wrapper around these two modules. Another tool I use almost every day is "select", which is also in need of being rewritten as a module. I haven't started that one yet.

In other news I stopped dorking around with FUSE and linux drivers, and dug into the guts of my distributed filesystem project. Instead of worrying about how to make it available at the OS level for now, I've simply written a perl module for abstracting the perl filesystem API. As long as my applications use the methods from my FS::AnyFS module instead of perl's standard file functions, transitioning them from using the OS's "native" filesystems to the distributed filesystem should be seamless. This is only an interim measure. I want to make the DFS a full-fledged operating-system-level filesystem eventually, but right now that's getting in the way of development. Writing a linux filesystem driver will come later. Right now I'm quite pleased to be spending my time on the code which actually stores and retrieves file data.

Questions posted by other slashdot users focussed my attention on how I expect to distinguish my DFS from the various other distributed filesystem projects out there (like BTRFS and HadoopFS). I want it to do a few core things that others do not:

(1) I want it to utilize the Reed-Solomon algorithm so it can provide RAID5 and RAID6-like functionality. This will produce a data cluster which could lose any two or three (or however many the system administrators specify) servers without losing the ability to serve data, without the need to store all data in triplicate or quadrupilate. BTRFS only provides RAID0, RAID1, and RAID10 style redundancy -- if you want the ability to lose two BTRFS servers without losing the ability to serve all your data, all data has to be stored in triplicate. That is not a limitation I'm willing to tolerate. Similarly, the other distributed filesystems have "special" nodes which the entire cluster depends on. These special servers represent SPOFs -- "Single Points Of Failure". If the "master" server goes down, the entire distributed filesystem is unusable. Avoiding SPOFs is a mostly-solved problem. For many applications (such as database and web servers), IPVS and Keepalived provide both, load-balancing and rapid failover capability. There's no reason not to have similar rapid failover for the "special" nodes in a distributed filesystem.

(2) I want the filesystem to be continuous. Adding storage, replacing hardware, or allocating storage between multiple filesystem instances should not require interruption of service. This is a necessary feature if the filesystem is to be used for mission-critical applications expected to stay running 24x7. Fortunately I've done a lot of this sort of thing, and haven't needed to strain thusfar to achieve it. (On a related note, I still chuckle at the memory of Brewster calling me in the middle of the night from Amsterdam in a near-panic, following The PetaBox's first deployment. The system kept connecting to The Archive's cluster in San Francisco and keeping itself in sync, and nothing brewster could do would make it stop. The data cluster's software interpreted all of his attempts to turn the service off as "system failures" which it promptly auto-corrected and restored. It was a simple matter to tell the service to stop, but Brewster has a thing against documentation.)

(3) I want the filesystem to perform well with large numbers of small files. This is the hard part for filesystems in general, and it's something I've struggled with for years on production systems. None of the existing filesystems handle large sets of very small files very well, and most distributed filesystems such as RAID5 do not address the problem (and in some ways compound it -- as RAID5 arrays get larger, the minimum data that must be read/written for any operation also gets larger). In my experience, most real-life production systems have to deal with large numbers of small files. Just running stat() on a few million files is a disgustingly resource-intensive exercise. RAID1 helps, but the CPU quickly becomes the bottleneck. One of my strongest motivations for developing my own filesystem is to address this problem. I don't want to be struggling with it for the next ten years. I am tackling this problem in three ways: First, filesystem metadata is replicated across multiple nodes, for concurrent read-access. Second, filesystem metadata is stored in a much more compact format than the traditional inode permits. Many file attributes are inherited from the directory, and attribute data is only stored on a per-file basis when it is different from the directory's. This should improve its utilization of main and cache memories. Third, the filesystem API provides low-level methods for performing operations on files in batches, and implementations of standard filesystem functions (such as stat()) could take advantage of these to provide superior performance. For instance, when stat() was called to return information about a file, the filesystem could provide that information for many of the files in the same directory. This information would be cached in the calling process' memor space by the library implementing stat() (with mechanisms in place for invalidating that cache should the filesystem metadata change), and subsequent calls to stat() would return locally cached information when possible. This wouldn't help in all situations, but it would help when the calling application was trying to stat() all of the files in a directory heirarchy -- a common case where high performance would be appreciated.

I don't know how long it will take to implement such a system. What work I've already done is satisfying, but it just scratches the surface of what needs to be done, and I can barely find time to refactor and comment my perl modules, much less spend hard hours on design work! But I'll keep at it until it's done or until the industry comes up with something which renders it moot.

This discussion has been archived. No new comments can be posted.