Grim Grumbles on Goods, Glance, Goons, and Goals

Journal TTK Ciar's Journal: Grim Grumbles on Goods, Glance, Goons, and Goals 2

Journal by TTK Ciar on Sunday July 10, 2005 @03:07AM

Whew, it's been a while since I wrote in here, eh? Well, life's been pretty full.

At The Internet Archive, I've been trying to get my latest grand project, an "Item Tracking System", off the ground and debugged for real-life deployment. It's been a difficult one. The basic idea itself is simple and fairly straightforward, but as with any project at The Archive, the quality of sheer quantity casts its own peculiar shadow over matters, demanding peculiar approaches to the design of the system.

At The Archive, we have two kinds of archives: Web, and Collections. The Web archive consists of thousands of .ARC, .DAT, and .CDX formatted files spread across several hundreds of nodes in the data clusters. Every two months or so, 50 TB's or so of additional (compressed) files are added to the collection, but otherwise they remain fairly static. The Web archives are Brad Tofel's specialty, and he is the de facto top authority on the technology used at The Archive to collect, analyze, and manipulate them.

The Collections archives are a bit less uniform. In fact, they're downright chaotic. They consist of about 120,000 "items" of various types, where an "item" is defined as a globally unique name, a directory referenced by that name, and a bunch of data files in that directory. The items are contributed to The Archive by our partner institutions and by the artists who created the content (and therefore have the right to make that content available via The Archive). There's been some attempts, lately, to define an "item" more rigorously than that, adding well-formed metadata to each item in the form of .xml files in the item directory, but all in all the item format has been highly fluid and changes rapidly. The data content also changes, since the owners of these items sometimes update them with more information, or with newly reformatted versions. Some items are music, others are "texts" (scanned or transcribed books), while others are movies, software, radio broadcasts, etc. They are spread across three data clusters, which keep each other in sync via an OAI feed which seldom works.

As the Data Repository department's designated programmer, and the technical lead of our QA efforts, I've had to deal a lot with the Collections archive, and am close to the problems we run into trying to keep the various instances of items in sync, moving them between servers (which we should never do, but do all the time anyway), handling item failure modes, etc. I became increasingly frustrated by the lack of a central database which enumerates all items, and got tired of compiling cluster-wide manifests by hand when cluster contents needed to be compared, so I made up a list of all the problems we deal with every day which such a database would solve and used it as my justification for building an Item Tracking System.

Tracey is officially the only programmer in charge of developing "infrastructure" for the Collections archive (Brewster believes in keeping a logically unified system by keeping it all in one person's head, which is "different", but totally his call to make), so this Item Tracking System is necessarily something of a skunkworks project. It will never be considered an authorative source of information, but it should be useful to many people within the organization anyway, and can also be used to sanity-check some of the other (very buggy) systems we use to track and manipulate Collection items.

Anyway, the Item-Tracking System is very close to being finished. I am mostly squishing bugs at this point, and finding ways to make it hammer the database less severely. It uses a schema which is exquisitely suited to a distributed design, with data columns duplicated across many servers and data rows distributed between them, and I've designed such a system before, but for now I'm keeping it very simple in the interests of expediting development and depolyment, and using just one database on a central server. Implementation improvements can come later. Projects at The Archive either get deployed very quickly, or they die, and I do not want this project to die. It promises to be too useful for too many people (including myself). I have deployed it across parts of our infrastructure a few times now, leaving it on for a while to observe bugs and then shutting it down to rewrite things, but I hope to turn it on and leave it on sometime next week. The longer it runs, the more changes it can catch, which will hopefully give us more insight into some of the unsolved bugs in our data cluster infrastructure. Also, there are people in the organization who want to use it for their own purposes, but are holding off until it is fully deployed and stable.

A few interesting bits of code have come out of this project which have wider application. One is an "idserver", which is a better-performing tool for acquiring unique identifiers for globally-scoped labels in a distributed environment. One conventional way to "atomically" acquire a globally unique identifier is to perform an SQL "insert" to a table of labels, let the database system allocate an identifier to it, and then "select" the inserted row back to the remote system so it can find out what the identifier is. It is important that assigning the identifier be atomic, else race conditions could result in two "versions" of the unique label having two different identifiers, and the SQL insert/select method achieves this, but the overhead cost of this is fairly high -- two database transactions per label. The Item Tracking System needs to manage millions of unique labels (one for each item, and one for each file in each item), and when I started it up for the first time, the database was completely deluged with unique-id transactions. The "idserver" is a more optimized form of performing the same role. It achieves atomicity of operation by being single-threaded, running accept() on one network connection at a time, assigning a new id if none exists, and responding with a tuple of the form [isnew, id, label], where "isnew" is 0 if the label already had a unique id or 1 if the idserver assigned an id during this transaction, "id" is a numeric id, and "label" is the label identified thereby. By keeping it brutally single-threaded, race conditions are avoided and it was fairly easy to write and debug (thanks, btw, to Brad Tofel for the idea, and to Odo for optimization tips). The idserver also achieves better performance by accepting a list of labels per transaction, rather than being limited to one label-id-assignment per transaction. This allowed me to batch up my labels (perhaps 200 filepaths for some item), establish the TCP connection, send them over, and get back a list showing all of their id's and which ones were new (so that it might choose, for instance, between using an "update" or an "insert" to store newly discovered information about an item or a file). With the SQL based solution, this would have required 400 separate transactions (though only one TCP connection). If necessary, I can also get an N-order reduction in idserver load by running N idservers on N different machines, with each idserver being responsible for a different subset of all possible labels (perhaps taking the md5 checksum of a label, modulo N, and the remainder being the index into the authorative idserver). Each idserver could assign identifiers from different ranges of numeric values (2**64)/N apart, and thus maintain global uniqueness of id's without need for communication between idservers.

I have some other projects which could take advantage of idserver. It will be a pleasure to drop it in place and see improvements in performance. I'll be open-sourcing idserver soon, I've been organizing a bunch of my tools into a cohesive collection, categorizing them, and writing documentation for them. They'll be appearing on Sourceforge when I'm ready, and idserver will be among them.

Another interesting technology is a "universal database" interface, which I developed in reaction to the annoying way perl's DBI module for interfacing with different database systems manages its system-specific components. The idea behind DBI is beautiful -- it provides a simple API for interfacing with SQL-based databases in general. If you write your code using DBI (and if you only use generic SQL), you don't need to care which database system is being used: MySQL, mSQL, PostgreSQL, Oracle, whatever. You can switch the database your servers run from MySQL to Oracle, and your perl will still "just work" by changing only the "mode" parameter of DBI's connect() function (which can trivially be made config-file-driven). The way DBI handles the different implementations these databases use to communicate by allowing for many DBD modules, where each DBD is compiled with the client-side libraries of the target database system, and then provides DBI with a uniform interface. There is a DBD::mysql, a DBD::postgres, a DBD::oracle, and so on. It's lovely.

Unfortunately, it also falls apart when an organization has several different (and incompatible) versions of the same database systems running on different servers. The Archive currently uses at least three different versions of MySQL and two different versions of PostgreSQL for various things, so different versions of client-side libraries are installed on different nodes, and my Item Tracking System needs to be able to run everywhere. These different databases are "owned" and maintained by different people, and it would be something of a major political effort to get everyone to change to using a single version of MySQL and a single version of PostgreSQL (even if that were advisable; I'm personally against forcing people into homogeneity, preferring to see diverse solutions to similar problems, and it would be a waste of everyone's time and effort, and would possibly disrupt existing services). So suddenly using DBI to interface with these different systems gets a bit more complex. If it was just a matter of making all of the Item Tracker's client-side daemons able to communicate with the one central database I'm currently using, it wouldn't be that bad. I'd "just" have to compile my chosen version of MySQL on all the different platforms in use, compile a special version of the DBD::mysql module, put it in its own special place, and use perl's "use lib" directive to make DBI ignore the system's usual DBD::mysql and use the special one. But it's more complex than that, because I designed the system with an eye towards distributing it into a heirarchy of databases (so that the European data cluster could write to a local database, which gets synchronized with the one in America periodically, rather than having to conduct every little transaction across the ocean), and because I want to be able to switch databases as needed without having to rebuild all of the client-side software everywhere. My initial deployment of the System ran on MySQL 4.0.13, which proved unstable. I'm currently giving MySQL 4.1.12 a whirl, and it seems stable so far, but I don't fully trust it yet. I anticipate trying MySQL 3.23.41 (which I *know* is stable, but doesn't support the full SQL feature set) or PostgreSQL 7.4.7, perhaps not in that order. I also anticipate running into this problem again.

Rather than trying to figure out a way to contort DBI to use the right DBD's depending on the database version, I wrote my own UniversalDB module which obviates the problem by using its own simple TCP-based protocol to communicate with a "universaldb" process which runs on the same server as the database system, and uses DBI with the local version of the DBD's to communicate with that system. Right now UniversalDB exports its own DBI-like API, but when I have some breathing space I want to rewrite it as a DBD so it can plug into DBI itself rather than working around it. Fortunately, I wrote the Item Tracking System from the beginning using my own do-everything-right function, dbdo(), which wrapped the necessary DBI calls and added some other features, like debug logging, and reconnecting when the TCP connection unexpectedly closes, etc. All I had to do was insert a check at the top of dbdo(), and if the $UDB flag (essentially a global variable) is set, it passes the sql query to UniversalDB's do() function. If the $UDB flag is not set, it runs the pre-existing DBI-using code.

I have great plans for expanding UniversalDB's functionality eventually, but right now it is enough that I can simply export UniversalDB.pm to the data clusters, run "universaldb" on the database servers, and my code will "just work" everywhere. One of the things I want to do is give UniversalDB some config-file-driven logic for recognizing some database names as distributed databases, which uses simple config-file-driven rules for factoring data rows, and uses multicast TCP connections to submit SQL queries. Then the client-side code need not even know whether the database it is using is a conventional single-system database, or a distributed system. It will all "just work". I love systems that "just work".

I'm eager to get the Item Tracker behind me, though, not only because I want to use it to squish some persistently annoying problems, but because I want to finish off a couple other projects before they get stale. One such project is Glance, which I've talked about before. It's about 80% ready for the big time, but for the last few months I've just been running an older version of it on a fraction of The Archive's data cluster. Still, even that older version of the code (my new code is unstable) has been useful for detecting and diagnosing a variety of problems. Another project is qq, my remote-parallel-command execution tool, which has been increasingly used and relied on for petabox-related tasks, not only at The Archive but also at Capricorn Technologies, which sells the hardware that the Petabox runs on. But qq suffers from serious shortcomings, so I've been recoding it to use a UDP-based protocol and borrow a few tricks from gexec to make it better. Also, I promised the VP of Operations at The Archive to write a "datacenter control panel" a long time ago. I've been keeping notes about it, and it's always been in my head, so I'm figuring out its organization, underlying data structures, and algorithms, but I've written relatively little code so far. I intend to make good on my promise, it will just take a little time.

Hrm .. there were lots of other topics I wanted to touch on (I've been neglecting this journal considerably) but that's enough for tonight. More on "Goods, Glance, Goons, and Goals" later.

BTW, cobalt's doing much better.

-- TTK

This discussion has been archived. No new comments can be posted.