A Night With Jacob Appelbaum
Last friday night at the Archive, co-worker Gordon Mohr brought an interesting individual
up to my workspace -- Jacob Appelbaum, a blogger who had just gotten back to the United States
after spending several months in Iraq and then New Orleans, shooting pictures and video in the
wake of Operation Iraqi Freedom and of Hurricane Katrina.
He was in a bit of a bind -- he needed to upload dozens of gigabytes of footage to a hosting
service in time to give a presentation on the material the following morning. So we used the
high-bandwidth connection here at Archive HQ to pull his highly unique footage off his travel
hard drive and fast-track it into the .us data cluster. It has been organized into five data
"items" (basic chunks of data handled by the Archive's software):
jacob_appelbaum_Iraq_Video
jacob_appelbaum_iraq
jacob_appelbaum_turkey
jacob_appelbaum_New_Orleans
jacob_appelbaum_Houston
We stayed at the office late into the night, talking about everything from his experiences at
Iraq to SSH ciphers to the impact of HR Giger on the fields of art, science fiction, and
transhumanism. He is quite the renaissance geek, and speaks very eloquently on a wide range of
topics. It was midnight by the time I got home, but we got all his footage hosted. It was a
very satisfying experience -- this is exactly the sort of content that The Archive exists to
archive, and why I work here.
Books Books Books!
One of the more interesting projects going on here at The Archive is the Scribe book-scanning
robot, currently in Beta state. It is a device which allows for the rapid and efficient scanning
of books into high-quality images. It associates metadata with each page it scans (describing
the page number, page type (Contents, Title page, Text, Illustration, etc)) and processes the
images into a number of formats, including jpeg, djvu (which is like pdf, but better in every
way), text, and xml. The plan is to deploy Scribes in libraries across the world so that rare
and/or interesting books can be archived into perpetuity with a minimum of human effort.
I have used the Scribe prototype to scan a few books, which I keep on my laptop so I can read
or reference them abroad. It has been extremely enjoyable to be able to read them when out and
about, and handy to be able to search them for rapid reference (one of the books is a text on
material engineering, another on automotive technology). I've been giving the Scribe developers
feedback on what works well, what could work better, and what features would be handy to have.
They seem to appreciate it.
Today before coming to work I wandered my bookshelf, pulling about 5000 pages' worth of books
I would like to also have on my laptop for quick reference and/or pleasure reading. My wife
thinks this is a great application, and gave me one of her more prized books to be scanned too.
In my spare time (what of it there is), I've been working on software which ingests the various
data and metadata the Scribe generates, and generates from it an HTML document for the book. This
would be more useful to me than the djvu or text versions, but there are a number of technical
problems to overcome before it works adequately; the Scribe's OCR is error-prone, and I am trying
to come up with heuristics for correcting mistranslations. Also, it cannot yet reliably pick out
just the illustration from a page of mixed illustration and text for inclusion in the document.
The fonts and layouts will be different than what's actually in the book, which is not an issue
to me. I'm pretty happy with plain ASCII text. But the audience to which the Scribe is being
pitched (librarians, historians, and archivists) cares a lot about preserving not only the fonts
and layout, but also the coloration of the pages, including blank paper, inked letters, and
illustrations. So this software is just for me; I doubt anyone else involved with the Scribe
project will be interested.
(ObCopyrightDisclaimer: The Archive only stores into its data clusters books which are
either out-of-copyright, or books whose intellectual property owners have given permission for
their archival. The books I have scanned for my laptop are for personal use only, and this is
covered under the Fair Use clause of American copyright law.)
ItemTracker and UniversalDB
My ItemTracker project continues to evolve. The idserver became more of a liability than it
was worth, since as the database continued to grow, the process of serving unique id's ceased to
be the performance bottleneck. I chopped out the idserver and replaced it with equivalent SQL
transactions, reducing the complexity of the project considerably.
As the range of data items being monitored by ItemTracker expanded to cover the entire .sf
and .us clusters, the performance of the database diminished. I tweaked the table indexes and
database parameters as well as I could, but some of my SQL queries have run into several minutes
of processing time. It was clear that the system would not scale to cover all three of our
existing clusters (.sf, .us, and .eu), much less when these clusters continue to expand. So I
am taking the step of distributing the database.
All of the ItemTracker code is currently accessing the database through my UniversalDB module,
which makes it the obvious place to implement the distribution code as an abstraction layer. My
goals are modest to begin with -- each table's columns will be duplicated across all nodes in the
"virtual" database, with different nodes storing different rows. Since everything in the database
ultimately relates to some item, the item rows are trivially segmented by item id. UniversalDB
parses each INSERT statement for its item id and sends it only to the appropriate node for storage
(and INSERTs to tables without an item id column are duplicated onto all nodes). All other
statements (UPDATE, DELETE, SELECT, etc) are being sent to each node in parallel, and the returned
data is concatenated together on the client's side. This will not work for some joins, but works
perfectly for all of the SQL statements currently used by the ItemTracker system. In the future,
if/when I have time (or perhaps if someone else takes it upon themselves to do the work), more
work can be done to see if non-INSERT statements can be sent to only some subset of all data
nodes, but for now this should be plenty to get ItemTracker back on track (and boy am I getting
impatient!).
UniversalDB's distribution function is not a one-trick pony; I'm sure I'll have uses for it in
other applications in the future, and someone else might find it handy too once it's published.
So I'm trying to do a good job of making its configuration useful in the general case -- at least
as much as I reasonably can. Its simple segmentation scheme is most useful for distributing
databases whose tables all have some common key, and much less useful outside such cases.
The abstraction is important -- I don't want to touch any of the existing ItemTracker code,
now that it's stable. Right now ItemTracker just opens a database by host and database name, and
spits SQL statements at it.
To make this "just work" with the distributed database, UniversalDB needs to know that a database
name corresponds to a "virtual" database, and not a physical database. It reads its configuration
file to know what nodes (servers) make up the database pool, and to know how to divide the tables
between them (eg: "split fester items on id by mod 16" tells it how the table is segmented, and
"chunk fester items 12-15 ia401293.archive.org rfester" tells it that when fester.items.id modulo
16 == in the range 12 through 15, it should send the INSERT to node ia401293.archive.org, database
"rfester" for storage). The object passed back by UniversalDB.pm's "new" method has a flag set
which indicates that it represents an interface to a virtual database, and also keeps a list of
other UniversalDB ojects which each represent an interface to a real database, one for each node
in the database pool. When the user passes an INSERT to the virtual database UniversalDB object,
it passes the statement to the appropriate real database object. Other statements are passed to
each real database object in asynchronous mode, and then the virtual database object polls each
of them cyclically for data.
Thus the details of the distribution are hidden from the "user", as long as there is an
"administrator" writing the configuration files. From the user's perspective, it's just an SQL
database. But in reality data is being stored and processed on a bunch of different machines.
It should work well -- the distributed database system I worked with at Flying Crocodile in the
late 1990's was similar, albeit much more sophisticated. Their system scaled to over 400 nodes,
with an emphasis on absolute performance. I'll be happy if mine scales to eight or sixteen,
with an emphasis on absolute stability and fault-tolerance. Several seconds (up to a minute or
so) of latency per high-level transaction is quite acceptable for my application, but it must be
robust or I will not deploy it. Silent data loss is intolerable.
UnixAdminBot, WAAG, and QQ
My unixadminbot project is looking good. Necessity is the best impetus for development, and
I have been needing unixadminbot to run on some of my systems to handle basic things like reporting
the local system's configuration back to a central location, and watching various daemons to make
sure they stay up (rsync, ftp, mysql, and http). I have ambitious plans for expanding unixadminbot's
functionality, but with great force of self-control I am limiting myself to just its minimum
level of functionality until it is stable and complete up to that level. I will release that as
its first beta, fork the codebase into stable and development versions, and check bugfixes into
both while expanding the functionality of the development version.
WAAG (WAN-At-A-Glance, the project formerly known as Glance) is one of the projects which is
getting folded into unixadminbot. I've rewritten much of the old codebase and written some new
code as perl modules, and they will eventually be used by unixadminbot to provide the WAAG
system with its periodic reports (quick rehash: the object is to monitor a nontrivial cluster in
a manner similar to Nagios, but to make the user interface and underlying architecture scale
better than Nagios -- a sysadmin should be able to deploy it across thousands of nodes without
taxing the system, and know how healthy his cluster is by simply glancing at a webpage, without
having to touch the scrollbar. Right now Nagios is straining to monitor The Archive's data
cluster, and the "summarized" table of pending problems reaches several hundred rows in length.)
My "first stab" at writing WAAG has been running on a subset of our data cluster for a few
months now, and even though it only has a fraction of the functionality I want it to have, it
proves useful every day at exposing problems in the cluster. When a server is having issues,
I know it at a glance because everything (not just the summary of problems, but all statuses)
fits on the screen, even using a nice fat easy-to-read font. It's not that Nagios is incapable
of detecting these same problems, but sometimes it takes a while because it falls behind in
scheduling checks, and then the new problems are lost amidst the other problems in its display.
With WAAG, new problems jump right out and are noticeable. This means if a server is heading
for swap-death, I can catch it in time to ssh in and kill apache, or add another few gigabytes
of swapfile, or whatever. This crude implementation has been useful in teaching me how the
"real" implementation should be done. Since the configuration and scheduling functions of the
WAAG daemon and unixadminbot are so similar, and it is in my interest to have unixadminbot
running on The Archive's cluster, I am folding the "real" implementation of WAAG into
unixadminbot.
The "qq" project started life as a quick-and-dirty parallel remote execution tool, and
various people have come to rely on it. It really needs to be better, though. I've been
writing code to make it better, but have run into some of the limitations of perl (especially
the "early" versions, if you can call 5.6.1 early), which handles signals and threads ineptly
and does not give me all the control I need over memory management. As a result, the "qq2"
I've been working on is getting close to having the functionality I need, but is horribly
unstable and apt to hog a lot of memory depending on the circumstances of its use. As much as I
hate to do this to some of the people suffering under the original qq's limitations, I'm going
to have to rewrite the client half of the system in C, as "rr" (Remote Run). I think the server
half can remain in perl, as it does not rely on signals nor suffers from memory management issues,
so I am rolling that aspect of the project as well into unixadminbot (which is perl). But writing
the client part in C will allow me to manage memory directly, use signals reliably, and use
standard pthreads with some expectation of good behavior.
Normally I hate gargantuan monolithic do-everything projects, preferring lots of little
simple specialized tools, but rolling these projects into unixadminbot just seems like the right
thing to do, from both a technical and political perspective. With appropriate and disciplined
use of modules, I should be able to keep its complexity managable and avoid the instability
which complexity brings. Also, nobody who runs unixadminbot will be stuck with running all of
the services unixadminbot can provide; by default unixadminbot will do nothing but sleep until
it sees a configuration file which tells it what services to provide. If people just want it
to act as an rr daemon, it will do that. If people just want it to act as a WAAG daemon, it will
do that too. If it is only told to report the system's configuration to the mothership on system
boot, it will limit itself to only that. But personally and professionally, I intend to use
unixadminbot to the limits of its capability to manage, configure, and run hassle-free,
high-availability computer clusters. When a cluster "just works", and adding new servers is as
easy as plugging them in and turning them on, then the real work can begin. I intend to
make unixadminbot handle all that crap so I won't have to. Now if I only had a mechanical robot
that could swap out faulty hard drives .. :-)
The DR Codebase Documentation Project
In theory, everything developed here at The Archive is open-source, but in practice the
responsibility for packaging and publishing code devolves to individual engineers who have to
do it on their own time. A few of the tools I've developed have made their way out to third
party users (like qq, dy, sizerate, and doublecheck), and I would like to push out more.
There are about a dozen tools in my bin directory which I developed, some of them quite small,
which might engender wider interest. Most of them rely on a subset of the perl modules developed
by The Archive's Data Repository department (mostly by me and Brad), specifically: DR.pm,
DR::IDClient.pm, DR::Poster.pm, and DR::UniversalDB.pm (I might be adding ItemTracker.pm at some
point in the future, but not for now). There are several
other modules in DR, but I generally do not use them, and I am leaving those to Brad to package
and push out if he deems it worthy of his time.
Aside from the issue of documentation (some of these tools are already documented, but others
are not at all), there is also the issue of where these perl modules should live. Right now,
The Archive's legacy cluster locates them in /ia, the Petabox locates them in /petabox/sw/modules,
the systems used by Hardpoint Intelligence locate them in /usr/cluster/modules, and some oddball
systems at The Archive which aren't really integrated into any cluster have them stashed away
in /root/bin/modules, /home/search/bin/modules, /home/bill/bin/modules, or /home/ttk/bin/modules
(depending on which accounts I can access on those systems). Thus, if I want
a tool or daemon to "just work" on any of these systems without tweaking the code, I put something
like this at the top of the script:
#!/usr/bin/perl
use lib '/home/ttk/bin/modules';
use lib '/home/search/bin/modules';
use lib '/root/bin/modules';
use lib '/home/bill/bin/modules';
use lib '/usr/cluster/modules';
use lib '/petabox/sw/modules';
use lib '/ia';
use DR;
use DR::Poster;
Yuck! Obviously it would be very nice if these could go in some standard place.
After poking around some on CPAN, it looks like the right thing to do is to rename the
packages to go into Cluster::InternetArchive::*, and then use the CPAN interactive install
tool and/or Makefile.PL to install these modules into their expected place. I do not have
administrative control over the Petabox, so that might be an ongoing political issue, but it
shouldn't be too hard to get this done on all of the other systems. Then writing a script
to work everywhere would be much cleaner:
#!/usr/bin/perl
use Cluster::InternetArchive::DR;
use Cluster::InternetArchive::DR::Poster;
Looking at the perlnewmod man page, turning a perl module into an installable package doesn't
look hard at all. I just need to type it up and push it onto SourceForge. Finding the time will
be the hardest part (then again, I find the time to write journal entries once a month or so, and
this shouldn't be much harder).
-- TTK