bollacker - Slashdot User

Comment We are the Internet Archive (Score 1) 199

by bollacker on Friday February 25, 2000 @10:14AM (#1248879) Attached to: On Preservation of Digital Information

Given our organization's mandate, I thought I should throw in my $.02.
Although still ramping up and learning how to make things work, we are
trying to ARCHIVE THE ENTIRE INTERNET FOREVER. Crawling or other
forms of collection are used to download the information, and we store
everything on hard drive. We plan to have about 100TB of HTML,
images, Usenet, streaming media, etc.. within two years, and we have
some collections that reach back to 1996.

Currently, we do no backups of the hard drives, because given their
low failure rate (about 1% in our history), it's less lossy overall
to use that space for new data rather than redundancy. By the time we
reach equilibrium with the Internet so that our download rate
approaches the information generation rate of the Internet, we'll have
some sort of backup mechanism in place. Probably software RAID of
some form.

As time passes, we will copy data to new media, but it will be on
disk, this will be much easier than if it were on tape or printed. I
have a vision that in the long run, we may be able to use something
like an Intermemory (intermemory.org) to create a distributed
filesystem that is the storage analog to distributed.net. In an
intermemory, folks donate storage space, so that collectively, a huge
amount of capacity is available. A lot of redundancy is used so that
earthquakes, floods, govt. coups, and massive hardware failures are
still unlikely to result in data loss. As folks' PCs fail or are
upgraded, the simply plug in the new store unit (hard drive,
holographic, etc.) and their part of the intermemory is reconstructed
(like RAID 5).

There's also been comments about how to handle (index/search/browse)
so much data if it is all archived. This is an area of active
exploration in which we are working with research groups and others.
Generally, we've found that working with flat ascii files and perl
scripts is one of the few approaches that scales up to TB of
information on reasonably priced hardware.

From a fanciful perspective, I see us eventually being something like
the "Library Institute" of David Brin's books, or being the digital
analog to the Library of Alexandria. As we are a non-profit, access to
are our archives is freely available (see archive.org) and we
encourage users of a broad range of types. If you are interest in
seeing a large scale implementation of archiving heterogeneous digital
information, check us out. As a shameless plug, we are also looking
to hire developers and researchers. What we develop is open source
and encourage its dissemination.

Kurt Bollacker
Technical Director, The Internet Archive. (www.archive.org)

Slashdot Top Deals