Interview with Brewster Kahle 196
Netmonger writes "A
fascinating interview with the man behind The Wayback Machine. Some specs from the article: "It's 150-odd standard PC cases, with four drives in each.. 'Over 100 terabytes.. As plain text in book form, that'd be over 3000 miles of shelf space.." All I can say is.. Wow!"
How many (Score:4, Funny)
According to... (Score:4, Informative)
-Cyc
Re:How many (Score:2)
5.6603773584905660377358490566038 LOCs
"That's X Pages!" analogies are silly. (Score:2, Interesting)
There are at least two problems with such analogies:
1) People use them to comment on the marvelous efficiency of technology - but in reality, it's only a comment on the hideous inefficiency of print. It doesn't say much at all about technology. It might be useful to convince people to digitize/OCR their printed matter - but is anyone *not* doing this? Even the Library of Congress is scanning its texts now.
2) In this case it's a particularly bad analogy, because it assumes that all data is printed as hex. Example: images, which are obviously a huge, huge chunk of the Wayback archive. Virtually all website images are small enough to print on a printed page at full resolution. But consider a 500x500-pixel image, at 16 bits (2 bytes per pixel, 2 chars to represent each byte)... that's 1,000,000 characters, or 1,000 pages!
Basically the analogy is good for wildly inflating some numbers to stun the 0.00001% of the population that doesn't already realize these things.
- David Stein
Re:"That's X Pages!" analogies are silly. (Score:2)
Print is inefficient because it's a waste of physical space. In terms of information density, it's hard to come up with a more inefficient use of space than setting 12 point text. I can't speak for Mr. Clavin, but I'd use something more efficient like, oh, say, a hard drive with compressed text and images.
Radical, yes I know... But we gotta move out of the 19th century some time. The 21st is as good a time as any.
Re:How many (Score:2)
Re:How many (Score:2)
Lets use standard units here people!
YEAH! What is it in FPS?
Re:How many (Score:2)
Re:How many....answer (Score:2, Funny)
Wow! (Score:1)
Re:Wow! (Score:3, Funny)
A lot of internet information is crap... (Score:2, Interesting)
Re:A lot of internet information is crap... (Score:5, Interesting)
For example, a ciouple of centuries ago old household accounts would have been considered valueless. But today's historians find a wealth of social data in them - what did people eat? how much did they get paid? did families tend to enter service together? how often did servants get new clothes?
Disc space is cheap. Keep everything, let future historians sort it out.
Re:A lot of internet information is crap... (Score:2, Insightful)
Ah, the legacies we'll leave... based on YOUR e-mail, what will YOU be remembered as?
Re:A lot of internet information is crap... (Score:2)
There. Now this will get archived in a few weeks, and will cancel out your worries. Move along
Re:A lot of internet information is crap... (Score:4, Insightful)
Doing historical research is fun b/c you get to get your hands dirty (literally). I spent 6 hours a day for three weeks researching crime rates in Toledo, OH during prohibition (before, during, and after) and b/c the books were all handwritten and they were so old my hands turned black for days at a time...
It would have been MUCH easier if all the information was sorted and easily found I guess it would make future historians jobs easier but what fun would that be?
Just my worthless
Re:A lot of internet information is crap... (Score:3, Insightful)
I think it's more that i will be different people. Understanding most of history is constrained by the lack of data about that time. Our age is precisely the opposite. We try and save EVERYTHING we can possible afford--because we know that crap will be valuable to many people later on. For next centuries historians it will be about data sampling and extracting the gold nuggets from all the crap we have saved.
It will be the folks who built google. Not the current type of folks.
That said. It's better to have too much than too little.
Re:A lot of internet information is crap... (Score:2)
I work w/records are are 30+ years old. They are on Microfiche (sometimes really bad copies).
They will have the tools and they will know how to use them, believe me.
Re:A lot of internet information is crap... (Score:1, Redundant)
Picking and choosing what goes into the archive does not solve the purpose of the archive in any way.
Like all those crappy old buildings... (Score:3, Insightful)
And of course, you're going to decide what is "good" and what "isn't?" He is providing the resource for, among other things, scholarly researchers. Of what use is the data if it has been hand edited according to one person's aesthetics or anothers?
Indeed, your comment reminds me of one that was heard quite often, shortly before beautiful and irreplacable old buildings were razed to make way for a new strip mall, or, in downtown Chicago, a couple of new government buildings whose architectural style is best described as "Federal Drab." Preserving as much as possible is a good thing, because none of us can tell what will be valuable, and what will not, in another 20 or 30 years, and no one's aesthetic should be dictating such a decision to entire generations to come.
Re:A lot of internet information is crap... (Score:2)
But I went to the wayback machine and checked it out. It was cool to see how far along i've come in my web design skills. Now I can show my friends the very first web site I ever made, and crap or not, it's there on the wayback machine and brings with it a lot of nostalgia. That's just the way sentimental crap goes, no matter how ugly or whatever, just the fact that you can go back and look at it makes it "cool"
Re:A lot of internet information is crap... (Score:2)
Who's to say what's crap and what's not? (Score:1)
Re:A lot of internet information is crap... (Score:2)
One mans crap is anothers holy grail,
what you think is interesting for another may be pale,
For examply this limerick for you is shit,
But I would be happy if it gets in the archive bit.
Re:A lot of internet information is crap... (Score:2)
Re:A lot of internet information is crap... (Score:1, Funny)
Fuck shit ass hell damn poo
A cock and nutsack
And boobs and asscrack
Fuckery duckery doo
Re:A lot of internet information is crap... (Score:2, Insightful)
In other words, perspective and context is a huge part in determining value and meaning. At some point these annoying popup ads may play be important for someone studying the evolution of advertising on the net. In fact, popups, or the frequency or timing of them might be something that's missing from the archive.
Most of the culture is invisible to most of us most of the time. The things we take for granted are the most ingrained into us, and possibly the most interesting to someone after the culture has changed.
Re:A lot of internet information is crap... (Score:4, Funny)
I mean, hell, forget pr0n, just imagine the blackmail value for the kids of 2020, to be able to dig up pictures of their parents on amihotornot.
Re:A lot of internet information is crap... (Score:3)
Identifying "good stuff" is very hard and certainly not something that can be automated. Furthermore, "good stuff" is in the eye of the beholder. Perhaps Jane's web page dedicated to her kittens in useless to almost everyone in the world. However, to Jane's great-great granddaughter who hasn't been born yet, it might provide a fascinating look into her own past. A historian a hundred years from now analyzing the first twenty years of the web would certainly want to know that porn and popups were so pervasive.
What a great way... (Score:2, Funny)
Transient Moments (Score:5, Interesting)
e.g. The Ded Kitty picture we put up when napster shut down at the star of september, it was only there for a few hours but it will be lost.
Of course, some of the more interesting transient events are websites that are hacked, but there exist dedicated archives for this kind of event, so you can relive the hilarity of RIAA.org being repeatedly defaced.
Re:Transient Moments (Score:1)
Maybe we can help them to get this info for us? (Score:2, Interesting)
Robots.txt - That was how the RIAA was hacked (Score:3, Interesting)
http://www.zone-h.org/en/news/read/id=894/
Not just Transient (Score:1)
Re:Transient Moments (Score:2)
I also wonder how well it does with Flash or other multimedia. I don't care about it not crawling commercial sites as much. There are much bigger copyright issues on that. Plus, lets be honest, most of those sites are just porn anyway. I don't think we need a historic archive of porn sites.
My big question though is whether they backup their data regularly. Afterall even hard drives wear out. . .
stupid Joe Six-Pack metaphors (Score:4, Funny)
As plain text in book form, that'd be over 3000 miles of shelf space.."
Huh? How about "If all data was spoken at once, it would be as loud as 674 jet engines!" Or "If this archive were a planet, it would be as large as Jupiter!"
Re:stupid Joe Six-Pack metaphors (Score:1)
Re:stupid Joe Six-Pack metaphors (Score:1, Funny)
What is that, Pacman's system of measuring storage?
Re:stupid Joe Six-Pack metaphors (Score:2)
Would the Large Type Edition be [cue Mr. Evil voice]:
One Million Miles!?!
Re:stupid Joe Six-Pack metaphors (Score:2, Funny)
Re:stupid Joe Six-Pack metaphors (Score:1)
Wait wait (Score:2, Insightful)
Move over Borges (Score:2, Interesting)
The universe (which others call the Library) is composed of an indefinite and perhaps infinite number of hexagonal galleries, with vast air shafts between, surrounded by very low railings.
Looks like he wasn't too far off...
Well, maybe not...
Is this thing backed up? (Score:3, Funny)
Re:Is this thing backed up? (Score:1, Informative)
Kahle? (Score:3, Funny)
On a related note, look up the Long Now Foundation (Score:3, Interesting)
100 Terabytes! (Score:3, Informative)
Re:100 Terabytes! (Score:3, Informative)
Obviously they're using IDE drives. Modern ones. And they must have replaced almost everything at once -- there could a mixture of 200 GB and 120 GB drives, but it would have to be mostly 200 GB drives.
Pretty neat, but still doesn't hold a candle to google's massive setup :)
(google must have a *team* of people who's sole job is finding failed computers/drives and replacing them :)
Re:100 Terabytes! (Score:2, Funny)
Re:100 Terabytes! (Score:1)
Wait, do they mean 100 trillion bytes, or 100 * 2**40 bytes? That's how these sneaky hard drive manufacturers get you!
Re:100 Terabytes! (Score:2)
Re:100 Terabytes! (Score:2)
Plus they have 3 separate facilities, I'm assuming each is a "complete set".
I personally have two 120 GB drives which is way more than I need, but hey, I got a deal on em, so I bought two so one could be a backup (put it in a seperate machine, albeit at the same location) rsync is a wonderful tool.
Re:100 Terabytes! (Score:3, Funny)
Obligatory (Score:1, Offtopic)
I don't understand terabytes.... (Score:4, Funny)
I don't understand terabyte or the shelf space analogy...
I need to know how many banana's.
nbfn
Re:I don't understand terabytes.... (Score:2)
Re:I don't understand terabytes.... (Score:2)
Re:I don't understand terabytes.... (Score:4, Funny)
So, how many bananas would it take to feed all the monkeys needed to store the data? Monkey's aren't that smart so lets approximate each monkey can hold 4k worth of data.
100 TB = 100 * 1024 * 1024 * 1024 KB = 107374182400 KB
107374182400 KB / 4 = 26843545600 monkeys
Now we'd want redundancy so lets have triplictate monkeys for all our data, in case one dies, or runs away, or simply forgets.
26843545600 * 3 = 80530636800 monkeys
But now want want to figure out how many bannas they're gonna eat, lets say 5 bananas a day per monkey?
80530636800 * 5 = 402653184000 bananas to feel all monkeys per day
402653184000 * 365 = 146968412160000 bananas to feed all monkeys per year
146,968,412,160,000 or 146 trillion bananas per year, which is probably just slightly over the nation debt.
Overall, I think your method of using bananas to store all this data is quite ridiculous. The latency and dataloss would be unbearable. Plus think of all the poop these monkeys would create, and you'd NEVER be able to get PETA off your back.
Re:I don't understand terabytes.... (Score:2)
Authough they have a much larger footprint than monkeys, I'm told Elephants have better data retention characteristics and the peanut has a much smaller form factor than the banana.
Wayback technology (Score:5, Informative)
Sounds good... (Score:2, Insightful)
Oh, I forget that honor is dead on the internet.
Re:Sounds good... (Score:2)
Scientology-censored web history (Score:2)
A site I run (sniggle.net [sniggle.net] - formerly found at syntac.net [syntac.net]) was removed from the wayback machine when the church of Scientology complained about an image of L. Ron Hubbard on one of the site's pages.
Now, not only all of the pages on my site, but all of the pages at syntac.net have vanished from the wayback machine.
Oh yeah, and they can't be found at Bibliotheca Alexandria either, so that's no solution.
Brewster's going to have to turn down his rhetoric about the wayback machine a bit until he gets the resources to fight back. Otherwise people might get the impression that he really is keeping the history of the web, even the parts of the web that entities like the church of Scientology don't like, alive.
Silly Me! (Score:3, Funny)
But seriously, unless you know about this project, and the fact that you can ask to remove data from the archives (though there's no reference as to how to actually do it), it means that your Internet past can haunt you forever.
Or at least until simultaneous attacks occur on Cairo and San Francisco...
OH CRAP!!!! (Score:1)
Damn! Now I'm really interested in how to remove stuff from their archive!
Another site, with pics (Score:5, Informative)
In the interest of full disclosure, I wrote it, so be gentle.
Re:Another site, with pics (Score:1)
Picture of a Picture (Score:4, Funny)
Re:Picture of a Picture (Score:2)
The first *working* link from http://archive.org
October 11th 1997 [archive.org]
Notice how it says The Archive will provide historians, researchers, scholars, and others access to this vast collection of data (reaching ten terabytes), and ensure the longevity of this information.
Oh how the times have changed.
BTW: Considering the importance of the archive, be gentle! Slashdoting archive.org == bad!
See also (Score:4, Informative)
Re:See also (Score:3, Informative)
Feed magazine interview [archive.org], back from the grave...
Odd, no copyright questions (Score:5, Insightful)
Surely they must know they're treading on untested legal ground. All it might take is one offended copyright holder to bring the whole thing to its knees. Basing it in a country other than the USA might have been smarter, then, given the existence of laws like the DMCA which could serve to shut the site down.
Re:Odd, no copyright questions (Score:3, Interesting)
Re:Odd, no copyright questions (Score:2)
Does this mean that he has the Godfather send Guido around to take them out, or does he merely mean that he removes their data from the data base? For the sake of future social scientists, I sort of hope it's the first choice.
Re:Odd, no copyright questions (Score:2)
"Brewster says you want out. He also says nobody goes against the archive..."
Re:Odd, no copyright questions (Score:3, Interesting)
"It is easier to ask for forgiveness than permission."
Have you been to the site? (Score:2)
10 March 2001
The Internet Archive respects the intellectual property rights and other proprietary rights of others. The Internet Archive may, in appropriate circumstances and at its discretion, remove certain content or disable access to content that appears to infringe the copyright or other intellectual property rights of others. If you believe that your copyright has been violated by material available through the Internet Archive, please provide the Internet Archive Copyright Agent with the following information:
Identification of the copyrighted work that you claim has been infringed;
An exact description of where the material about which you complain is located within the Internet Archive collections;
Your address, telephone number, and email address;
A statement by you that you have a good-faith belief that the disputed use is not authorized by the copyright owner, its agent, or the law;
A statement by you, made under penalty of perjury, that the above information in your notice is accurate and that you are the owner of the copyright interest involved or are authorized to act on behalf of that owner; and
Your electronic or physical signature.
True story and a small thanks.... (Score:3, Interesting)
Why only four? (Score:4, Insightful)
With a simple $10 PCI IDE card (per additional 4), you could have gotten at *least* 8 drives, possibly as many as 16, per case. Granted, not many cases will let you *mount* that many, but I would expect paying a few bucks extra for the IDE cards and a better case would save quite a bit of money (and physical space) by halving or quartering the number of PCs you need ($100 extra to save $1500 per $2000, not counting the drives themselves?).
88lf of machines vs 22lf. One requires an entire room, one would fit on a standard sized 3-or-4-tier storage rack. Of course, speaking of racks (of a different sort)... What on earth made you go with an array of standard PCs rather than a raid-in-a-rack?
Re:Why only four? (Score:1)
Re:Why only four? (Score:2)
Re:Why only four? (Score:5, Informative)
I know, I have a fileserver at home that has this exact problem, but I don't care if my fileserver is slow so it's not a problem.
Why 150 PCs? (Score:2)
Something like this [ebay.com].
Would give you far better disk performance and scalability than trying to add another 200 PCs with IDE disks.
Archive architecture (Score:3, Informative)
The Archive's first storage device (circa 1996) was a large StorageTek tape robot with a multi-gigabyte disk cache to handle user requests for archived pages. As drives and processors became cheaper, it became more interesting to use them instead of tape. The cost penalty of using drives over tape is only 2x - 3x, with the enormous win of increased bandwidth and decreased latency (when the request queue for the bot got large, the wait time for a page could be 16 hours. With disk, it's a fraction of a second).
The first hard-drive based Archive storage used multiple 4U and 5U 12-20 drive Linux/FreeBSD boxes with ~80G IDE drives and Promise cards.
Drive density is greater now - you can get 200G IDE drives and 320G IDEs are on the way, so you can use regular PCs as opposed to custom or niche-market (rackable server) boxes.
--Pat / zippy@cs.brandeis.edu
Fascinating? (Score:1)
Ya But... (Score:2)
Netscape Behind? (Score:1)
http://web.archive.org/web/19971221012817/http://
Vaguely uncomfortable (Score:1, Interesting)
If there is a way to permanantly erase pages from the archive, I would be a little less worried. But I can never tell if they let you delete stuff, or just "block" it. "Blocking" is crap, we all know what that will be worth if somebody really wants the info someday and knows the Archive has it.
Re:Vaguely uncomfortable (Score:3, Interesting)
If you put something on the web, you have put it up for the world to see. The whole point of putting information on the web is making that information available to lots of people.
What the Internet Archive is doing is no different than libraries storing old copies of newspapers and magazines. With an increasing amount of things being published online, we need an archive of those things.
Years from now archives of web pages will be quite useful for those doing research on the events of today.
Say you are a student in the year 2050 and are doing a report on the "history of the web." Wouldn't it be nice to have copies of the web pages from the 1990s to show how the "early" web looked like?
Typical Question... (Score:1)
"Yeah, but can it make coffee?"
Response being:
Of course it can! But since it's the Wayback Machine, it's yesterday's coffee... old, cold, and slightly burnt (but when you gotta, you gotta)...
Underfunded. (Score:1)
Vannevar Bush (Score:3, Informative)
Those who find this subject interesting, but who may not be familiar with Vannevar Bush's work, might want to read the paper to which Brewster Kahle refers [theatlantic.com].
You mean Mr. Kahle is Jay Ward's father? (Score:2)
The Wayback machine is a lie (Score:5, Insightful)
I have also personally ran a website which contained fairly controversial material (based on this story [projectcensored.org]) that I saw listed on their website and then removed shortly thereafter. Tell me, why would a service like this ever have occasion to remove material once it's been archived, especially if there are *NO* copyright issues and the webmaster of the archived site never asked them to remove it?
The answer is simple: the powers-that-be saw how dangerous it was to make all this information available to anyone on demand so they took control. It would be a great service were it allowed to operate unfettered, but the reality is quite different.
And I'm the first to mention this here so far? You should all be modded down -1 for naiveté.
Re:The Wayback machine is a lie (Score:2, Informative)
And I'm the first to mention this here so far? You should all be modded down -1 for naiveté.
Hm. And yet the WayBack Machine has the Project Censored page here [archive.org], and even the AlterNet story [archive.org] linked therein. Ah, but yes, it must be a conspiracy by the Big Eye In The Pyramid -- someone call Hagbard Celine [rawilson.com]. Fnord.
-1, Delusional.
Any bets.... (Score:5, Interesting)
I have a feeling we are either going to have to become way more forgiving, or we're going to be stuck with only faceless boring types with no opinions as our leaders (no wisecracks, it could be much worse than it is now).
Great archive (Score:2)
With quality website snapshots like this, I can see how it will be a great resource for future historians!
Oooh! Localhost! (Score:2)
The most interesting of these is the one from October 19th 2000. [archive.org] See for yourself!