Please create an account to participate in the Slashdot moderation system

How the Wayback Machine Works 134

Posted by timothy on Wednesday January 23, 2002 @09:13AM from the very-big-hard-drive dept.

tregoweth writes: "O'Reilly has an interview with Brewster Kahle about how The Internet Archive's Wayback Machine works, with lots of juicy details about how the biggest database ever built works."

This discussion has been archived. No new comments can be posted.

How the Wayback Machine Works

Load All Comments

Search 134 Comments Log In/Create an Account

Comments Filter:

Google? (Score:4, Interesting)

by kenneth_martens ( 320269 ) writes: on Wednesday January 23, 2002 @09:25AM (#2887393)

It's an interesting idea, but the real problem is not storing the 100 TB of data, it's figuring out how to search through it to find what you're looking for. Now, apparently they write a lot of their own software, but it might be better if they could team up with Google and have Google index their sites on a special database. We'd have www.google.com for regular searches, and wayback.google.com for the Wayback Machine's sites.

Something else I found interesting: according to the article, they "use as much open source software as [they] can." That makes sense when they've got between 300 and 400 computers, and with the number growing all the time. Licensing all those with a non-open OS would be quite expensive.

Share
twitter facebook
- Noooooooooo !!! (Score:5, Funny)
  
  by morzel ( 62033 ) writes: on Wednesday January 23, 2002 @09:53AM (#2887489)
  
  Please please please please do _NOT_ google it... It was embarassing enough when google acquired dejanews, and put the old usenet archives on-line. :-)
  I just visited some sites from which I hoped that they dissappeared completely from cyberspace. The only defense I've got now are the old cryptic URLs of these monstrosities... Indexing that database would be a disaster, especially with an unusual name like mine...
  (Yes, I was stupid enough to use my real name ;-)
  Damn you, wayback :p
  
  Parent Share
  twitter facebook
  - Re:Noooooooooo !!! (Score:2, Informative)
    
    by ichimunki ( 194887 ) writes:
    
    Important to note that they will allow you to "opt out" by using a robots.txt file (not sure what you do if the domain is no longer available).
    
    Funny part is, they may not have to allow this, except out of courtesy. Apparently libraries such as this can get away with all kinds of stuff that, if done by private individuals with any kind of profit motive, would normally constitute serious copyright violations. (see http://www.loc.gov/copyright/circs/circ21.pdf for information).
  - The shame of one's past (Score:1)
    
    by sunhou ( 238795 ) writes:
    
    You know, for sites like the wayback machine, google groups, etc., it would be nice to have another option. Rather than contacting them and saying "please remove this old embarassing crap of mine from your database," it would be great if one could instead tell them "please suppress this stuff for the next 60 years" (and have the request carry forward to whoever inherits the project). That way, people (historians) could still see all my old stupid stuff, but not until I'm dead or too senile to care.
- Re:Google? (Score:1)
  
  by osolemirnix ( 107029 ) writes:
  
  Even better, they should use Google to figure out what to cache in the first place.
  The way it works right now, it's more like a mindnumbingly dumb brute-force cache (like I found some of my own irrelevent web pages from years ago).
  While there may be arguments in favour of saving everything and storage may be cheap, while not use Google's ratings to save more relevant/interesting info and ignore the crap...
  - Re:Google? (Score:1)
    
    by tregoweth ( 13591 ) writes:
    
    In future years, it may be the crap that people are interested in. For example, researchers looking at old newspapers get a lot of useful information from advertisements, but they wouldn't seem particularly useful at first glance.
- Re:Google? (Score:1)
  
  by Toliaro ( 411158 ) writes:
  
  Yes, searching it is definitely an issue, but I think storing it is also a significant issue. Why did they have to put something important and cool in a proven earthquake area?
  
  At least the last time we lost the most important collection of knowledge in human history, some human intervention was required--the burning of the library at Alexandria.
  
  We're collecting a lot of knowledge, but we seem to be forgetting some relevant history. (At least we're avoiding the other California disasters: wildfires, pestilence, mud slides...)
Successfully crashed (Score:3, Funny)

by SilentChris ( 452960 ) writes: on Wednesday January 23, 2002 @09:27AM (#2887401) Homepage

Ok, we have successfully Slashdotted the Wayback Machine. Screw history! :) Let's move on to bigger and better things.

Share
twitter facebook
Not very way back! (Score:1)

by webword ( 82711 ) writes:

Wayback Slashdot [archive.org] ...only goes back to 2000? Seems kind of lame when you consider that my little web site goes back to 1998 [archive.org].
- Re:Not very way back! (Score:3, Informative)
  
  by tom.allender ( 217176 ) writes:
  
  Wayback Slashdot ...only goes back to 2000?
  
  Wayback slashdot.org [archive.org] goes back to 1997...
  - Re:Not very way back! (Score:1)
    
    by sunhou ( 238795 ) writes:
    
    Great, now you're slashdotting pages from the past. Even more ironic, you're slashdotting slashdot itself from 5 years ago. This is why many civilizations ban time travel.
  - Re:Not very way back! (Score:2)
    
    by PurpleBob ( 63566 ) writes:
    
    You'll notice that Slashdot's logo used to be really big [archive.org]...
  - - Re:Not very way back! (Score:1)
      
      by tom.allender ( 217176 ) writes:
      
      The images are references to slashdot itself. The files hvae been moved since '97 I expect. This things gonna generate a lot of 404s for everyone.
      - my favorite wayback slashdot story so far... (Score:2, Interesting)
        
        by CrudPuppy ( 33870 ) writes:
        
        "So IBM announces a 25 gig hard drive... does the world need this yet? Unless this is in a RAID, would you really want to trust 25 gigs on a single drive? What would you use this for? 400+ hours of MP3s comes to mind... "
        
        mind you, this was only a couple short years ago, and now I'm writing this from a PC with three 80 giggers.
        
        i thought we geeks were supposed to have more foresight than this? *grin*
        
        Storage sizes (Score:1)
        
        by shamino0 ( 551710 ) writes:
        
        The humungous size of storage, and the exponential growth rate of drives has never ceased to amaze me.
        
        In 1980, I was using a TRS-80 with a cassette tape interface. Not a lot of storage, unreliable and quite slow.
        
        In 1983, I started using an Apple-II with its 140K floppy drive. I went through all of high school keeping everything I ever wrote on only 20 disks.
        
        In 1987, the computers in college had 20M hard drives. One machine I had access to had a 40M drive, and there was almost never a shortage of space.
        
        In 1988, I got my first 1.44M floppy drive, and found that it took a really long time to fill them. (I was working with 360K floppies until then.)
        
        In 1989, I got my first hard drive - an 80M model, which blew away all the machines that the school was providing in the labs.
        
        In 1991, I got my first 1G drive and couldn't imagine ever filling it. Until I started getting OSs like Windows and apps like Office.
        
        Today, 40G drives are pretty much generic standard issue, and 100G drives aren't terribly expensive either.
        
        Today, I can't imagine needing that much storage (Right now, my home system has about 5G of stuff (on 11G worth of media), which includes three operating systems and an installation of MS Office.) But I'm sure the need for it will arise soon, and even bigger drives will become cheap and popular.
        It's the nature of things. Engineering continually makes stuff smaller and cheaper, and data always grows to consume all available space.
        Anyway, to keep this somewhat on-topic, it doesn't surprise me that archive.org was able to build a 100TB server farm. Today, you can get a 160GB drive for $275 (according to a listing on PriceWatch [pricewatch.com]). 100TB is 625 of these drives, which would cost about $172,000. (Of course, it would really cost less, because 625 drives would qualify for a rather large bulk-purchase discount.)
- Try this instead.. (Score:4, Interesting)
  
  by CptnHarlock ( 136449 ) writes: on Wednesday January 23, 2002 @09:37AM (#2887433) Homepage
  
  http://web.archive.org/web/*/http://slashdot.org [archive.org]
  
  Parent Share
  twitter facebook
Ewwwww! (Score:3, Funny)

by NeoTron ( 6020 ) writes: <kevin&scarygliders,net> on Wednesday January 23, 2002 @09:34AM (#2887423) Homepage

And I thought I'd erased all my old embarrassing HTTP handywork....until I discovered my old website nicely archived - bleargh!

Ah hell, may as well keep it there - it's even got my old web-based Curriculum Vitae on it too - perhaps in some way I've now been "immortalised"?? :)

I've not touched HTML ever since those first abortive attempts I made 5 years ago, cause I realise now that I'm pretty crap at it - I'll stick to Unix admin, what I know best ;)

Share
twitter facebook
Re: (Score:2, Funny)

by account_deleted ( 4530225 ) writes:

Comment removed based on user account deletion
- Retro Topics (Score:1)
  
  by netsharc ( 195805 ) writes:
  
  Haha, look at the retro topics of those days.. Netscape, Kernel 2.1.74 (hey, why a dev-kernel?), the MS-trials...
Interesting Thoughts (Score:3, Insightful)

by nurightshu ( 517038 ) writes: <rightshu@cox.net> on Wednesday January 23, 2002 @09:35AM (#2887427) Homepage Journal

I was glad to see the interviewee was brutally honest about free software -- both its benefits and its drawbacks. Usually discussions among my friends usually degenerate into holy wars, with both of us spouting cliches at one another until we all storm off in huffs.
Free software can save the world, I think. We just need to realize that it needs a lot more work to get there.

Share
twitter facebook
They haven't got http://web.archive.org/ (Score:5, Funny)

by Rentar ( 168939 ) writes: on Wednesday January 23, 2002 @09:43AM (#2887449)

They don't seem to think the history of their site would be interesting: http://web.archive.org/web/*/http://web.archive.or g/ [archive.org] lredirects you to their index.html! boring!

Now, that would really be a test for their apps. Same as if Google indexed www.google.com (entirely).

Share
twitter facebook
- Re:They haven't got http://web.archive.org/ (Score:2)
  
  by Restil ( 31903 ) writes:
  
  Ahh.. but google can't index google. Check the robots.txt on google. They block their search directory. :)
  
  -Restil
  - Re:They haven't got http://web.archive.org/ (Score:1)
    
    by fo0bar ( 261207 ) writes:
    
    They must be ignoring their own rules then... Not only are they indexing themselves [google.com], you can visit a cached version of their home page [google.com].
    - Re:They haven't got http://web.archive.org/ (Score:2)
      
      by Restil ( 31903 ) writes:
      
      They CAN index their home page. They just can't index their search page. Which, just so happens, to be most of google.
      
      -Restil
    - Re:They haven't got http://web.archive.org/ (Score:1)
      
      by Tupper ( 1211 ) writes:
      
      A google search on releated pages to google.com turns up, yahoo, lycos, northern lights, etc. Also cnn and brittanica. And Slashdot!
      The world wide web is still very strange after all these years. -
- Re:They haven't got http://web.archive.org/ (Score:1)
  
  by Lev_Arris ( 60782 ) writes:
  
  Well, if you consider the wayback machine as a time machine, if it were indexing itself it would probably create a rift in the time/space continuum, right?
  
  Erm.. I need a cup of coffee ;)
Quite a lofty goal... (Score:3, Insightful)

by NOT-2-QUICK ( 114909 ) writes: on Wednesday January 23, 2002 @09:44AM (#2887452) Homepage

As per the article, Brewster Kahle states that:

"The idea is to build a library of everything, and the opportunity is to build a great library that offers universal access to all of human knowledge."

Not only does this sound like a rather far fetched plot from an old StarTrek episode, but it also seems to be an a physical and theoretical impossibility. Even if adequate storage space did exist for such a task (a 10 TB database would be but a small start), I do not foresee any type of technology that could ever adequately capture new data at a sufficient speed to harness that which is human innovation and creativity.

It is a nice thought, however, and I certainly wish him all the best in her pursuits...

Share
twitter facebook
- Re:Quite a lofty goal... (Score:2, Interesting)
  
  by limber ( 545551 ) writes:
  
  Kahle's idea is actually quite reminiscent of Vannevar Bush's seminal 1945 description in The Atlantic Monthly of the memex [theatlantic.com], a device that would "give man access to and command over the inherited knowledge of the ages".
  
  The frequency with which this article (the Bush article, that is) has been cited in hypertext research attests to its importance.
Not the biggest DB (Score:5, Informative)

by costas ( 38724 ) writes: on Wednesday January 23, 2002 @09:44AM (#2887454) Homepage

100 TBs do not make the biggest DB ever. I am personally working on an 60-70TB ERP system that's also writeable; I am sure there are bigger systems out there (e.g. Wal-Mart's or GM's ERP systems come to mind).

A read-only DB containing highly-compressible text does not really make for a very challenging datamine. Just because it's on and about the Web and sexier than a stodgy ERP system should not make you overlook the real technology.

Share
twitter facebook
- Re:Not the biggest DB (Score:2, Informative)
  
  by limber ( 545551 ) writes:
  
  I am sure there are bigger systems out there (e.g. Wal-Mart's or GM's ERP systems come to mind)
  
  Just to nitpick, in the interview Mr. Kahle does explicitly mention that the database is in fact bigger than Walmart's. No mention is made of GM's, however.
  
  "It's larger than Walmart's, American Express', the IRS. It's the largest database ever built. "
  he says. Whether the claim is credible is a different matter.
  - Re:Not the biggest DB (Score:3, Interesting)
    
    by limber ( 545551 ) writes:
    
    As a side example to this discussion of 'what constitutes a large database', the NOAA's National Climate Data Centre [noaa.gov] maintains a database of digital data of about a petabyte of climatological data. The Centre takes in about a quarter of a terabyte of data *daily*.
  - Re:Not the biggest DB (Score:3, Insightful)
    
    by costas ( 38724 ) writes:
    
    I find the claim dubious. Bigger than what kind of database? Wal-Mart if famous for tracking every single little thing about their supply chain. Most grocers or hypermart chains do the same. I can easily see, say A&P or Tesco or Carrefour having multi-TB DBs, even petabyte DBs.
    
    Also, the size is not the only thing that defines a database installation: numbers of simultaneuous users or concurrent transactions, read or write access, ability to rollback, quality of service standards are way more important in my book (and also for most big companies). Part of the reason DBs in that size range are rare is exactly that current technology does not scale up to those levels while maintaining rollbacks, read-write and fast user response.
    
    I like the Wayback machine, but to compare it to a proper database is ludicrous. EMC or Veritas will give you much more for their 100TBs of storage than 400 x85 PCs... instant backups for one and way larger MTBF.
- Deep Thought etc. (Score:1)
  
  by delphi125 ( 544730 ) writes:
  
  I'm not sure how much storage Deep Thought had, but certainly the computer it was priviliged to design has more than any database on this planet - by definition, since that database is simply a subsystem! Even more impressively it compresses down to just two (base 13) nibbles: 42.
  - Base 13? (Score:1)
    
    by vrt3 ( 62368 ) writes:
    
    Where does the base 13 come from? Did I miss something?
    Perhaps it doesn't show, but I'm genuinely interested to find out.
    - Re:Base 13? (Score:2)
      
      by Amazing Quantum Man ( 458715 ) writes:
      
      What do you get when you mulitply nine by six? 42.
      
      It works in base 13.
    - - Re:If you're still confused (Score:1)
        
        by vrt3 ( 62368 ) writes:
        
        I knew that, but I can't remember having read anything about base 13.
        
        Re:If you're still confused (Score:2)
        
        by PurpleBob ( 63566 ) writes:
        
        Don't worry. That's because the base 13 thing is a load of crap made up by people who read that part of "Restaurant at the End of the Universe" and managed to miss the point entirely.
- Talk to the US government (Score:3, Informative)
  
  by Remus Shepherd ( 32833 ) writes:
  
  You're right, the Wayback machine is not the largest collection of data -- not even the largest collection online. I work with the USGS's catalog of satellite data. They have over 300 terabytes of satellite imagery, and the collection is growing at a rate of about 1 terabyte per day.
  
  The USGS collection comprises multiple instruments, but Landsat 7 is a big one, contributing about 100 terabytes that's searchable online. [usgs.gov]
  
  Perhaps 'Largest TEXT Database' would be a better description of the Wayback Machine?
  - Re:Talk to the US government (Score:1)
    
    by p3d0 ( 42270 ) writes:
    
    Wow. How do you even deliver 1TB/day to a database? That's more than enough to saturate a 100Mbps network, 24 hours/day. And how do you index it, so it can be usefully retrieved? And how do you store it?
    
    That is a truly astounding amount of data. It's like receiving Google's entire database every week, several times over.
Just Network Programming? (Score:2, Interesting)

by shic ( 309152 ) writes:

So... would I be right in thinking that the "Wayback Machine" is an (admittedly large scale) exercise in network database programming (of the style popular pre Codd and his relational model?) I am tempted to question if this is indeed the biggest database ever built - I suppose it depends upon definitions - but to my mind a database should be general purpose... whereas it appears to me that this project is basically a large-scale single index.

I also wonder if it would be appropriate to call this the largest project of it's kind - for example - while Google stores less data, I suspect it supports a higher query rate... how exactly do you intend to measure scale... if it is in terms of computing power is it relevant that Google already have thousands of Linux server nodes?

That said - I think it is an exciting project in its own right. I hope and expect this offering to become a significant information resource in years to come.
What? (Score:2, Funny)

by Ezubaric ( 464724 ) writes:

I thought it was going to tell us how Mr. Peabody's wayback machine worked. You know, like the flux capacitor diagrams that made everything clear . . .
Pretty amazing ... (Score:4, Funny)

by CDWert ( 450988 ) writes: on Wednesday January 23, 2002 @09:53AM (#2887487) Homepage

Id say is pretty amazing, I actually was able to retreive content I thought lost years ago.

My sites go back to 95, and yep theyre archived starting 96, this is too cool.

I wonder how much of the goverments docs that were pulled off post Sept 11 are still on this ?

A really funny note is it seems like all the p0rn is intact staring in 96, gotta archive the porn.

But seriously , I was unaware of this, Im gonna use this thing like hell as a sales tool if nothing else. Its also great to find certain content thats been pulled.

Share
twitter facebook
sigh (Score:2, Insightful)

by __aahlyu4518 ( 74832 ) writes:

This world needs more people like that... driven to make this world a better place... and having fun doing it, being proud of what they're trying to accomplish... This interview sent shivers down my spine... These are the kind of interviews that inspire people... It makes me think about humanity a little less sceptic... There's still hope.
good article, lesson on human spirit (Score:2, Insightful)

by f00zbll ( 526151 ) writes:

The article was good with all it's warts and gems. I take it more as a testament to the human spirit. I seriously doubt it's the largest database, though it might be the largest publicly accessible database. I'm sure the NSA could easily dwarf their database considering how much data they collect from around the world every day.
The Cost of a Terabyte (Score:3, Interesting)

by wayn3 ( 147985 ) writes: on Wednesday January 23, 2002 @10:01AM (#2887512)

You buy from EMC a terabyte for maybe $300,000. That's just the storage for 1 TB. We can buy 100 TBs with 250 CPUs to work on it, all on a high-speed switch with redundancy built in.
Interesting quote. Mr. Kahle addresses something I've been wondering for a while -- are storage area networks really worth it? Or is he ignoring the costs of maintenance and manpower to keep these things afloat?

Share
twitter facebook
Copyright infringement (Score:3, Redundant)

by Karma Star ( 549944 ) writes: on Wednesday January 23, 2002 @10:02AM (#2887514) Journal

Seeing that they cache webpages from other sites, I wonder how long it will take before another company sues them?

Also, I wonder what their criteria will be for "submissions"? 1 month? 1 year?

Share
twitter facebook
- Re:Copyright infringement (Score:2, Informative)
  
  by madfgurtbn ( 321041 ) writes:
  
  Google, and others, also cache a lot of content. If a web provider doesn't want their stuff cached on The WayBack, all they have to do is include the no bots code in their html.
- Re:Copyright infringement (Score:3, Informative)
  
  by pjones ( 10800 ) writes:
  
  Child! Child! They do not sue you right away -- and they can't. First they send you a cease-and-desist order and you evaluate their claim.
  
  But Brewster answers your question in the interview himself on the second page:
  
  Koman: What about the question of rights? I just wrote about Lawrence Lessig's book on intellectual property. Surely the publishers and the television networks and the record companies aren't willing to let you keep a copy of all of their stuff?
  
  Kahle: All we collect for the Web archive are sites that are publicly accessible for free, and if there's any indication from the site owner that they don't want it in the archive, we take it out. If there's a robot exclusion, it's removed from the Wayback Machine. Over the years, people would notice these things in their logs and would say, what are you doing? And we'd explain what we're doing -- building this archive and donating a copy to the Library of Congress, etc., etc., and 90% of the time they say, "Oh, that's cool, you're crazy, but go ahead." About 10% of the time, they'd say, "I don't want any part of it," and we instruct them on how to use a robot exclusion and they're taken out of history. That seems to work for everybody at this point. People are really excited about this future that we're building together.
Distributed Computing solution... (Score:3, Insightful)

by Tazzy531 ( 456079 ) writes: on Wednesday January 23, 2002 @10:02AM (#2887516) Homepage

The interview talked a little about throwing more machines on when the demand deems necessary. I wonder if it is possible to do this over the internet? I mean, I'm seeing something along the lines of SETI, where millions of people worldwide donate their unused processor power. Would it be possible to distribute the searches to remote computers over the internet in real time?

Share
twitter facebook
- Re:Distributed Computing solution... (Score:2)
  
  by anonymous loser ( 58627 ) writes:
  
  Have you used gnutella?
  
  Same thing.
Government Removed Site still Available (Score:4, Informative)

by Tazzy531 ( 456079 ) writes: on Wednesday January 23, 2002 @10:14AM (#2887571) Homepage

A number of you have asked whether the websites taken down since 9/11 are available on archive.org. The answer is yes. One example is:

DC Air National Guard on Archive [archive.org]

Same Page - 404 [af.mil]

One of the conspiracy websites that I have read was saying that combat airplanes, normally on 24 hour alert, at this base should have and could have prevented the plane from entering the restricted airspace in DC. They were saying that this site was removed because it provided evidence that somebody dropped the ball.

Share
twitter facebook
- Re:Government Removed Site still Available (Score:1)
  
  by aengblom ( 123492 ) writes:
  
  Living a five minute walk from the Pentagon I sort of followed the "WTF WHERE WERE THE JETS" stories. I was wondering myself. Turns out the Jets were launched and sent towards New York as air controllers were searching for additional planes. Then they found one...heading for Washington. The Jets arrived within range about 5-10 minutes after the first Jet crashed into the Pentagon.
  - Re:Government Removed Site still Available (Score:2)
    
    by Tazzy531 ( 456079 ) writes:
    
    I am in no way saying that what you heard might not be true, but read this analysis [emperors-clothes.com] of it. I don't totally agree with their analysis, but it sure does make you think. According to them, no jets were scrambled until after the fact.
    
    Make your own decision...
  - Re:Government Removed Site still Available (Score:1)
    
    by xah ( 448501 ) writes:
    
    It was the ol' duck and roll, the ol' feint and lunge, the ol' slice and jab. I don't know if they planned it that way, but that's how it worked. It was a strategy error that a typical Risk player would make. They should have kept two planes in the launch area, and sent just two to NYC. They forgot to cover home base. It's like they sent the catcher to third base, or brought the safety on a blitz.
"Are you violating Copyright Laws?" (Score:2)

by DoorFrame ( 22108 ) writes:

" Question:
Are you violating copyright laws?
About the Internet Archive

No. Like your local library's collections, our collections consist of publicly available documents. Furthermore, our Web collection (the Wayback Machine) includes only pages that were available at no cost and without passwords or special privileges. And if they wish, the authors of Internet documents can remove their documents from the Wayback Machine at http://www.archive.org/internet/remove.html [archive.org]."

I don't really think that they're neccesarily right about this. I'm glad they've got the archive up, and I think it's dandy, but it seems like the copying and reposting of other's materials is a suspect practice. This will end up in court as soon as something that someone removed from their own webspace re-appears historically accurate here. I'd guess some liable suits will be the first...
- Re:"Are you violating Copyright Laws?" (Score:2)
  
  by Tazzy531 ( 456079 ) writes:
  
  I agree. This does walk the gray area of legality.
  
  For example, I took the Term of Use Agreement [cnn.com] from CNN.com:
  
  (B) CNN Interactive contains copyrighted material, trademarks and other proprietary information, including, but not limited to, text, software, photos, video, graphics, music and sound, and the entire contents of CNN Interactive are copyrighted as a collective work under the United States copyright laws. CNN owns a copyright in the selection, coordination, arrangement and enhancement of such content, as well as in the content original to it.
  
  Subscriber may not modify, publish, transmit, participate in the transfer or sale, create derivative works, or in any way exploit, any of the content, in whole or in part. Subscriber may download copyrighted material for Subscriber's personal use only. Except as otherwise expressly permitted under copyright law, no copying, redistribution, retransmission, publication or commercial exploitation of downloaded material will be permitted without the express permission of CNN and the copyright owner. In the event of any permitted copying, redistribution or publication of copyrighted material, no changes in or deletion of author attribution, trademark legend or copyright notice shall be made. Subscriber acknowledges that it does not acquire any ownership rights by downloading copyrighted material.
  
  I'm sure CNN isn't the only site that has this type of Policy.
  
  If they ever do get sued and are held liable for these copyrighted materials, it would be at a major lost to the global community. The internet is a part of our history and has a history of its own.
  - Re:"Are you violating Copyright Laws?" (Score:2)
    
    by arkanes ( 521690 ) writes:
    
    They can stick all the policies they want on it, but I doubt that they have any legal force. Normal copyright laws and exculsions would apply, and, since there's no agreement or license, regardless of what they stick in the small print, nothing extra applies. In fact, as I read more closely, it's right there - "Except as otherwise expressly permitted under copyright law".
    To my knowledge (limited), publicly available archives have always fallen under fair use.
    Thats not to say that someone won't sue, but, considering that they seem to be more than willing to pull things at owner request, I doubt they'll ever end up in court. And if they do, they'll probably win.
  - Re:"Are you violating Copyright Laws?" (Score:1)
    
    by brain159 ( 113897 ) writes:
    
    That's a policy in English for human beings to interpret. If they want to stop browsing automatons from cacheing/indexing, the Robots Exclusion Standard is the common and almost always adhered-to standard (except for spambots, of course).
- Re:"Are you violating Copyright Laws?" (Score:2)
  
  by EricEldred ( 175470 ) writes:
  
  In order to resolve the copyright issues, and in order to preserve the public domain on the Internet, the Internet Archive has filed a friend of the court brief in our court case against the Sonny Bono Copyright Term Extension Act. See http://openlaw.org/eldredvreno [harvard.edu]
  
  As of today, the Supreme Court hasn't decided what to do with the appeal. Stay tuned to openlaw.
- Re:"Are you violating Copyright Laws?" (Score:1)
  
  by ichimunki ( 194887 ) writes:
  
  I'm pretty sure they're not far off. I spent some time this morning reading documents at www.copyright.gov and I got the impression that under most circumstances, libraries could do all kinds of things that regular people couldn't.
  
  I also note that archive.org is owned by Alexa Internet, and according to alexa.com: " © 1996-2001, Alexa Internet, Inc. Service provided by Alexa Internet. A wholly owned subsidiary of Amazon.com. " So what we really need to be concerned with is that this whole thing will be patented and we'll have to get a license from Amazon to have Linux clusters or online databases.
- Re:"Are you violating Copyright Laws?" (Score:1)
  
  by shamino0 ( 551710 ) writes:
  
  Maybe technically, but as long as they are willing to quickly pull a site from the archive, and work with the site owner to prevent future content from being archived, I can't see how it will ever end up in court.
  Lawyers are expensive. Nobody is going to file suit without first sending a "please remove my data from your archive" letter.
  And if some site owner is completely stupid and sues first, they can simply pull the pages from the archive, then show up in court and say "we deleted their pages as soon as we were made aware of the problem", and the judge will dismiss the entire case.
Link to various database sizes (Score:3, Informative)

by rkgmd ( 538603 ) writes: on Wednesday January 23, 2002 @10:16AM (#2887594)

http://znet.net/~schester/facts/database_sizes.htm l Apparently, walmart's is 24TB, and the entire www index as of 1999 was only 6TB.

Share
twitter facebook
Hardly the biggest. (Score:2, Informative)

by dmd ( 404 ) writes:

The biggest database ever? 100TB? Hardly.

I worked at a large pharmaceutical company for two years (known internally as the Squid), and supported a 380TB protein interaction database (Oracle) and a 260TB SAP-backend database (Informix + custom).

Certainly Wayback's database is large, and certainly it holds far more varied information and appeals to a far larger audience, but by no means is it the biggest. I'm sure there are databases that made the ones I worked on look puny by comparison.
Useful resource (Score:2)

by prototype ( 242023 ) writes:

Even though this has been available since Oct, it's the first I've seen of it. I think it's a great resource. Long dead sites that are no longer there now can be found for historical purposes. The interesting thing is that the links on the page are also updated to link to the archived versions. What I found it useful for was building a history page of what my site looked like over the years. Lots of great uses for this so hope it stays up!

liB
Operating system (Score:2)

by johnburton ( 21870 ) writes:

So the article says in one place that they wrote their own operating system, and in another that they use linux (or BSD, I forget which).

So which is it?
- Re:Operating system (Score:2, Informative)
  
  by madfgurtbn ( 321041 ) writes:
  
  They use different OS's for different purposes within the system. The so-called OS they wrote is described in the article. It's a collection of tools for controlling their parallel computer, which is a collection of many inexpensive computers running the BSD and Linux OS's you talk about.
  
  The interviewer is the one who describes it as an OS. The interviewee expains that the real breakthrough is that with their tools an ordinary programmer can operate in a parallel computing environment-- you don't need a specialist in parallel computing anymore. Which leads to the conclusion that relatively small institutions on relatively small budgets can build enormously powerful computers with massive storage.
DBMS and model? (Score:3, Interesting)

by leandrod ( 17766 ) writes: <lNO@SPAMdutras.org> on Wednesday January 23, 2002 @10:44AM (#2887740) Homepage Journal

But what is the DBMS? Is the database relational? How it was modelled?

Share
twitter facebook
Biggest ever? I don't think so! (Score:4, Funny)

by Proud Geek ( 260376 ) writes: on Wednesday January 23, 2002 @10:50AM (#2887764) Homepage Journal

I once worked on a site with a 25 year old database that was much larger.

The ancient magnetic storage took up several warehouses. Beat that, for biggest database ever!

Share
twitter facebook
Interesting thought process (Score:3, Interesting)

by cheese_wallet ( 88279 ) writes: on Wednesday January 23, 2002 @10:57AM (#2887799) Journal

Pretty decent read, but one thing they said got me thinking a little bit.

They said that at Thinking Machines they built a super fast computer, but it required a new way of thinking about things in order to program it. And then they called this a mistake, because they couldn't attract any customers.

This seems like a real problem that would lead to technological stagnation. At least from a market place point of view.

It is kind of similar to a company making games off of pre-existing engines, like quake, instead of some new non-quake compatible engine.

Or everybody making x86 compatible CPUs.

It also seems that when a company does come up with some new way of doing things, they get burned, and it is the second generation of companies that pick up the torch that make the money. So nobody wants to be that first company, they are all waiting for someone else to break the ground.

Maybe the only people/companies that come up with new stuff are the ones that are insanely rich, and won't get hurt by doing something new, or the insanely poor who have nothing to lose anyway.

I can't help thinking that this clustering boom going on is just like what 3dfx was trying to do. The difference right now is that clustering actually *does* outperform the super fast single chip. I wonder when technological advances will change this fact.

Share
twitter facebook
- Re:Interesting thought process (Score:1)
  
  by sunhou ( 238795 ) writes:
  
  They said that at Thinking Machines they built a super fast computer, but it required a new way of thinking about things in order to program it.
  
  That caught my eye too, for a couple of reasons. First, I think in a way, Thinking Machines was a victim of its own success. At first, people thought they were nuts building computers with 65,536 processors in them. Then, when people realized hey, that's actually pretty handy, suddenly there was pretty intense competition for a relatively small market.
  
  Second, what was a "new way of thinking" in the late 80's and 90's when Thinking Machines was doing their stuff, may now have become the norm (or at least "a norm"). A lot more people now think about parallel processing. So you're right, the first guy to do something may very well get burned, and after burning, can watch everyone else succeed.
  
  Third, I think Brewster's sound bite about TMC (Thinking Machines Corp) came off sounding like he was selling them short. I used TMC's Connection Machines, along with parallel machines from other companies (one called Masspar comes to mind). The TMC software was pretty slick in its day. One really great feature it had, which others lacked, was that in the beginning of your program, you specified a "virtual processor geometry", i.e. you told the system "I want to run this program on an array of 1000x2000 processors". The software would then figure out, oh, this machine only has 65536 processors, not 2 million, and it would map the 2 million virtual processors onto the 65536 physical processors, and the programmer didn't have to worry about doing that mapping. It sounds simple, but again, I used other big parallel machines at the time which couldn't do it, which made things a pain (writing a chunk of code in every programming to do that mapping myself). You could take your code which requested 2 million virtual processors, and run it on Connection Machines with different numbers of processors, and it would Just Work.
Another interesting site linked to in the article (Score:1)

by dietz ( 553239 ) writes:

This site [televisionarchive.org] has archived television from all over the world on September 11th-September 18th.

I've been pretty jaded and unpatriotic/anti-war about the whole thing, but I can't help but admit I still get creeped out watching the footage.
You know what is SAD? (Score:3, Funny)

by dood ( 11062 ) writes: on Wednesday January 23, 2002 @11:20AM (#2887915)

Slashdot looks the exact same it did 5 years ago!

WHEN is this site going to be updated? Forget the wayback machine, if I want ancient web history I visit slashdot.

--Dood

Share
twitter facebook
- Re:You know what is SAD? (Score:1)
  
  by qslack ( 239825 ) writes:
  
  Rob (CmdrTaco) has always said that if you come up with a better design, email him and he'll put it up if he thinks it's worthy. So instead of complaining, go redesign Slashdot yourself!
  
  It would sure be a cool thing to be able to say you designed Slashdot. :)
Wisdom in his words.. (Score:3, Interesting)

by grub ( 11606 ) writes: <slashdot@grub.net> on Wednesday January 23, 2002 @11:32AM (#2887994) Homepage Journal

From the article:
How the archive works is just with stacks and stacks of computers runnning Solaris on x86, FreeBSD, and Linux, all of which have serious flaws, so we need to use different operating systems for different functions.

The man puts bias aside and uses various OSs in areas in which each performs well. A real, tangible project like this is worth more than any amount of drooling zealotry.

Share
twitter facebook
Slashdotted. (Score:1)

by TheCrunch ( 179188 ) writes:

Arg. Where's Google's cache? Oh.. wait.. nevermind.
200 transactions/second? (Score:4, Insightful)

by selan ( 234261 ) writes: on Wednesday January 23, 2002 @11:37AM (#2888027) Journal

Having so few transactions for a database of this size probably helps them run without needing large expensive machines. Many VLDBs [vldb.org] support thousands of transactions per second. I found a list here [wintercorp.com] of top ten winners of a very large database scalability contest. The winner for peak performance was something like 20,000+ TPS.

Share
twitter facebook
omg, a beowulf article (Score:1)

by netsharc ( 195805 ) writes:

imagine what beowulf joke we can put...
Ok, nice, but what data do we throw away? (Score:2)

by joshv ( 13017 ) writes:

This brings to mind a serious question. What data, if any, do we throw away? With ever expanding storage capacities it's getting easier and easier to just keep stuff, than to sit down and figure out what you want to throw away.

In the past media degeneration and obscelecense over time have made the decisions for us. But going forward we will have massive distributed, redundant data stores, with geographically remote backups. The data isn't going to go away unless we tell it to.

Freenet addresses this problem by culling the less popular data (not actively, but as a end result of its caching policies) - but this has the unfortunate effect that important data can get lost. Not a desirable behavior for corporate data.

-josh
- Re:Ok, nice, but what data do we throw away? (Score:2, Interesting)
  
  by delta0 ( 181139 ) writes:
  
  : In the past media degeneration and obscelecense
  : over time have made the decisions for us. But
  : going forward we will have massive distributed,
  : redundant data stores, with geographically
  : remote backups. The data isn't going to go away
  : unless we tell it to.
  
  GOOD! And there is something *wrong* with this??
  Seriously -- the Internet should have been designed like this from the start. Don't ever throw away, simply classify and organize. In a throw away world, we already destroy too much to afford loosing what's on the Internet.
  
  Given, there is lots of junk and what some might consider noise, but... (as the saying goes) some peoples junk is others treasure.
Listing can be removed from the archive. (Score:1)

by SpookyR ( 115189 ) writes:

With an appropriate robots.txt [mjifc.com] file, a site's listing can be stopped from showing up [archive.org] on the Wayback archive. Interesting.
they are blocking on request :( (Score:1)

by dickens ( 31040 ) writes:

I was kinda bummed to see that dec.com and digital.com yielded:

Blocked Site Error.

Per the request of the site owner, http://www.dec.com is no longer available in the Wayback Machine. Try another request or click here to see if the page is available, live, on the Web.
http://www.dec.com
Mheh. (Score:2)

by Mr. Neutron ( 3115 ) writes:

Some things are just plain funny. [archive.org]
sweet... where do I donate? (Score:1)

by delta0 ( 181139 ) writes:

I hope people help them out, they have already brought me back to some cool stuff.

This is a noble cause.
100 TB of information (Score:1)

by clickety6 ( 141178 ) writes:

That must be like 99.99 TB of pr0n!!!
- Re:100 TB of information (Score:1)
  
  by delta0 ( 181139 ) writes:
  
  Well regardless, I love the fact that they keep binaries such as hard to find .tgz's and shelved projects. It's amazing how some things just seem to fall off the net, and then are almost impossible to locate.
  
  I so hope, that they don't have to contend with any legal troubles which might interfere with the initiative. If so, they should move the archive to an island, move the servers to switzerland, or a boat somewhere at sea, where they can have legal immunity. This is too great a cause to have tangled up in a brainless legal mess.
Slashdot Slashdotted? (Score:2)

by NitsujTPU ( 19263 ) writes:

I can't seem to get the slashdot pages archived to load.
the misanthropic bitch (Score:3, Interesting)

by joshuaos ( 243047 ) writes: <(moc.tnemodeerf) (ta) (soroboruo)> on Wednesday January 23, 2002 @02:07PM (#2889074) Journal

I spent the summer on the road, and when I settled down the for the cold months, I was quite sad to see the the Misanthropic Bitch [shutdown.com] appears to have vanished. This made me very sad. Today, when I read this article, I was delighted to find that all of dear bitch's articles are archived.
I think this is a fabulous project, and I hope it does well. However, I think that the notion of such a centralized database will begin to become unrealistic. I think peer to peer projects are the future, and I can see a day far in the future when the database layer comes down and inhabits the filesystem layer and all the databases on the internet can talk to eachother, and in a sense, the net becomes a giant database that anyone can contribute to.
Cheers, Joshua

Share
twitter facebook
Their movie archive has "Hired!" (Score:3, Informative)

by for(;;); ( 21766 ) writes: on Wednesday January 23, 2002 @03:00PM (#2889477)

Hot damn! Their movie archive has a downloadable version of the short they showed on MST3K prior to "'Manos:' The Hands of Fate."

"Ma'am, did you realize that Chevrolet has an important plan for your life?"

Share
twitter facebook
Isn't this illegal? (Score:3, Insightful)

by russ-smith ( 126998 ) writes: on Wednesday January 23, 2002 @03:11PM (#2889554) Homepage

The majority of information being collected by Archive.org is covered by copyright law. It is up to Archive.org to get permission before they republish the information. If you look at the Archive web site they run banner ads for the Alexa toolbar. This Alexa service provides the marketing with information somewhat similar to the Nielson ratings for TV. Archive.org has received complaints about their service contrary to the statements made in the published article. Archive.org has refused to respond to any meaningful way to these issues. Archive.org is trying to put burden on the publisher to determine that The Archive is publishing it, find it within TheArchive web site and then provide them a notarized statement. see their FAQs at
http://www.archive.org/exec/faqsidos/about/faqs.ht ml?index=2 [archive.org] and
http://www.archive.org/exec/faqsidos/about/faqs.ht ml?index=26 [archive.org]
The claims made in these faqs are just not consistent with the law. Are they going to repost everything that was available on Napster?
They also have some problems with their algorithm so that some domains that are redirected fool their algorithm into associating content with a site that was never actually associated with the site. To try to find copywritten works would be a nightmare. Archive.org has refused to respond to any of these issues and, in fact, are lying about it if the quotes in the article are factual.
Russ Smith

Share
twitter facebook
OT IDSoftware (Score:1)

by dominic7 ( 70356 ) writes:

slightly off topic but check out ID Webpage circa Dec 1997 and then look at it today. The work that must have gone into update that baby (;
It's naive to think this is the largest database (Score:1)

by tuxlove ( 316502 ) writes:

The US federal gov't has larger databases than this. They just don't talk about them too much.

BTW, looks like the Wayback Machine has been slashdotted.
Google (Score:1)

by arealperson ( 140859 ) writes:

I wonder if Google indexes this site. Think of the ramifications on finding the revellent inforation. Paradoxs galore!!
beowolf? (Score:1)

by aierwin ( 445651 ) writes:

wow!
just imagine a beow.......

ah. nevermind...
forgotten flavor... (Score:1)

by cyberfraud ( 95382 ) writes:

well, look at that... it has tons and tons of lost and forgotten porn... time to buy that new 160 gigabyte hard drive!
Doesn't work in Konqueror? (Score:2)

by sandler ( 9145 ) writes:

Has anyone tried this site in Konqueror? There is a floating link to the home page on top of the results that says search in progress, but never goes away?
- Nor Opera 6.0 (Score:1)
  
  by PhilMills ( 209855 ) writes:
  
  Same symptoms. I had to start up IE5.5 for the first time this month to use archive.org.
easy one (Score:2)

by geekoid ( 135745 ) writes:

index it, split the database horizontally each split on a different machine.
I would have to look at the database to tell you how big each split would ned to be, and how much power the machine housing each split needed to be, as well as other details.
depending on certian variables, I might even consider splitting it accross a cluster od some sort. but its hard to say without a real look at the structure.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Google? (Score:4, Interesting)

Noooooooooo !!! (Score:5, Funny)

Re:Noooooooooo !!! (Score:2, Informative)

The shame of one's past (Score:1)

Re:Google? (Score:1)

Re:Google? (Score:1)

Re:Google? (Score:1)

Successfully crashed (Score:3, Funny)

Not very way back! (Score:1)

Re:Not very way back! (Score:3, Informative)

Re:Not very way back! (Score:1)

Re:Not very way back! (Score:2)

Re:Not very way back! (Score:1)

my favorite wayback slashdot story so far... (Score:2, Interesting)

Storage sizes (Score:1)

Try this instead.. (Score:4, Interesting)

Ewwwww! (Score:3, Funny)

Re: (Score:2, Funny)

Retro Topics (Score:1)

Interesting Thoughts (Score:3, Insightful)

They haven't got http://web.archive.org/ (Score:5, Funny)

Re:They haven't got http://web.archive.org/ (Score:2)

Re:They haven't got http://web.archive.org/ (Score:1)

Re:They haven't got http://web.archive.org/ (Score:2)

Re:They haven't got http://web.archive.org/ (Score:1)

Re:They haven't got http://web.archive.org/ (Score:1)

Quite a lofty goal... (Score:3, Insightful)

Re:Quite a lofty goal... (Score:2, Interesting)

Not the biggest DB (Score:5, Informative)

Re:Not the biggest DB (Score:2, Informative)

Re:Not the biggest DB (Score:3, Interesting)

Re:Not the biggest DB (Score:3, Insightful)

Deep Thought etc. (Score:1)

Base 13? (Score:1)

Re:Base 13? (Score:2)

Re:If you're still confused (Score:1)

Re:If you're still confused (Score:2)

Talk to the US government (Score:3, Informative)

Re:Talk to the US government (Score:1)

Just Network Programming? (Score:2, Interesting)

What? (Score:2, Funny)

Pretty amazing ... (Score:4, Funny)

sigh (Score:2, Insightful)

good article, lesson on human spirit (Score:2, Insightful)

The Cost of a Terabyte (Score:3, Interesting)

Copyright infringement (Score:3, Redundant)

Re:Copyright infringement (Score:2, Informative)

Re:Copyright infringement (Score:3, Informative)

Distributed Computing solution... (Score:3, Insightful)

Re:Distributed Computing solution... (Score:2)

Government Removed Site still Available (Score:4, Informative)

Re:Government Removed Site still Available (Score:1)

Re:Government Removed Site still Available (Score:2)

Re:Government Removed Site still Available (Score:1)

"Are you violating Copyright Laws?" (Score:2)

Re:"Are you violating Copyright Laws?" (Score:2)

Re:"Are you violating Copyright Laws?" (Score:2)

Re:"Are you violating Copyright Laws?" (Score:1)

Re:"Are you violating Copyright Laws?" (Score:2)

Re:"Are you violating Copyright Laws?" (Score:1)

Re:"Are you violating Copyright Laws?" (Score:1)

Link to various database sizes (Score:3, Informative)

Hardly the biggest. (Score:2, Informative)

Useful resource (Score:2)

Operating system (Score:2)

Re:Operating system (Score:2, Informative)

DBMS and model? (Score:3, Interesting)

Biggest ever? I don't think so! (Score:4, Funny)

Interesting thought process (Score:3, Interesting)

Re:Interesting thought process (Score:1)

Another interesting site linked to in the article (Score:1)

You know what is SAD? (Score:3, Funny)

Re:You know what is SAD? (Score:1)

Wisdom in his words.. (Score:3, Interesting)

Slashdotted. (Score:1)

200 transactions/second? (Score:4, Insightful)

omg, a beowulf article (Score:1)

Ok, nice, but what data do we throw away? (Score:2)

Re:Ok, nice, but what data do we throw away? (Score:2, Interesting)

Listing can be removed from the archive. (Score:1)