Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
The Internet

Snapshotting the Whole Internet? 142

Anonymous Coward writes "CNN is running a story about a company that is saving periodic 'snapshot' archives of the whole www (or as much as they can) for historical purposes. Interestingly they say that although they might have considered saving everything except ads, they didn't throw away the ads because historians claim that ads give a better "glimpse of what life was like" in the past. I wonder what legal ramifications will arise for possessing such archives of the "whole web" as snapshots-in-time. Thoughts of DeCSS, CPHack, MS Kerberos' click-wrap license, I.P. "ownership" of collected databases cross my mind."
This discussion has been archived. No new comments can be posted.

Snapshotting the Whole Internet?

Comments Filter:
  • Yes, we need the ads.
    That way, when we're outfitted with clothes for backward time travel, we won't look like Austin Powers.
  • hm. my hosts file is preventing me from easily reading the whole CNN article, but here's an article about a possibly different company doing the same thing, dating back to 97:
    http://Slate.msn.com/webhead/97-02-27/webhead.asp [msn.com], the website itself is http://www.archive.org/ [archive.org].

    The related Xerox project I think is merely affiliated with Archive.org, actually, and is currently called the Internet Ecologies [xerox.com] project
  • by BadlandZ ( 1725 ) on Saturday July 08, 2000 @08:33AM (#949045) Journal
    "The way we're able to pull this off is by having robots that go around and contact every Web server around the world periodically, and download each page -- each image -- off of every one of those sites"

    Why do I just NOT believe this? In order to do this, they would not even be able to search for httpd at every IP, because it would only grab one host, and virtual hosts would be skipped.

    They would actually have to search every registered domain name, dig up all it's host names, and then search each one for an httpd, and I don't believe they are actually going to do that (can someone point out to me that they are?).

    One of the biggest problems on the web right now is the lack of orgninzation of information... "You want to know what? Oh, it's on the web, no doubt, but WHERE!" As brillant as search engines are, they still don't know where everything is, and you frequently will miss what you REALLY want.

    So along come theses people, and they make an even larger claim than searching through _all_ of the web, but they say they will take a snapshot of ALL of it?

    /me looks at hit logs for home Linux box running apache on modem... Hmm... Don't see anything... been up weeks... I'll keep looking ;-)

    At best, they will get the most popular sites, and try to leave it at that.

  • This will become difficult in the future as the web becomes a distributed dynamic infrastructure. It's easy to capture static pages, but not so easy to capture dynamic computing resources.

  • LynxInformation

    Lynx

    Lynxisa textbrowserfortheWorld WideWeb. Lynx2.8.3runson
    Un*x,VMS, Windows95/98/NTbutnot3.1 or3.11,onDOS(386 or
    higher)andOS/2 EMX.Thecurrent developmentalversionis also
    availablefor testing.PortstoMac areinbetatest.

    *How togetLynx,andmuch moreinformation,is availableatLynx
    links.
    *Many userquestionsare answeredintheonline helpprovidedwith
    Lynx.Pressthe'?'key tofindthishelp.
    *If youareencountering difficultywithLynxyoumay writeto
    lynx-dev@sig.net.Beas detailedasyoucanabout theURLwhere
    youwereontheWeb whenyouhadtrouble,what youdid,whatLynx
    versionyouhave(try'=' key),andwhatOSyou have.Ifyouare
    usinganolderversion, youmaywellneedto upgrade.


    Maintainedby lynxdev@browser.org.

    Commands:Usearrowkeys tomove,'?'forhelp, 'q'toquit,' Arrowkeys:Up andDowntomove. Righttofollowalink; Lefttogoback.
    H)elpO)ptionsP)rint G)oM)ainscreenQ)uit /=search[delete]=historylist
  • You can tell that all they are trying to do is get the largest archive of free porn :)
    -
  • .. the most important part in the history
    of the Internet, namely Al Gore?
    In 50 years time, perhaps people will look back
    and say: "remember son, that if it hadn't been
    for Al Gore, nothing of this would have been created".

    I'm bored...shoot me..
  • By the time they finish taking napshot of the net half the sites would have changed and btw what frequency is the best?
    The Royal Swedish Library decided on every six months as a reasonable compromise between completeness and what was practical. (And they only archive Swedish sites and some sites related to Sweden.) Apart from gathering, they've also asked for copies of sites from before they started, so I know they have at least one as it looked in 1994. They're quite determined they won't filter anything, as they're still upset that someone around a hundred years ago decided to clean out "uninteresting" posters from the archives... (and nowadays they always get more underground storage space when it's needed and there's an offsite backup site for everything on paper produced today) http://kulturarw3.kb.se/
  • Just realised, everything we say here would be archived too. We could be making ourselves look bad if it turns out to be a good thing and we are bagging it this whole time.

    And what about all those lame arse web pages youve ever made, i know ive still got some bad ones flying around there somewhere :)

    Tigris
  • "And bring me a hard copy of the Internet so I can do some serious surfing." - Pointy Haired Boss
    Dilbert comic strip, 6 June 1999
  • Then again, anything before ms touches it is better.
    ___
  • Omigawd! This is actually for real??

    I thought someone had managed to troll CNN with the "My boss asked me to back-up the Internet for him" gag.

    Wouldn't be the first time they 'reported' on a hoax.
  • by Da_Monk ( 88392 ) on Saturday July 08, 2000 @10:49AM (#949055)
    i parsed this as
    "Slashdotting the whole internet"

    help me....

  • Now after Microsoft goes on their Jihad against warezers trying to take down all illegal copies of their programs, you can consult these archives to find some choice applications.
  • I can just see the add campaign for Whataburger in 2005...

    --Joe
    --
  • by Anonymous Coward
    So since they are currently only archiving ASCII text your point is irrelevent isn't it? They aren't demonstrating the development of communications technology at all, mereley the content on web pages.

    Hmm.. Interesting. That's not the way the Headline News story said it. It does destroy the ability to get certain information out of the web past 1998 as many sites are image dependent. You can still see the increase of use of certain media technologies by looking at the HTML that includes them and in-line scripting.

    <OFFTOPIC>
    Thanks to the efforts of liberals and atheists everywhere people, like myself, who hold with decent Christian ethics are oppressed and considered to be backward and out of touch with the modern world. Your post seems to indicate that you also believe in the Lord and follow the Bible, so I don't see how you can disagree with what I'm saying. The temptations of Evil are many in today's world, and if I choose to try and spread the Word, why is that considered to be such a bad thing?

    Well, neither hard-core liberals nor hard-core conservatives truly uphold Christian values completely. I often take issue with the fact that the Christian Coalition places fiscal politics in it's platform. Supporting Big Business has nothing to do with Jesus's teachings, IMHO.

    Similarly not all of what you're saying has anything to do with the Word of God, and attempting to justify yourself by falsely claiming those views are supported is detestable. I feel free to disagree with you because I feel that my faith in God does not mean that I should believe against the archiving of the Internet for historical purposes.

    Plus, I fully support attempts to spread the Word of God. However, you must realize that actions often speak far louder than words. This is the core reasons behind most anti-Christian sites. From what I can tell, the most virulent are written by people who grew up in Christian households, but were burned as children by hypocritical Christians, Pharasees, if you will. I have seen plenty of people who attempt to shield all their poor attitudes and bullying tendencies behind their faith, which is not how Jesus intended for us to live.

    In many cases, these are what I call "Style-Over-Substance" Christians. Those who feel that going through the motions is all that is needed rather than holding Christ in your heart. Christ is willing to forgive a stumble or two on the path, but simply mimicking his steps without having his word in your heart is an empty and fruitless act. People who pray when there's an accident or a storm but not when they are under temptation are often the same people who shout at others on corners rather than trying to befriend them and lead through their lifestyle. This is what bothers me.

    The attitude you've shown here, and in other posts I read from your "User Info" link almost leads one to believe that you're deliberately trying to give Christians a bad name. Almost.

    Lastly, I'll quote a verse that I too often fail to follow properly: "Judge not, lest ye be judged yourself." Condemnation of sinners is not an option. Guidance is what we are supposed to provide.
    </OFFTOPIC>

    That's what American history textbooks do, and they are hardly the most unbiased texts in the world are they? I can't remember where, but there's a book about how bad American history teaching and books are.

    The fact is that the majority of political and social change in early America, in Europe, and all around the world, for that matter, was done by a particular elite social class or two. My point is that that doesn't make the time periods and literature from those eras historically valueless.

    I won't even try to go into the actual quality of elementary and high school American history education. I honestly have to wonder about that whole George Washington and the cherry tree story, too. Oddly, no book seems to mention that he was elected to his state legislature due in no small part to the illegal but common practice of "treating," or bribing voters with alcohol. However, the record of exactly how much alcohol and of what kinds he brought is available. (Gotta love listening to Paul Harvey, sometimes.)

    It might make me anal, but not a troll.

    Sorry, in my experience, the people who spend the most time dickering about formatting and spelling are those that spend the least time in critical thinking about the points made. Yes, I do regret not deleting that last hypocritical jab about spelling.
  • That was me. I just forgot that after previewing, you lose your login if you don't have cookies on.
  • See the problem I see is that capturing the whole Internet's state would in itself trigger enough firewalls and write enough log lines that there would be more data created than actually recorded.

    Can you imagine the bot downloading a firewall log that somehow became public and is being written to faster than the bot can record. And since its recording that log itself, its causing an infinite loop of new log lines being generated :)

    DoS on the bot?
    Free Porn! [ispep.cx] or Laugh [ispep.cx]
  • A large portion of the Web is dynamically generated. Isn't it very hard to snapshot it automatically without human interaction?

    -jfedor
  • by Raunchola ( 129755 ) on Saturday July 08, 2000 @08:38AM (#949062)
    Joey: Wow Grandpa, this was how the Internet was back in your day?

    Grandpa: Yep Joey, that's right, this archive shows how the Internet looked back when I was a youngster like yourself.

    Joey: Wow! You know Grandpa, it's too bad what the Internet has become now. Just look at this...porno, warez kiddies, AOLers, and ads everywhere! Talk about a downward spiral.

    Grandpa: Joey?

    Joey: Yeah Grandpa?

    Grandpa:: You're still looking at the archive.

    --
  • They'd have to do more than that... They'd also need to preserve copies of browsers that render the pages as they do today, and also keep a machine or two (or develop emulators) to run them on..... Maybe even keep a server or two + some network hardware.
  • This is a really excellent idea, I agree.
  • We intend to provide access to researchers, scholars, historians, etc. Research proposals can be submitted here [archive.org].

    That, to me at least, begs the question of why do people need to apply for access at all? Let everybody have a crack at it ... after all, for many of them, it is their data you archived.

    --------------------

  • Not last time I checked

    You get channels enough to support 4 drives standard on most MB's. A Promis Ultra33, Ultra66, or IDE RAID card will get you channels for another four drives at the cost of 2 IRQ's. Thats about as far as you can go unless you happen to somehow magicly have another 2 IRQ's left for another card in which case you may be able to get away with adding another 4 drives for a total of 12.

    I think he either accidently added a 0 (as in 2 75gig drives) or dropped a point (as in a 20.75 gig drive)


    --
  • Then again, anything after ms touches it is better.

    I bet I'll get flamed for this, and the flames will get +5's too... I love the double standards of Slashdot.
  • by Money__ ( 87045 ) on Saturday July 08, 2000 @08:16AM (#949068)
    . . the old internet before micrsoft.
    ___
  • Not everything is for money. This is a non profit organization, as you can see here [slashdot.org].
  • U know even if there are only 6 billion people in the world most people I know have 5 or 6 different homepages(unmaintained of course) and companies may have 100's of sites so it doesnt have to be that every 6th person in the world has a website)
  • 1024 terabyes = 1 petabyte, 1024 petabytes = 1 exabyte

    (Old joke: what's the difference between a novice geek and a real geek? A novice geek think's there's 1000 bytes in a kilobyte, while a real geek think's theres 1024 meters in a kilometer. Haha, LAUGH god damnit)
  • <i>And as such, it's still in a state where teething problems overwhelm content.</i>

    Absolutely... however, the fact that these snops are taking place during the teething period make them far more interesting. Periods of change offer far more insight and interesting data reflective on a given society than studying it's finished product ever has/will.

    <i>Research carried out my both my consultancy group and others all indicate that the majority of people able to use the web are white, middle-class and certainly in higher than average tax brackets...</i>

    Two comments here... First, are you throwing out the entire archive because this particular demographic is not important or something? Or maybe because it's not a politcally correct one it shouldn't be considered relevent for societal inquary? Second Throughout history, recorded history has almost always been about the more affluent classes... A dark ages serf really doesn't have the resources to commission an expose on his life. However there are long and well practiced methods of studying one class to determine how another lived. Historians aren't so ignorant that if all the information they have is on white middle class suburbia, they assume that white middle class suburbia is all that existed. Don't edit or discard information because <b>you</b> don't think it's relevent. That type of historical auditing forever cripples the studies of future generations. Additionally, enough random people for very different demographics are also up and running on the internet that alot of their information is also recorded. Historical studies unlike politcal ones (and your points are all politcal, none historical) are not an exercise polling the statistical data of the given information in order to appeal too the politcally correct, or the politcal majority at any given moment. You are being far to arrogant yourself and way undereastimating historians who haven't even been born yet.

    <i>Anyway, any study that attempts to categorise how we live at the moment using the web is doomed to be prejudiced and incomplete...</i>

    Any study based on any single source is always doomed to be incorrect. However, there is no absolutely correct histories anywhere, which is why there is always on-going historical study.

    You take far to much upon yourself, and far too much credit from those who do this for passion and for living.

    -T

  • Have they also throught about archiving some of the more influential (developer) mailing lists?
    Yes, there are websites that carrry archives of some mailing lists - will those be archived also?
    I'm not talking about high-volume, low quality, newbie-filled lists but the developer-oriented lists could prove to be interesting ...
  • True enough, but hardware limitations prevent providing completely open access at this time. Most of our hardware resources are devoted to simply storing the data; we can't support millions of users as well. (yet) We're a small operation. Also, recall that we are storing snapshots over time, and standard HTTP protocol has nothing in it to let a user specify "I want the index.html from June 1996, please". So access to our collections is not performed through any standard mechanism at present.
  • Sigh. Actually, what we recommend is robot exclusion. If you have those in your WWW pages, they won't be crawled and will never make it in to the Internet Archive to begin with. We do honor requests to have particular pages removed after the fact. -Megan, interning at the Internet Archive
  • Every megabyte they add to their archives will cause the web to grow by another megabyte!

    (Note: I know that they won't be stupid enough to recursively mirror themselves, but the thought was kind of funny)
  • by paled ( 22916 ) on Saturday July 08, 2000 @08:18AM (#949077)
    Sounds like an excuse for having the worlds largest pr0n collection 8^)
  • by ballestra ( 118297 ) on Saturday July 08, 2000 @08:41AM (#949078) Homepage
    Future historians will no doubt be fascinated by the "ancient mystery" race of trolls. They developed their own hybrid latin/arabic based alphabet, but often reverted to more primitive picture-based communication. The figurehead of their religion was a deity named Natalie Portman, who they worshipped with sacramental rituals involving hot breakfast cereals. Who these people were, and how they really lived, will sadly remain a mystery.
  • Your estimates of hundreds of terabytes is way off. They say on their site [archive.org] that the whole archive including web pages from 1996-Current, and a limited ammount of FTP and Usenet is only ~14TB. That's not bad.
  • Well, I'm glad somebody is finally doing this. Now if anyone ever needs to prove that something happened or exists during a certain time, it could definately be proved (there must certainly be web sites about everything that is happening). In a court case about something such as Kerberos (sp?), they could get an origional, unchanged version of the document that existed before the case was brought to trial. This was changes the company imposed during or right before the trial will have no effect.

    Josh
  • by Rhys Dyfrgi ( 64793 ) on Saturday July 08, 2000 @08:43AM (#949081)
    Most people seem to have not found the homepage of the project (not surprising, as I saw no link on the CNN story.) The project is at http://www.archive.org [archive.org]. There are 3 archives there; the web, from 1996 to now, taking 13.8 TB. FTP, in 1996, taking .05 TB. And Usenet, from '96 to '98, at 0.592TB. All this space info is from the front page of the site.

    There is info on the side on how the archive is accessed, created, who pays for it, everything. Read it before you hit that post button another time.
    ---

  • It's not a question of converting anything into a legal document: it's a useful source of additional evidence, though, if the owners let it be used that way. Short of new legislation, though, they're under no positive duty to help someone trying to sue a site owner.

    In practice (and this goes for more or less all of the legal world) you can get evidence of what was at a given site by traditional methods with only a little bit of technology. It's not outside the rules of evidence for a witness to stand up in court and say, on oath, that he saw such-and-such on a site one day.

    Indeed, that's how nearly all documentary evidence goes in unless both sides agree it's kosher: someone has to stand up, swear in as a witness and say on oath that a particular document came into existence in the form it is in before the court on the date written at the top, etc..

    Producing a hard-copy of the offending page with someone to swear he downloaded it on the date shown at the bottom and printed it out would be quite enough, without having to go pester someone who wasn't a party to the action.

    And this is before you get discovery of the other side's HDD and run some undeleting software on it. That's been done in a couple of decided cases and certainly the UK courts are happy with it provided the party doing it wheels a geek in as expert witness (and you guys are among the cheapest expert witnesses going: sharpen up, OK?) to reassure the judge there's no jiggery-pokery involved. I've got a case on at the moment where we've done just that.

  • gaaa...neat...now thats something I've GOT to try ;)
    --
  • let me try and clear this up for everyone: JE is a troll.

    But what's worse is his strategy of arguing against humanities' overall increase in knowledge, and having many of the lurkers take this stuff seriously. Those lurkers in turn become the grousing bitches in the workplace, the fux0rs that make it impossible to achieve greatness. Daikatana is a perfect example of not only inept management, but a bunch of nasty grunts that regurgitate base cynicism, by way of sucking up this sort of troll born tripe.

    Why would you discourage the archiving of such a valuable chunk of data? Besides just eliciting responses... In the final analysis, there is no reason not to support (or at least stand clear of) these archive.org folks, is there?

    In short, this is the crank-assed behavior that hobbles the entire high tech world. This is not a well formed criticism or a burst of insight, this is a wanky know-it-all piece of crap designed to spark a useless debate. The problem is, that is more damaging than anything else going on on the web today.

  • I'm sorry, but as much of a fan of the web as I am, I really wouldn't consider it to be something worthy of archival in the state that it is at the moment.

    Excellent troll. You are to be commended, sir.

  • doesnt google cache already do this? they save alot of a site already.
  • It is not a snapshot of our culture as a whole. I agree. But we are a part of the larger culture. The data (if truely it is being collected) will be significant. I would suggest that there are many more of us than there are of the royal family. I would like to think we have more impact. In terms of tracking growing trends we are perhaps even useful. If the spread of our cultural normal curve is widening, that would imply the extremes become more worthy of notice, yes?
  • How is this feasible in terms of bandwidth and storage? AltaVista runs on dual OC-48s and some horrendously expensive Alpha cluster(?) or somesuch, and even though they're only categorizing pages (and not storing them), they claim to only have surveyed 30% of all sites.

    What kind of technology in terms of storage media and bandwidth is needed for a company to actually mirror all available websites, even when we take robots.txt, password protected sites, and things like database-driven sites into consideration?

    It sure seems like a lot of diskspace and bandwidth to me!
  • Sure, this'll be a useful reference for future generations, won't it? I'm sorry, but as much of a fan of the web as I am, I really wouldn't consider it to be something worthy of archival in the state that it is at the moment. Why? Well, because currently the web is still in the transitional period between the days of ARPAnet and purely academic use and acceptance as a medium through which the general public can communicate. And as such, it's still in a state where teething problems overwhelm content.

    The "teething problems" are what makes it historically interesting! I think historians of the future will be much more interested in looking at the development of the web through trial and error than at the finished product.

    Anyway, any study that attempts to categorise how we live at the moment using the web is doomed to be prejudiced and incomplete. Until everyone is online and has equal access, this is just another arrogant study attempting to categorise who is worth enough to be able to use the net.

    It's not a study, it's an archive. The purpose of this project is to collect data, not to analyze it or place any sort of value judgement on it. Yes, the contents of today's web are reflective of the current online population. But it's rather naive to assume that future historians won't be smart enough to realize that and account for it. They'll know that the vast majority of today's world is still unwired, just as modern historians know, for example, that the vast majority of people in the 18th century did not have access to a printing press.

  • Remember 1994? Monica Lewinski was just another intern. peecees were still 16 Bit. Linux was 1.0 [memalpha.cx] and a guy named Jim Clarke started Netscape Comunications. [wired.com]
    ___
  • by Jon Erikson ( 198204 ) on Saturday July 08, 2000 @08:54AM (#949092)

    The "teething problems" are what makes it historically interesting! I think historians of the future will be much more interested in looking at the development of the web through trial and error than at the finished product.

    Well I suppose that the sheer amount of perversion and degredation available on the net at this point in history will provide a lot of interest to future historians, so in that context sure it'll be "historically interesting"!

    But, pornography aside, what is there of real historical value on the net? Sure there are any number of mindless geocities homepages full of drivel about people's pets, but sifitng through this would drive anyone mad and there are a lot more "insightful" sources already available about today's culture.

    Unfortunately the web as it stands at the moment shows the worst side of humanity rather than its best side - historians looking through terabytes of things like the anarchists cookbook, virulent anti-Christian diabtribes, terrorist manifestos and race hate sites will hardly pick up a balanced view of society will they?!

    It's not a study, it's an archive. The purpose of this project is to collect data, not to analyze it or place any sort of value judgement on it.

    But unless it will be used as the basis for future studies then this project is a waste of time, so I don't think you have a valid point here.



    ---
    Jon E. Erikson
  • Alright, chances are the growth of the Internet (and particularly the web) will eventually be recognized as one of the major sociological developments of the late 20th century. Someday, we'll want to look back at the roots of the revolution and trace its development through the present day. However, that may well be an impossibility: we still have Gutenburg Bibles, but the original Mosaic/Netscape site is already a dim memory.

    The other day, I was browsing through the Hotwired [hotwired.com] archives. Basically, they have everything to come out of the Wired family of publications for the past five or six years, and that's great: it's fun and oddly fascinating to read an article from early '96 and hear about this fantastic new "push" technology. But, being a Wired venture, many of the stories are gaudily hyperlinked, and very, very few (if any) of those external pages are still extant. Entire dimension to these stories are already lost to the ages.

    The are a lot of obvious difficulties in archiving the web, but it's something that probably should be done. I really think not too far down the line, we'll look back and regret that a lot of what we take for granted today wasn't preserved.

    -jay
  • A copyright lasts the lifetime of the author plus an extra hundred years or so.

    "Sensitive Information?" They clearly can't archive stuff the site doesn't have open to the public -- documents that require authentication or some kind of POST form submission are most-likely out of its reach.

    As far as displaying copyrighted information they'd need to license it from whomever holds
    the copyright to distribute or display it.

    Archiving it seems to be 'fair use' just as much as printing, saving pages from a site for
    later perusal are, but the second they want to
    sell it or display it publicly they've got copyright issues.
  • by Anonymous Coward on Saturday July 08, 2000 @09:02AM (#949095)
    Given our recent exposure, I thought I'd make a few comments since journalists tend to miss important details.

    Our website is at http://www.archive.org.

    We are *NOT* a company. We are a non-profit organization making our archives freely available to researchers, scholars, historians, etc.. A for-profit company may not be the right model to insure long term longevity of the collections. We only archive publicly available information on the Internet.

    We currently have about 17TB of Web pages and images on disk. We've also got about 6TB of older stuff on tape that we are migrating to disk. We're growing at about 3-4TB/month. We are not yet getting Usenet or streaming media because of labor limitations. Anyone wanna come work for us?

    We buy storage PC's with twenty 75GB IDE hard drives, 2 667Mhz CPUs and 512MB RAM. We run Linux, but are migrating to FreeBSD because of the 2GB file size barrier.

    Access currently requires a bit of UNIX skill. There is no browser interface to our collections. You'll need to be able to write your own search software, as the only index we have right now is a URL index. If you want access, you'll need to fill out a form at http://www.archive.org/proposal.html.

    Kurt Bollacker
    Technical Director, Internet Archive

    kurt@archive.org -- www.archive.org
    P.O. BOX 29244, San Francisco, CA 94129
    vox: 415-561-6796 -- fax: 415-561-6768

  • So this is just a big daily backup of web sites. Sweet.
    From: Joe User <juser@anywhere.net>
    To: webmaster@internetarchive.com
    Subject: Need file restored

    Hi. I accidentally blew away my resume at http://www.geocities.com/somedir/somepage/resume.h tml and didn't have a backup.

    Can you email me a copy? Thanks! I'm so glad I don't have to buy my own tape backup!

    :-)
  • You're right; god dammit. I plead two mitigating circumstances. First, I'm British, and we just avoid all that sort of confusion by simply having 1760 yards in a mile. Second, I'm an anthropologist, and was just slavishly quoting the Nature article. However, the confusion seems to be in other places too, for instance in the web site of Yotta Yotta NetStorage at www.yottayotta.com [yottayotta.com] (company motto: "Put a Lotta Yottayotta in Your Life," and there's also a company theme song). Also, somebody who uses the base 10 rather than base 2 definitions of these terms has placed calculations of how many bytes there are in different things are here [caltech.edu]... I was just thinking, that although 2*10 is quite close to 10*3 (2.4% off, right), 2*80 is quite a way off from 10*24 (about 21% I think). At some point, the two series must get completely out of step.
  • Another group, the private company Alexa, has also engaged in internet archiving. A couple of years ago Alexa donated an archive to the Library of Congress [alexa.com]. It was written up in Brill's Content [brillscontent.com] last November (article text not online, alas).

    (Alexa's normal business involves a browser plug-in that is What's Related on steroids [alexa.com].)
    ----
  • Jon Erickson spewed:
    [much repetition deleted] ... any study that attempts to categorise how we live at the moment using the web is doomed to be prejudiced and incomplete. Until everyone is online and has equal access, this is just another arrogant study attempting to categorise who is worth enough to be able to use the net.

    So what you're saying is that the ONLY records worthwhile to historians are those that reflect "everyone"? Just who the hell are you to tell future historians what will and won't be useful to them? Do you understand history at all?

    The importance of any given historical document is often not found in the document itself, but in that document's context.

    My father managed a professional local historical society and museum while I was a kid. The museum was a high Italianate mansion in the upper Midwest built by a lawyer who got rich on land speculation. You are correct that the biggest danger facing future generations is to look at a museum such as a rich lawyer's house and believe This is the way people lived in 1870. In fact, only about 1% of the people lived that way (servants, nice furniture from Europe, that sort of thing). To help counter this one of the first things my father had done was to take custody of a "fortuitously" threatened building, one of the first in town, which was more typical of your average family.

    By themselves, neither building is representative. Even together they fail to represent everyone in this particular community. But by putting them together in context you can illuminate things that would otherwise be much more difficult to see.

    What a snapshot of the web ca. 1996 shows is most definitely a subset of the larger society. But you can't say to me that it doesn't say anything useful. You yourself note that the people, ideas, and connections that it shows are a particular and identifiable subset of the larger society. Realize then: That, in and of itself, is useful to historians. Even more so is the randomness and complexity of the archive, because having a human select what future generations of historians will find useful or interesting is a dubious proposition.

    If you're waiting for the day when "everyone is online" to start recording our digital history, you'll have to hold your breath a very long time.
    ----
  • Would you like to see the PC industry before Microsoft?
  • I don't know if this has been touched on yet, but it seems to me that the archive would make an ideal 'memory' for an AI to grow in to.

    During the AI's maturation cycle, one could simply provide the data raw and let the pattern matching and cross indexing fly.

    The archive is mixed with every aspect of the human mind, the art and music, the filth and evil, the self-righteous religious fury, the meek and mild post-hippy sentiment, the peircing precision of scienctific method and the wacky good fun of psuedo-science madness. The full spectrum of humanity is presented here, from autistic to genius, idiotic to brilliant. There are plenty of wholesome sites, children friendly basic learning resources, museums and FAQ's, on and on.

    I propose that the archive be used as the base for intuition in a full scale AI project, with the actual info rarely making its way up the stack to conciousness, instead providing the hints and connections necessary for a model of human memory. Explicit data filtering will be required for the prototype and experimental versions, but a truly 'human' AI would need to understand humanity on some pretty deep levels if we were to expect it to relate to us, and have sympathy/empathy for us regardless of our shortcomings.

    Maybe this process is already happening under the guise of the internet search engines!!! There is an awful lot of pressure and incentive to provide relavant responses... what better way to accomplish this, than to have some entity think about the list of matches and understand human requests like a human?


    :)Fudboy

    I guess I'm just a Fudboy, looking for that real Transmeta...
  • I certainly won't argue your statements about WHO accesses the 'net or whether or not it is in a transitional period. I think you are fairly right on the money. (However, it is possible to see how more and more different types of people are finding their way online.)

    That does not mean that this is a bad time to archive the net. In fact, I would say it is a great time to do it.

    Considering the (rather sci-fi) assumption that the 'net (in whatever form) will become a more integral part of everyone's life eventually, I think it might be interesting to see the steps of that evolution. It would be like looking at those drawings representing the evolutionary stages of the human race. 100 years from now they can look at the 'snapshot' and say, "So that's how it was. Horny buncha capitalists weren't they? Glad the 'net ain't like that any more. New Mail! Damn spammers!" Heh.

    Perhaps it's too bad that these snapshots did not start sooner. It would have been nice to see the real baby pictures. "ARPAWha? Who used this?" ;)


    \//
  • Marketing being what it is, you can bet the first hard drive/tape/holocube storage system that manages to cram 10^12 bytes into itself is going to be advertised as storing a "terabyte", and etc...

    Well, actually, the first that manages 5x10^11th will claim it does - "with compression".
  • Oh sheesh... It was only until I got to your comment that I realized it didn't say that. Thanks for the heads-up.
  • Ever read old ads in old magazines? They can be pretty funny.

    "Buy the new 1958 Edsel! Look at the great styling -- what a wonderful car!" I love all the safety features they put in 1950's cars including padded dash boards and OPTIONAL seat belts.

    Ads can be a great way to show what life was like. Everyday things that we forget -- like what was housework like in the 30s? Are saved in ads of the day.
  • A one armed friend who has a job at Sears in sales has DSL 5 computers and can't drive.

    Plenty of people buy several CDs of music each month.

    They can afford DSL.
  • This still won't bring back paranoia.com...
  • I wish I hadn't just used my last moderator point.
    <BR>
    <BR>
    <I>Sure, this'll be a useful reference for future ...</I>

    You're partially correct. Yes, the web is in a "teething" phase. Yup, glitches do tend to override content, in some places. But it's still worthy to archive. How will future generations answer the questions: How exactly did the Internet start out? What did it look like?

    With these snapshots, of course.

    <I>The trouble ...</I>

    I feel this is pretty much completely correct.

    <I>So, given this ...</I>

    I don't think it is getting a snapshot of how society is today. But is that really the point? Or, rather, is that really what these archives will be used for in the future? I doubt it. But this archive will provide a great deal of information about the growth of communications techology, and how it will change our lives.

    <I>expressing views that people consider "outdated" or "primitive", even they are held by many others<I>

    If a million people say something that is factually wrong, it's still wrong. Got that?

    <I> Anyway, any study that attempts to categorise how we live at the moment using the web is doomed to be prejudiced and incomplete. Until everyone is online and has equal access, this is just another arrogant study attempting to categorise who is worth enough to be able to use the net.</I>

    Yeah, so what? Does that mean they should stop? It's still their time and money. And this archive WILL provide an amazing picture of the Internet when it was still relatively young(teething, as it were).

    Dave
  • by Anonymous Coward
    For a moment I thought the headline read 'Slashdotting the Entire Internet'

  • A large portion of the Web is dynamically generated. Isn't it very hard to snapshot it automatically without human interaction?

    I doubt that would be included with the arhive... Robots just eat the text don't they? Like Slashdot allows robots, but they read the .shtml and just get it as text instead of the dynamic threaded page... Hrmmm


  • some people like to think of their replies,
    either you go for a first post or write
    something lost in trollsville.
    What the heck, I'm not doin' too well,
    the letters are worn off his keyboard
    .UNFAIR UNFAIR

    This is old stuff, there was supposed
    to be time mirrors [ftp] that allowed you
    to look at adds & features from six months
    behind.

    I've been trying to get interest up in a
    swap page for pics of match books, shopping
    bags, beer coasters from around the world.
    Thats half the fun of travel.
    ^ ^ ^
    Just as laser surgery can
    improve your sight
    a MICROWAVE LASER can
    blur it.

    THIS POST HAS BEEN CENSORED,
    l 'GOT THE SCREENSHOTS TO PROVE IT
  • Aww man, Every time anything happens that /. can stick IP, patent, or copyright violation on a story they sure as heck do. I could go through the list of /. stories and like every 5th one would probaby have one of those in it. Its.. in a word getting old. Yes I know we need to be aware of the issues but come on folks :(

  • But, pornography aside, what is there of real historical value on the net? Sure there are any number of mindless geocities homepages full of drivel about people's pets, but sifitng through this would drive anyone mad and there are a lot more "insightful" sources already available about today's culture.

    Unfortunately, our knowledge of history already has numerous gaps where things that, at the time, were thought to not any value. Leave the interpretation of value to future historians, and meanwhile, let's not make decisions about what isn't important enough to save.

    Unfortunately the web as it stands at the moment shows the worst side of humanity rather than its best side - historians looking through terabytes of things like the anarchists cookbook, virulent anti-Christian diabtribes, terrorist manifestos and race hate sites will hardly pick up a balanced view of society will they?!

    I am amazed if you think racism and hatred content is the overwhelming majority of the Internet. But that is completely irrelevant; even if this is true, embarressment of it is no good reason to censor historical records.

    But unless it will be used as the basis for future studies then this project is a waste of time, so I don't think you have a valid point here.

    Why do you think it won't? Why do you presume this information is, and always will be, useless trash? It is only by archiving without bias and censor that there can ever be an accurate historical record. Archive it all and let the historians sort it out.

  • Until everyone is online and has equal access, this is just another arrogant study attempting to categorise who is worth enough to be able to use the net. What is the proposed solution to this "problem"? I can see it now, the constitutional amendment that gives everyone "an equal right to information access". Behold the new world of computer welfare. Wouldn't that just be great? The government would give people computers and internet access. The downside to this is apparent. Do you think the taxpayer is going to want their money going to download porn or information that doesn't serve society's interests? The government will of course limit the access that people on computer welfare can have, and then in effect, the government will have complete control over this medium. The price of "equality" in this sense is going to be all our freedom. Is it worth it?
    -----------------------------
  • That's a lot of porno... Lots of child porn too, I assume.
  • I've written an essay, "In Search of Webs Past," [techreview.com] on the topic of archiving the Web. It's in the current issue of Technology Review. [techreview.com] One of the things I point out is that the Internet Archive (because of some of the legal problems mentioned in the original post, and additional issues) is not going to be the sole solution for archiving the Web. I do see such archiving as very necessary from several standpoints besides the "slice of life" historical one - media history and criticism being one of them. Those with an interest in these aspects of the topic might wish to take a look at my essay, and continue the discussion either here or at the Tech Review forum [techreview.com] about the essay.

    -Nick M.
  • You say, Well, because currently the web is still in the transitional period between the days of ARPAnet and purely academic use and acceptance as a medium through which the general public can communicate. And as such, it's still in a state where teething problems overwhelm content.

    Well, that might be all well and true. But does that change the fact that millions upon millions of people are using the internet so heavily? That's the basic problem I see with your whole arguement. You argue that it's still not ALL that common, eg: due to money constraints, and you say that most web users are in above average tax brackets. You argue that the study will create divisons in our society, the timeless clash between the "have"s and "have-nots". But that's not the point at all!! You are missing the entire point!! MANY people are using the internet in our society RIGHT NOW. Given this fact, that it has become such an important tool and resource for such people makes it worthy of documenting historically. Why are celebrities documented so much in the media? What percent of the population do they represent? Like 0.01% or something? Do we exclude WWII because it only happened to white Europeans, surely Europeans don't make up the majority of the world's population; does that make it not worth recording?

    Of course not. Your argument is inherently flawed, and misses the point completely. The fact of the matter is that the Web is an extremely important factor in a LOT of people's lives right now. Such a technology has never had such an overwhelming impact on a socitey before, EVER. As such, it certainly does deserve to be documented. I hope you see where you went wrong.
  • Thank you Quecojones.

    The net is how a lot of people live. It's an extraordinarily useful thing that millions use. The fact that not everyone uses it makes it more interesting, because it hasn't yet become something we all take for granted, like electric power or running water - or television or radio, for that matter. Once that happens, it will be very interesting to look back on how it got that way. Future historians will thank the people doing this, even if Erikson doesn't.

    sulli

  • Now the question "Is this the latest version of the internet?" will actually be valid.
  • (Old joke: what's the difference between a novice geek and a real geek? A novice geek think's there's 1000 bytes in a kilobyte, while a real geek think's theres 1024 meters in a kilometer. Haha, LAUGH god damnit)
    LOL!!!

    You mean there's only 1,000m? I already know I'm gonna use this one a lot.
  • So given this, how does taking a snapshot of the web give a view of how society is at the moment?

    I don't know about you, but I've seen lots of First World War propaganda that had absolutely nothing to do with what was really going on, but it gave me an excellent impression of what society was like about then.

    The key thing is that I know it was propaganda, just as the key thing with this is that historians (and others) will know that today's web doesn't represent everyone.


    ===
  • AFAIK the net has millions if not a billion sites...
    Y'know, 76.459 percent of all statistics are made up on the spot. [E-mail me directly for my source for that statistic. Get it?] A billion sites would be one for every six humans on the planet. Think about that for a second. Then ponder how outrageously lucky and fortunate you are that all the right accidents of fate fell in your favor-- so that you can post wildly overblown guesses as to the ubiquity of Internet across the human race.

    As a species, we have a long way to go. The first step on that journey is the acknowledgement that many of us that read this are *incredibly blessed* [or lucky, or whatever floats your theist or atheiest boat] and that we have an obligation to help those who aren't as fortunate.

    But I troll^h^h^h^h^h digress.... (Can't SlashDot allow <strikeout>?)
  • by mikpos ( 2397 ) on Saturday July 08, 2000 @09:19AM (#949124) Homepage
    So you're saying that because the web isn't perfect and it doesn't reflect the general society, it won't be useful to historians? If you ask me, this would make it more interesting, not less. This transition will have an extremely short lifespan (probably under 20 years in length), so the more data the better (for the historians).

    And, FYI, just because the Royal Family doesn't reflect English society, it does not mean that historians don't find them interesting.
  • by Valdrax ( 32670 ) on Saturday July 08, 2000 @09:44AM (#949125)
    Well I suppose that the sheer amount of perversion and degredation available on the net at this point in history will provide a lot of interest to future historians, so in that context sure it'll be "historically interesting"!

    You just don't get it, do you? Should historians gloss over the Holocaust, the Reconstruction, and the Dark Ages simply because they were "icky?" Sometimes the darker elements of society are the most worth examining in a historical context. The whole point of the saying about those who don't study history are doomed to repeat it isn't that you should study only the good points and avoid them.

    But, pornography aside, what is there of real historical value on the net? Sure there are any number of mindless geocities homepages full of drivel about people's pets, but sifitng through this would drive anyone mad and there are a lot more "insightful" sources already available about today's culture.

    Do you think it's not just as frustrating to shuffle through archives of old 19th century newspapers to find ads and articles about the medicine of the day? The point that the man speaking for the Internet Archive was making is that this is not a study of only the famous. With these archives at hand, you can study the transition from the early days of research papers to the rise of pornography and personal websites to the current days of e-commerce to whatever major social trend the web next holds. An archive of the web shows how society has adapted to the format. You can see what issues were hot enough to spur crops of websites only to fade away in the span of a year or two.

    Face the music that the majority of humanity isn't putting out "insightful" commentary. Ignoring the common man is a mistake that many historians simply can't ignore because there's nothing available about them. All the "mindless" Geocities sites give an insight into the kind of people that use them.

    Unfortunately the web as it stands at the moment shows the worst side of humanity rather than its best side - historians looking through terabytes of things like the anarchists cookbook, virulent anti-Christian diabtribes, terrorist manifestos and race hate sites will hardly pick up a balanced view of society will they?!

    Sounds like you're the one with the hardly balanced view of society if you honestly think that is what the majority of the web is. The fact is that the majority of the web currently is commercial sites and those "mindless" Geocities sites you like to talk down about. Though there are some bad elements on the web, it's also worth historical note that the web led to the coming out of many of these fringe groups. The very anarchy and rebellion of the web is of major historical interest, and the web is becoming one of the more important socio-economic influences of the turn of the century, at least in America.

    But unless it will be used as the basis for future studies then this project is a waste of time, so I don't think you have a valid point here.

    Ah, but it will be. Say in 30 years you want to do some research on the Y2K histeria of the turn of the century. While there will be plenty of books to read through, a major factor in spreading the word about Y2K was the Web. However, these web sites are already mostly gone from the Web today. Fortunately, the Internet Archive may have already preserved them for future study.

    Would you like to study the rise of Linux or of the web itself? Many of the early web pages about the topics could provide priceless research. Hell, even if you really object to the large amount of pornography, the booming porn industry on the web was a major driving factor in advances in e-commerce. It would also be valuable in studying the "warez" counter-culture of today.

    Plus, like it or not, it's not for you to say. This is being done by a privately funded group. If you really feel so strongly that the web is worthless and should absolutely not be archived for historical purposes, then go torch the place. While you're at it, go ahead and start burning those libraries that hold material about history you object to. Otherwise, your choices are "shut up" and "like it."
  • I collect software and giveaways from ISP's, especially Win 3.1 software, tshirts, ads, old magazine articles, etc.

    My facination is on the number of ISP's that emerged in the past five years and how many of them disappeared, especially the high-profile, nationally advertised companies - gnn, pipeline, etc.

    I would like to see some of their archived sites from 1996 for different ISP's, their service offerings, pricing.

    Call me wierd, but I find it facinating. When do we get access?

  • by kurtbollacker ( 208996 ) on Saturday July 08, 2000 @09:51AM (#949129)
    We buy PC's with 20 75GB IDE hard drives, paying about $11/GB for storage. Pretty cheap these days. We've calculated that the growth of the Web and the growth of disk drives tend to track pretty closely, so the cost of keeping up with the Internet will mean a relatively constant spending rate.

    Kurt Bollacker,
    Technical Director, Internet Archive
    kurt@archive.org
    www.archive.org
  • wasn't Xerox PARC already doing this? well, capturing all the text, storing terabytes at a time? I can't seem to find the link currently, unfortunately.
  • Archive.org is the nonprofit arm of Alexa, both having been started by Brewster Kale. Archive.org still gets its data from Alexa's web crawler, but I'm not sure how much connection they retain since Amazon.com bought Alexa.
  • Where are they going to get the capacity from? AFAIK the net has millions if not a billion sites and to store such a large no of sites would take a horribly large no of raid arrays. Also the net is dynamic and always changing . By the time they finish taking napshot of the net half the sites would have changed and btw what frequency is the best? This seems like an Augean Stable kind of task
  • I didn't read the CNN story closely enough. Apparently it's the same group, after all, but they are now operating under a different name for the nonprofit archive activity (as opposed to the commercial search-engine activity).
    ----
  • When do we get to see the archive?

    I heard there was a version distributed over millions of servers worldwide. :P

  • How deep does this archiving go? Are they going to store every single page and image of every single website?

    The goal is to get all publically available Web pages and their images. Technical and labor limitations prevent this from happening as yet, but we are working on it.

    How much storage space is required for the whole web?

    No one knows for sure, but the best estimates I've seen put it at about 20-30TB.

    What software/OSes are they using for this project?

    We've got got our own software running on Linux, although we are migrating to FreeBSD because of the 2GB file size limit. As a shameless plug, we are hiring!!

    When do we get to see the archive?

    We intend to provide access to researchers, scholars, historians, etc. Research proposals can be submitted here [archive.org].

    Kurt Bollacker
    Technical Director, Internet Archive
    kurt@archive.org

  • by Valdrax ( 32670 ) on Saturday July 08, 2000 @10:05AM (#949152)
    Well, what else is it going to be used for? Your suggestion that it be used as a reference on "the growth of communications techology" is rediculous - the growth of hate material and pornography on the web has no correlation with the growth of communications technology at all. This project is not getting a snapshot of web technology, it is getting a snapshot of web content, something entirely different.

    Much of the content of the web relates to the growth of communications technology. You are limiting yourself severely if you are only thinking of the raw bandwidth connections. The growth of use of non-textual content, multimedia, and scripting languages and applets are all advances in communications technology. Not to mention, the radical growth of the internet, in terms of number of sites and content on sites help document that. Plus, your view of what the content of the web actually is is rather stilted. Even if it wasn't, the vast amount of negative material is worth studying in it's own right.

    Please demonstrate how believing in God and decent Christian morality is "factually wrong". I doubt you can.

    <offtopicrant>
    Please show how your condenscending and arrogant attitude reflects a life led by Christ. It's jerks like you who give the rest of us a bad name. Did he anywhere indicate that he was talking about Christiantiy? Anywhere? Or was he just responding to your blanket assertion that you were persecuted for your views, which in no place specified Christianity.

    Instead, you assumed he did, or were attempting to sidetrack the issue to make yourself look like the oppressed religious minority. This kind of behavior is what disgusts others. When people look at me, they see someone like you -- an arrogant, bigoted ass who sees the entire world as filled with evil sinners out to get them. It makes me sick.
    </offtopicrant>

    My point is, if you read my post, that this is not a good thing given the exlusive access to the net by a certain portion of society. Would you consider how a society lived through records of its nobility?

    Um.. Let me think. YES!! That's how historians have had to do it for ages. Should we ignore early American politics because it too was primarly run by white, middle-aged landowners? You must attempt to look at all elements of a society to see how the framework fits together. If the web is a rich WASP playground, like you assert, then it's worth studying why exactly this is. Just because one particular class was behind a major force of societal change doesn't make it not worth studying. This kind of PC "1984" style of thought would have us ignore all of our history for the goals of delluding ourselves into thinking we're perfect. Well, we're not. Get over it, and start trying to figure out why. Just because America was led by white landowners early on doesn't mean we should've ignored history to know where we are now. Similarly, we should not ignore the current Internet so that future generations will know how they got to where they will be.

    P.S. Sort your HTML tags out.

    Finally, the prima facie evidence of a troll. Someone you picks at the formatting rather than the content of the person they disagree with. The guy obviously accidentally selected the "Extrans" option, which I too have accidentally done before in the past thinking that it would do the opposite. Besides, you should really preview and check your spelling before being so harsh. Talk about the pot calling the kettle black.
  • I wish I hadn't just used my last moderator point.

    Why? Because you, like so many /. moderators, moderate anything you disagree with down? IMHO that is the biggest simgle problem with /. at the moment, with troll metamoderation the second.

    I don't think it is getting a snapshot of how society is today. But is that really the point? Or, rather, is that really what these archives will be used for in the future?

    Well, what else is it going to be used for? Your suggestion that it be used as a reference on "the growth of communications techology" is rediculous - the growth of hate material and pornography on the web has no correlation with the growth of communications technology at all. This project is not getting a snapshot of web technology, it is getting a snapshot of web content, something entirely different.

    If a million people say something that is factually wrong, it's still wrong. Got that?

    Please demonstrate how believing in God and decent Christian morality is "factually wrong". I doubt you can.

    Yeah, so what? Does that mean they should stop? It's still their time and money. And this archive WILL provide an amazing picture of the Internet when it was still relatively young(teething, as it were).

    My point is, if you read my post, that this is not a good thing given the exlusive access to the net by a certain portion of society. Would you consider how a society lived through records of its nobility?

    P.S. Sort your HTML tags out.



    ---
    Jon E. Erikson
  • Remember 1994? Monica Lewinski was just another intern. peecees were still 16 Bit.

    Huh?! Maybe you're thinking of 1984. By 1994, everyone and their dog has already been 32 bit for a few generations.


    ---
  • "Archiving the net is like washing toilet paper!"
  • Well, of course it's not perfect. In the news story itself, they mention that sites can block the archive entirely by blocking 'bots. Also, I doubt everytime it hits MP3.com it downloads every single song. However, the attempt is being made, and from the sound of the size of their archives, they're doing a darned good job of it. Also, these snapshots obviously can't be done in just a week or less. I'm guess each snapshot takes months. Bandwidth plus the time in writing to tape archive limits how fast they can do it. They probably miss all the little dynamic IP sites too. Oh well, at least an attempt is being made. As I pointed out in another post, hopefully they've managed to capture a bit of the Y2K histeria before all the sites were pulled down in embarrassment.
  • by Jon Erikson ( 198204 ) on Saturday July 08, 2000 @08:30AM (#949164)

    Sure, this'll be a useful reference for future generations, won't it? I'm sorry, but as much of a fan of the web as I am, I really wouldn't consider it to be something worthy of archival in the state that it is at the moment. Why? Well, because currently the web is still in the transitional period between the days of ARPAnet and purely academic use and acceptance as a medium through which the general public can communicate. And as such, it's still in a state where teething problems overwhelm content.

    The trouble with the web is that although it is supposedly accessable to anyone with a phone line and a PC, the harsh reality is that cost and communications infrastructure have meant that only those of a certain socioeconomic group are currently able to use the web, and this group is mainly comprised of the priviliged, a group which most /.ers fall into by dint of their jobs or backgrounds. Research carried out my both my consultancy group and others all indicate that the majority of people able to use the web are white, middle-class and certainly in higher than average tax brackets.

    So given this, how does taking a snapshot of the web give a view of how society is at the moment? It doesn't, any more than looking at the Royal family of England gives a picture of what England is like (despite what some Americans seem to think). The views that are expressed on the web are those of a priviliged class who do not have to suffer the effects of current liberal free-market policies and the increasing divide between the rich and the poor.

    No, this exercise will be a "Who's Who" of society, showing only those who are rich enough to be able to afford net access. The majority of people, unable to benefit from the web, will be left by this study as an underclass, something which I view as incredibly wrong and an example of the undeniable arrogance that most people on the net display towards those that are perceived as their inferiors. Indeed, I have suffered the same myself here on this forum for expressing views that people consider "outdated" or "primitive", even they are held by many others.

    Anyway, any study that attempts to categorise how we live at the moment using the web is doomed to be prejudiced and incomplete. Until everyone is online and has equal access, this is just another arrogant study attempting to categorise who is worth enough to be able to use the net.



    ---
    Jon E. Erikson
  • To collect snapshots of the internet reminds me of the Wim Wenders' movie Lisbon Story. It's about a director that decides to videotape everything he can, leaving operating cameras in parks, walking around with a camera all the time, etc. He would afterwards destroy the tapes (or lock them somewhere, I don't quite remember) to avoid people to see those particular images. Some idealistic freedom for images, not to be ever seen.

    I don't think a huge unfiltered database with everybody's day-to-day life has a lot of historical meaning. And I think that's what a snapshot of internet is like.

    Who knows, maybe these guys are just planning to burn everything, afterwards...

    -- Aber
  • According to this article [nature.com] in Nature [nature.com], 1000 terabytes = 1 petabyte, and 1000 petabytes = 1 exabyte. The article notes that as ever larger and more complex scientific experiments produce ever larger quantities of data, there was briefly a possibility that we would run out of words to describe the amount of data produced. Consider that while the Library of Congress contains less than 12 terabytes, the Large Hadron Collider at the European Laboratory for Particle Physics (CERN) is expected to produce 100 petabytes (i.e., 100,000 terabytes) of stored data in 15 years... Anyway, moving on up, after exabyte, they (whoever they are) started naming things from the back of the alphabet. Thus 1000 exabytes = 1 zettabyte, and 1000 zettabytes = 1 yottabyte. Although the article does not say, perhaps the term for 1000 yottabytes will skip over 'x,' as we already have exabyte, and go to 'wottabyte'? I like the sound of that :) ...
  • by TheFrood ( 163934 ) on Saturday July 08, 2000 @08:31AM (#949180) Homepage Journal
    This is really a cool idea. Up 'til know, we've taken it for granted that our media would last long enough for historians to make use of it in the future. With the web, you can't assume that's the case, so it's good that someone's taking it upon themselves to archive the web.

    But I want to know more:

    How deep does this archiving go? Are they going to store every single page and image of every single website?

    How much storage space is required for the whole web? Wild guess: A recent /. story put the number of Apache-served websites at about 10 million. Since Apache has roughly two-thirds of the market, that makes the total number of web sites 15 million. If sites average, say, 10Mb in size (wild guess), then it looks like 150 terabytes would be enough to store the entire web.

    What software/OSes are they using for this project?

    When do we get to see the archive?
  • by Chairboy ( 88841 ) on Saturday July 08, 2000 @08:32AM (#949183) Homepage
    You know, Google is doing something similar. They have copies of websites cached from the last time they crawled it.

    On topic, though, doesn't this threaten to change accountability from people who post commitments on their sites that they cannot meet? In the past, these people could change their website at will, and since there wasn't a physical copy, there was no evidence of the previous comittment.

    I anticipate this also being used in court. Think about it, if someone sues for libel, the evidence could be available in the 'snapshot' archive. This converts their project into a legal document, and means that the company doing this net archiving could be in danger of contempt hearings if they don't take extraordinary measures to ensure the integrity of the data.

    I don't know, this sounds like an awfully big responsibility, and I hope this company has a good bunch of lawyers.

Top Ten Things Overheard At The ANSI C Draft Committee Meetings: (5) All right, who's the wiseguy who stuck this trigraph stuff in here?

Working...