Become a fan of Slashdot on Facebook


Forgot your password?
The Internet

Linkguard To Cure Broken Links? 74

sean dreilinger writes: "Here's a BBC writeup of the company Linkguard, which hopes to monitor hyperlink stability via their 40-terabyte database and notify web authors when links are broken." This is a different effort than this one. Still, 40 terabytes?
This discussion has been archived. No new comments can be posted.

Linkguard To Cure Broken Links?

Comments Filter:
  • by Anonymous Coward
    I think it is clear now that everything posted by Signal 11 should be moderated down as flamebait. His posts always draw flames, therefore it is flamebait.
  • just add an ErrorDocument 404 notify.php3 to your apache config, and the e-mail the administrator of the link (and the referer ... ;)

    You might getting quite some amout of e-mail .. but .. that "urges" you to fix the problem ;)

    Samba Information HQ
  • There are already companies out there that are doing this for free and when they find a broken link they send and email to

    This is nothing new really...

    Nathaniel P. Wilkerson
    NPS Internet Solutions, LLC []
  • by British ( 51765 )
    I'm betting most of the broken links that need to be fixed are geocities pages. I've never visited a geocities page without at least 1 404 on them.
  • There's a company, Seven Twentyfour [] which provides this same service, though not for free. I've been using them, one a lower cost one-report-per-week service for several months, and they do a nice job. They offer a 30 day trial with no obligation to buy, so if you have a recently moved web site, you could do the trial on the old website to find all the old links that got broken by the move.

    If all you're worried about is moved pages within your server, looking at the error log from the web server is pretty good.

  • To make things more interesting, Netscape 4.x has (IMHO) a broken implementation of the Referer field: it reports whatever URL the user happened to be on, not necessarily the URL they linked from.

    If you click the above link to my home page, my server is supposed to see that you came from this article on Slashdot. That's all well and good.

    If you type a random URL - say, - into your address bar now, should Yahoo see that you came from Slashdot? I'd call that an invasion of privacy. Netscape sends that information.


  • How about a brand new protocol for locating documents on httpds? If a page has a link to then the client checks loc:// for the actual location of baz.html! The client tells the location server:
    WHERE /baz.html

    If baz.html is in the same place ( the location server will return:
    100 /baz.html

    If it's been moved to elsewhere on the server (say, /baz/1.html) the location server will say,
    101 /baz/1.html

    If it's moved to another server (this time then the location server returns:

    And if it's been removed for good, then:
    103 KILLED

    If the file never existed, then the location server will say:

    If the server encounters a problem then it should return:
    Or something else, if it knows what it did wrong.

    Maybe, also, there should be another client command, SEARCH, to find any/all occurances of a file name, like:
    SEARCH bar.html

    To this the server might reply:
    401 /bar.html
    401 /foo/bar.html
    401 /baz/bar.html

    And a directory list, too... Client says:
    LIST /

    And the server says:
    406 /index.html
    406 /style.css
    406 /cgi-bin/
    406 /products.html
    406 /downloads/
    406 /people.html
    406 /bar.html
    406 /baz.html
    406 /links.html
    406 /menu.html
    407 LIST ENDS

    I have begun to write this all up. Anyone who wants to help, visit my web site, find my e-mail address, and tell me you wish to help with this protocol.

  • by Parsec ( 1702 )

    I just have cron set up to do a daily 404 report for the last seven days with Analog, and read it occasionally.


  • 5 years ago, I'd've been like, "Wow!", but now, in a time in which Seagate is selling a 70 gig SCSI drive and where people I know have a terabyte of storage _at home_, 40 of them in a company seems like nothing.

    - A.P.

    "One World, one Web, one Program" - Microsoft promotional ad

  • I'm not really familiar with the workings of the
    web, but why can't HTML code be devised that
    looks ahead to where the links lead after the rest
    of the page has finished loading, and return a
    status code on how the page is working? If it's
    404, then the browser can change the link text to
    a predefined color so the user knows not to click
    on it. Any ideas?
  • You don't seem to understand what this company does at all. But thats ok, you got 14th post. Good job!
  • How about a brand new protocol for locating documents on httpds? If a page has a link to then the client checks loc:// for the actual location of baz.html!
    Uniform Resource Locators are actually quite flexible. URLs exist in an abstact namespace which does not necessarly have to map exactly to filesystemspace. mod_rewrite [], a powerful URI-to-filename mapping system using regular expressions can be used instead of your protocol described above. Want to rewrite URLs in the form /Language/~Realname/.../File to /u/Username/.../File.Language?

    RewriteLog /anywhere/rewrite.log
    RewriteMap real-to-user txt:/anywhere/map.real-to-host
    RewriteRule ^/([^/]+)/~([^/]+)/(.*)$ /u/${real-to-user:$2|nobody}/$3.$1

    Also, the HTTP URI or Refresh header can be used to easily redirect an existing location to another. There is no need for a document location protocol.

    And if it's been removed for good, then:

    103 KILLED
    I somewhat disagree with this. After all, a URL could be bookmarked, linked, or be refered to in another way. Once a URI is created, it should exist forever. Freenet [] is an interesting distributed Internet-like network where documents can be uploaded, and since the files do not reside on a central server, exist as long as their is a demand for the file.

    Searching should, in my opinion, be higher level and not in the protocol. CGI can easily be used instead.

    And a directory list, too... Client says: LIST /

    This really shouldn't be necessary. Links should be able to get to all the public documents on the web server. HTTP is not FTP.

    If you feel a new feature should be added to HTTP, suggest it on the ietf-http-wg [] working group mailing list and it might be accepted in HTTP 1.2.

  • Someone please give him an extra point for the 404 message; it's a classic. I'm saving that just for future reference.
  • You could make all your external links point to a Perl script that either spewed the right page at the browser, or made a log entry if the link was down. All your links would look like this: <a href=" ://">click here</a> Of course the downside is that the beauty of simple hyperlinks is lost.

    The regular .sig season will resume in the fall. Here are some re-runs:
  • Someone could create a plugin to do this for you -- when you start Netscape or IE, it checks each of your Favorites/Bookmarks. It does this in the background, so it doesn't interrupt what you want to do. If it finds a broken link, it moves it into a Broken Links folder and highlights the folder red so the next time you pull down the Bookmarks/Favorites menu, you'll see that something broke.

    This is the least intrusive behavior -- it doesn't need to popup to let you know, because it won't be of concern to you until you actually want to go somewhere else. And by moving them into another folder, it gives you a chance to review them and find out where the right link is, if it's something you still care about. I'd pay money to see something like this developed.

  • With a simple cronjob and Perl's wonderfulLWP module package, not to mention the other implemtations of tracking web-pages, any relativly smart administrator should already be doing this.

    I once made such a short Perl script to check the links on my own web pages: []

  • roughly 41,943,040 megabytes

    i believe that would be 40 GB not TB

    $mrp=~s/mrp/elite god/g;
  • by Anonymous Coward
    It's funny that you're shouting "fag" at Sig11, esp. considering how I've got my cock buried up to the hilt in your tight brown-rose as I type this, AC.

    Well, it's a funny world, I guess.

  • Another use for a database like this is to warn webmasters. You would go the company to tell everyone that links to you that your site is moving before it actually does. You type in the URL you want to tell people about. It does a cross reference and returns all the other sites that link to it. Now, either it would return you the addresses or it could send them an e-mail by itself. This way people won't get the "We've moved" page. This would be good because even if you did move you may have the "We've moved" page linking (and/or redirecting) the person to the new site. The page exists, but the link is outdated though.

  • Will linkguard be monitoring these various searchengines/portals/whatever? Seems to me many of the 'broken links' I find are actually links from places like Yahoo, Lycos, Hotbot, etc. Perhaps if these companies did periodic link checking, many 404s would be eliminated from the web.
  • I think that the most efficent place to do something like this would be at the search engine level. When you are at a companies page, they should have thier own software to do that, not some central 40tb database, thats just a waste. But if each search engine company put something along the lines of this into place (i.e. check a link when someone uses it. is it good? then good. good. Is the link bad? then put it in a test again later queue.

    at least, thats what I think.
  • I don't see how Google can store web pages locally. Most pages are copyrighted so it wouldn't work out legally.
  • http://www. soft&hl=en []

    Need I say more?

    --Remove SPAM from my address to mail me
  • Sorry, File Not Found 404's vanish for good File still not found
  • First of all, it would be alot easier to log all outgoing 404s from your server, and notify the system admin when they occur. It sounds to me like these guys were just looking for the most expensive way to do this, and consequently, get the most VC money. Hell, if I told a VC that I was going to write a shell script to record outgoing 404s, I wouldn't even be able to buy a new Porche!

    For those of you who want to try, here's a starting point:

    grep "File does not exist:" error_log > 404s_log


  • by babbage ( 61057 ) <> on Sunday June 18, 2000 @03:25PM (#994252) Homepage Journal
    This idea looks okay and all, but is it necessary? It seems to me that the simplest solution would just to have well behaved site maintainance -- mainly by making liberal use of things like server redirects & aliases, which should take care of 85% of the problem or something. If I rename a document on my site, I add a redirect so that the old name still works; if I delete a document, I consider redirecting to a Google search for similar documents -- at least it doesn't leave the user completely lost.

    Yeah, these methods are kind of a pain in the ass, but they're only worse if this new plan can do no better. But look at it -- they want every single link out there to be rewritten to their spec. Who does that help? Millions of web authors out there trying to rewrite documents to filter into this (really small []) database can't possibly be easier than having the much smaller set of web admins adding server redirects whenever they notice more than a handful of 404 errors on the same document.

  • Two that come to mind are Dreamweaver and Frontpage. They'll check your entire site for you and display broken links.
  • I agree that this is something that a webmaster should deal with on his own site, individually. It's like printing company business cards with telephone numbers that don't exist on them, it gives an unprofessional feel to the whole site. However, on complex, dynamic and interactive user-based pages, this might not always be possible, unless multiple Webmasters are constantly monitoring the site. So I think that while a webmaster should be obliged to look after broken links - and other aspects of website care, in some cases tools like this will be beneficial.

  • As the BBC article states, Linkguard now knows that there are on average 52 links per page on the world wide web.

    Checking that number of links could well slow down page load times, especially if the pages in question are on another server. Yes you're only doing a HEAD rather than a full page fetch but it still takes time. And users can get rather irritated at slow-loading pages...

  • I use what I hope is a more or less useful [] 404 page on my site (useful in that it links to Google; better still would be linking a search for that document, but I haven't had a chance to try that).

    But, I think that this one [] is much more fun, in a clever little funny on IE funny on Lynx kinda way....

  • Where did common sense go???

    It's not even close to using Robust Hyperlinks (nobody wants to use them or understand them). The web is created by the lowest denominator.

    This is a perfectly valid approach given _people_are_lazy_. It HAS to be done by a third arty or it will never be done.

    I cant believe some of the rediculous comments...
    "any good programmer"...etc. Most webpages are not even given a second thought after being created by an everyday joe who struggles to grasp HTML or more importantly DOESNT CARE. Think it through.

    Often wrong but never in doubt.
    I am Jack9.
  • It's pretty impressive in terms of organizing an (relational) database. Be it in terms of reasonable response time or maintenance.

    Just look into the backup aspects, which are certainly not trivial: Let's say you have a fiber channel link to the backup sub system and the database in question is a good backup citizen and handles 80 Gbyte per hour.

    Believe me, that's darn good throughput and rarely achieved in the real world. Go calculate.

    What shudders me most about the story is the (Err, yessir; you know we had this incredible stoopid .COM biznes model idea, collected data for a few month and - sheesh I tell ya boy - where we stunned that we suddenly sat on 40 terra bytes of data...) approach of database engineering.

    I've seen a lot of outrageously dumb approaches in database design and engineering. But those blokes really deserve a top slot in the list.

  • by Seumas ( 6865 ) on Sunday June 18, 2000 @03:52PM (#994259)
    Eventually, companies will purchase advertising space on your 404's. User runs into a non-existant page and, suddenly, they're confronted with the picture of a big juicy Whopper and a coupon to print out and take to Burger King.

    Or better, Linkguard will work with Netscape and Microsoft to have the browsers automatically redirect you to companies who have paid money to have 404's intercepted and -- instead of redirecting you to the original site as the designer intended, will steal you away to some big corporate website.

    "Jeeze, every time I run into a 404, I wind up at!"


  • There's tons of apps that do this already - WebTrends being an example, for instance. Probably shareware apps too. It's just basically a spidering thang.

    Paying a service to do it when I can buy an app, schedule it to run overnight, and have reports generated in the morning, strikes me as silly.

  • As long as we're collecting 404s, here's [] mine.

    I agree that fun 404s have become a nice amusement on the web. At least they avoid the two biggest problems with standard ones: telling people to contact the sysadmin, especially on a many-user machine, and telling people they must've typed something wrong, when people almost never type URLs.

  • by Abigail ( 70184 ) on Monday June 19, 2000 @06:23AM (#994262) Homepage
    It shouldn't be too difficult to write a short script that would check all the links on one's web site periodically.

    That will only check outbound links. That's not the problem what's being solved. The problem is checking for inbound links. That is, links on other peoples websites to your website. That isn't easily solved with a short script.

    -- Abigail

  • How about a brand new protocol for locating documents on httpds?

    Most of the suggested ideas for this "protocol" are already part of of the HTTP protocol.

    WHERE /baz.html

    Not at all needed. That is basically what HTTP does. The URL name space is just a name space. The only relationship between a URL and a file is whatever is dictated by local policy.

    100 /baz.html

    That's basically the 200 HTTP status code.

    101 /baz/1.html

    HTTP doesn't make a needless difference between moved to the same server or a different server. It does however make a difference between moved permanently and moved temporarily. Status codes 301 and 302.

    103 KILLED

    Status code 410.


    Status code 404.


    That's the 5xx category of status codes. There are also the 4xx status codes, if the problem is with the request itself.

    Maybe, also, there should be another client command, SEARCH, to find any/all occurances of a file name, like: SEARCH bar.html

    That doesn't make sense to put in a protocol, as URLs do *not* point to files. An HTTP server *might* map it to a file, but that's outside the domain of the URL name space. Furthermore, since the URL name space is infinite, the result of such a search command could be an infinite list as well.
    However, it's isn't hard to put in such a functionality in your HTTP server. For instance, the server can be instructed to do a search when encountering the request for /SEARCH/bar.html.

    And a directory list, too... Client says: LIST /

    Again, that doesn't fit in the current standards for the same reason. But note that many HTTP servers have this feature already.

    I have begun to write this all up. Anyone who wants to help, visit my web site, find my e-mail address, and tell me you wish to help with this protocol.

    I strongly suggest you won't trouble yourself in making the effort. You start with the wrong idea, that URLs map to files, and most of the requested functionality is already been taken care of in the HTTP standard.

    Ref: RFC 2068 []

    -- Abigail

  • []. Seems to look fairly simple actually, they just have massive hdds. Gits.
  • Can they cure this link [does.not.exist]?
  • by Peter Dyck ( 201979 ) on Sunday June 18, 2000 @10:39AM (#994266)
    This is a completely wrong approach.

    It's like name resolving using only a single DNS server.

    It shouldn't be too difficult to write a short script that would check all the links on one's web site periodically. Using a 40 terabyte database to effect this is insanely ineffective.

  • by prac_regex ( 142250 ) on Sunday June 18, 2000 @10:39AM (#994267)
    With a simple cronjob and Perl's wonderfulLWP [] module package, not to mention the other implemtations of tracking web-pages, any relativly smart administrator should already be doing this. It comes down to this, programmers are lazy and that is good, but is this just too lazy? phooey. Maybe this should be done as an apache module .. hrmm... maybe i should write that one.. mod_url_validator
    <Location />
    Add-handler Check-Links
    or something like that... no i dont like it. too much overhead. well at least my first offer works, because i use it.
  • Sounds like a service that many site-scanning search engine companies could provide. Of course, those facilities which scan more often would provide faster service to the subscribing webmasters -- we've all seen out-of-date search results.
  • by bgalehouse ( 182357 ) on Sunday June 18, 2000 @10:37AM (#994269)
    This service would be much easier for somebody like google or altavista to provide. If a company starts making money at this, I'd expect the search engine boys to come in and offer to do the same thing for less.

    Seriously, don't they already have the database?

  • by Signal 11 ( 7608 ) on Sunday June 18, 2000 @10:40AM (#994270)
    Okay, three problems with tracking broken links..
    • Dynamically generated pages
    • server / operator error
    • I like my cool 404 errors.

    There's two problems that this thing could never catch with dynamically generated pages - one, is the famous "missing include" problem which usually looks something this in the middle of the page: [Unable to process directive] - not a 404, as the page renders, but definately a Bad Thing. The second problem is that of 404's which appear and disappear at random - like doing a sitewide search & replace across a hundred include files - that has a tendancy to lock files, producing share violations, which in turn result in 404's. So the database can't have 100% integrity.

    The other problem is the rapid amount of turnover on the web - millions of pages are appearing and disappearing every day. Those quantum people thought virtual particles were odd - try tracking down the same piece of information you found in a search engine 2 weeks ago!

    The second problem is operator/server/network errors. I've seen misconfigured proxies that mangle the URL and produce 404's when the page is there.. I've seen people make typos in the URL field of their browser (and then report it to me!), hell.. I've seen the 'net itself eat a few pages. All of this increases entropy in the database.

    Finally.. I like seeing the ocasional cool 404 error. Take this one, from my server:

    Once upon a midnight dreary, while I websurfed, weak and weary, ...Over many a strange and spurious website of 'hot chicks galore', ...While I clicked my fav'rite bookmark, suddenly there came a warning, ...And my heart was filled with mourning, mourning for my dear amour. ..."'Tis not possible," I muttered, "give me back my cheap hardcore!" Quoth the server, "404".
  • Not a bad idea. What happens if the server is temporarily down, however? You don't want everyone taking down their links if it is just down for a day or so because of technical difficulties. This service would get annoying if someone had a lot of links on their site and a few were randomly down on that particular day.
  • They are just trying to create a market for themselves. Trying to keep tabs on the whole internet just in case a page moves every now and again is a silly idea. The right approach is for each site author to use cron to check the links every now and again. As soon as a page moves, update your link.

  • by cperciva ( 102828 ) on Sunday June 18, 2000 @10:42AM (#994273) Homepage
    (from RFC 1945):
    10.13 Referer

    The Referer request-header field allows the client to specify, for the server's benefit, the address (URI) of the resource from which the Request-URI was obtained. This allows a server to generate lists of back-links to resources for interest, logging, optimized caching, etc. It also allows obsolete or mistyped links to be traced for maintenance. The Referer field must not be sent if the Request-URI was obtained from a source that does not have its own URI, such as input from the user keyboard.

    If you want to make sure that you don't break any links when you move your website, all you have to do is consult your HTTP logs, pull out all the lines starting "Referer:", and remove the duplicates.
  • by BJH ( 11355 ) on Sunday June 18, 2000 @10:43AM (#994274)

    I like that bit about cataloging pages with a five-word "lexical" signature based on words that appear mainly only on that page. How are they going to deal with the 5,000,000,000 web pages that contain only the word "porn"? ;)
  • Couldn't someone just rewrite one of those e-mail collecting spambots to do this kind of work, then e-mail to the page admin if his links were broken?

    The problem with this sort of stuff is that the people who'd pay big bucks for it usually hire web admins smart enough to take care of it themselves.

  • *cough* ; )

    This is a government conspiracy. I'm surprised none of you (especially Signal 11) didn't pick up on it right away.

    Any webmaster worth his weight in HTML can use LWP or even a simple GUI-based Xenu (freeware linkchecker) to check on the current status of links on their site, and elsewhere.

    The only obvious benefit to something like Linkguard, is for the government to keep track of you. You have 20% dead links on your site? Bad webmaster -- BAD!".

    Next thing you know, your name is published in the paper, your wife leaves you, your house is forclosed, and your children are taken away from you and put into foster care, with a family who does know how to maintain their links.

  • It's a really bad idea to have huge 404 pages. You could've had the same result (or better) if you just had a nice blank page with that message. It's even worse to redirect the 404 to another page, because I often try to find what I was looking for by going up the tree, and editing the URL becomes painful when I have to retype or repaste it every time because the site redirected me.

    Now that I got that off my chest, here are a few more amusing, yet not annoying, 404 pages:


  • by Animats ( 122034 ) on Sunday June 18, 2000 @04:38PM (#994278) Homepage
    Dumb journalism. This isn't a breakthrough.
    • There's a free service [] that already does this. They do it even if you don't ask them to, then send you spam telling you about broken links on your site.
    • And there's Alexa [], which really does archive the Web so that you can find old pages.
    • Personally, I like the link checker in Dreamweaver. [] It's very well integrated with the site maintenance tools.
    Probably the biggest source of bad links is unmaintained "favorite links" lists. That's something that needs a simple tool. If the major free-web-page sites provided something, that would probably cut the number of dead links substantially.
  • But what about broken links to other sites?
  • i'm sure i remember something in the HTML 4.0 spec that enables one to tell a browser to pre-load linked pages, but ive never seen a browser that actually implemented it.

    Abashed the Devil stood,
    And felt how awful goodness is
  • HyperG.

    Cooperative networkerd multi-media via ad hoc netowrked indepednant locally caching nodes that make up a destribuetd database system.

    Thera re a few good books on it for thsoe who are interested out there and source code is freely available.

    Unfortunately the "minimally functional/maximally stupid fragily linked file server" solution of HTTPD got too established too quickly and HyperG couldn't penetrate.

    Once again, better technology proves not to be the deciding point in the market.
  • It seems that I am in the wrong. But does anyone know how to get FrontPage extensions working with Apache 1.3.12 on Windows 98? That's why the LIST existed in this idea, and the actual purpose behind this.

  • Here's a key line from the article:

    Eventually Linkguard is planning to use discrete software programs called agents to watch links and tell the webmasters of any affected sites when they are updated or changed.

    By "agents" they mean "bots", I suppose.

    Now, if it takes 40 terrabytes (roughly 41,943,040 megabytes, I believe) to document all the links on the web, how much more space will be needed to keep contact info on all those links? Plus, how efficient will these agents be? I'm not so hot on the idea of bots constantly poking around my lil' Network, checking that all my links are okay.

    And will these bots follow the robots.txt rules? I know plenty of sites which revoke all robots, so the "agents" would be useless anyway... Nice idea, but sounds a bit invasive.

    Plus this line below:

    If the destination page disappears, search engines that can use these signatures would try to find the relevant signature and relocate the page.

    Oh, so now you're relying on search engines to get the links right... hm...

    I'll stick to manually checking them myself, thankyouverymuch.


  • Using a 40 terabyte database to effect this is insanely ineffective.

    Ineffective? I doubt it. I'm sure it'll work just fine. Inefficient, maybe.

  • Try that when people are behind a corporate firewall/proxy that eats the referer. Many systems do not respect the standards.. hence the need to build a parse tree and go through your site looking for bad links.
  • In a recent New Scientist article (sorry, can't source it better than that) I read that Google has "4000 linux computers each with 80gb of diskspace". Well, that works out to be 312tb and that's only to index a small portion of the web - 20% is it? (as I understand it, Google stores the whole webpage in order to serve a cached version on demand). So, how could 40tb be enough? Even assuming that this new company compressed the data to a higher degree than Google (which needs to serve pages fast), this just couldn't be enough to be useful.

    Gimmick or badly planned...whichever.

    --Remove SPAM from my address to mail me
  • I agree, a total waste of money and bandwidth.

    I mean, seriously, link checking should be the responsability of the page admin. And as you say, it isn't that hard either. If you're so lazy you can't maintain your own site, then you deserve broken links.
  • by QBasic_Dude ( 196998 ) on Sunday June 18, 2000 @10:49AM (#994288) Homepage
    There are no reasons at all in theory for people to change URIs (or stop maintaining documents), but millions of reasons in practice.
    Tim Berners-Lee, inventor of the World Wide Web, wrote about this in a page titled Cool URIs don't Change []. Many web authors don't realize file name extensions can be removed [] from the URI space. Pages Must Live Forever (Alertbox Nov. 1998) [] is another document about the same issue.

    The Network Working Group is working on a replacement for URLs -- Uniform Resource Names. URNs [] are intended to serve as persistant, location-independent, resource identifiers and are designed to make it easy to map other namespaces (which share the properties of URNs) into URN-space.

  • Simple, you take out everything that isn't in a

    Oh my god, Bear is driving! How can this be?
  • If you take out everything but the <a href="">'s it doesnt take up nearly as much space.

    Oh my god, Bear is driving! How can this be?
  • Just how does this `company' plan to make money?

    Will they email the companies saying "at least
    one of your links is down; send us a cheque for
    x00000 pounds and we'll tell you which"?

    Just how?
  • Think of all the times you _had_ to take down a web page because it had misinformation, because it broke a copyright or some other reason... I have already had this problem with google, I think it'll be even worse w/ this service.
  • Nope

    1 gig = 1,000 meg.
    40 gig = 40,000 meg.

    1 tb = 1,000 gig

    so 40*1000*1000
    40 tb = 40,000,000 megabyte

    Of course this is with 1 kb equaling 1000 bytes no 1024. So add two terabytes roughly.
  • To me the article seemed to stress not broken links among your own page, but broken links from other people's pages to your own, thus causing you to lose out on visitors coming to your site from others.

    However, this seems like it could also be done on the local side, by logging the http-referer so you can keep track of any pages that a lot of your visitors seem to be coming from and then notifying them if/when you change your URL's.
  • Compression compression and more compression...
  • This WOULD be an ideal case for mobile agents ... but I seriously doubt that they use such technology ..

    (for those who are not that familliar: mobile agents are programms that "hop" from server to server, and perform their tasks at the current location. It would be ideal for distributed indexing / spidering / checking whatever, unfortunately there is not much infrastructure around, that allows execution of foreigen java programms .... ;)

    Samba Information HQ
  • There's also things like PURL [], but they haven't really caught on.
  • Try that when people are behind a corporate firewall/proxy that eats the referer.

    Yes, but there will be enough people who aren't behind firewalls that you can pick out all the important incoming links anyway. You only need one person to trasverse an incoming link without a referer-eater in order to get the link source into your logs.
  • Even with 40 terabytes, I don't see how it can find every possible link. Well, I appreciate them trying, I've been informed before about a bropken link on my page by an automated bot, and I did fix it. So, I'm all for it, even though I can't see very much of a point. Well, we'll see.
  • This does not help the problem of people bookmarking a page, or recording it some other out of band mechanism (email).

In less than a century, computers will be making substantial progress on ... the overriding problem of war and peace. -- James Slagle