Linkguard To Cure Broken Links? 74
sean dreilinger writes: "Here's a BBC writeup of the company Linkguard, which hopes to monitor hyperlink stability via their 40-terabyte database and notify web authors when links are broken." This is a different effort than this one. Still, 40 terabytes?
Re:But.. but... (Score:1)
Easy way to check for broken links ... (Score:2)
You might getting quite some amout of e-mail
Samba Information HQ
Already being done (Score:1)
This is nothing new really...
Nathaniel P. Wilkerson
NPS Internet Solutions, LLC
www.npsis.com [npsis.com]
hm (Score:2)
Another Service, 7-24 (Score:1)
If all you're worried about is moved pages within your server, looking at the error log from the web server is pretty good.
Re:HTTP "Referer" header (Score:1)
If you click the above link to my home page, my server is supposed to see that you came from this article on Slashdot. That's all well and good.
If you type a random URL - say, http://www.yahoo.com/ - into your address bar now, should Yahoo see that you came from Slashdot? I'd call that an invasion of privacy. Netscape sends that information.
--
Re:URIs don't change: people change them (Score:1)
WHERE
If baz.html is in the same place (foo.bar.org/baz.html) the location server will return:
100
If it's been moved to elsewhere on the server (say,
101
If it's moved to another server (this time foo.baz.org/bar.html) then the location server returns:
102 http://foo.baz.org/bar.html
And if it's been removed for good, then:
103 KILLED
If the file never existed, then the location server will say:
200 DOESNT EXIST
If the server encounters a problem then it should return:
300 SERVER ERROR
Or something else, if it knows what it did wrong.
Maybe, also, there should be another client command, SEARCH, to find any/all occurances of a file name, like:
SEARCH bar.html
To this the server might reply:
400 SEARCH BEGINS
401
401
401
402 SEARCH ENDS
And a directory list, too... Client says:
LIST /
And the server says:
405 LIST BEGINS
406
406
406
406
406
406
406
406
406
406
407 LIST ENDS
I have begun to write this all up. Anyone who wants to help, visit my web site, find my e-mail address, and tell me you wish to help with this protocol.
Analog (Score:1)
I just have cron set up to do a daily 404 report for the last seven days with Analog, and read it occasionally.
Duh!
40 terabytes no longer impressive, emmett (Score:1)
- A.P.
--
"One World, one Web, one Program" - Microsoft promotional ad
404 ahead. (Score:1)
web, but why can't HTML code be devised that
looks ahead to where the links lead after the rest
of the page has finished loading, and return a
status code on how the page is working? If it's
404, then the browser can change the link text to
a predefined color so the user knows not to click
on it. Any ideas?
Re:But.. but... (Score:1)
Re:URIs don't change: people change them (Score:1)
RewriteLog
RewriteMap real-to-user txt:/anywhere/map.real-to-host
RewriteRule ^/([^/]+)/~([^/]+)/(.*)$
Also, the HTTP URI or Refresh header can be used to easily redirect an existing location to another. There is no need for a document location protocol.
I somewhat disagree with this. After all, a URL could be bookmarked, linked, or be refered to in another way. Once a URI is created, it should exist forever. Freenet [sourceforge.net] is an interesting distributed Internet-like network where documents can be uploaded, and since the files do not reside on a central server, exist as long as their is a demand for the file.Searching should, in my opinion, be higher level and not in the protocol. CGI can easily be used instead.
This really shouldn't be necessary. Links should be able to get to all the public documents on the web server. HTTP is not FTP.
If you feel a new feature should be added to HTTP, suggest it on the ietf-http-wg [w3.org] working group mailing list and it might be accepted in HTTP 1.2.
Re:But.. but... (Score:1)
Re:Easy way to check for broken links ... (Score:1)
You could make all your external links point to a Perl script that either spewed the right page at the browser, or made a log entry if the link was down. All your links would look like this: <a href="http://foo.bar.com/cgi-bin/safe_link.pl?http ://www.externalwebsite.com/">click here</a> Of course the downside is that the beauty of simple hyperlinks is lost.
The regular
Re:Big deal. (Score:1)
This is the least intrusive behavior -- it doesn't need to popup to let you know, because it won't be of concern to you until you actually want to go somewhere else. And by moving them into another folder, it gives you a chance to review them and find out where the right link is, if it's something you still care about. I'd pay money to see something like this developed.
--
Re:This is already possible and for free (Score:2)
I once made such a short Perl script to check the links on my own web pages: http://www.iki.fi/kaip/linkkuri.html [helsinki.fi]
Re: (Score:1)
rear thruster (Score:1)
Well, it's a funny world, I guess.
tell webmasters, before you move (Score:1)
Yahoo, excite, etc. (Score:1)
Search Engines (Score:1)
at least, thats what I think.
.mincus
Re:40tb not enough (Score:1)
Cached pages (Score:2)
Need I say more?
--Remove SPAM from my address to mail me
404 No More (Score:1)
This is silly. (Score:2)
For those of you who want to try, here's a starting point:
grep "File does not exist:" error_log > 404s_log
;)
Misguided? (Score:3)
Yeah, these methods are kind of a pain in the ass, but they're only worse if this new plan can do no better. But look at it -- they want every single link out there to be rewritten to their spec. Who does that help? Millions of web authors out there trying to rewrite documents to filter into this (really small [slashdot.org]) database can't possibly be easier than having the much smaller set of web admins adding server redirects whenever they notice more than a handful of 404 errors on the same document.
Most site management tools do this already (Score:1)
Re:This is already possible and for free (Score:1)
I agree that this is something that a webmaster should deal with on his own site, individually. It's like printing company business cards with telephone numbers that don't exist on them, it gives an unprofessional feel to the whole site. However, on complex, dynamic and interactive user-based pages, this might not always be possible, unless multiple Webmasters are constantly monitoring the site. So I think that while a webmaster should be obliged to look after broken links - and other aspects of website care, in some cases tools like this will be beneficial.
Re:404 ahead. (Score:1)
Checking that number of links could well slow down page load times, especially if the pages in question are on another server. Yes you're only doing a HEAD rather than a full page fetch but it still takes time. And users can get rather irritated at slow-loading pages...
Re:But.. but... (Score:2)
But, I think that this one [slab.org] is much more fun, in a clever little funny on IE funny on Lynx kinda way....
Well duh. (Score:1)
It's not even close to using Robust Hyperlinks (nobody wants to use them or understand them). The web is created by the lowest denominator.
This is a perfectly valid approach given _people_are_lazy_. It HAS to be done by a third arty or it will never be done.
I cant believe some of the rediculous comments...
"any good programmer"...etc. Most webpages are not even given a second thought after being created by an everyday joe who struggles to grasp HTML or more importantly DOESNT CARE. Think it through.
Often wrong but never in doubt.
I am Jack9.
Not trivial in terms of data management (Score:1)
Just look into the backup aspects, which are certainly not trivial: Let's say you have a fiber channel link to the backup sub system and the database in question is a good backup citizen and handles 80 Gbyte per hour.
Believe me, that's darn good throughput and rarely achieved in the real world. Go calculate.
What shudders me most about the story is the (Err, yessir; you know we had this incredible stoopid .COM biznes model idea, collected data for a few month and - sheesh I tell ya boy - where we stunned that we suddenly sat on 40 terra bytes of data...) approach of database engineering.
I've seen a lot of outrageously dumb approaches in database design and engineering. But those blokes really deserve a top slot in the list.
404 Commercials (Score:3)
Or better, Linkguard will work with Netscape and Microsoft to have the browsers automatically redirect you to companies who have paid money to have 404's intercepted and -- instead of redirecting you to the original site as the designer intended, will steal you away to some big corporate website.
"Jeeze, every time I run into a 404, I wind up at eBay.com!"
---
icq:2057699
seumas.com
Re:Wrong approach! (Score:1)
Paying a service to do it when I can buy an app, schedule it to run overnight, and have reports generated in the morning, strikes me as silly.
Re:But.. but... (Score:1)
I agree that fun 404s have become a nice amusement on the web. At least they avoid the two biggest problems with standard ones: telling people to contact the sysadmin, especially on a many-user machine, and telling people they must've typed something wrong, when people almost never type URLs.
Re:Wrong approach! (Score:3)
That will only check outbound links. That's not the problem what's being solved. The problem is checking for inbound links. That is, links on other peoples websites to your website. That isn't easily solved with a short script.
-- Abigail
Re:URIs don't change: people change them (Score:2)
Most of the suggested ideas for this "protocol" are already part of of the HTTP protocol.
WHERE /baz.html
Not at all needed. That is basically what HTTP does. The URL name space is just a name space. The only relationship between a URL and a file is whatever is dictated by local policy.
100 /baz.html
That's basically the 200 HTTP status code.
101 /baz/1.html
102 http://foo.baz.org/bar.html
HTTP doesn't make a needless difference between moved to the same server or a different server. It does however make a difference between moved permanently and moved temporarily. Status codes 301 and 302.
103 KILLED
Status code 410.
200 DOESNT EXIST
Status code 404.
300 SERVER ERROR
That's the 5xx category of status codes. There are also the 4xx status codes, if the problem is with the request itself.
Maybe, also, there should be another client command, SEARCH, to find any/all occurances of a file name, like: SEARCH bar.html
That doesn't make sense to put in a protocol, as URLs do *not* point to files. An HTTP server *might* map it to a file, but that's outside the domain of the URL name space. Furthermore, since the URL name space is infinite, the result of such a search command could be an infinite list as well. /SEARCH/bar.html.
However, it's isn't hard to put in such a functionality in your HTTP server. For instance, the server can be instructed to do a search when encountering the request for
And a directory list, too... Client says: LIST /
Again, that doesn't fit in the current standards for the same reason. But note that many HTTP servers have this feature already.
I have begun to write this all up. Anyone who wants to help, visit my web site, find my e-mail address, and tell me you wish to help with this protocol.
I strongly suggest you won't trouble yourself in making the effort. You start with the wrong idea, that URLs map to files, and most of the requested functionality is already been taken care of in the HTTP standard.
Ref: RFC 2068 [pasteur.fr]
-- Abigail
Link? (Score:1)
A Test (Score:1)
Wrong approach! (Score:3)
It's like name resolving using only a single DNS server.
It shouldn't be too difficult to write a short script that would check all the links on one's web site periodically. Using a 40 terabyte database to effect this is insanely ineffective.
This is already possible and for free (Score:4)
<Location
Add-handler Check-Links
</Location>
or something like that... no i dont like it. too much overhead. well at least my first offer works, because i use it.
Spin-off (Score:1)
Why an independent effort? (Score:3)
Seriously, don't they already have the database?
But.. but... (Score:3)
There's two problems that this thing could never catch with dynamically generated pages - one, is the famous "missing include" problem which usually looks something this in the middle of the page: [Unable to process directive] - not a 404, as the page renders, but definately a Bad Thing. The second problem is that of 404's which appear and disappear at random - like doing a sitewide search & replace across a hundred include files - that has a tendancy to lock files, producing share violations, which in turn result in 404's. So the database can't have 100% integrity.
The other problem is the rapid amount of turnover on the web - millions of pages are appearing and disappearing every day. Those quantum people thought virtual particles were odd - try tracking down the same piece of information you found in a search engine 2 weeks ago!
The second problem is operator/server/network errors. I've seen misconfigured proxies that mangle the URL and produce 404's when the page is there.. I've seen people make typos in the URL field of their browser (and then report it to me!), hell.. I've seen the 'net itself eat a few pages. All of this increases entropy in the database.
Finally.. I like seeing the ocasional cool 404 error. Take this one, from my server:
server down? (Score:2)
Wrong approach to a non-existent problem (Score:2)
They are just trying to create a market for themselves. Trying to keep tabs on the whole internet just in case a page moves every now and again is a silly idea. The right approach is for each site author to use cron to check the links every now and again. As soon as a page moves, update your link.
HTTP "Referer" header (Score:3)
If you want to make sure that you don't break any links when you move your website, all you have to do is consult your HTTP logs, pull out all the lines starting "Referer:", and remove the duplicates.
Hah! (Score:4)
I like that bit about cataloging pages with a five-word "lexical" signature based on words that appear mainly only on that page. How are they going to deal with the 5,000,000,000 web pages that contain only the word "porn"?
Re:Wrong approach! (Score:1)
The problem with this sort of stuff is that the people who'd pay big bucks for it usually hire web admins smart enough to take care of it themselves.
-jpowers
Government Conspiracy (Score:2)
This is a government conspiracy. I'm surprised none of you (especially Signal 11) didn't pick up on it right away.
Any webmaster worth his weight in HTML can use LWP or even a simple GUI-based Xenu (freeware linkchecker) to check on the current status of links on their site, and elsewhere.
The only obvious benefit to something like Linkguard, is for the government to keep track of you. You have 20% dead links on your site? Bad webmaster -- BAD!".
Next thing you know, your name is published in the paper, your wife leaves you, your house is forclosed, and your children are taken away from you and put into foster care, with a family who does know how to maintain their links.
---
icq:2057699
seumas.com
Re:But.. but... (Score:1)
Now that I got that off my chest, here are a few more amusing, yet not annoying, 404 pages:
--
Big deal. (Score:3)
Re:Easy way to check for broken links ... (Score:1)
Re:404 ahead. (Score:1)
Abashed the Devil stood,
And felt how awful goodness is
Another existing better solution... (Score:1)
Cooperative networkerd multi-media via ad hoc netowrked indepednant locally caching nodes that make up a destribuetd database system.
Thera re a few good books on it for thsoe who are interested out there and source code is freely available.
Unfortunately the "minimally functional/maximally stupid fragily linked file server" solution of HTTPD got too established too quickly and HyperG couldn't penetrate.
Once again, better technology proves not to be the deciding point in the market.
Hrm... Good points made... General reply (Score:1)
Agents? No thanks. (Score:2)
Eventually Linkguard is planning to use discrete software programs called agents to watch links and tell the webmasters of any affected sites when they are updated or changed.
By "agents" they mean "bots", I suppose.
Now, if it takes 40 terrabytes (roughly 41,943,040 megabytes, I believe) to document all the links on the web, how much more space will be needed to keep contact info on all those links? Plus, how efficient will these agents be? I'm not so hot on the idea of bots constantly poking around my lil' Network, checking that all my links are okay.
And will these bots follow the robots.txt rules? I know plenty of sites which revoke all robots, so the "agents" would be useless anyway... Nice idea, but sounds a bit invasive.
Plus this line below:
If the destination page disappears, search engines that can use these signatures would try to find the relevant signature and relocate the page.
Oh, so now you're relying on search engines to get the links right... hm...I'll stick to manually checking them myself, thankyouverymuch.
---
Re:Wrong approach! (Score:2)
Ineffective? I doubt it. I'm sure it'll work just fine. Inefficient, maybe.
Re:HTTP "Referer" header (Score:1)
40tb not enough (Score:2)
Gimmick or badly planned...whichever.
--Remove SPAM from my address to mail me
Re:Wrong approach! (Score:1)
I mean, seriously, link checking should be the responsability of the page admin. And as you say, it isn't that hard either. If you're so lazy you can't maintain your own site, then you deserve broken links.
URIs don't change: people change them (Score:4)
The Network Working Group is working on a replacement for URLs -- Uniform Resource Names. URNs [isi.edu] are intended to serve as persistant, location-independent, resource identifiers and are designed to make it easy to map other namespaces (which share the properties of URNs) into URN-space.
Re:40tb not enough (Score:1)
----
Oh my god, Bear is driving! How can this be?
Re:40tb not enough (Score:1)
----
Oh my god, Bear is driving! How can this be?
Money? (Score:2)
Will they email the companies saying "at least
one of your links is down; send us a cheque for
x00000 pounds and we'll tell you which"?
Just how?
But what if you want the page removed. (Score:1)
Re:actually (Score:1)
1 gig = 1,000 meg.
40 gig = 40,000 meg.
1 tb = 1,000 gig
so 40*1000*1000
40 tb = 40,000,000 megabyte
Of course this is with 1 kb equaling 1000 bytes no 1024. So add two terabytes roughly.
Missed the point? (Score:2)
However, this seems like it could also be done on the local side, by logging the http-referer so you can keep track of any pages that a lot of your visitors seem to be coming from and then notifying them if/when you change your URL's.
Re:Sure, Sure... (Score:1)
Re:Agents? No thanks. (Score:1)
(for those who are not that familliar: mobile agents are programms that "hop" from server to server, and perform their tasks at the current location. It would be ideal for distributed indexing / spidering / checking whatever, unfortunately there is not much infrastructure around, that allows execution of foreigen java programms
Samba Information HQ
Re:URIs don't change: people change them (Score:2)
Re:HTTP "Referer" header (Score:1)
Yes, but there will be enough people who aren't behind firewalls that you can pick out all the important incoming links anyway. You only need one person to trasverse an incoming link without a referer-eater in order to get the link source into your logs.
Sure, Sure... (Score:1)
Re:HTTP "Referer" header (Score:1)