Broken Links No More? 212
johndoejersey writes "Students in England have developed a tool which could bring the end to broken links. Peridot, developed by UK intern students at IBM scans company weblinks and replaces outdated information with other relevant documents and links. IBM have already filed 2 patents for the project. The students said Peridot could protect companies by spotting links to sites that have been removed, or which point to wholly unsuitable content. 'Peridot could lead to a world where there are no more broken links,' James Bell, computer science student at the University of Warwick, told BBC News Online. Here is another story on it." See also the BBC story.
Can someone say "Bad Idea Jeans"? (Score:5, Interesting)
First, replacing links. This is a rather quite bad idea. Here's why, with an example.
In general, we can all agree that the technology behind Google is pretty impressive. It has its own "More Pages Like This" feature, which we can assume is at least somewhat similar to this one. Complex content analysis amoung billions of pages, to determine which are similar and which are different.
So, suppose we had a link to Major League Baseball, www.mlb.com [mlb.com] on our page. And suppose, for whatever reason, that their site went away (perhaps a few more players' strikes?).
Well, what does Google suggest as a replacement? Check it out here [google.com].
First the National Football League (NFL), then the National Basketball Association (NBA), and then the National Hockey League (NHL). Followed by the ESPN sports network, and NASCAR racing.
Obviously if wanted to link to a site about baseball, all of those (other than ESPN) are really entirely irrelevant.
But if we wanted to link to a site about professional sports organizations, all of those (other than ESPN) are QUITE relevant.
Can this software know our intent?
Hardly.
You really have to question the ability of machines to select relevant links.
The situation is this: If someone goes to the trouble to manually create links in the first place, those should not be automatically changed to other sites that some computer program thinks may be related. Links shouldn't be inserted automatically; if someone needs more information on something you haven't linked to, they can use a search engine. And then your company isn't liable to look idiotic by linking to irrelevant sites.
Now, the other aspect of this product.
Removing dead or changed links is quite another matter. Automated removal of links is a great idea and quite useful. For example, consider when someone's domain name expires and it is taken over by a porn site. It'd be great to have a program that automatically removes links to it from your site. Like this tool, this could be based on a percentage of changed content--if the content changes significantly, remove the link quickly and automatically. If the content changes some intermittent amount, flag the link as needing review by the webmaster.
But in those both case, the software should present the webmaster with a list of such questionable links, those it has removed from the site temporarily, and then allow the webmaster to select replacement links.
Manually. With relevance.
Re:Can someone say "Bad Idea Jeans"? (Score:5, Insightful)
Try this. [google.com]
Re:Can someone say "Bad Idea Jeans"? (Score:2)
Re:Can someone say "Bad Idea Jeans"? (Score:4, Interesting)
While this might seem like a LOT for google to be doing on the backend, I would have to think that a majority of the public ends up visiting the same 5-10% of the the internet each day (number pulled from my ass, but an educated guess at least).
Re:Can someone say "Bad Idea Jeans"? (Score:3, Funny)
Re:Can someone say "Bad Idea Jeans"? (Score:2)
Here's another proposal: a tool that finds linked redirects and updates the link to the new URL. Even then, requiring individual approval for each change seems like a sensible precaution.
Although, there would be seeing something elegant about seeing all the old goatse.cx links in the /. archives change to wherever the new site is now.
Re:Can someone say "Bad Idea Jeans"? (Score:2)
Regardless, it would be a simple adjustment to have the tool replace only links to the same site.
Re:Can someone say "Bad Idea Jeans"? (Score:4, Interesting)
The reason why they want to replace the links manually is because some webmasters have to manage thousands op pages and don't want to press the 'ok' button every time the system detected a change.
Re:Can someone say "Bad Idea Jeans"? (Score:4, Insightful)
"product", this looks like a technology a webmaster would use on there own site. It also gives them they option to accept the suggestion or not. This could be really good for corporation with large intranet sites as webmaster leaves documents constanly get moved etc.
I think had the original poster read the article they wouldn't have gone of half cocked. IBM must also be somehwat confident that this is new technology or else they wouldn't have filled two patents for it.
Re:Can someone say "Bad Idea Jeans"? (Score:3, Insightful)
Yep. Because Major League Baseball has strong conceptual similarity to several other concepts: the game of baseball, professional sports, American culture, and others. Granted some are more specific than others, but that's a pretty tough judgment call that depends on the context in which the original link occurred. If I link to mlb.com from a site about baseball, then it means something diff
Re:Can someone say "Bad Idea Jeans"? (Score:2, Funny)
Re:Can someone say "Bad Idea Jeans"? (Score:2)
Re:Can someone say "Bad Idea Jeans"? (Score:2)
Google worked perfectly, isn't it obvious? MLB is not a sport; it is the corporation that is related to a sport, that controls the major professional players, but it is *not* the sport itself. You wanted to find similar items, and so Google brought up
Re:Can someone say "Bad Idea Jeans"? (Score:3, Interesting)
Basically when a page is indexed by a search engine such as google, the first step is to create a document vector from the document based on the repetition of words (terms) and how common these words are (ie. list of TFIDF values -- term frequency * inverse document frequency).
Anyway, this document vector is what is compared against by the search engine to find matches (which is how google can return results is 0.14 s
Re:Can someone say "Bad Idea Jeans"? (Score:5, Funny)
Re:Can someone say "Bad Idea Jeans"? (Score:2)
Re:Another patent on something stupid (Score:2)
Great (Score:5, Insightful)
Hang on. On similar lines, I've a great idea. Suppose I type a nonexistent hostname into my browser. Wouldn't it be good if the DNS server just gave me its best guess instead of an error message. Or some kind of Site Finding search engine. That'd be even better than
Re:Great (Score:5, Insightful)
I want things to be LESS tolerant of mistakes, not more. this is why the web is so fucked up. when people can get away with absolute shit, why produce anything better than shit?
Re:Great (Score:3, Funny)
"Hey guys, we have grass roots support, check out slashdot!"
Re:Great (Score:3, Informative)
Re:Great (Score:2)
check your browser settings, i seem to remember IE having this ability on its own. and i know i've seen a plugin for moz somewhere.
Perhaps you are thinking of Internet Keywords: [mozilla.org]
I liked this one: (Score:5, Funny)
Does this app take the form of a paper clip? Because that would be a great idea!
scary enough... (Score:3, Informative)
Clippy indeed, must be a slow news day,
- RLJ
Semantic Web? (Score:5, Insightful)
Re:Semantic Web? (Score:2)
Parent poster is exactly right. The semantic web is designed exactly for just this kind of thing, and would drastically reduce the amount of computing power needed to do it.
For a good discussion of the semantic web, and why we need to get going and build it, read the relevant chapter in The Unfinished Revolution [harpercollins.com] by Michael Dertouzos. I didn't quite understand what Tim Berners-Lee was getting at when he described the semantic web in Weaving the Web [w3.org]. Dertouzos explains it better, I think.
I had an idea
well (Score:5, Funny)
Re:well (Score:2)
Re: (Score:2)
No more broken bookmarks... (Score:3, Insightful)
Re:No more broken bookmarks... (Score:2)
How many bookmarks do you actually use after a year? Most stuff I bookmark is irrelevant long before it gets a chance to be broken.
Re:No more broken bookmarks... (Score:3, Insightful)
Re:No more broken bookmarks... (Score:2)
Re: (Score:2)
What if the page is deleted, not changed (Score:4, Insightful)
Re:What if the page is deleted, not changed (Score:2)
Re:What if the page is deleted, not changed (Score:4, Informative)
That's what the 410 Gone [w3.org] HTTP response header is for. If only admins would use it more...
Changed is worse in some situations (Score:3, Interesting)
You can imagine how much the staff enjoy the content on the new page... and the IT Security folks especially as the proxy was suddenly giving them lots of nice warnings about workers' viewing inappropriate conduct (
Take this with a grain of salt (Score:5, Insightful)
Re:Take this with a grain of salt (Score:3, Informative)
Re:Take this with a grain of salt (Score:3, Informative)
Not Entirely New (Score:4, Informative)
Suppose you have broken link http://somesite.com/foo/bar.html, some sites return a list of search results from within 'somesite.com' matching 'foo' or 'bar'. Quite clever, and much more useful than a plain old 'page not found' error.
This just takes that one step further by doing the searching at the referring end instad.
The Slashdot use? (Score:3, Interesting)
I decided it'd be too hard for software to decide whether a change was significant. I wonder how this software does it - presumably, you can change the threshold?
More info... (Score:2, Funny)
worrying (Score:5, Insightful)
Only time will tell, I suppose.
Re:worrying (Score:3, Insightful)
The Real Worry (Score:2)
No, the big worry for me is PATENTS. What the hell are they patenting? What is the Big Idea here that deserves a patent? This is scary stuff. What, do we have to find prior art for every stupid idea someone decides to patent? The answer is "yes." We are all out of business if we let this continue. Support the EFF! Kill this stup
Re:The Real Worry (Score:2)
I personally think there's no singular mind at work on it, it's just one IBMer trying to get a patent listed on their resume and their manager trying to look important.
Re:The Real Worry (Score:2)
I personally think there's no singular mind at work on it, it's just one IBMer trying to get a patent listed on their resume and their manager trying to look important.
I actually think it's funny that people will brag about how many patents their division or company received in the last year. After seeing the kinds of crap that get patents over the last 10 years or so, I'm not likely to be impressed, regardless of their numbers. In fact, a high number is more likely to be indicative of a large number o
Can I just have a web site that, you know, works? (Score:2, Insightful)
From bad layout, to missing options, to obscure names for common links, it seems that people are actively trying to hide crap from the end user, making their website utterly worthless.
Can we devise a tool that fixes this problem first?
Re:Can I just have a web site that, you know, work (Score:3)
I don't mind being modded down, but seriously, I am at a loss here.
Well that sounds perfectly dreadful (Score:5, Insightful)
Sounds like a recipe for utter disaster in the worst case, and a source of mildly embarassing incidents at best.
How about this algorithm just report dead links to a human instead of trying too hard to be clever?
This sounds like someone had to come up with a final project, and settled on this one.
yawn (Score:2, Insightful)
The whole BBc News Technology section reminds me of the 'Tomorrows World' program when it was in full swing, saying how everything could be 'the next big thing' and that we'd likely se eit on shop shelves and in every home 'in a year'. Why do these people never learn that so much of this is just press release bullshi
Re:yawn (Score:2)
Listen, let me explain this in simple terms: BBC News caters to a wide audience made up of mainly lay people and, as such, it pitches its articles accordingly. It's not New Scientist, Nature, The Lancet or whatever academic publication that's on your reading list and it doesn't pretend to be. It doesn't try to blind its readers with science because it's readers aren't al
Re:yawn (Score:2)
Re:yawn (Score:2)
grr
And... ? (Score:5, Insightful)
The only parts that seemed worth while are replacing the links automatically, and testing if links are relevant.
I'm not so sure I'd trust a computer to do those things though. I'd much rather have the links flagged and checked by a human.
Re:And... ? (Score:4, Informative)
It wouldn't take long to write a script to find all the broken links on a page.
Just use Xenu's Link Checker [snafu.de].
Re:And... ? (Score:3, Insightful)
CMS (Score:5, Insightful)
The only kind of people who'd go out of their way to use this software, probably have already use some sort of CMS.
It will work, but that isn't good, here is why (Score:5, Insightful)
If document X moves, and the link is invalid, a search for the link might actually find document X, and therefore, you have your benefit, and you would have saved a 404.
However - if a document becomes deprecated and deleted, then how can you assume the link is valid?
Or indeed, if the document has no relevant substitute.
A genealogy providing a link to another Willian Wallace wouldn't be good news if the original page went missing.
A better system is automated 404 alerting to the webservers administrator.
A bad link gets hit, bam, what document, from where. You can work things out intelligently, not automatically.
I think this is silly, perhaps grasping at straws, I see no reason why we would replace all our links to google 'I feel lucky' searches, so why do something like this?
This is the essence of what they have, and all they have done is coulded the search IP field (which is important) with 2 more patents, again increasing costs and endangering open source innovation, the true innovative playing field.
Of course, I could be wrong.
Re:It will work, but that isn't good, here is why (Score:2)
Anyone knows the number of those patents, I'd actually be interrested in reading them... to see if what my company is developping right now will be infringing...
Re:It will work, but that isn't good, here is why (Score:2)
Re:It will work, but that isn't good, here is why (Score:3, Informative)
If document X moves and the link is invalid, you should be serving an HTTP 301 Permanent Redirect and well behaved user agents will update their bookmarks, and well behaved content management systems will update their code. If document X is gone, you should be serving an HTTP 410 Gone.
Ideally, 404 is supposed to mean that the web server has never heard of the file in question before, but in the real world...
Re:It will work, but that isn't good, here is why (Score:3, Interesting)
I, personally, hate dead links with a p
Obligatory RTFA (Score:4, Insightful)
not everything that happens in the world is an attempt by big brother to steer internet traffic to verisign or microsoft.
Hey micheal RTFA (Score:2)
Broken Links No More? "Students in England have developed a tool which could bring the end to broken links. Peridot............Peridot could lead to a world where there are no more broken links,'"
I'll troll to hell for this, but I could care less, and I have no problems standing up for what I say. This is terribly irresponsible journalism. No fucking where in the summary does it mention intranet or corporate websites. A world would be pretty global, would it not? Again, the hea
Not New at All (Score:4, Funny)
Slashdotted (Score:4, Funny)
Damn you!!
And purple hatstands
In other developments.... (Score:3, Interesting)
Each webserver will return a redirect to a google cache lookup for itself if the load sever gets too high.
1: Stupid idea
2: Patent
3: Wait 'til someone nudges at your generously worded patent
4: happily license this unrelated technology to keep thier VC peeps in the green.
Re:In other developments.... (Score:2)
Simple solution (Score:3, Insightful)
Where script.pl parse the wanted URL and ask an indexing engin to find the most relevant page associated with the query...
Re:Simple solution (Score:2)
That would be very useful if I could persuade everyone I link to to do it. However, since I can't, a solution that runs on the server where the links reside, not the linked content, is much more useful.
Vulnerability? (Score:3, Informative)
Prior Art (Score:2, Insightful)
Rather then finding new relavent links... (Score:2, Insightful)
This would not work with large web sites, but if it is just a link to a how-to guide or something small like that this would work.
How can you tell changed link? (Score:2)
No thanks (Score:3, Insightful)
Sort of a "cannot find hello.jpg, click here to go back to the main page".
My point being, if the document I'm looking for is not there, I want to know it's not there. I don't want to read something else, thinking it's what I meant to read.
Usually when I'm googling around and clicking stuff I'm looking for the answer to some coding or computer related problem. I don't want to click on a link for "configuring Samba 3.0 with AD support", and wind up on a "Configuring Samba 2.2 with LDAP" and waste my time following bad advice.
Re:No thanks (Score:2)
Exactly. If I put up a web page with links to other sites, I want people to email me saying "hey this link is broken".
Does a computer understand satire? If I've linked to a satirical, subtle, pro-aborition piece, is the algorith smart enough to know that, or will it relink to a anti-abortion site? If I link to a serious anti-war speech mad
German readers... (Score:5, Informative)
http://www-ai.cs.uni-dortmund.de/DOKUMENTE/malzahn _2003a.pdf [uni-dortmund.de]
Basically, the thesis evaluates different methods to build a kind of "finger-print" of a page. The finger print is used to find the page with google if it is gone, or has changed significantly.
The internet wayback machine was used to learn distinguishing disappeared pages from pages changing slightly over the time.
implications (Score:3, Funny)
Wonder where it would send me if www.hotmail.com were down?
*shudder* [hotmale.com]
(disclaimer: no, I didn't actually look to see what's on that site)
A better solution from the BOFH (Score:3, Insightful)
SED? (Score:3, Informative)
Patents are ridiculous... (Score:2)
I weep for the future of technology if this is what it's gonna come down to.
Instead (Score:3, Informative)
More patents for IBM (Score:2)
More evidence that IBM isn't really committed to an open/free philosophy.
Oh boy... (Score:2)
been there, done that. (Score:5, Informative)
javascript:Qr=document.URL;if(Qr=='about:blank'
Now when I click on a link that isn't there, I select my Archive search button and it shows me the Wayback Machine's history of that link. Of course it works only if the url hasn't been modified by the server. If it has it's another couple steps (copy link, ^T, archive search, paste url in pop-up dialog)
Google "I'm Feeling Lucky" (Score:4, Interesting)
e.g:
http://www.google.com/search?hl=en&ie=UTF-8&q=New
And voila, you'll site will take you to the most popular related site to news for nerds, automagically, if slashdot died one day, another site would take it's place in the google rankings. FF.
I prefer my system (Score:2)
Firefox extension (Score:4, Insightful)
On a less related note, I've long been disappointed that some 300 series status codes in HTTP are so under-exploited, both by clients (e.g. automated bookmark management) and people running web sites.
If anybody than myself change the links on my site (Score:2)
Comment removed (Score:3)
Xanadu (Score:2)
If broken links are a problem, maybe the html/http pair would better be shaped more acording the original Xanadu project.http://xanadu.com/ [xanadu.com]
Coining a nickname for this technology... (Score:3, Funny)
network down (Score:4, Insightful)
Soon the target network would be back up, but all your links would be lost and randomly changed to something less useful. Good Invention!
A better idea (Score:2, Insightful)
archive.org [archive.org]
or maybe google cache.
Then ofcourse it has to be smart enough to know it did that and replace the links back with the originals if they come online.
Sometimes "broken links" can recover.
countdown to the first virus... (Score:2)
I believe in dot.com IPOs too (Score:2)
predecessor: robust hyperlinks (Score:3, Informative)
As far as I know, they haven't done any more recent work on this and the software is only available via archive.org [archive.org].
A paper [dlib.org]
I gather that the IBM effort is different in significant respects, but it certainly employs ideas from Phelps & Wilensky.