Slashdot Log In
Millions of Pages Google Hijacked using ODP Feed
Posted by
CmdrTaco
on Wed Mar 23, 2005 10:36 AM
from the well-this-isn't-going-well dept.
from the well-this-isn't-going-well dept.
The Real Nick W writes "Threadwatch reports that millions of pages are being Google Hijacked using the 302 redirect exploit and the ODP's RDF dump. The problem has been around for a couple of years and is just recently starting to make major headlines. By using the Open Directory's data dump of around 4 million sites, and 302'ing each of those sites, the havoc being wreaked on the Google database could have catastrophic effects for both Google and the websites involved."
This discussion has been archived.
No new comments can be posted.
Millions of Pages Google Hijacked using ODP Feed
|
Log In/Create an Account
| Top
| 427 comments
| Search Discussion
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
OMG!!! (Score:1, Funny)
OMG!!! Slashdot's been hijacked too!
Ugh. This is so not true. (Score:2, Informative)
(http://www.google.com/webmasters/)
(Yes, I am GoogleGuy.)
Re:Ugh. This is so not true. (Score:5, Funny)
Re:Ugh. This is so not true. (Score:5, Funny)
(http://theari.com/)
Re:Ugh. This is so not true. (Score:5, Informative)
(http://www.google.com/webmasters/)
Here's the skinny on "302 hijacking" from my point of view, and why you pretty much only hear about it on search engine optimizer sites and webmaster forums. When you see two copies of a url or site (or you see redirects from one site to another), you have to choose a canonical url. There are lots of ways to make that choice, but it often boils down to wanting to choose the url with the most reputation. PageRank is a pretty good proxy for reputation, and incorporating PageRank into the decision for the canonical url helps to choose the right url.
A lot of sites that try to spam search engine indices get caught, and their PageRank goes lower and lower as their reputation suffers. We do a very good job of picking canonical urls for normal sites; sites with their PageRank going toward zero are more likely to have a different canonical url picked, though, and to a webmaster I understand that it can look like "hijacking" even though the base cause is usually your reputation declining. For a long time, it was hard to get anyone to report canonicalization problems, because the site that got "hijacked" would be free-cheap-texas-holdem-plus-viagra-and-payday-lo
But even though I suspected that this issue affected very few sites, we still wanted to collect feedback to see how big of a problem it was, and to see if we could improve our url canonicalization. So starting a while ago, we offered a way to report "302 hijacking" to Google; I mentioned the method on several webmaster forums. You contact user support and use the keyword "canonicalpage" in your report. Then I created a little mailing list with some engineers on it, and user support passes on emails that meet the criteria to the mailing list.
So how much reports has all this work (including posting multiple times on lots of webmaster boards to request data) gotten me? The last time I checked, it was under 30. Not a million pages. Not even a hundred reports. Under 30. Don't get me wrong, we're still looking at how we can do better: one engineer proposed a way that might help these sites, and he's got a testset of sites that would be affected by changes in how we canonicalized urls. A few of us have been looking through it to see if we can improve things, but please know that this is not a wildfire issue that will result in the web melting down.
As a side note, I'm getting a little tired of debunking the source of this story (NickW at threadwatch). For example, he claimed that Google had removed Greg Duffy from Google's index. When I pointed out that he was making an assertion of fact without evidence, he started out revising the story by sprinkling in words like "appears" and eventually pulled the story at http://www.threadwatch.org/node/1822 off his front page. But given that this is the third link to NickW's site from Slashdot in the last couple weeks, I'm guessing that he's tasted the Slashdot effect and wants more.
Re:Ugh. This is so not true. (Score:5, Insightful)
(http://www.dynamoo.com/)
Well shucks GG, not every webmaster is glued to WMW and other forums.. and even if they did the signal/noise ratio on this topic is so low that you probably couldn't find the information even if you were looking. It's hardly an obvious reporting mechanism. Although posting it on /. should help some, so that's appreciated. Thanks.
But look - what we have here are a whole bunch of webmasters who have been nuked off the face of the earth by 302 redirects and just don't have the technical knowledge to try and fix it. Mom and Pop stores, hobbyists, nonprofits etc etc. These people are just gonna get pasted.. they'll just be wondering why they don't get any visitors any more.
This is a HUGELY serious problem - and it's getting worse all the time as more and more people deliberately try to exploit the 302 bug. I've been hit by this bug myself, and let me tell you that unless you know EXACTLY what to look for you'd be stuffed - all you'd see is your traffic flatlining.
The key issue here - and it's the kind of issue that will really, really hit the headlines when it's exploited is redirection. Sure, I can use a 302 and send Googlebot to the correct page.. so first of all I basically 0wn the content of that page not the publisher. *Then* I insert an exploit into the 302 redirect.. and hey presto, I've 0wned hundreds of thousands if not millions of computers. *That's* going to make unpleasant reading for Google when it hits the headlines - "Use Google and Get Owned". Nasty.
Kindly extract your head from wherever it is (Score:5, Informative)
(Last Journal: Thursday March 31 2005, @11:31AM)
What it needs is a rapid and satisfactory answer or Google will find themselves at the receiving end of more angst than they even know is possible.
A concrete example. My company's web site has been in existence since 1995. So we have pretty good page ranking. Our main page has one phrase, very distinct, unique.
When I search for this phrase (in quotes), Google reports hundreds of matches. These sites (except our own) do not contain the phrase but are sites that sell traffic boosting.
The 302 problem is real.
Incidentally, I just spent 15 minutes at Google.com looking for a way to report the problem. Where is that mention of "canonicalpage"? In the bottom shelf of a filing cabinet, behind a locked door that says "beware of the tiger"?
I'm not surprised you got only 30 reports. What I am surprised at is that you appear to speak for Google yet have such an inane response to what is a real (and for many people, a terrifying) problem.
Re:OK, an example (Score:4, Informative)
(http://www.google.com/webmasters/)
- for the search imatix [google.com] I see you at number one.
- for the search "Strategic solutions for a complex world" [google.com] I see you at number one.
- for the search allinurl:imatix.com [google.com], that search (and it's sister operator inurl:) only look for the words in the url. So it's perfectly fine to show results like "real-imatix.com/" because they contain the word imatix. These results are not hijacking results--this is expected behavior for inurl and allinurl.
Hope this helps,
GoogleGuy
Re:Ugh. This is so not true. (Score:5, Informative)
But even though I suspected that this issue affected very few sites, we still wanted to collect feedback to see how big of a problem it was, and to see if we could improve our url canonicalization. So starting a while ago, we offered a way to report "302 hijacking" to Google; I mentioned the method on several webmaster forums. You contact user support and use the keyword "canonicalpage" in your report.
I'm sorry, but this is a flat-out lie. If you are the GoogleGuy, then there were 1000+ post threads on WebmasterWorld where people were begging you for input, and you essentially disappeared. I think I might remember seeing one post from you about this "canonicalurl" on a short, almost unrelated thread. You certainly didn't make it clear where to send problem reports, at least not on any of the threads that people were actually reading.
The fact is, this is a huge problem, and has totally fucked a lot of legitimate site rankings. I honestly believe Google was doing everything in their power to ignore the problem up until now, hoping that it was just a figment of people's imagination, or worse, that it would help increase advertising revenue. And now that it's turning out to be a PR disaster for you, you're in damage control mode.
I run one of the sites that was affected by the 302 bug. I sent a message to Google about it, and got a canned response essentially telling me there was nothing wrong. I read through no less than 10 threads on WebmasterWorld about this, many with hundreds or even thousands of posts. I saw maybe, maybe, two or three from GoogleGuy. Where were you? Did you somehow miss those threads that spanned 80+ pages??? Why weren't you posting on those threads about this "canonicalurl" thing.
Luckily there was only one site 302-ing me, and they were doing it by accident and were happy to remove me from their directory. Now I'm back up at the top of the rankings. But I know it's going to be nowhere near as easy for many of the thousands of people who are still affected by this.
Seriously, that you would come on here and try to discredit someone for bringing attention to a very big problem with Google is pretty distasteful. To me it indicates either a cover-up or having your head buried firmly in the sand. Either way, it doesn't bode well for the future of Google. Instead of flaming people now that the problem is getting mainstream press, why not try and actually fix things.
Re:Ugh. This is so not true. (Score:5, Interesting)
(http://www.pobox.com/~meta/ | Last Journal: Sunday February 29 2004, @09:19AM)
Google has login accounts, so let logged-in users have a link saying "report spam site". Track who files the most reliable reports, and if a few of those people all agree that a site is spam, nuke its pagerank.
See how OpenRatings does reliability calculations for more info. Or buy them
Re:Ugh. This is so not true. (Score:4, Insightful)
As an alternative, I'd love a cookie based version of this that you could click "ignore all results from this domain". After a couple of weeks you'd get rid of most of them on your personal browser. Make the lists sharable even. All the pagerank wannabies can do is start from scratch with new URLs.
OK, I'll bite ... (Score:4, Insightful)
However, if this is Google's PR method, I think you are kind of asking for it! In the absence of information, the internet community will speculate until the cows come home. I'm not saying it's right, I'm just saying that's reality. Even though I said on my site that I thought Google didn't do anything underhanded I bet a lot of people were still not convinced. Google can do a little better than this, and although you have been fairly nice to me (thanks) this response is a little flamebaity for PR. Please understand that I mean no offense, it's just constructive criticism. Even if everything you say is true, a representative of the company should always at least attempt to sugar coat something like your last paragraph.
Also, on a more personal note, maybe Google should embrace the people that are involved [clsc.net] in researching [gregduffy.com] these problems instead of using this broken communications policy. I know that in my case I contacted you guys 5 *months* ago about the Google Print problem I described and never got any followup except for my t-shirt (which I really like). I have some great ideas about possible solutions to the problem I described, and as far as I can see Google has not fixed the root of the problem. When are you guys going to contact me?
-Greg Duffy
Re:Ugh. This is so not true. (Score:4, Informative)
(http://www.google.com/webmasters/)
Robot.txt (Score:3, Insightful)
Re:Robot.txt (Score:5, Informative)
No, it's not about redirecting the user... (Score:5, Informative)
(Last Journal: Thursday March 31 2005, @11:31AM)
For instance: I have a site with excellent page ranking. Now a new site will set up, and do a 302 to my site. Google now gives this new site my page ranking. When the new site is indexed, it removes the 302 redirection.
When you search for my site, you now find these new sites instead. There is no redirection when you click on a link, the the "cached text" that Google shows is wrong.
Basically this technique allows people to get high page rankings without earning them. It's very widespread - I counted over 60 such parasites for my company's web site (which has excellent page ranking).
Re:Robot.txt (Score:5, Informative)
(http://pluggo.net/)
Site A can return a 302 HTTP redirect to site B when Googlebot crawls their site. The googlebot will then index site B as site A. Site A could have no affiliation whatsoever with Site B; people could be clicking on SesameStreet.com and get AsianHookers.com, etc.
I do think the figure of millions of pages being hijacked is a little steep, though.
Re:Robot.txt (Score:5, Insightful)
(http://www.ilikepuffynipples.com/)
Why? It can be completely automated. A million is no harder than four.
Re:Robot.txt (Score:5, Informative)
(http://www.ilikepuffynipples.com/)
This isn't about fooling people, it's about fooling a flawed technology to get false listings in the search engine results pages. It's about getting a lot of traffic. Yes, some people will be really pissed off when they get redirected to an affiliate program or something of the sort, but some small percentage of people will buy. If the cost to bring in a million visitors is miniscule because you're stealing search engine placement, and you get 50 people to sign up to something that pays you $50 a person, then you're up $2500 minus your hosting costs.
$2500 to someone in Malaysia is a lot of dough for a little coding... they could work for $200/mo in some kind of outsourcing plan or make a year's wages in their spare time. What do you think they're going to do?
Re:Robot.txt (Score:5, Informative)
Aside from a filter on Google's end to resolve this, it would be nice if the practice of using 302 redirects also included a means of confirmation of the setup on the site being redirected to. If the site actually hosting the data does not in some way confirm the redirection, either through a tag in the header of the html, or perhaps in a third, predictably place file (much like a robots.txt file). Of course, this would first require te standard to be rewritten, and then would require people to actually abide by it.
Re:RTFA (Score:5, Insightful)
(http://dotfuturemanifesto.blogspot.com/)
The article is confused and baddly written. It does not explain the exploit being used ever. So stop dumping on people. It is not at all surprising that people don't get what is going on when the description is crud.
What is really going on has nothing to do with 302, or at least very little. What these people are doing is to set up fake web sites using content filched from genuine Web sites. This allows (or is beleived to allow) them to climb the google rankings.
I don't see why someone would use a 302 response when they can just copy the entire content unless there is some sort of bug in Google's pagerank that is not being explained. Copying the entire content is much simpler.
So what the attacker does is to set up their site so that when the googlebot comes round it publishes some legitimate content, then when other folk follow the site from a google search they get pages infested with spyware or the like.
This would certainly explain the number of times I have done a Google search and ended up at an idiotic 'search site' that does nothing for me.
Re:RTFA (Score:5, Informative)
(http://127.0.0.1/)
No, the way it works is with the 302, but only for the googlebot.
For this to work the scammer has to give the 302 only to the googlebot, all other browsers need to get the content of the scammer's page. If you google for "cheapest car insurance" (IIRC) you can find an example of this. Change your User Agent accordingly and click on the top Google link, you'll end up at another site. Change back to Mozilla and you'll get the scammer's site.
Re:Robot.txt (Score:5, Funny)
(http://home.roadrunner.com/~jmattclark | Last Journal: Saturday October 27, @12:08PM)
couldn't you have made that a link so I can just click on it?
Re:Robot.txt (Score:5, Informative)
(http://slashdot.org/)
A 302 is a "temporary redirect". Basically, it says that the content normally lives at the URL you requested but that, just this once, you should look at this other URL for the content. Googles response to a 302 is actually very reasonable. I suppose the best thing they could do is just not follow 302s.
A 301 is a permanent redirect, indicating that the page isn't at the original URL and that all future requests should be made to the new one. I don't know what Googlebot does in this case but I assume it discards the original URL, which is what the standard recommends.
Re:Robot.txt (Score:5, Informative)
(Last Journal: Friday May 06 2005, @07:02PM)
(Sorry for dumbing down my post so much, too much experience explaining things to my grand mother)
I've had it with Google! (Score:5, Funny)
*duck*
Re:Google Cookie last until 2038! (Score:5, Funny)
Easy to prosecute, hmmm? (Score:5, Interesting)
(Last Journal: Friday May 05 2006, @11:53PM)
site exists with behavior dependent on browser name
being GoogleBot or not. The replacement site will
generally have some way of making money, which can
be tracked via financial transactions.
Re:Easy to prosecute, hmmm? (Score:5, Insightful)
Law of the Internet (Score:5, Insightful)
(http://geexology.org/ | Last Journal: Tuesday October 11 2005, @07:25PM)
302 (Score:5, Informative)
Re:302 (Score:5, Informative)
(http://thesmithfam.org/blog/)
Re-re-explained (Score:5, Informative)
(http://www.snowplow.org/martin/)
302 redirections are temporary redirections - the idea is that a 302 is supposed to be used when someone needs to be redirected to a new page, but should still use the original URL if they want to come back later. As an example, the page http://purl.oclc.org/OCLC/PURL/CONTRIBUTORS [oclc.org] performs a 302 redirect to http://purl.oclc.org/docs/contributors.html [oclc.org]. This means that although your web browser needs to go to some other URL for the content at the moment, they really should remember the first url as the permanent one.
Contrast this with what happens when your browser visits http://snowplow.org/martin [snowplow.org] - you get sent a 301 redirect to http://snowplow.org/martin/ [snowplow.org]. (Note the extra slash) In this case, the server is saying "the url with the slash on the end is the real location, and you should not try to come back here without the final slash in the future."
Ideally, if every web browser behaved according to spec., bookmarks (remember bookmarks?) would get automatically updated to the new URL when you selected them and the redirect was a 301 redirect. However, for a 302 redirect, the bookmark would stay as is.
302 redirects can be very useful when you want to set up a hierarchy of "logical" URLs that will permanently point to the correct location. 301 redirects are useful when you're obsoleting an old URL and wish people to go and use the new URL from now on.
Okay, so how does this relate to google? Well, let's suppose that you have a great site on fruitbats. I can set up http://www.example.com/topics/fruitbats to be a 302-style redirect to your site, essentially saying "The information at http://www.example.com/topics/fruitbats is temporarily being hosted by http://www.yoursite.com/". Now, google when it spiders pages will see that, will go retrieve the text from your page and will then index it under http://www.example.com/topics/fruitbat, since after all I just gave a temporary (302) redirect.
But it gets worse, because a final part of google's indexing process is to compare pages for identical text, and throw out all but one of the URLs. Apparently this stage has nothing to go on other than the text and the recorded URLs, and so your URL stands a fifty-fifty chance of being thrown out.
Except that I've not just redirected http://www.example.com/topics/fruitbats to your site, but also http://www.example.com/topics/fruitbat, http://www.example.com/topics/fruit_bat, and http://www.example.com/topics/fruit_bats. Now your lone URL doesn't stand much of a chance of being the one kept by the "throw out duplicates" processor, does it?
In a sense, of course, there's little google can do to prevent this, because even if they weighted 302-redirects lower in their "throw out duplicates" stage, I could always just go snag a copy of your website each time googlebot visits, in essence doing the redirection myself. (How? Just search the apache mod_rewrite guide [apache.org] for "Dynamic Mirror") However, doing it through 302 redircts means that google pays for the bandwidth to go get your page, not me. (Not that this is necessarily a signficant amount of bandwidth, since we're only talking about basic google here and not images. Depending on the revenue you get by misdirecting google queries it might be economical)
Of course, for this to really work, I'd need a list of websites sorted by category to build up my redirect db. But wait! The ODP feed provides exactly that.
I am a little bit wary of doi
Re:302 (Score:5, Interesting)
Although, they could probably still figure out it's google by their IP, but it's a step in the right direction.
Re:302 (Score:5, Informative)
Re:But what's the point? (Score:4, Informative)
301 redirects (Score:3, Interesting)
I noticed in my logs that search engines have repeatedly requested the 301 pages, but often don't follow the links to the new pages. And when searched with google, the pages still show up with the old urls. Should I be using 302 redirects instead?
Wrong (Score:5, Informative)
(http://www.ilikepuffynipples.com/)
This is why the "302 hack" works. If the redirect is only supposed to be temporary, the search engine keeps the URL of the 302 as the URL for the document, but indexes the content of the page to which the redirect is directed.
301 is what you should be using to point the SEs to your new pages if you've moved them. The behavior is supposed to be for the SEs to replace the old URL in their index with the new one, and furthermore count all links to the 301ed URL as being towards the new one. I don't know why it's not working for the grandparent poster, but it's the way that the functionality is "advertised" for Google and Yahoo, and it should work.
Why? (Score:2, Insightful)
(http://www.voidone.com/)
"Oh! Look! Something beautiful! Something impressive! I must destroy it!"
pah. feeling jaded today, i guess.
Do what I'm going to do... (Score:4, Insightful)