Slashdot Log In
Follow Up on Google Favoring Yahoo
Posted by
CmdrTaco
on Thu Sep 14, 2000 01:44 PM
from the heres-the-skinny-dept dept.
from the heres-the-skinny-dept dept.
After yesterday's story about google favoring Yahoo links, I got word from Sergey Brin from google. He says that the reason that the site tested showed so poorly is that a robots.txt file prevented Google's crawler from fully indexing the site. The robots.txt file has since disappeared, and the next index should show a change in the rankings.
This discussion has been archived.
No new comments can be posted.
Follow Up on Google Favoring Yahoo
|
Log In/Create an Account
| Top
| 96 comments
(Spill at 50!) | Index Only
| Search Discussion
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
|
2
(1)
|
2

How Google indexed even the excluded parts (Score:3)
If robots.txt was there, how did Google index the site at all (instead of just poorly)?
<O
( \
XGNOME vs. KDE: the game! [8m.com]
Don't know what robots.txt is? (Score:3)
Morons, all of 'em. (Score:5)
Google hears about it via Slashdot, and in less than 24 hours, the real reason is revealed.
Kinda makes me wonder at humanity, when we're all so locked into our own little shells that we occupy ourselves trying to prove something that five minutes of talking could solve. Sort of like how most Americans never say hello to their neighbor, and can live next to them for years without ever exchanging niceties.
Robot Exclusion Protocol (Score:3)
The robot exclusion protocol (http://info.webcrawler .com/mak/projects/robots/norobots.html [webcrawler.com] is a way for websites to tell robots what they shouldn't be crawling. When a robot wants to crawl http://foo.bar.com/ it will first fetch http://foo.bar.com/robots.txt. If that file does NOT exist, that is taken to mean implicit permission to crawl anything it can find on that site. If it does exist, then the patterns contained in it are used to restrict what portions of that site are crawled. Every site has its own robots.txt (or lack thereof). To look at Yahoo's robots.txt, just point your browser to http://www.yahoo.com/robots.txt [yahoo.com].
If a site has a robots.txt that is telling the robots not to crawl, they have no business yelling at search engines when their pages don't show up.
ok now say your sorry everyone (Score:3)
----------
Geeks make mistakes to!
Re:robots.txt ? (Score:5)
--
Reasonable expectation of privacy? (Score:4)
As for searching beyond the request of robots.txt's and _really aggressively_ searching, that strikes me as being something of a different issue. It seems to me that robots.txt is more of a practical and protectionary issue, than it is one of privacy. It's more of a request not to bother you, than it is a request for privacy, at least in my opinion. Also, failure to adequately process and obey robots.txt can easily be the fault of programming error or ignorance, not necessarily a willful or particularly unreasonable act--one need not neccessarily take special measures to circumvent its intention.
This is not to say that I can't sympathize with parties that get hammered by such spiders, but I don't believe the privacy argument per se holds any water. I see legitimate complaints on both sides of the issue. For instance, let's say you're a software company and you find a LINKED and self-proclaimed warez page, but the hosting site doesn't allow spidering. Is that still so criminal? Even if the desire is to simply catalogue and document all of it?
Partial retraction from MedWebPlus (Score:5)
http://www.lib.uiowa.edu/hardin/md/ notes7a.html [uiowa.edu]
Explanation why robots.txt file affects ordering (Score:4)
It's actually pretty simple, really. The reason the site in question would have plummeted is that as Google is updating its stats, it probably makes some allowances for screwups and inability to reach a given site. However, after a time, the fact that Google was not allowed to search the page must have some sort of impact, and probably an exponential one. "OK, not here, probably a screw up, but we can't verify the search terms will be there" happened at the beginning and eventually as it aged out of relevence, it became "Well, lots of people think this page is good, but it's just not there!" from Google's perspective.
That makes sense.
Now, we know Google weights other sites by the weight of the site that links them. As the original directory started sliding, anything it linked to starts sliding as well. Which means Yahoo! fills the void. Particularly in such a specialized example where your liklihood of getting a good match is based on a few key sites.
--
Ben Kosse
Re:robots.txt (Score:3)
The Implications being... (Score:3)
After all, if there was no "crime" to complain about, and any "damage" was done by themselves to themselves, this never merited one story let alone two.
Since no lawyers were involved, it's not a case where "the lawyers won" (as is often seen in big, bloody trials); instead, it could be said that "the journalists won," as they got a bunch of blather out of no real story.
Re:So what about yahoo? (Score:3)
Re:robots.txt (Score:3)
See this link [slashdot.org] for more information.
--
what good is a robots.txt nowadays... (Score:4)
User-agent: *
Disallow:
I continue to receive spidering from companies such as NetCurrents and Cyvelliance because it is easy to ignore robots.txt. Rude, yes -- but easy. It is also easy for me to deny access via Apache, but bots from companies such as the above mentioned continue aggressive spidering.
It seems that standards (such as those for robots.txt) are useless, particularly for companies who spider the Net in search of copyright/trademark violations.
Granted, some companies have an interest in policing their products, but when do they go too far? Wouldn't deliberate/aggressive spidering into areas of my site which I have instituted restrictions/blocking constitute some sort of invasion of privacy? If a government entity is doing the spidering, wouldn't a search warrant be required?