Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Microsoft Businesses Google The Internet

Is Microsoft Crawling Google? 480

triplecoil writes "Jason Dowdell over at WebProNews has written a piece questioning a tactic Microsoft might be using to beef up its new search engine. He thinks they might be dipping into Google's results to supplement its own. Dowdell likens it to leaving your garbage on the curb--anyone could conceivably go through it and take whatever is there for their own."
This discussion has been archived. No new comments can be posted.

Is Microsoft Crawling Google?

Comments Filter:
  • by garcia ( 6573 ) * on Thursday November 11, 2004 @03:37PM (#10790801)
    Has anyone out there seen similar behavior on their own sites? Please comment with your qualitative/objective data if so.

    Sure, I see crawlers on my site all the time sometimes hitting the same URL over and over again. Do I understand their repetitive behavior? No. Do I care what they are doing? No, as long as they are obeying my robots.txt.

    I have complained before about MSNbot ignoring changes to robots.txt while Google happily changed its habbits (I can't find the link sorry). My recent fighting with Googlebot has come to a head when I had to disallow them access to my gallery completely because they refused to honor anything except Disallow: /. I had to go so far as to point Googlebot at my robots.txt and tell it to remove all the previous links. It was rather annoying dealing with support via email from Googlebot as they have apparently taken on the stance of "we don't care but you should put meta tags in all your files so that we don't index those pages." Umm, you are crawling MY site for YOUR profit, you do as I say, not the other way around.

    Do I care if MSNbot is crawling Google and then finding sites and links to search? No as it's none of OUR concern. What is OUR concern is our own robots.txt and how the spiders interact with our sites through that file. Let Google deal with Microsoft/MSNbot if that's what needs to be done but don't concern yourself with it otherwise.
    • by TheAmazingBob ( 801587 ) <JRogers@TWNCommunications.Net> on Thursday November 11, 2004 @03:43PM (#10790898) Homepage
      "Google happily changed its habbits..."

      Google is Catholic?
    • Comment removed (Score:4, Insightful)

      by account_deleted ( 4530225 ) on Thursday November 11, 2004 @03:45PM (#10790921)
      Comment removed based on user account deletion
      • Now if they are DoSing you then you have a valid complaint but robots.txt is just there as a friendly suggestion.

        Crawling a gallery of images (and all image property links as well) all day for several days might be considered "DoSing" I consider it being rude.

        You're right, they don't have to obey the robots.txt but they should when they say they will.
      • by mollymoo ( 202721 ) on Thursday November 11, 2004 @03:59PM (#10791111) Journal
        No offense dude, but you are the one who put the site out their publically. Now if they are DoSing you then you have a valid complaint but robots.txt is just there as a friendly suggestion.

        There's more to it than that. Google caches your pages and makes that cache of your copyright material available. Arguably if you have used your robots.txt file to tell it not to index (and therefore cache) your pages and it still does they are breaching copyright. OK, the Google cache is the world's largest breach of copyright anyway, but if you have told its spider not to index and it does regardless, that's a different ballgame.

        Putting it out there on the web does not give anyone the right to do with it as they please.

        • by liquidsin ( 398151 ) on Thursday November 11, 2004 @04:13PM (#10791307) Homepage
          Hmmm...let's call "robots.txt" a "copyright control device" in that it states who may and may not have access to my copyrighted images directory. I'd bet a DMCA suit or two for circumventing your copyright control device would get them to pay attention...
        • by ad0gg ( 594412 ) on Thursday November 11, 2004 @05:33PM (#10792199)
          If don't want your site indexed or cached by google. Go here and follow the directions.

          Remove yourself from google [google.com]

          "Note: If you believe your request is urgent and cannot wait until the next time Google crawls your site, use our automatic URL removal system. In order for this automated process to work, your webmaster must first insert the appropriate meta tags into the page's HTML code. "

          • by mollymoo ( 202721 ) on Thursday November 11, 2004 @07:41PM (#10793597) Journal
            If don't want your site indexed or cached by google. Go here and follow the directions.

            I shouldn't need to go and fill out some form for every search engine to protect my rights. One accepted standard way to say "do not index this" should be sufficient. This is an automated system. There is an accepted automated method to stop crawlers indexing your site (robots.txt). If they (Google or anyone else) take your copyrighted content and reproduce it automatically when their automatic system could have automatically respected your explicitly stated and legally protected rights they are knowlingly making a flagrant copyright violation.

      • If you don't want people (or bots) viewing it then password protect it or take it off the public interweb.

        Interweb? Is that the same as the 'Information superhighway'?
    • Sure, I see crawlers on my site all the time sometimes hitting the same URL over and over again. Do I understand their repetitive behavior? No.

      Google gives a partial answer to this on their GoogleBot page [google.com]:

      In general, Googlebot should only download one copy of each file from your site during a given crawl. Occasionally the crawler is stopped and restarted, and it may recrawl pages that it has recently retrieved. These recrawls should happen infrequently.

      If they're playing around with new indexing alg

    • As far as I see, MSNBot is behaving itself whilst Googlebot is hungriest - (much as I hate to stick up for Microsoft).

      Googlebot (Google) 74 945.51 KB 11 Nov 2004 - 03:02
      Netcraft Web Server Survey 13 0 10 Nov 2004 - 23:48
      Mirago 6 76.44 KB 02 Nov 2004 - 04:13
      MSNBot 6 76.44 KB 05 Nov 2004 - 05:58

      It's interesting that Mirago and MSNBot have taken exactly the same bandwidth in the same amount of visits. Are MS innov^H^H^H^H^H buying new technology again?

      Bob
  • by Anonymous Coward on Thursday November 11, 2004 @03:37PM (#10790807)
    All Google has to do is run some unusual queries through MSN, check their logs, find the IP addresses and block them.
  • by winkydink ( 650484 ) * <sv.dude@gmail.com> on Thursday November 11, 2004 @03:38PM (#10790824) Homepage Journal
    If so, they have legal remedies.

    If not, it's called doing business and gaining an advantage any legitimate way that you can.

    I think the interesting bit is in the conclusion. If MS is using this to establish a baseline, they can benchmark their spider against Google's over time.

    • Does it violate Google's Terms of Service? If so, they have legal remedies.
      If not, it's called doing business and gaining an advantage any legitimate way that you can.
      I think the interesting bit is in the conclusion. If MS is using this to establish a baseline, they can benchmark their spider against Google's over time.


      If I copy your work and take credit or it, does it violate your terms of service? If so, you have legal remedies. If not, it's called doing business and gaining an advantage any legitimat
      • In the case os a listing of pages on the internet, my guess is that it would be considered akin to the data in the phone book, which was recently ruled not subject to protection by copyright.

        But, I am not a judge. Or a lawyer. And I expect that if Google litigated here, they would be setting precedent.

      • If you copy his work without permission, you've already committed copyright infringement -- so yes, you violate the TOS by default.

        Comparing this to the MS/Google situation is not the same so the grandparent post still stands.
    • by TheRaven64 ( 641858 ) on Thursday November 11, 2004 @03:58PM (#10791102) Journal
      Do Google's terms of service have any legal standing? Click-through EULAs don't in many jurisdictions, and I don't remember ever even seeing Google's ToS, let alone agreeing to them.
    • by nick13245 ( 681899 ) on Thursday November 11, 2004 @04:22PM (#10791413)
      Yes it does.
      From Googles Privacy Center (http://www.google.com/terms_of_service.html):

      Personal Use Only

      The Google Services are made available for your personal, non-commercial use only. You may not use the Google Services to sell a product or service, or to increase traffic to your Web site for commercial reasons, such as advertising sales. You may not take the results from a Google search and reformat and display them, or mirror the Google home page or results pages on your Web site. You may not "meta-search" Google. If you want to make commercial use of the Google Services, you must enter into an agreement with Google to do so in advance. Please contact us for more information.
  • Yea, and (Score:5, Funny)

    by BrianGa ( 536442 ) on Thursday November 11, 2004 @03:38PM (#10790830)
    The new search engine's name will be Mooglesoft.
    • by Moth7 ( 699815 )
      They're taking on Square Enix too? o.0
    • Spork or foon? (Score:2, Offtopic)

      by 3770 ( 560838 )

      So, what name do you favor for the combined fork and spoon utensil?

      Spork or foon?
    • Re:Yea, and (Score:5, Funny)

      by MooseByte ( 751829 ) on Thursday November 11, 2004 @03:46PM (#10790927)

      "The new search engine's name will be Mooglesoft."

      Which will subsequently be sued by SCOogle, the latest startup from The Canopy Group, after announcing they purchased the rights to the Internet in a complex transaction which is documented in a briefcase somewhere in Germany.

    • Re:Yea, and (Score:3, Funny)

      by meabolex ( 788745 )
      Initiating a Mooglesoft search:

      Instead of clicking a button named Google Search, it simply says "KupoKupo!"

      You are then returned a page where 100% of the text is the word "Kupo"

      This is slightly less optimized than a Marklar search (which at least has some words other than 'Marklar').
  • by biffnix ( 174407 ) on Thursday November 11, 2004 @03:39PM (#10790840) Homepage Journal
    Couldn't Google just crawl Microsoft in return? Then they'd be stuck in an endless loop, and William Shatner can then swoop in, crack some skulls, and save the day.

    Or something like that.

    biffnix
  • by Shant3030 ( 414048 ) * on Thursday November 11, 2004 @03:39PM (#10790842)
    Nah, never happens....
  • by mpost4 ( 115369 ) * on Thursday November 11, 2004 @03:40PM (#10790852) Homepage Journal
    I can say that they been crawling like mad as of late, Google, Yahoo, and MSN. I say this because on my site I have had a lot of traffic from all three, and my site is not a popular, or even an important one but I seen a lot of traffic from them. Not just once a week or a few times a week but every day. There are big updates coming. I was not surprised to see the article about google doubling their index, I know something was coming from the way they are crawling unimportant/unpopular sites.
    • Ever think they might all be following the 2 links you have to your site in every slashdot comment you post?
  • by bbzzdd ( 769894 ) on Thursday November 11, 2004 @03:40PM (#10790860)

    more evil than satan [msn.com]

    ROOFLES!

  • by Wrathie ( 668211 ) on Thursday November 11, 2004 @03:40PM (#10790863)
    Such trouble. Just buy the damned company.
  • by account_deleted ( 4530225 ) on Thursday November 11, 2004 @03:41PM (#10790869)
    Comment removed based on user account deletion
  • Look I dislike M$ as much as the next guy, but if this were true then it would become immediately obvious to Google as they would be receiving a huge number of page requests from Microsoft. It would become even more obvious because they would be of the form

    site:example.com


    Doing this for say 100,000 domains would be noticable but would not even scape the surface of what's on the web.
  • Meta-search? (Score:4, Interesting)

    by grasshoppa ( 657393 ) on Thursday November 11, 2004 @03:44PM (#10790904) Homepage
    The question is why? If they are doing this, are they simply going to present the results as their own, or are they going to work some magic and find the most relevant search results from ALL the engines and use those.

    In the first case, it's a slimy business practice. In the second, it's fairly cunning ( and has been tried before ).

    In either case, I doubt google is in any real danger. They are to search engines what MS is to the desktop. And while MS has squandered that advantage in the desktop arena ( reader homework: 250 word essay as to why ), google is only improving on their work.
  • Why can't Google just block MS from crawling their site? Wouldn't Google notice if other spiders were crawling them?
  • But doesn't Google index other search engines as well?
  • Msn Crawling (Score:4, Informative)

    by clinko ( 232501 ) on Thursday November 11, 2004 @03:46PM (#10790936) Journal
    If you've been watching the logs to your site lately Microsoft has been RAPING most servers. Most crawlers will pick through pages with large lists 1 at a time, then come back every hour or so.

    MSN starting last week has been pulling EVERY LINK in sequence from my site. Even the larger Artist Index pages [clinko.com] of my site.

    Seriously, I've had this same spider on my site for about 36 hours now.
  • by Anonymous Coward on Thursday November 11, 2004 @03:46PM (#10790937)
    From Google's Terms of Service [google.com]
    Personal Use Only

    The Google Services are made available for your personal, non-commercial use only. You may not use the Google Services to sell a product or service, or to increase traffic to your Web site for commercial reasons, such as advertising sales. You may not take the results from a Google search and reformat and display them, or mirror the Google home page or results pages on your Web site. You may not "meta-search" Google. If you want to make commercial use of the Google Services, you must enter into an agreement with Google to do so in advance. Please contact us for more information.
  • Absurd (Score:5, Insightful)

    by targo ( 409974 ) <targo_t@@@hotmail...com> on Thursday November 11, 2004 @03:50PM (#10790984) Homepage
    The claims are so absurd I don't even know where to start.
    1) His whole theory is based on the "fact" that the only way in the world to find his pages is to use site:www.sitename.com in Google, implying that Google has cached the results from an earlier crawl. Of course, there is no way that the Microsoft search couldn't have also cached it.
    2) Then, he claims that Microsoft is probably screen-scraping Google's results (for all the millions of sites out there), and using these results to recrawl those sites? This doesn't even make any sense.
    3) And last but not least, Microsoft is certainly basing its whole search architecture on the assumption that Google wouldn't ever notice MSN mirroring its whole index. Yeah right.
    • I think the large amount of traffic coming from MS alone would be enough to clue Google in on something fishy.
  • Probably Not.. (Score:2, Interesting)

    by DelawareBoy ( 757170 )
    My website is the #1 site listed with specific Criteria on Google. Consistently for the last 2 months. I try the same thing with MSN search and My site does not even show up at all.

    If they are searching Google, they haven't done it recently, or else they haven't gotten to my site yet.
  • by G4from128k ( 686170 ) on Thursday November 11, 2004 @03:52PM (#10791014)
    It would be easy for Google to insert a small fraction of non-sequiturs in the results, look at Microsoft's search results, and then sue for misuse. Even if MSFT uses random proxies to avoid detection, it cannot manually recheck all the hits to make sure they are correct (if they could, they had the resources to check all the sites, then they not need to crawl Google. A few made-up sites or inappropriate search hits would be enough to establish a pattern of abuse.
  • I might be mistaken, but I thought google has a 10,000 query limit per IP address per day. So it might be conceivable that enough computers over several days could get it, though I imagine it wouldn't be trivial

    I think this is mentioned in Google Hacks by O'Reilly. Those with an online account there can check it out and mock me if I'm wrong :)
  • by Skiron ( 735617 )
    I see bots hitting a cgi test set-up forum I ran 2 years ago (before uploading to remote ISP) STILL try to index pages. I think the bloke is spot on with his analysis.
  • by JustNiz ( 692889 ) on Thursday November 11, 2004 @03:55PM (#10791052)
    You can't get to every page on the internet just by starting at one page and recursively following links, therefore the more places you from, the more likely you are to have 100% coverage.

    I could imagine that Microsoft just needs a few thousand URL's evenly-spread across the internet just to seed their crawler, which they can get from Google by using a list of most popular queries.

    Once their crawler has so many starting points it can do the rest itself.
  • It's called a router. It can be set to null route whole chunks of IP address space. Set it to forget where Microsoft is and forget it.
  • Anybody know what IP address ranges msnbot is using? Might be possible to limit the rate of connection from those addresses using firewall rules (or, for that matter, forbid connection entirely if that's your preference) to avoid the "hammering" that msnbot is said to be doing...

  • that article was so ambiguous..."some person was searching some site and it was being spidered by some MSN bot and the links were added sometime after"

    Yay way to go slashdot thanks for posting the most blatant flamebait article ever - how about for your next post, you repost that routers article about a machine that makes more energy than it uses....

  • by Skuld-Chan ( 302449 ) on Thursday November 11, 2004 @04:00PM (#10791132)
    And got banned from using google. Seriously.
  • Isn't Google a webpage? Is MSN doing anything wrong by indexing a webpage and it's subpages?
    Look at it this way. If Google were to complain about someone searching their page/databases, they would be the largest hypocrites in the history of history.
  • Terrible article (Score:5, Insightful)

    by angio ( 33504 ) on Thursday November 11, 2004 @04:02PM (#10791159) Homepage
    The author suggests that microsoft must be scraping google b/c the only place _he_ could find the URLs they're requesting was google's cache.

    Uh.

    Microsoft has been developing their internal search engine for quite a while now. Part of developing a search engine is using it to crawl and creating a large corpus of test data. It's hugely likely that M$ has had a working crawler system for much, much longer than would be indicated by their public announcement. Quite a few people who helped develop Altavista at HP/Compaq/DEC research joined Microsoft Research about two years ago - the kind of people who could write a high-performance crawler in their sleep and wake up feeling refreshed.

    That article seems like baseless, uninformed speculation, to put it not-so-politely.
  • by theluckyleper ( 758120 ) on Thursday November 11, 2004 @04:02PM (#10791161) Homepage
    I'm certainly no Microsoft groupie, but this behavior may not be as sinister as it seems. Afterall, Google is on the internet, too. There are links found all over the internet to Google, with some specific search term embedded in the URL. If MSN's bot happened upon a link to a Google search page, is it somehow wrong for the MSN bot to follow that link, and spider as normal?
    • If MSN's bot happened upon a link to a Google search page, is it somehow wrong for the MSN bot to follow that link, and spider as normal?

      Find a link, fine
      Follow the link, fine
      Spider the link, not fine - google's Robots.txt [google.com] does not give them permission to.

  • Interesting (Score:2, Insightful)

    by Eric119 ( 797949 )
    Try entering a known Googlebomb into the MS search engine. "litigious bastards" [msn.com] shows up www.sco.com as the number one hit.
  • by dfj225 ( 587560 ) on Thursday November 11, 2004 @04:03PM (#10791176) Homepage Journal
    Microsoft's beta search engine's index doubled in size to over 8 billion pages.
  • So I can see how you can distill the entire content of the web that your bot has crawled into a database, but is it possible to pump enough queries into Google to get the entire database? (Or in more mathematical speak: Is this a well posed inverse problem?)

    I don't think so. You still have to have your own crawler (to use on the top ranked results of any query). And a good set of queries to hit google with (so you have an idea of what to index)...which changes constantly. Look at Google's zeitgeist som
  • by potus98 ( 741836 ) on Thursday November 11, 2004 @04:07PM (#10791233) Journal

    Hey Google, please don't make us read those wacky JPG/GIF letter scrambles with criss-cross lines and input the random characters into a field before submitting a search.

    "Hold on a sec while I Goog- Huh? Grrrr.... H... P... 7... O... wait no, 7... zero... ummm...

  • Bogus article (Score:3, Insightful)

    by YU Nicks NE Way ( 129084 ) on Thursday November 11, 2004 @04:13PM (#10791303)
    This whole article is based on the speculation of a web master who notices that a bot which allegedly isn't leaving behind a bot name is crawling his site. He then figures out that, oh look, there is a standard record in his server log.

    And I'm supposed to take this clown's "friend" seriously? That's not a good start, anyway.

    But then there's the real howler: the site can allegedly only be found through site: on Google. How does the friend know that? Has he done a complete crawl of the web to find all forward links to any image in his site -- even broken ones? MSNBot, like all bots, recognizes that many anchors are broken, and tries plausible corrections around the broken links. That's particularly useful with a deep link, where the deep link may have timed out but the shallow link still exists.
  • Full Circle (Score:5, Interesting)

    by Guppy06 ( 410832 ) on Thursday November 11, 2004 @04:17PM (#10791354)
    "Dowell likens it to leaving your garbage on the curb--anyone could conceivably go through it and take whatever is there for their own."

    It's interesting to know that Bill Gates has been forced to go back to his roots...
    The best way to prepare [to be a programmer] is to write programs, and to study great programs that other people have written. In my case, I went to the garbage cans at the Computer Science Center and fished out listings of their operating system.
  • Arg I hate M$ (Score:4, Interesting)

    by OverlordQ ( 264228 ) on Thursday November 11, 2004 @04:28PM (#10791503) Journal
    Yes this might sound like a rant, but somehow (partly my fault), the MSN Spider bot found one of my joke cgi scripts that translate pages to my own imaginary language. It's linked nowhere on my site, and maybe 3-4 places on the entire web. Said MSNBot began to pull PDF after PDF through the script, in addition to other large files, it also tried mailto: links. All in all said spider pulled about 1GB of data in a single day. My site's previous average was about maybe 300-400MB a Month. Let's just say that entire M$ IP Netblock was quickly filtered through iptables.
  • Highly unlikely (Score:4, Insightful)

    by David Leppik ( 158017 ) on Thursday November 11, 2004 @04:28PM (#10791513) Homepage
    Google keeps track of IP addresses and blocks which are doing an unusually high number of searches and disables requests from them.

    How do I know? Because a friend of mine decided to find out how common all TLAs are (three-letter acronyms) by counting Google hits on each TLA. This was before the Google API, so he did it with good old fashioned HTTP/HTML. It didn't take long for Google to flag him as evil and block access from his IP block.

    Sure, Microsoft could find some way around this-- using different enough IP addresses to conceal the source-- but that's more trouble than it's worse. Worse yet, it sets up a cat-and-mouse game and keeps M$ dependent on Google-- when their stated goal is to beat Google at its own game.

    I've got a simpler explaination for what the author is seeing. His evidence is based on the fact that some pages being requested exist only in Google's cache. Well, spiders are supposed to do breadth-first searches so they don't hit the same site too often. Microsoft is probably going against data it collected a few weeks ago but hasn't put on its public servers yet. (Why not? Could be lots of things. Maybe they haven't put enough hardware on the front end to support the amount of data they have on the back end. Or maybe they're just slow.)

    As much as I'd like to bash M$, there's nothing here that really looks suspicious to me.
  • Not quite (Score:4, Insightful)

    by SamMichaels ( 213605 ) on Thursday November 11, 2004 @04:36PM (#10791590)
    Dowell likens it to leaving your garbage on the curb--anyone could conceivably go through it and take whatever is there for their own.
    My garbage doesn't have a copyright statement, contain my patented technology, nor does it come with terms of service or licensing agreements.
  • by the-build-chicken ( 644253 ) on Thursday November 11, 2004 @06:01PM (#10792533)

    microsoft is looking at old pages, google uses a cache...ergo microsoft must be using google.

    if we're going to use that kind of logic, I could just as easily come up with "afghanistan is in the middle east and supports terrorist, iraq is in the middle east...ergo, iraq must support terrorists", and use it to make a case for invading iraq...but you don't see......oh wait

Factorials were someone's attempt to make math LOOK exciting.

Working...