Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Google Businesses The Internet

NCSA Issues Disclaimer on Google/Yahoo Study 118

Jean Veronis writes "NCSA has issued a strong disclaimer on the study announced recently on Slashdot that seemed to contradicted the fact that Yahoo's index size would be bigger than Google's: ' Staff at the NCSA noted several issues with the study'. This study conducted by students is 'not an NCSA publication and was not conducted as part of any NCSA project or under the supervision of NCSA'. "
This discussion has been archived. No new comments can be posted.

NCSA Issues Disclaimer on Google/Yahoo Study

Comments Filter:
  • Disclaimer Text (Score:5, Interesting)

    by Stanistani ( 808333 ) on Monday August 22, 2005 @12:24PM (#13372514) Homepage Journal
    From http://vburton.ncsa.uiuc.edu/indexsize.html [uiuc.edu]:
    "The following study was completed by two of Professor Vernon Burton's students at the University of Illinois. Though one of the students previously worked with Professor Burton at the National Center for Supercomputing Applications (NCSA), the study was done outside the scope of any NCSA core projects. When first published online, staff at the NCSA noted several issues with the study, and some revisions have been made to the document to reflect several of these concerns. Changes are detailed at the bottom of the following page.

    Please note again that this study is not an NCSA publication and was not conducted as part of any NCSA project or under the supervision of NCSA.

    A Comparison of the Size of the Yahoo and Google Indices [uiuc.edu] "
    • Ok, so NCSA claims to not be associated with this paper, but several changes have been made to reflect several concerns of NCSA staff. So which is it, NCSA, are you involved or not? And from the fact that changes were made to reflect your concern, it sounds like you were involved. It also sounds like you pissed off Google.
  • Comment removed (Score:3, Interesting)

    by account_deleted ( 4530225 ) on Monday August 22, 2005 @12:24PM (#13372516)
    Comment removed based on user account deletion
    • Re:... so? (Score:2, Interesting)

      by Anonymous Coward
      Preliminary results (from 7000 test queries) indicates that the results of this verification study confirms the conclusions of this study, but final results are still forthcoming.

      Looks like they're still doing some looking to make sure their results are rock solid, but that so far they seem to be. As such, the current state of reality is that the fact is that Google has a must bigger index of the world wide web (or Internet, or whatever you want to call it) than Yahoo. Yahoo may have a bigger index squ
      • Re:... so? (Score:2, Interesting)

        by mi ( 197448 )
        The whole method seems flawed. Trying to compare the sizes of two sets by the sizes of various subsets makes sense only if the method of selecting the subsets is the same.

        This is not the case. The methods depend on each search engine's algorithms and are very likely to differ greatly.

        In any case, whether a particular query returns 40 results or 40000 does not matter -- only the first 20 are ever of any use...

  • /. 503 error (Score:2, Interesting)

    by dhasenan ( 758719 )
    Off topic...

    Anyone else get 503 errors when trying to reach Slashdot?

    Where do you go to talk about Slashdot being Slashdotted?
  • by d3m057h3n35 ( 695460 ) on Monday August 22, 2005 @12:26PM (#13372521)
    Also pertinent was the discovery that Yahoo's claims to increased index size were based on the hope that buying products from companies which advertise "longer, thicker index size in two weeks, money-back guarantee, all-natural supplements" would yield actual results.
  • Wait... (Score:5, Funny)

    by lbmouse ( 473316 ) on Monday August 22, 2005 @12:28PM (#13372524) Homepage
    I thought that size didn't matter.
  • by ChrisF79 ( 829953 ) on Monday August 22, 2005 @12:34PM (#13372549) Homepage
    Although they don't say it in the disclaimer, their actions of posting a disclaimer after posting the article screams that they realize the article is flawed. If that's the case, why publish it in the first place? Shouldn't they have had some foresight and left this one on the cutting room floor? Maybe Finance is different, but I remember it being very difficult to get an article published unless it was groundbreaking and free from any minor flaws.
    • From the disclaimer I would say thet the report was not a university sanctioned project, but a funtime project for a couple of students. They then published it in a manner that implied that it was offical work of the university, or at least sanctioned by the professor. Now, whether the study is right or wrong come peer review, the university wants it known that it wasn't their project. A peer reviewed research project is much different than throwing together a bad stats class midterm and putting the resu
    • by kaan ( 88626 ) on Monday August 22, 2005 @01:04PM (#13372738)
      why publish it in the first place?

      Dude, it was never published, it was posted on one web server that is part of the ncsa.uiuc.edu sub-domain (specifically, vburton.ncsa.uiuc.edu). There are probably hundreds of machines that are in this network, and posting something on a web server running there does not equate to NCSA formally publishing an article. What we're talking about here is a web page written by two students, they worked on a project, they wanted to post it for other people to see. So that's what they did, period.

      Stupidly, everyone is claiming that NCSA backed this whole thing, like they (NCSA) are on some crusade to compare Yahoo and Google. But this must be taken for what it is - a project by two students. NCSA's disclaimer is just trying to make this clear for the idiots out there who think that every little thing a student says or does must have been funded, supported, backed, etc. by NCSA.
    • Although they don't say it in the disclaimer, their actions of posting a disclaimer after posting the article screams that they realize the article is flawed. If that's the case, why publish it in the first place? Shouldn't they have had some foresight and left this one on the cutting room floor? Maybe Finance is different, but I remember it being very difficult to get an article published unless it was groundbreaking and free from any minor flaws.

      Yes, the web page was lacking in methodology and had a numbe
      • Yes, you are underreacting. Did you miss the original Slashdot posting:

        NCSA Compares Google and Yahoo Index Numbers

        chrisd (former Slashdot editor and now Google employee) writes "Recently, Yahoo claimed an increase of index size to "over 20 billion items", compared to Google's 8.16 billion pages. Now, researchers at NCSA have done their own, independent, comparison of the two engines. "

        Notice that the summary was submitted by a well known Google employee, and that it states the study was conducted by the N

  • Filtering (Score:5, Insightful)

    by Spazmania ( 174582 ) on Monday August 22, 2005 @12:35PM (#13372555) Homepage
    Readers can consult the list of search terms provided by the authors, and can see for themselves that, in the vast majority of cases retained (i.e. those with fewer than 1000 results), the results in question are lists and spam.

    I don't know which disturbs me more: The possibility that this is the correct explanation for the discrepancy or the possibility that it isn't.

    It seems to me that the correct solution to filtering results would be to put the "undesirable" results at the bottom of the list, not get rid of them entirely. One man's trash is another man's treasure after all.
  • The fact that a study conducted by students got mention on /. is impressive. Usually, most works done by students are ignored as class exercises. Now "retracted" can be added to the list.
  • Covering Ones Rear (Score:3, Insightful)

    by gkozlyk ( 247448 ) on Monday August 22, 2005 @12:39PM (#13372589) Homepage
    Ah, the good old disclaimer added to cover ones rear. With litigation flying free as newspaper in the wind, one can't be to careful these days.
    • What I really love is that fact that the page used to have the professor listed on the list of authors, NCSA logos were on the page, UIUC was listed under the authors' affiliation, and it looked much more official. Now that it's been aired out as non-scientific, there's all sorts of disclaimers saying that it was his student's work, and shifting the blame. Too bad it was published on his webspace :P

      Perhaps the professor of History and Sociology will think twice next time before attempting to put his nam

  • by frdmfghtr ( 603968 ) on Monday August 22, 2005 @12:48PM (#13372647)
    I didn't read the article the first time around, so maybe something was changed/removed that prompted the disclaimer. I read the report and couldn't find a single reference to NCSA, except in the URL and in the disclaimer itself.

    Aside from the URL, was there some sort of NCSA association implied or claimed in the original post then removed?
    • I think you're right, the original article had no visible association with NCSA other than the url. But this is just like the classic telephone game: I tell you something, you repeat it to somebody else with a minor addition/change, then that person tells somebody else, etc. By the time it goes 4 or 5 hops, it's been totally twisted around, and my original message has turned into something idiotic, and everyone thinks I said it. This is exactly what happened here, because it started showing up on blogs,
  • The dark web (Score:5, Insightful)

    by SpinyNorman ( 33776 ) on Monday August 22, 2005 @12:49PM (#13372652)
    The Yahoo vs Google page count methodology of counting numbers of pages returned for various high-response queries seems to be completely ignoring the fact that Yahoo *might be* picking up some of the less highly linked-to "dark web" that Google's page rank alogorithm are going to rate lowly, and which their crawler may be ignoring.

    This is the portion of the web that I'd like to see - not the commerical portion but the hobbyist and enthusiast sites that may be out there without lots of incoming links that would make them more highly rated and/or visible to Google.

    What'd therefore be relevant and interesting to know isn't how many hundreds of pages Google vs Yahoo get for "my job sucks", but rather how many it gets for "my weevil collection".
    • Your wish fulfilled:
      Google: Approx 47,100
      Yahoo: Approx 258,000

      Both searches we for "my weevil collection" without quotes. With quotes the results are:

      Google: 3
      Yahoo: 4

      Yahoo is champ.

    • Search results for "weevil":
      google - 915,000
      yahoo - 2,200,000

      Search results for "my weevil collection":
      google - 3
      yahoo - 4

      Yahoo returned the wikipedia page for "weevil" on the first page, so now I know what it is :)

      • by Anonymous Coward
        Search results for "weevil":
        google - 915,000
        yahoo - 2,200,000

        Search results for "my weevil collection":
        google - 3
        yahoo - 4


        You're getting negative hits?
    • Re:The dark web (Score:3, Interesting)

      Interesting.

      The original search rating papers (Kleinberg's algorithm, PageRank) made the ground-breaking observation that links between pages contained lots of useful information that could be used for ranking in addition to the keywords contained in the pages themselves. This was in a time where websites were mostly personal, and there was an atmosphere of friendly sharing of information where people would link to other sites that they find interesting. However, how much have things changed today? Who s
      • At one time that was google's 'pages like this one', and then there are Alexa's "Related Links", which have been around since before Google. Unfortunately there are privacy issues, and there would be (and is for alexa) a whole industry built around gaming that system.
        • Good tip, thanks. The Alexa client is dead on. Abuse and privacy issues are inevitable, but I'm curious how a search engine using client-side information compares to a crawl based one.
      • I have had a few items pop in my head which I thought were "totally awesome" and if successfully employed, worth many many monies.

        As for the search engine, how about a little checkbox that says "No Business". What's a business? Someone who sells something (loosely). Anyway it'd take a heck of a lot of work to implement and define, but that's a checkbox I'd have thoroughly molested. I'd also want to make a thesaurus that edits words as you type.
      • What makes you think your method is any better? It would be gamed just like PageRank (much worse, actually). Overreliance on any single method is not good if you want to have a decent search engine.
        • It uses a different source of reputation, one that seems more in tune with what the Web content looks like today.

          There is always the issue of abuse, no matter what the method.
      • Interesting idea. I imagine that Google would have the bandwidth and server capacity to capture and processs this data if browsers were able to make it available.

        I quite often find that Amazon's "people who bought this book also bought/viewed ..." section turns up useful stuff that a title search doesn't, so I expect the same may be true here too. One could even get a "user interest rating" of pages by how long they viewed them for...

        Maybe Mozilla/Firefox could work with Google to implement this type of fee
    • Re:The dark web (Score:3, Insightful)

      by RAMMS+EIN ( 578166 )
      ``This is the portion of the web that I'd like to see - not the commerical portion but the hobbyist and enthusiast sites that may be out there without lots of incoming links that would make them more highly rated and/or visible to Google.''

      I personally don't think Google is _excluding_ pages that somehow don't get enough links to them. Typically, good resources will get linked to, and thus taking into account the number of links to a page seems sensible.

      From personal experience, I can't say I have anything
    • This is the portion of the web that I'd like to see - not the commerical portion but the hobbyist and enthusiast sites that may be out there without lots of incoming links that would make them more highly rated and/or visible to Google.

      Dude, if your idea of the dark web is millions of spam infested blog pages that have been crawled by a million spam robots putting links in the comment pages, along with a smattering of Mediawiki sites that have similarly been "0wnz0r3d" by spam crawlers that edit the pages a
  • trust (Score:5, Funny)

    by dioscaido ( 541037 ) on Monday August 22, 2005 @01:01PM (#13372720)
    If it made it through the Slashdot filters, then the study is good enough for me.
  • I know it's been said before, but you cannot just measure search engines based on volume of hits returned. Clearly, when you get into the millions, it doesn't hurt the results to prune some crap off the end, and I'm sure they're both doing things -- either one could easily focus a little on breadth of hits per query and jump past the other.

    Important thing to note: The general principal is MORE COMPLEX than "find all pages containing this term". You can ADD terms and get MORE hits.

    As an example and as a t
    • Comment removed based on user account deletion
      • Though note that I'm not really referring to 'index size' but to the size of the list of hits returned for a term. I'm pretty much in favor of the indexes being as large as possible, and that's a reasonable thing to demand. But saying that one engine returns 2,000,000 more hits for 'banana store' than the other is not measuring the same thing at all, and is in fact dumb.
  • DISCLAIMER: This comment is influenced by Colt 45 malt liquor...

    Big deal about what some other corp. says. This is a Joe Schmoe study conducted by college students. This means they're an independent, non-funded (therefore non-corp influenced) study. Too bad they have seemingly been coerced into changing some things in their article. *sigh* Why can't they ever stick with their guns??
  • by xiaomonkey ( 872442 ) on Monday August 22, 2005 @01:58PM (#13373169)
    Try the following sets of key words on Google: This trend appears to continue, as seen in that repeating the "lawyer" keyword 10 times results in Google estimating that there are 389,000,000 [google.com] hits in it's index.

    On yahoo, this sort of thing doesn't seem to happen as much, but it still does happen. For example, searching for "laywer" returns 124,000,000 [yahoo.com] results, and searching for "lawyer lawyer" or "lawyer lawyer" returns 125,000,000 [yahoo.com] results.

    So, it probably doesn't really make seen to judge the relative size of either index based on the estimated number of hits for any given set of keywords in their index. Right now, Google's numbers look a little more suspect since they seem to variety so greatly just based on the repetition of a keyword. However, the stability of Yahoo's numbers don't necessarily mean that they're correct either.
    • Sorry, the yahoo links I gave above are erroneous.

      Here's the corrected version of the first one, "lawyer" [yahoo.com] that results in 125,000,000 many estimated hits. The second one, "lawyer lawyer" [yahoo.com] results in 124,000,000 many estimated hits.
    • Those estimates are pretty irrelevant for this discussion, I think. When there are many results, those estimates aren't supposed to be accurate at all, that's why the study focused on queries with very few results.

      But yes, those numbers you show are quite strange.
    • For the search terms bla, bla bla, bla bla bla etc.pp. the numbers at Google remain pretty stable (starting out at 2.040 mio, and later somewhere between 1.870 und 1.900 mio).

      So the interesting question is: Why does it work with lawyer, but not with bla?
    • lawyer - results 29,300,000
      lawyer lawyer - results 29,300,000
      lawyer lawyer lawyer - results 62,000,000
      lawyer lawyer lawyer lawyer - results 78,600,000

      lawyer lawyer lawyer lawyer
      lawyer lawyer lawyer lawyer
      lawyer lawyer lawyer lawyer
      LAW SUIT LAW SUIT!

      lawyer lawyer lawyer lawyer
      lawyer lawyer ...

  • by lcsjk ( 143581 ) on Monday August 22, 2005 @02:22PM (#13373386)
    I understand that Google uses a very efficient compression technology to compress documents before they are indexed, thereby making characters so small that they can only be read with a magnifying glass or microscope.

    In contrast, Yahoo, unless I misunderstand, only compresses the file after it has been indexed. Since only the file is compressed and not the individual characters, they indeed have a larger index file as the study concluded. :)

  • by freality ( 324306 ) on Monday August 22, 2005 @02:23PM (#13373391) Homepage Journal
    After criticising [slashdot.org] the study when I first saw it, I now have some constructive ideas on how to perform a better test of the relative search performance of the two engines.

    - Crawler Test

    Do a search of "microsoft site:microsoft.com" in Google, Yahoo and MSN search. Assume that Microsoft knows how to crawl its own site completely and judge the relative strength of Google's and Yahoo's crawlers based on how many of those pages they find. Google easily wins.

    Unfortunately, his test doesn't work with Yahoo as the reference site since Google returns no hits for "yahoo site:yahoo.com". This is very disappointing, as Yahoo is one of the largest sites on the Web. Neither Yahoo or MSN share this self-censorship policy.

    Another test of the same kind is "amazon site:amazon.com", since amazon is a very big site which everyone is presumably very interested in crawling well. Unfortunately, amazon doesn't allow this kind of search on themselves via their A9 engine.

    This is an interesting test because it compares the very likely actual size of a site (i.e. Microsoft's reported size for microsoft.com) with the reported size by the second-party search engine. This may be the best 3rd party test of a crawler's Recall on a per-site basis.

    - Common Word Test

    Surprisingly, Google, Yahoo and MSN all now allow stop-word searching. Stop words are words like "the", "a", "it", etc.. A search for "the" on each show Yahoo significantly in the lead.

    This is an interesting test because the word "the" is probably as uniformly distributed in the source webpage population than the random phrases used in the NCSA test (or possibly moreso), plus this word is more likely than any other to occur in every web document (at least with English). These two characteristics mean that finding the most pages with "the" may be the closest approximation of an actual Recall measurement for all sites on the web that can be done without a prior-knowledge testing set.

    - Conclusion

    Google and MSN seem to have high-quality crawlers, whereas Yahoo makes up with a much larger current index size. However, take this with a grain of salt as it's hard to measure anything without knowing how the sites do Precision/Recall tradeoffs, and there may be substantial structural differences in the way the sites index pages to start with.

    Even assuming these results, Yahoo's bigger index isn't necessarily more desirable. A better crawler can yield a bigger and better index with relatively little work (just add more machines to store it on), while there is no easy fix for the meticulous work needed to ensure a crawler is getting to all the nooks and crannies of the Web.

    Longer-term, it is these caveats that make an Open-Source approach to crawling shine by comparisson. Take for example the Nutch search engine. Though it is in its early days, there need be no doubt to its cralwing and ranking algorithms as well as its Precision/Recall tradeoff.
    • What?

      A google search for yahoo site:yahoo.com turns up over 57,000,000 hits, not zero.
    • A search for "the" on each show Yahoo significantly in the lead.

      The problem with such searches is that if a search engine misindexes a mirror site or a 401 page and that is returned in the count, then that SE looks bigger.

      On the other hand if you launch a query that has 5-10 answers you can actually examine every single page on both result pages and make sure that all hits are correct and distinct.

      Using that technique Google comes ahead of Yahoo, by a large margin. [slashdot.org]
      • Yeah, I agree there are many caveats. Like I said: "However, take this with a grain of salt as it's hard to measure anything without knowing how the sites do Precision/Recall tradeoffs, and there may be substantial structural differences in the way the sites index pages to start with."

        What you describe would be to me a "substantial structural difference". Which means I agree.

        However, that doesn't change that I do think it's better to accept probable error in a huge population of samples than to choose a m

    • Surprisingly, Google, Yahoo and MSN all now allow stop-word searching. Stop words are words like "the", "a", "it", etc.. A search for "the" on each show Yahoo significantly in the lead.

      From the article:

      Interestingly, the actual total number of results returns varies dramatically from the estimated total number of results that both Google and Yahoo! provide users in the search results. In the case of Google, the number of actual results returned is about one third of the estimation that Google gives. However
      • Well, yeah.

        But even though I said take everything I said with a grain of salt, I would take that claim about estimated vs. actual hits with 2 grains of salt.

        Here's my superb rationale why.

        Consider that if it was easy or efficient return the exact # of hits, they would probably do it, instead of, for instance, "Results 251 - 259 of about 284". I mean, consider the UI people at Google.. do they want to muddly their otherwise famously precise and clear interface with a guffaw like that? I'd bet Not unless th
        • Actually, glad you brought this up. I turned off duplicate elimination on both sites for my test query and got an exact match between Yahoo's estimated and actual number of results pages. Google's actual number increased but still fell short of actual.

          So again, looks like Yahoo has a good handle on what its index size is, but is simply filtering out lots of the results from its index. Perhaps there's not really that much difference between the two after all :)
  • Felonies for the whole lot of'em!

    Oh, wait. Which students were these?

  • by RunzWithScissors ( 567704 ) on Monday August 22, 2005 @02:32PM (#13373462)
    I got flamed for proposing this theory when the article was first posted on /.

    One major problem with the study, not really addressed by this problems article is one of comparison. Yes, Yahoo! and Google are two search engines, but they perform their searches differently, and more importantly, use different criteria when returning matches! It is quite possible that when doing the exact same search in both yeilds a difference in results. Why? Because the two different search engines have different criteria for providing output, or matches if you will, from said search. Perhaps Yahoo! does indeed have more pages indexed, but because of their search algorithm, or their program which displays results to the user, less matches are provided; even though more pages were looked at.

    I'm not saying that Yahoo! does have more or less pages than Google. But the study that was executed and published did not account for many of the differences between the products they were comparing. The above is merely another interpretation of said results. I'm glad that some folks at NCSA agree and provided some clarification; lest we get another urban myth like storks bring babies.

    -Runz
  • I took a philosophy class with Matt Cheney at the University of Illinois. Let me just say for the record that he is a douchebag. I am really not surprised that he tried to pass off this study under the auspices of NCSA. I'm just glad to see that someone called him on this.
  • Google has great most of the web covered. While obeying robots.txt and such, they can't index much more of meaningful content. So how did Yahoo almost triple the Google's goal? Well, as long as you're looking for obvious stuff with "easy hits", the results will be similar. But if you enter REALLY obscure stuff, for which Google shows 3-5 hits, Yahoo will show the same 3-5 hits and 15 others, which are all different variants of 404, pages pointed to through broken links. Simply put, 2/3 of Yahoo index are "4

Today is a good day for information-gathering. Read someone else's mail file.

Working...