Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
The Internet Software

Nutch: An Open Source Search Engine 291

Anonymous Coward writes "Someone forwarded me this site working to create an open source search engine called Nutch. In the age of weighted rankings on search engines for profits, there's an obvious need for an unbiased search engine. After all, isn't a search engine supposed to be for finding relevant data, not as an indirect and sometimes slimy method of advertising? Nutch is clearly in their intial stages, but it would certainly get my vote." You can find the project on SF.net, and also read the Business 2.0 article on it.
This discussion has been archived. No new comments can be posted.

Nutch: An Open Source Search Engine

Comments Filter:
  • Google? (Score:5, Informative)

    by devphaeton ( 695736 ) on Wednesday August 13, 2003 @04:54PM (#6689378)
    Last i heard google still doesn't accept bribes for page ranking.

    inobtrusive adverts on the right hand column nonwithstanding.
  • by AtariAmarok ( 451306 ) on Wednesday August 13, 2003 @04:56PM (#6689407)
    To me, accuracy is the most important "Relevance".

    The problem with Google is that there are errors in it: you ask for something and sometimes you get something else.

    A search on "to be or not to be" produces an error (non-matching results) in three of the first ten results: a 30% search failure rate. It used to be worse, when most of the links were bad.

    Since it seems like Google will never fix this problem, I'm looking forward to something with all of Google's great features, plus accuracy.
  • by Anonymous Coward on Wednesday August 13, 2003 @04:57PM (#6689414)
    Just use google. Search for "SEARCH-STRING site:slashdot.org"
  • by binaryDigit ( 557647 ) on Wednesday August 13, 2003 @05:06PM (#6689499)
    A search on "to be or not to be" produces an error (non-matching results) in three of the first ten results: a 30% search failure rate. It used to be worse, when most of the links were bad.

    This is a bit of a misrepesentation. Google will toss the words 'to' 'be' and 'or'. So you effectively end up searching on 'not'. It does this to eliminate words that show up to frequently and make the searches faster (and the overloading of the word 'or'). If you really want that text, then either quote the whole thing, or place a '+' in front of those words, which will give you exactly what you're looking for. So there is no problem with it's acurracy when you understand the proper way to ask it for something.
  • Re:Google? (Score:3, Informative)

    by fireboy1919 ( 257783 ) <rustyp AT freeshell DOT org> on Wednesday August 13, 2003 @05:15PM (#6689589) Homepage Journal
    Yeah, they been known to do that when people make server farms to attempt to influence the rankings of google. It is in their best interest to ensure that the pages that people actually want to see come up first, not the advertisers pages.

    That's why people use google. If they stacked the deck supporting places people don't care about - advertisers pages, for instance, then we'd all jump ship and use another search engine.

    They're like the Swiss and Consumer Reports. Part of the reason they make money is neutrality, and they won't make as much if they're not.
  • by Anonymous Coward on Wednesday August 13, 2003 @05:18PM (#6689621)
    Check out Lucene [apache.org], the indexing and search engine used by Nutch. From what I've heard, Nutch is mainly the spider/crawler used to gather documents.
  • by nadadogg ( 652178 ) on Wednesday August 13, 2003 @05:21PM (#6689649)
    Grub is another open-source search engine, I have the client running right now, its nice and distributed, I think this kind of idea is great.
  • Re:Hardware? (Score:2, Informative)

    by AsparagusChallenge ( 611475 ) on Wednesday August 13, 2003 @05:32PM (#6689768)
    Don't worry too much. This is software, not a service. When available it may be implemented by someone and be the infrastructure of a company, which may then provide bugfixes and development to the original project. Or it may not. Who knows.
  • by Wesley Felter ( 138342 ) <wesley@felter.org> on Wednesday August 13, 2003 @05:36PM (#6689803) Homepage
    Nutch has four developers, one of whom is Doug Cutting [sourceforge.net] who wrote several indexing engines. They count Alexa founder Brewster Kahle as a "friend" and are sponsored by Overture.
  • by cpeterso ( 19082 ) on Wednesday August 13, 2003 @05:45PM (#6689877) Homepage

    Lucene and Nutch are related:

    http://scriptingnews.userland.com/2003/08/13#When: 12:20:53PM [userland.com]

    Paul Nakada, via email: "It appears that the coding muscle for Nutch is Doug Cutting, the author of Lucene, an Apache Project open source search engine. We use it here at salesforce and have a huge amount of respect for Doug's coding."

  • Re:Google? (Score:1, Informative)

    by Anonymous Coward on Wednesday August 13, 2003 @05:48PM (#6689902)
    Yes, but google does delist pages when threatened with lawsuits.

    Remember the Scientologists?
  • Re:Patents. (Score:2, Informative)

    by alwayslurking ( 555708 ) <<jason.boissiere> <at> <gmail.com>> on Wednesday August 13, 2003 @05:50PM (#6689910)
    I still don't think you can describe google's setup as distributed. They have multiple data centers each running a very large cluster and containing a similar, but not identical, snapshot of the database, indices, etc. A truly distributed engine is likely to require an innovative step or three to emulate that with no centralised control, unknown hardware and bandwidth resources and the real possibility that some "clients" may be corrupted by their owners to distort results. I haven't got any arguments about the real value of this effort though. Google has done nothing to lost my trust and seems to be run with retaining people's trust as an active ambition. Closest they came to worrying me was crippling for China, but that was really a no-win situation, IMHO.
  • by curunir ( 98273 ) * on Wednesday August 13, 2003 @06:01PM (#6689990) Homepage Journal
    You've entirely missed the point of this project.

    I highly doubt that Nutch is going to offer an alternative to Google in the area of web search. What they seem to be doing is offering an alternative in the area of Enterprise search.

    Currently, the company that I work for pays Verity (used to be Inktomi, before that Infoseek) tens of thousands of dollars a year for the use of their software. We use their software to make our own site searchable. If Nutch offered us a free alternative to our Ultraseek server, we'd definitely be interested.

    We don't have to worry about anyone "googlebombing" our search collections because, well, we create all the content that goes into those collections. We'd love it if the algorithm that determined rankings was open-source. That way, we could change it to suit our specific needs if we thought it would help return more relevant results. There are currently a number of undesirable phenomena that we live with or work around because the mechanics of the problem are burried within proprietary Ultraseek code.

    Google is the best of the best in web search and I don't think anyone short of MS is interested in challenging them for that. But 'search engine' in this case means something entirely different.
  • by KalvinB ( 205500 ) on Wednesday August 13, 2003 @06:02PM (#6689996) Homepage
    That's nice that they want to open source the engine but that's the least of a search engine. They're going to need multiple high end servers to process the searches and plenty of bandwidth to get the results to the users.

    How do they plan to pay for that? Apparently advertising is out. And we just had another monephobe complaining about lack of funds for his accounting software who expected people to donate because he couldn't figure out that maybe, just maybe he should find a way to sell his product in some form while also keeping one form free. I can get RedHat for free OR pay money to get a hard copy with some bonus stuff. Net result is that RedHat makes money and everyone is happy. Those who refuse to pay don't have to and those who are willing to pay have a reason to. Most people are not going to just give you money out of the goodness of their heart and accept nothing in return if they don't have to. Why do you think PBS gives you gifts with your donations?

    I'd be more impressed with such undertakings if the owners weren't convinced the bandwidth fairy was real and that money will fall from the sky like mana.

    When someone comes along who recognizes that the bandwidth fairy doesn't exist and that money needs to be aquired through marketing to get any real amount then I'll think twice before laughing it off.

    Free is a pretty dream but free don't pay the bills.

    Ben
  • by randyest ( 589159 ) on Wednesday August 13, 2003 @06:13PM (#6690072) Homepage
    167 posts and no mention of ht://dig [htdig.org]? It's a great open source search engine, and I've been using it daily (well, cron really uses it now, not me) to spider about 100 sites on my intranet, which has servers all over the world.

    While not currently designed for massive whole-web spidering (it's aimed at single websites or intranets), ht://dig is a great starting point (and a lot further along than the Nutch 'nascent effort' mentioned in the story). Some database optimization to ht://dig seems easier than starting over with Nutch. Plus, the name 'Nutch' sucks.
  • by randyest ( 589159 ) on Wednesday August 13, 2003 @06:42PM (#6690276) Homepage
    If you reach into the freezer without really looking, thinking that you are grabbing a freezer-pop, and get an 8 month old leg of lamb instead, are you going to shrug and eat the lamb anyway?

    Of course not. I'd put it back and try more carefully to get what I want. I, what's the word I'm looking for, . . . wait for it . . . refine my search :)

    Regarding your comments above about google inaccuracy: I searched for +"to be or not to be" [google.com] and consider the first page of 10 hits to definitely be 100% "correct". In fact, all of the 104,00 results that I checked (about 50, hehe) are 100% correct in that the sites on the list, or the sites linking to the sites on the list, contain the phrase "to be or not to be". Check the '2bee or nottoobee' link in google's cache and where you normally see the search term highlight colors, you'll see

    These terms only appear in links pointing to this page: to be or not to be

    Just because you wanted "Shakespeare" doesn't mean that "Shakespeare" is any more correct as an "answer" to "to be or not to be". If it were more popular (on the web), I'm confident that it would be higher on the list. That is, whether we like it or not, on the current www there are exactly 3 things more relevant to that famous phrase than Shakespeare, and they are, in order: barium enemas, beOS, and a kids' grammar game starring a bee. Or, more acurately and revealingly: an article about barium enemas titled "To BE or Not to BE?", an article about BeOS titled "TO Be OR NOT TO be?", and a kids' grammar game starring a bee called "2Bee or Nottoobee" which is linked to by sites containing the phrase "to be or not to be" in or near those links.

    Lucky for us that ol' Bill is still in the top 10 at all, I'd say.

  • by lvdrproject ( 626577 ) on Wednesday August 13, 2003 @08:34PM (#6690972) Homepage
    Interestingly enough, if i had read this story a few months ago, i would've said "Poppycock! Google should be good enough for anyone!". But lately i've been noticing that Google turns up a lot of garbage results. Like, if you search for something "generic" (like, no brand name or product name or anything like that), you're going to find a whole bunch of results that just lead to pop-up search sites.

    For example, look at the results [google.com] for the search 'convert wmv mpeg'. The first three results lead to the same exact search site. (Whether they have pop-ups or not, i can't tell, because i block them.) The fourth result is another search site. And then the last three are the same as the first three.

    Of course, this obviously works with stuff you'd expect it to, like 'mp3s' and 'warez' and 'porn', but it works with legitimate stuff too. I wonder if there'll be anything to combat this trend, whether it be implemented by Google or by someone else....

  • by msgregory@earthlink. ( 98641 ) on Wednesday August 13, 2003 @08:45PM (#6691030)
    I've noticed that searching for Eric S. Raymond's home page brings up his actual home page third or fourth in the listing. I don't know if that means Google is on it's way to going downhill or what. The first listing it brings up doesn't appear to have anything to do with ESR. I don't even think his name appears anywhere on the page.
  • Re:Google? (Score:3, Informative)

    by RedWizzard ( 192002 ) on Wednesday August 13, 2003 @09:05PM (#6691168)
    See this article on slate for some interesting ideas on why Google's page-ranking system is being undermined due to the evolution of ecommerce and price-comparing portals.
    That article has already been dealt with on Slashdot (here [slashdot.org]). Using a bit of intelligence when searching will avoid the problems cited.
  • by pauljlucas ( 529435 ) on Wednesday August 13, 2003 @11:34PM (#6692182) Homepage Journal
    I see this project as a competitor to shrink wrapped search engines. IE google appliance or maybe even Folio based products. Typically corporations have many documents that need to be indexed and searchable to their needs.
    SWISH++ [mac.com] fills this niche nicely. It can index hundreds of thousands of documents very quickly, indexes not only HTML, but e-mail, news, man pages, LaTeX, RTF, and even the ID3 tags of MP3 files; can apply filters on-the-fly (convert PDF to text, then index that), can do incremental indexing, and can run as a multi-threaded search daemon.

The one day you'd sell your soul for something, souls are a glut.

Working...