Nutch: An Open Source Search Engine 291
Anonymous Coward writes "Someone forwarded me this site working to create an open source search engine called Nutch.
In the age of weighted rankings on search engines for profits, there's an obvious need for an unbiased search engine. After all, isn't a search engine supposed to be for finding relevant data, not as an indirect and sometimes slimy method of advertising?
Nutch is clearly in their intial stages, but it would certainly get my vote." You can find the project on SF.net, and also read the Business 2.0 article on it.
Google? (Score:5, Informative)
inobtrusive adverts on the right hand column nonwithstanding.
Accuracy is relevance (Score:3, Informative)
The problem with Google is that there are errors in it: you ask for something and sometimes you get something else.
A search on "to be or not to be" produces an error (non-matching results) in three of the first ten results: a 30% search failure rate. It used to be worse, when most of the links were bad.
Since it seems like Google will never fix this problem, I'm looking forward to something with all of Google's great features, plus accuracy.
Re:Hook it up to slashdot! (Score:3, Informative)
Re:Accuracy is relevance (Score:4, Informative)
This is a bit of a misrepesentation. Google will toss the words 'to' 'be' and 'or'. So you effectively end up searching on 'not'. It does this to eliminate words that show up to frequently and make the searches faster (and the overloading of the word 'or'). If you really want that text, then either quote the whole thing, or place a '+' in front of those words, which will give you exactly what you're looking for. So there is no problem with it's acurracy when you understand the proper way to ask it for something.
Re:Google? (Score:3, Informative)
That's why people use google. If they stacked the deck supporting places people don't care about - advertisers pages, for instance, then we'd all jump ship and use another search engine.
They're like the Swiss and Consumer Reports. Part of the reason they make money is neutrality, and they won't make as much if they're not.
Lucene (index and search engine) (Score:1, Informative)
Anyone ever heard of grub? (Score:2, Informative)
Re:Hardware? (Score:2, Informative)
I wouldn't count on it (Score:3, Informative)
Re:Lucene (index and search engine) (Score:5, Informative)
Lucene and Nutch are related:
http://scriptingnews.userland.com/2003/08/13#When
Paul Nakada, via email: "It appears that the coding muscle for Nutch is Doug Cutting, the author of Lucene, an Apache Project open source search engine. We use it here at salesforce and have a huge amount of respect for Doug's coding."
Re:Google? (Score:1, Informative)
Remember the Scientologists?
Re:Patents. (Score:2, Informative)
Re:not a good idea.... (Score:3, Informative)
I highly doubt that Nutch is going to offer an alternative to Google in the area of web search. What they seem to be doing is offering an alternative in the area of Enterprise search.
Currently, the company that I work for pays Verity (used to be Inktomi, before that Infoseek) tens of thousands of dollars a year for the use of their software. We use their software to make our own site searchable. If Nutch offered us a free alternative to our Ultraseek server, we'd definitely be interested.
We don't have to worry about anyone "googlebombing" our search collections because, well, we create all the content that goes into those collections. We'd love it if the algorithm that determined rankings was open-source. That way, we could change it to suit our specific needs if we thought it would help return more relevant results. There are currently a number of undesirable phenomena that we live with or work around because the mechanics of the problem are burried within proprietary Ultraseek code.
Google is the best of the best in web search and I don't think anyone short of MS is interested in challenging them for that. But 'search engine' in this case means something entirely different.
Irrational fear of money (Score:4, Informative)
How do they plan to pay for that? Apparently advertising is out. And we just had another monephobe complaining about lack of funds for his accounting software who expected people to donate because he couldn't figure out that maybe, just maybe he should find a way to sell his product in some form while also keeping one form free. I can get RedHat for free OR pay money to get a hard copy with some bonus stuff. Net result is that RedHat makes money and everyone is happy. Those who refuse to pay don't have to and those who are willing to pay have a reason to. Most people are not going to just give you money out of the goodness of their heart and accept nothing in return if they don't have to. Why do you think PBS gives you gifts with your donations?
I'd be more impressed with such undertakings if the owners weren't convinced the bandwidth fairy was real and that money will fall from the sky like mana.
When someone comes along who recognizes that the bandwidth fairy doesn't exist and that money needs to be aquired through marketing to get any real amount then I'll think twice before laughing it off.
Free is a pretty dream but free don't pay the bills.
Ben
Re:Hook it up to slashdot! (Score:4, Informative)
While not currently designed for massive whole-web spidering (it's aimed at single websites or intranets), ht://dig is a great starting point (and a lot further along than the Nutch 'nascent effort' mentioned in the story). Some database optimization to ht://dig seems easier than starting over with Nutch. Plus, the name 'Nutch' sucks.
Re:Accuracy is relevance (Score:4, Informative)
Of course not. I'd put it back and try more carefully to get what I want. I, what's the word I'm looking for, . . . wait for it . . . refine my search
Regarding your comments above about google inaccuracy: I searched for +"to be or not to be" [google.com] and consider the first page of 10 hits to definitely be 100% "correct". In fact, all of the 104,00 results that I checked (about 50, hehe) are 100% correct in that the sites on the list, or the sites linking to the sites on the list, contain the phrase "to be or not to be". Check the '2bee or nottoobee' link in google's cache and where you normally see the search term highlight colors, you'll see
These terms only appear in links pointing to this page: to be or not to be
Just because you wanted "Shakespeare" doesn't mean that "Shakespeare" is any more correct as an "answer" to "to be or not to be". If it were more popular (on the web), I'm confident that it would be higher on the list. That is, whether we like it or not, on the current www there are exactly 3 things more relevant to that famous phrase than Shakespeare, and they are, in order: barium enemas, beOS, and a kids' grammar game starring a bee. Or, more acurately and revealingly: an article about barium enemas titled "To BE or Not to BE?", an article about BeOS titled "TO Be OR NOT TO be?", and a kids' grammar game starring a bee called "2Bee or Nottoobee" which is linked to by sites containing the phrase "to be or not to be" in or near those links.
Lucky for us that ol' Bill is still in the top 10 at all, I'd say.
Re:Hook it up to slashdot! (Score:4, Informative)
For example, look at the results [google.com] for the search 'convert wmv mpeg'. The first three results lead to the same exact search site. (Whether they have pop-ups or not, i can't tell, because i block them.) The fourth result is another search site. And then the last three are the same as the first three.
Of course, this obviously works with stuff you'd expect it to, like 'mp3s' and 'warez' and 'porn', but it works with legitimate stuff too. I wonder if there'll be anything to combat this trend, whether it be implemented by Google or by someone else....
Re:Hook it up to slashdot! (Score:2, Informative)
Re:Google? (Score:3, Informative)
Shameless plug for SWISH++ (Score:4, Informative)