Slashdot Log In
Nutch: An Open Source Search Engine
Posted by
Hemos
on Wed Aug 13, 2003 03:51 PM
from the but-will-it-matter dept.
from the but-will-it-matter dept.
Anonymous Coward writes "Someone forwarded me this site working to create an open source search engine called Nutch.
In the age of weighted rankings on search engines for profits, there's an obvious need for an unbiased search engine. After all, isn't a search engine supposed to be for finding relevant data, not as an indirect and sometimes slimy method of advertising?
Nutch is clearly in their intial stages, but it would certainly get my vote." You can find the project on SF.net, and also read the Business 2.0 article on it.
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
Patents. (Score:5, Interesting)
Re:Patents. (Score:5, Insightful)
Hmmm, I just realized something... with patents, you end up stepping on people's toes. Without patents, you get to stand on their shoulders. Which do you think is the better vantage point?
Parent
Re:Patents. (Score:4, Insightful)
Of course, in practice patents are a mess.
Parent
Re:Patents. (Score:5, Insightful)
Parent
Re:Lucene (index and search engine) (Score:5, Informative)
Lucene and Nutch are related:
http://scriptingnews.userland.com/2003/08/13#When
Paul Nakada, via email: "It appears that the coding muscle for Nutch is Doug Cutting, the author of Lucene, an Apache Project open source search engine. We use it here at salesforce and have a huge amount of respect for Doug's coding."
Parent
The purpose of a search engine (Score:2)
I'm pretty sure a search engine is supposed to be for whatever purpose the people making it want it to be.
Re:The purpose of a search engine (Score:3, Funny)
And I'm sure many Slashdotters would love a search engine dedicated to find pr0n and anti-Microsoft propaganda. Right?
Re:The purpose of a search engine (Score:4, Funny)
Porn [sublimedirectory.com]
Anti-Microsoft Propoganda. [slashdot.org]
Parent
Google? (Score:5, Informative)
inobtrusive adverts on the right hand column nonwithstanding.
Re:Google? (Score:4, Insightful)
Parent
Re:Google? (Score:3, Interesting)
Re:Google? (Score:3, Informative)
Re:Google? (Score:3, Informative)
That's why people use google. If they stacked the deck supporting places people don't care about - advertisers pages, for instance, then we'd all jump ship and use another search engine.
They're like the Swiss and Consumer Reports. Part of the reason they make money is
Slimey adverts? (Score:3, Insightful)
Also of note is that companies can still influence search engines in slimey ways - Google can be manipulated to make a page rank higher, although Google keeps an eye on this activity and works around it.
Re:Slimey adverts? (Score:3, Funny)
You speak blasphemy! How dare you speak of such practical issues as money when talking about free software!
Re:Slimey adverts? (Score:5, Insightful)
Anyone could take this source code and with enough money, challenge Google.com as the top search engine.
I see this project as a competitor to shrink wrapped search engines. IE google appliance [google.ca] or maybe even Folio based products. Typically corporations have many documents that need to be indexed and searchable to their needs.
I haven't seen this on the homepage but it doesn't list what content it can index. I hope it can at least index PDF's and popular Office documents.. Maybe even Media files? And what XML indexed fields? Or external metadata?
Parent
Shameless plug for SWISH++ (Score:4, Informative)
Parent
Advertising != Manipulating the rankings (Score:3)
After the "paid" listings come the Inktomi listings. Those crawler based listings include PFI (pay for inclusion, you pay for daily spidering, but no "boost" in rankings) and the
Biased listings (Score:5, Insightful)
I'm quite comfortable with how Google does this (present commercial links clearly marked to the side), and am not convinced a non-commercial (open source) alternative is needed.
just don't get it (Score:4, Insightful)
Re:just don't get it (Score:5, Insightful)
Parent
Accuracy is relevance (Score:3, Informative)
The problem with Google is that there are errors in it: you ask for something and sometimes you get something else.
A search on "to be or not to be" produces an error (non-matching results) in three of the first ten results: a 30% search failure rate. It used to be worse, when most of the links were bad.
Since it seems like Google will never fix this problem, I'm looking forward to something with all of Google's great features, plus accuracy.
Re:Accuracy is relevance (Score:4, Informative)
This is a bit of a misrepesentation. Google will toss the words 'to' 'be' and 'or'. So you effectively end up searching on 'not'. It does this to eliminate words that show up to frequently and make the searches faster (and the overloading of the word 'or'). If you really want that text, then either quote the whole thing, or place a '+' in front of those words, which will give you exactly what you're looking for. So there is no problem with it's acurracy when you understand the proper way to ask it for something.
Parent
Re:Accuracy is relevance (Score:4, Informative)
Of course not. I'd put it back and try more carefully to get what I want. I, what's the word I'm looking for, . . . wait for it . . . refine my search
Regarding your comments above about google inaccuracy: I searched for +"to be or not to be" [google.com] and consider the first page of 10 hits to definitely be 100% "correct". In fact, all of the 104,00 results that I checked (about 50, hehe) are 100% correct in that the sites on the list, or the sites linking to the sites on the list, contain the phrase "to be or not to be". Check the '2bee or nottoobee' link in google's cache and where you normally see the search term highlight colors, you'll see
These terms only appear in links pointing to this page: to be or not to be
Just because you wanted "Shakespeare" doesn't mean that "Shakespeare" is any more correct as an "answer" to "to be or not to be". If it were more popular (on the web), I'm confident that it would be higher on the list. That is, whether we like it or not, on the current www there are exactly 3 things more relevant to that famous phrase than Shakespeare, and they are, in order: barium enemas, beOS, and a kids' grammar game starring a bee. Or, more acurately and revealingly: an article about barium enemas titled "To BE or Not to BE?", an article about BeOS titled "TO Be OR NOT TO be?", and a kids' grammar game starring a bee called "2Bee or Nottoobee" which is linked to by sites containing the phrase "to be or not to be" in or near those links.
Lucky for us that ol' Bill is still in the top 10 at all, I'd say.
Parent
Seems pretty pointless (Score:5, Insightful)
What they should be doing is pressuring the existing search engine companies for some integrity.
not a good idea.... (Score:4, Interesting)
people are already 'googlebombing' to try and get better rankings by signing up tons of domains and cross linking them all with the keyword that they want to be #1...
if the algorithm that determined how #1 is determined was public, then the best possible strategy to cheat the system could be demised... instead of paying for weight to the search engines you would be paying to web developers to make the search engine think you were #1. and as a web developer i feel that.... oh... wait, proceed.
Re:not a good idea.... (Score:3, Informative)
I highly doubt that Nutch is going to offer an alternative to Google in the area of web search. What they seem to be doing is offering an alternative in the area of Enterprise search.
Currently, the company that I work for pays Verity (used to be Inktomi, before that Infoseek) tens of thousands of dollars a year for the use of their software. We use their software to make our own site searchable. If Nutch offered us a free alternative to our Ultraseek ser
Can this work? (Score:5, Insightful)
The other major problem would be that, with the ranking criteria being available for all to see, it would be relatively simple to manipulate page rankings.
Re:Can this work? (Score:3, Insightful)
I think to way to overcome this obstacle is to develop a distributed system...run a nutch node on your server, host a few GBs of index data. There could be master nodes
A Tough Challenge (Score:5, Interesting)
With most of the major engines today including Google, they make an effort to prevent horribly unbalanced results (recent controversy over blogs outweighing professional sites in the rankings due to linking and other factors). Some even admit (again, Google does) to manually messing with the rankings a little. If you search for suicide methods, they will bend the engine to make sure you get reasons why you shouldn't commit suicide before you get the how-to. That's in their own public docs. It's also discussed in Wired.
I honestly don't know if open-source could do a better job. The algorithm might be better (likely, given the manpower), but would it really be that much fairer?
Are they thinking too big? (Score:3, Insightful)
I don't see a solution in one great open-source, independent search engine, but many individual specialized search engines, each mastering their own niche area of specialty stands a chance to compete, especially if run by people who focus on their areas of expertise. Alternative news search engines, music search engines, literary search engines, etc. each run by people who know what to filter in and out.
If Nutch.org could create the technology that would allow each of these search engines to exist autonomously, it could also be the hub/portal/start-page/blahblahblah that links all these engines and databases together.
Alex.
The answer is "Nutch" (Score:4, Funny)
The answer: "What did Sean Connery say when he saw the reviews for 'League of Extraordinary Gentlemen?"
Let's check out the credits page... (Score:3, Interesting)
Overture Research has donated hardware and helped to fund development.
So, even an "open source," "unbiased" search engine is funded by a commercial search organization.
funding (Score:3, Interesting)
Distributing the Power (Score:4, Interesting)
With that in mind, how would this project help? It would allow webmasters to quickly & easily modify it for their needs, and deploy their own niche engines; in other words, Google would be supplemented by 10,000 niche search engines, each focusing on a specific field (microsoft propaganda, for instance). This would create a balance of power, ensuring that no single search engine accumulates an insane amount of control over the web as a whole.
Bias: Inevitable (Score:3, Insightful)
"In the age of weighted rankings on search engines for profits, there's an obvious need for an unbiased search engine."
Bias is inevitable -- we're talking about ranking, which necessarily means bias.
The question is: what bias do you want? What bias suits your purposes?
My ideal search engine would offer a variety of biases from which to pick.
Search Engine Monoculture (Score:5, Interesting)
The point is, are you really comfortable to have one, and only one, effective search engine? No matter how well it searches?
O'Reilly [userland.com] put it best
Actually, Nutch has no ambitions to dethrone Google. It's just trying to provide an open source reference implementation of search to help keep Google and other search engines honest, by letting people compare the results of an engine whose algorithms and methodologies are transparent and accessible. It also aims to give a platform for people outside of the search heavyweights to research new search algorithms.
Irrational fear of money (Score:4, Informative)
How do they plan to pay for that? Apparently advertising is out. And we just had another monephobe complaining about lack of funds for his accounting software who expected people to donate because he couldn't figure out that maybe, just maybe he should find a way to sell his product in some form while also keeping one form free. I can get RedHat for free OR pay money to get a hard copy with some bonus stuff. Net result is that RedHat makes money and everyone is happy. Those who refuse to pay don't have to and those who are willing to pay have a reason to. Most people are not going to just give you money out of the goodness of their heart and accept nothing in return if they don't have to. Why do you think PBS gives you gifts with your donations?
I'd be more impressed with such undertakings if the owners weren't convinced the bandwidth fairy was real and that money will fall from the sky like mana.
When someone comes along who recognizes that the bandwidth fairy doesn't exist and that money needs to be aquired through marketing to get any real amount then I'll think twice before laughing it off.
Free is a pretty dream but free don't pay the bills.
Ben
Large scale and DB (Score:3, Interesting)
Firstly the choice of Java, personally I have no gripe about this. And reading that a choice was made to use language-independent formats is a good idea. My main concern is for the larger scaling and distribution over multiple machines.
At present I make the educated guess that a project on this scale, in Java, would still be best run on a `hardware base as uniform as possible', like UltraSparc 450's with a fibre back-plain.
My second concern is that there is so much choice of indexing and searching technique that there are sure to be some problem due to Patent restrictions.
Just browsing the US patent office gave me a couple of possible Patent nasties;
6,463,428 or 6,278,992. (And about 10 others I glanced at...)
Lastly DB, in the short time I've been looking at the code it seems to me that a choice was made to implement a DB build for the problem. Although this could be a good thing, it is usually better to reuse existing products. I found SleepyCat (DB4) to match the requirements. And if the choice is final read this. [1]
I hope these comments are useful to somebody at least.
[1] http://www.xlnt-software.com/xml_dl.html
Some commentary... (Score:3, Insightful)
I have a few comments on this development:
An open search engine application is a nice idea, but unfortunately it's one of those applications which are essentially useless without an enormous ASP architecture behind it. An earlier poster indicated that it might be useful for searching and indexing intranets and the like, analogously to the Google Search Appliance. This is indeed a valid potential application, but then, HT://Dig exists already. Is this dramatically better?
Comments and suggestions... (Score:3, Insightful)
This is exactly like the problem the mice had one day. They couldn't come out of their mouse hole because there was a dangerous cat prowling around. One day, as food was getting scarce and everyone was afraid to leave the hole, the mice called a meeting to discuss the problem. One excited young mouse came up with the most wonderful idea: Let's put a bell around the cat's neck, so that when the cat is nearby, the mice would have advance warning and could escape! All the mice got excited at this proposal, until a very old, very wise mouse came over and asked, "And who will tie the bell around the cat's neck?"
What I'm trying to say is: If the search engine is free software and companies don't pay to increase their ranking... who will pay for the bandwidth to host the engine? I can tell you this much:
Proposed solution? Make it a distributed search engine, like SETI@home, or the DNS.
This is much easier said than done because:
- RAID-like distributed storage technology would have to be developed, so that the indexing database could be distributed among all computers worldwide that donate bandwidth and storage. This would have to guarantee statistically that all the data will be available at any point in time even if people turn off their computers for extended periods of time. However, this technology could make reliable clustered storage a reality, and the resulting free software implementation could be licensed for corporate use for an exhorbitant price, which would go to the EFF, FSF and other organizations that develop free software and/or support the development thereof.
- An efficient P2P-like protocol, along with a network topology of some sort (like the DNS system has) would have to be developed to support the searching; It would have to be damn fast and, like before, very resiliant to computers being shut off, chunks of data becoming lost at any moment, etc. Furthermore, changes would need to propogate at blazing speeds so that new items on the Internet could be found shortly after appearing.
- Bandwidth and disk quota would need to be managed at each participating host, so that limits set by the user are not exceeded.
Governments, companies, universities and individuals would likely support an effort like this by donating some bandwidth and storage, rather than money.In the spirit of worldwide computing on the Internet, I hope this makes some amount of sense.
Re:Hook it up to slashdot! (Score:3, Informative)
Re:Hook it up to slashdot! (Score:3, Insightful)
Re:Hook it up to slashdot! (Score:4, Informative)
While not currently designed for massive whole-web spidering (it's aimed at single websites or intranets), ht://dig is a great starting point (and a lot further along than the Nutch 'nascent effort' mentioned in the story). Some database optimization to ht://dig seems easier than starting over with Nutch. Plus, the name 'Nutch' sucks.
Parent
Re:Hook it up to slashdot! (Score:4, Informative)
For example, look at the results [google.com] for the search 'convert wmv mpeg'. The first three results lead to the same exact search site. (Whether they have pop-ups or not, i can't tell, because i block them.) The fourth result is another search site. And then the last three are the same as the first three.
Of course, this obviously works with stuff you'd expect it to, like 'mp3s' and 'warez' and 'porn', but it works with legitimate stuff too. I wonder if there'll be anything to combat this trend, whether it be implemented by Google or by someone else....
Parent
Search engine game is NOT over (Score:5, Insightful)
At one time, Oldsmobile won the auto company wars. Where are they now?
IBM ruled the PC roost. Hmmmm....
Command-line OS's were king. But now???
Altavista and infoseek and Lycos were search engine kings at one time. Whither this trio?
The point is, it is not over.
Parent
I wouldn't count on it (Score:3, Informative)
Re:Nutch? (Score:3, Funny)
Re:Not making nutch sense (Score:3, Funny)
Re:"written in Java" ?!? -- trashcan. Next ? (Score:3, Interesting)
I met someone the other day who had an an associates in Computer Science from a community college and had never used anything but an AS/400 and a Mac. (Not even Windows! Seriously!)