Nutch: An Open Source Search Engine 291
Anonymous Coward writes "Someone forwarded me this site working to create an open source search engine called Nutch.
In the age of weighted rankings on search engines for profits, there's an obvious need for an unbiased search engine. After all, isn't a search engine supposed to be for finding relevant data, not as an indirect and sometimes slimy method of advertising?
Nutch is clearly in their intial stages, but it would certainly get my vote." You can find the project on SF.net, and also read the Business 2.0 article on it.
Patents. (Score:5, Interesting)
I see two problems (Score:1, Interesting)
Two problems:
not a good idea.... (Score:4, Interesting)
people are already 'googlebombing' to try and get better rankings by signing up tons of domains and cross linking them all with the keyword that they want to be #1...
if the algorithm that determined how #1 is determined was public, then the best possible strategy to cheat the system could be demised... instead of paying for weight to the search engines you would be paying to web developers to make the search engine think you were #1. and as a web developer i feel that.... oh... wait, proceed.
looking forward to it (Score:1, Interesting)
A Tough Challenge (Score:5, Interesting)
With most of the major engines today including Google, they make an effort to prevent horribly unbalanced results (recent controversy over blogs outweighing professional sites in the rankings due to linking and other factors). Some even admit (again, Google does) to manually messing with the rankings a little. If you search for suicide methods, they will bend the engine to make sure you get reasons why you shouldn't commit suicide before you get the how-to. That's in their own public docs. It's also discussed in Wired.
I honestly don't know if open-source could do a better job. The algorithm might be better (likely, given the manpower), but would it really be that much fairer?
Distributed Open Search Network (Score:2, Interesting)
Being open an open search network, some peer servers could specialize in searching what they're hosting, making it possible to index otherwise dynamically generated content. These specialized hosts would act as "search plugins" for some otherwise hard-to-define content.
An authentication method (a la Freenet) would be needed, though. Some form of authority to prevent rogue peers from injecting too much crap in the results.
Overall, a good idea. If they make it, I'll run it.
Re:Patents. (Score:2, Interesting)
If this is to be cheap to run, it will probably have to be distributed, and thus a very different architecture than most of what we've seen up to now.
Re:Seems pretty pointless (Score:2, Interesting)
Let's check out the credits page... (Score:3, Interesting)
Overture Research has donated hardware and helped to fund development.
So, even an "open source," "unbiased" search engine is funded by a commercial search organization.
funding (Score:3, Interesting)
Distributing the Power (Score:4, Interesting)
With that in mind, how would this project help? It would allow webmasters to quickly & easily modify it for their needs, and deploy their own niche engines; in other words, Google would be supplemented by 10,000 niche search engines, each focusing on a specific field (microsoft propaganda, for instance). This would create a balance of power, ensuring that no single search engine accumulates an insane amount of control over the web as a whole.
Re:Google? (Score:3, Interesting)
Search Engine Monoculture (Score:5, Interesting)
The point is, are you really comfortable to have one, and only one, effective search engine? No matter how well it searches?
O'Reilly [userland.com] put it best
Actually, Nutch has no ambitions to dethrone Google. It's just trying to provide an open source reference implementation of search to help keep Google and other search engines honest, by letting people compare the results of an engine whose algorithms and methodologies are transparent and accessible. It also aims to give a platform for people outside of the search heavyweights to research new search algorithms.
Java? (Score:2, Interesting)
Portability should be a mute point since the pages can be generated on the server, which could easily run an OS specific binary.
Large scale and DB (Score:3, Interesting)
Firstly the choice of Java, personally I have no gripe about this. And reading that a choice was made to use language-independent formats is a good idea. My main concern is for the larger scaling and distribution over multiple machines.
At present I make the educated guess that a project on this scale, in Java, would still be best run on a `hardware base as uniform as possible', like UltraSparc 450's with a fibre back-plain.
My second concern is that there is so much choice of indexing and searching technique that there are sure to be some problem due to Patent restrictions.
Just browsing the US patent office gave me a couple of possible Patent nasties;
6,463,428 or 6,278,992. (And about 10 others I glanced at...)
Lastly DB, in the short time I've been looking at the code it seems to me that a choice was made to implement a DB build for the problem. Although this could be a good thing, it is usually better to reuse existing products. I found SleepyCat (DB4) to match the requirements. And if the choice is final read this. [1]
I hope these comments are useful to somebody at least.
[1] http://www.xlnt-software.com/xml_dl.html
Re:"written in Java" ?!? -- trashcan. Next ? (Score:3, Interesting)
I met someone the other day who had an an associates in Computer Science from a community college and had never used anything but an AS/400 and a Mac. (Not even Windows! Seriously!) I think people saw the dollar signs from 4-8 years ago and went to school for something they really are only marginally interested in just because they thought they could make a few more bucks. It's a damn shame too, because these shitty little tech schools (or community colleges hawking their tech "degrees") are doing a disservice to these people by making them that 1) they'll get a good education, 2) they'll have a job waiting for them when they graduate, and 3) they'll be respected by their peers in the IT biz as professionals.
And then there's my niece who got a nursing degree in 18 months, had 4 interviews her first week out, and now makes more than your average 4-year college graduate. Go figure.
But anyway, yeah, I agree with you. Don't