Nutch: An Open Source Search Engine

Nutch: An Open Source Search Engine 291

Posted by Hemos on Wednesday August 13, 2003 @04:51PM from the but-will-it-matter dept.

Anonymous Coward writes "Someone forwarded me this site working to create an open source search engine called Nutch. In the age of weighted rankings on search engines for profits, there's an obvious need for an unbiased search engine. After all, isn't a search engine supposed to be for finding relevant data, not as an indirect and sometimes slimy method of advertising? Nutch is clearly in their intial stages, but it would certainly get my vote." You can find the project on SF.net, and also read the Business 2.0 article on it.

Nutch: An Open Source Search Engine

This discussion has been archived. No new comments can be posted.

Search 291 Comments Log In/Create an Account

Comments Filter:

Patents. (Score:5, Interesting)

by Christopher Thomas ( 11717 ) writes: on Wednesday August 13, 2003 @04:52PM (#6689370)

I hope the authours of this project do their homework. My impression is that most of the good search and indexing schemes have already been patented, which will make it difficult to release such a project without stepping on someone's toes.

I see two problems (Score:1, Interesting)

by Anonymous Coward writes: on Wednesday August 13, 2003 @04:55PM (#6689403)
Two problems:
1. Bandwidth. Having to search through so much data is going to take so much bandwidth, how could you pay for it?
2. Patents. Google has lots of patents in this area, I imagine other search engines do as well. This is one area where I think software patents are deserved, since Googles' alorthims are actualy innovative. I don;t they will be willing to let you use thier patents in a GPLed app.
not a good idea.... (Score:4, Interesting)

by edrugtrader ( 442064 ) writes: on Wednesday August 13, 2003 @04:59PM (#6689435) Homepage

google is already ideal... the weight of search results is not sold, just text ads.

people are already 'googlebombing' to try and get better rankings by signing up tons of domains and cross linking them all with the keyword that they want to be #1...

if the algorithm that determined how #1 is determined was public, then the best possible strategy to cheat the system could be demised... instead of paying for weight to the search engines you would be paying to web developers to make the search engine think you were #1. and as a web developer i feel that.... oh... wait, proceed.

looking forward to it (Score:1, Interesting)

by Anonymous Coward writes: on Wednesday August 13, 2003 @04:59PM (#6689441)

take a look at the developers and contributors. these guys are all top notch. doug cutting, one of the developers there is the developer for lucene, one of the best libraries out there for developing application search engines in any language. not to mention overture, internet archive, and mitch kapor.. looks like an all-star team. can't wait to play the software.

A Tough Challenge (Score:5, Interesting)

by Cloudmark ( 309003 ) writes: on Wednesday August 13, 2003 @05:01PM (#6689455) Homepage

One of the biggest issues with running a search-engine, open-source or otherwise, is that you can't eliminate bias in the results. No matter what scheme you put in place to handle rankings, someone will find a way to take advantage of it. It's a fact of any major system - there's always a way to twist it. Part of the challenge that Google and similar sites face is that they have to work constantly to protect themselves from systems designed to take advantage of their algorithm. While a completely unbiased search service would be nice, I think it would require the impossible. It would require that no one out here took advantage of it to further their own interests, be they political, commercial, or otherwise. That's fairly unlikely.

With most of the major engines today including Google, they make an effort to prevent horribly unbalanced results (recent controversy over blogs outweighing professional sites in the rankings due to linking and other factors). Some even admit (again, Google does) to manually messing with the rankings a little. If you search for suicide methods, they will bend the engine to make sure you get reasons why you shouldn't commit suicide before you get the how-to. That's in their own public docs. It's also discussed in Wired.

I honestly don't know if open-source could do a better job. The algorithm might be better (likely, given the manpower), but would it really be that much fairer?

Distributed Open Search Network (Score:2, Interesting)

by Massacrifice ( 249974 ) writes: on Wednesday August 13, 2003 @05:04PM (#6689487)

It'd be nice if they could make distributed. Kinda like P2P search engines, but for the web. That way, the main searching server farm wouldn't be tied to any company in particular. That would give Google a run for their money, and would keep Microsoft at bay for another while.

Being open an open search network, some peer servers could specialize in searching what they're hosting, making it possible to index otherwise dynamically generated content. These specialized hosts would act as "search plugins" for some otherwise hard-to-define content.

An authentication method (a la Freenet) would be needed, though. Some form of authority to prevent rogue peers from injecting too much crap in the results.

Overall, a good idea. If they make it, I'll run it.

Re:Patents. (Score:2, Interesting)

by socrates32 ( 650558 ) writes: <socrates32&gmail,com> on Wednesday August 13, 2003 @05:05PM (#6689494)

"most of the good search and indexing schemes have already been patented" Not at all... just the easy ones.
If this is to be cheap to run, it will probably have to be distributed, and thus a very different architecture than most of what we've seen up to now.

Re:Seems pretty pointless (Score:2, Interesting)

by jawtheshark ( 198669 ) * writes: <{moc.krahsehtwaj} {ta} {todhsals}> on Wednesday August 13, 2003 @05:20PM (#6689638) Homepage Journal

Yes, that would hold true if you want to index the WWW. But what about indexing an intranet? Now businesses are paying Google for indexing servers (not that I think it is bad), but an Opensource searchengine could save costs for medium sized businesses. Just toss in another Quad Xeon with a few Gigs of RAM and it will do fine for a normal intranet.

Let's check out the credits page... (Score:3, Interesting)

by baggachipz ( 686602 ) writes: on Wednesday August 13, 2003 @05:22PM (#6689654)

Ooh, what's this?

Overture Research has donated hardware and helped to fund development.

So, even an "open source," "unbiased" search engine is funded by a commercial search organization.

funding (Score:3, Interesting)

by bindaaas ( 659754 ) writes: on Wednesday August 13, 2003 @05:27PM (#6689712) Homepage

let's see where is the funding coming from. Project is funded by overture [overture.com] which is to be bought by Yahoo [yahoo.com]. More info is here [corporate-ir.net]. Hmm.. So i guess Yahoo needs a revival...

Distributing the Power (Score:4, Interesting)

by FsG ( 648587 ) writes: on Wednesday August 13, 2003 @05:39PM (#6689823)

I think having an open source search engine that people can modify and deploy would be an excellent thing, and here is why. Currently, google has the complete power to highlight or censor anything on the web. So far, they have used this power wisely, but that's no guarantee that it'll always be so. If they go public, you may find this power being used to increase the shareholders' wealth, rather than in the highest standards of fairness as it is today.

With that in mind, how would this project help? It would allow webmasters to quickly & easily modify it for their needs, and deploy their own niche engines; in other words, Google would be supplemented by 10,000 niche search engines, each focusing on a specific field (microsoft propaganda, for instance). This would create a balance of power, ensuring that no single search engine accumulates an insane amount of control over the web as a whole.

Re:Google? (Score:3, Interesting)

by g1zmo ( 315166 ) writes: on Wednesday August 13, 2003 @05:43PM (#6689860) Homepage

See this article [msn.com] on slate for some interesting ideas on why Google's page-ranking system is being undermined due to the evolution of ecommerce and price-comparing portals.

Search Engine Monoculture (Score:5, Interesting)

by peachawat ( 466977 ) writes: on Wednesday August 13, 2003 @05:52PM (#6689929) Journal

Why is it that when it comes to OS, everyone is bitching and screaming how bad monoculture created by Microsoft Windows is, but otherwise feeling warm and fuzzy and swear to god Google is and always be the only search engine they use?

The point is, are you really comfortable to have one, and only one, effective search engine? No matter how well it searches?

O'Reilly [userland.com] put it best :

Actually, Nutch has no ambitions to dethrone Google. It's just trying to provide an open source reference implementation of search to help keep Google and other search engines honest, by letting people compare the results of an engine whose algorithms and methodologies are transparent and accessible. It also aims to give a platform for people outside of the search heavyweights to research new search algorithms.

Java? (Score:2, Interesting)

by slimak ( 593319 ) writes: on Wednesday August 13, 2003 @06:10PM (#6690049)

It seems like there would be a better choice than Java for the language when speed/efficiency is a must. Isn't the added overhead of the JVM going to decrease performance significanly?

Portability should be a mute point since the pages can be generated on the server, which could easily run an OS specific binary.

Large scale and DB (Score:3, Interesting)

by webhat ( 558203 ) writes: <slashdotNO@SPAMspecialbrands.net> on Wednesday August 13, 2003 @08:38PM (#6690995) Homepage Journal

I was looking over the site and a number of things concerned me.

Firstly the choice of Java, personally I have no gripe about this. And reading that a choice was made to use language-independent formats is a good idea. My main concern is for the larger scaling and distribution over multiple machines.
At present I make the educated guess that a project on this scale, in Java, would still be best run on a `hardware base as uniform as possible', like UltraSparc 450's with a fibre back-plain.

My second concern is that there is so much choice of indexing and searching technique that there are sure to be some problem due to Patent restrictions.
Just browsing the US patent office gave me a couple of possible Patent nasties;
6,463,428 or 6,278,992. (And about 10 others I glanced at...)

Lastly DB, in the short time I've been looking at the code it seems to me that a choice was made to implement a DB build for the problem. Although this could be a good thing, it is usually better to reuse existing products. I found SleepyCat (DB4) to match the requirements. And if the choice is final read this. [1]

I hope these comments are useful to somebody at least.

[1] http://www.xlnt-software.com/xml_dl.html

Re:"written in Java" ?!? -- trashcan. Next ? (Score:3, Interesting)

by forkboy ( 8644 ) writes: on Thursday August 14, 2003 @05:34AM (#6693440) Homepage

That's all they're teaching the kids in college these days. Seriously. At the school I go to (i'm not a CS major) you have to take C/C++ as an elective. The core CS curriculum is all Java. I don't think they even teach assembly there. Good schools are probably different of course, but who can afford good school anymore?

I met someone the other day who had an an associates in Computer Science from a community college and had never used anything but an AS/400 and a Mac. (Not even Windows! Seriously!) I think people saw the dollar signs from 4-8 years ago and went to school for something they really are only marginally interested in just because they thought they could make a few more bucks. It's a damn shame too, because these shitty little tech schools (or community colleges hawking their tech "degrees") are doing a disservice to these people by making them that 1) they'll get a good education, 2) they'll have a job waiting for them when they graduate, and 3) they'll be respected by their peers in the IT biz as professionals.

And then there's my niece who got a nursing degree in 18 months, had 4 interviews her first week out, and now makes more than your average 4-year college graduate. Go figure.

But anyway, yeah, I agree with you. Don't /. while high, it makes one ramble.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Nutch: An Open Source Search Engine 291

Nutch: An Open Source Search Engine More Login

Nutch: An Open Source Search Engine

Patents. (Score:5, Interesting)

I see two problems (Score:1, Interesting)

not a good idea.... (Score:4, Interesting)

looking forward to it (Score:1, Interesting)

A Tough Challenge (Score:5, Interesting)

Distributed Open Search Network (Score:2, Interesting)

Re:Patents. (Score:2, Interesting)

Re:Seems pretty pointless (Score:2, Interesting)

Let's check out the credits page... (Score:3, Interesting)

funding (Score:3, Interesting)

Distributing the Power (Score:4, Interesting)

Re:Google? (Score:3, Interesting)

Search Engine Monoculture (Score:5, Interesting)

Java? (Score:2, Interesting)

Large scale and DB (Score:3, Interesting)

Re:"written in Java" ?!? -- trashcan. Next ? (Score:3, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot