Become a fan of Slashdot on Facebook


Forgot your password?
The Internet

Peer-to-Peer Search Engine Wants You To Help Grub 64

FuzzyMan45 writes: "Check this out! has finally finished writing their internet crawler. For those of you who don't know, grub is a distributed internet crawler that is indexing the internet and working towards an almost realtime index of all the pages combined with a search engine. Think about it, a search with no dead links and no out-of-date pages! Grub is the way to go." It sounds like a cool hybrid of client- and server-side information (a crawler works from your computer, with updated findings sent to a central repository) and all GPL'd. If grub outperforms Google, I'll be happy as a google-using clam -- but unless they have google's logic and caching, that is a very tall order. A better search engine is a pleasant dream though.
This discussion has been archived. No new comments can be posted.

Peer-to-Peer Meets Search Engine

Comments Filter:
  • by Anonymous Coward
    Have you tried []?

    Its the new best thing.

    Also [] seems promising but they don't have many pages *yet*.

  • by Anonymous Coward
    There's just way too much potential for abuse. People would constantly claim their site had changed, or generate content that isn't really there. There's already a huge problem with search engine spammers. This would just make it easier for them. The funny part is, when they (grub) start out, they probably won't have a problem with spam. But if they ever become popular... well, there goes the quality of the results.

    Besides, the real issue with search isn't crawling. It's indexing and ranking. They don't seem to have any real plan for how they're going to do that...

    The name grub is already taken []. I suggest the name 'maggot' as an alternative. This name would be very appropriate, because for the most part it's going to be crawling the rotten meat of the web.
  • by Anonymous Coward
    Check out my search engine, Kascade. It consists of a Linux client program that can be used to browse distributed open directories in a special format called DII and to communicate via IRC with others browsing the same categories. There is not a single open directory, as anyone can start a new one - for example by replacing the root category of an existing directory, so control is impossible. The DII format enables the use of programming-language like abstraction in open directories, as well as database like queries that dynamically create category structures while the user is browsing. Abstraction and queries can be used to present information in a multitude of ways, which makes searching much more precise than with other open directory systems. I've been working on it two years now and everything is working great. I'm now looking for people who would like to join me in further development and to start building those open directories!

    Please visit the project website []and tell me what you think.
  • The Gracenote problem is NOT really there. The way I see it, will need to keep it's volunteers happy... otherwise their database decays in value - unlike the CDDB database. Web pages change (a lot), CD titles dont. So... if the volunteers become annoyed, grub's core business dies.... thats a strong safeguard that grub must remain community friendly.. added to that the software is GPL
  • by Klaruz ( 734 ) on Sunday May 13, 2001 @04:14AM (#226326)
    I have a few 'issues' with this.

    1) Lots of people running grub means that they don't have to spend money on bandwidth to index sites, they get nice indexed data for free. Sure it's nice for them, but it seems like a kind of lame thing to do, I wonder how this will set with people.

    2) What if I modify/run their GPL software to index my collection of web pages, and I return indexes that contain pages that aren't really there to taint their search engine with my 10,000 non-existant pr0n pages. Do they have a way around that problem?

    3) Userland's and stopped allowing certain bots to index them because they were being abusive. Google was accidently blocked, but they were being nice. But still, the load of Google crawling all those pages is huge. Grub says they can crawl every page on the internet every day. What kind of load will this cause? The faq says your client will only do part of the internet, so you won't have everybody hitting your site daily, but I'm wondering what kind of flaws exist in that logic. Could this result in huge loads on webservers like the slashdot effect or a DDOS?
  • The software may be open source, but what's the license on the content of the database? I don't want to put huge amounts of work into creating what will become someone else's proprietary content, a la CDDB...
  • What if web masters manipulate the spider and return false search results (for luring people into pr0n, spam, propaganda...)?
    The grub concepts sounds good, but I doubt it will stand reality unless you create a complex "web of trust" system. (Which in turn would be too complex for grub to become popular.)
  • Quick response:

    1) Some webmasters go to great lengths to ensure their pages are spidered often, whether it is nefarious attempts to get first page listing or just attempts to keep the engines up to date with content. These people would be happy to run a client that kept their page listing fresh.

    2) Count on it happening.

    3) Userlands content management system could do better to recognize robots and present them with a narrower view of the site. I set a robot to pull down one of my editthispage sites and it found every version of each page that Frontier can generate, at least 20 versions of each page. But you're right, what if grub starts spidering every morning at 8am, we're going to have some bad net congenstion.

    I would imagine some of this has been addressed but these are important issues.

    Chris Cothrun
    Curator of Chaos

  • I've rescued a couple of dead sites out of it already, and been able to rip stuff from ``obsolete'' pages when it had disappeared from the original site.
  • A fairly easy solution would be to have every site scanned by one crawler and verified by another (possibly in browser mode so it looks just like a browser to the web server) so that the search engine would assume any page that was different between the two was dynamic and therefore shouldn't be cached.

    Hrm idea.. why couldn't you hook up a spider that searched the Net in this distributed way and added all files to FreeNet along w/ some sort of search system. Wasn't there something about an SQL engine for FreeNet posted the other day? That sounds interesting.
  • I'm getting really fed up with all of this peer-to-peer shit lately. What ever happened to the client/server model? Because there are millions of computers on the internet doesn't mean they fucking need to be part of some borg collective, especially when they're connecting at 26.4kbps. Unless all of the network nodes are flying around on optical channels from a fibre line it is sort of ridiculous to expect everyone to contribute to any of these P2P projects. Napster worked but it wasn't a full fledged P2P network. You connected to a server and found other people with stuff you wanted and then your client negotiated a client to client send. Gnutella has had to adopt this same model in order to function with half the usability as Napster. It has incorporated the concept of connecting slow clients to a single fast one rather than directly to other slow clients.
  • Or worse, pretend to be pr0n just to lure people in. That would be much worse - nobody's more impatient than when they're cruisin' for porn :)

    Caution: contents may be quarrelsome and meticulous!

  • For one thing, it excludes common small words like "to" "that" "the". Those words can be important when you're searching for a specific quote, say an old song or a line from a movie that you once heard.

    There's a solution [].
  • Pfff.. who came up with that name?
    It's ridiculous... GRUB is a bootloader,
    not an url indexer.

    Worse than that: I've built their client,
    and it even clashes with binary names.
    The client binary is called 'grub', which
    clashes with my /usr/sbin/grub

    See also:

    Please hand over the domain
    to GNU, and stop confusing me.

    Bram Stolk
  • by Restil ( 31903 ) on Sunday May 13, 2001 @05:47AM (#226336) Homepage
    Actually, we're not really providing an index for them, we're providing to them a list of pages that were updated. The clients don't do much except monitor pages to see when they get updated. Of course, the clients COULD just send all the data to the servers, it makes no difference on the server side, they'll have to archive all the data anyways. The advantage here is they get notified of all the pages that change and can poll them at that time, instead of having to poll their entire index constantly if only 0.01% of those pages change every time.

    Ultimately this comes down to the fairness and who owns the database itself. If its open and free to everyone, then this is a good cause, even if they have to generate revenue to support the site that serves the searches. In fact, if they distribute their servers properly, it won't even be necessary to have a large revenue stream as the load could be scattered more or less evenly over the entire volunteer pool.

    Unfortunately, due to their somewhat fishy intentions with regards to revenue, it might end up killing the project before it ever takes off.
    And it really seems like a good idea too.

    Oh well.

  • Please note that RMS didn't post the above - I asked him. Someone is using his name. However, he agrees with the sentiment.
  • Wow, it seems that you also have a problem with the guys that sell Resin or with all of the Linux distros. :)

    Does Grub want to make money? Yes, there's no way Grub can afford the overhead of the service without passing the costs on to someone somewhere. However, that in and of itself does not make Grub an opportunist. In this current market, investors couldn't care if you claimed your project was powered by Jesus if it didn't have a solid business model under it.

    The reasons Grub is opensource are varied. Kord comes from an ISP and BSD background. He knows from his own experience that a sys admin is not going to install networked software on a box when that network software can't be inspected. Having Grub be open source is insurance for the wise sysadmins. Kord being a FreeBSD freak from the start, I'm pretty sure closed source doesn't do it for him. Also, if you analyze the business model, there's no compelling reason to make it closed source. The money is in the data, not the ins and outs of how the clients and servers work.

  • by maddboyy ( 32850 ) on Sunday May 13, 2001 @07:18AM (#226339) Homepage

    Wow, I'm surprised to see Grub on Slashdot this morning. The first client beta was _just_ released last night!

    Anyways, I know the Grub guys and was there when Grub was just an idea being discussed over coffee. Although I can't speak with 100% authority, I feel that I can give some insight and perhaps some clarification to a few concerns/questions floating around. It appears that Kord and Iggy may have left a bit to be desired on the FAQ :)

    From my understanding, the initial desired audience is the ISP admin. As an ISP, you'd be able to have your grub client index and crawl sites that you host. In turn, those sites will be available on whatever search engine Grub is supplying data to. Those running an ISP or hosting websites know how often clients request that you make sure they get crawled and listed in a search engine; this is a pretty nice value-add for your ISP service then. In this case, it's a win-win scenario. Grub gets up to date information on sites and the ISP gets to provide a much requested service to its customers.

    Later, I believe the plan to encourage individuals on broadband connections is to provide rewards for a certain number of sites crawled and also prizes for top crawlers.

    There are some concerns about the licensing of the database. It's my understanding that Grub is taking a commercial-pay/non-commercial-free approach. That means for instance, if you started an opensource search engine like you could use the Grub data for free. But if you're Google or Inktomi, you'll have to pay for access.

    The data will not be free to everyone. There's just no way anybody can provide the overhead costs for that kind of service free to everyone. I think charging only for commercial use is the best option in this case. Also, keep in mind that the server will eventually be released as well. This means that individuals could run their own grub servers and stockpile their own data.

    As far as the few statements regarding the stock options payment, I'm pretty sure all of the in house full-time developers get paid real money. However, Kord is really determined to make sure that those people kicking in 5-10hrs a week in their spare time get to share in some of the success when Grub hits it big. Once again, that's a win-win situation. The contributors get to work on a promising, useful OS project and if the world comes knocking for this better mouse trap, the contributors also get a bit of cash for their troubles.

    I'd encourage those that have concerns or are curious about the project to go ahead and download the client now while it's in such early development. Take a look at the code. Email Kord and Iggy and tell them what you think. Even email them if you think Grub is a stupid idea, but tell them why. I don't think wanting to make a successful commercial P2P application is a bad idea in and of itself.

  • Interesting business plan. Pay people 100% in stock options -- and in a business where many stock options have proven to be worthless. Well, it might fly.
  • The FTP server seems already to be bugged by /. I have mirrored a version here:
  • From the client README:

    We plan to port a release to Windows within a few months.
  • Well they are doomed to failure then, aren't they? After all everyone (M$) *knows* that you cannot make money or be viable if you use Open Source software.

    Good luck to them, I think we need a few companies making extortionate profits from free software. I won't be queuing up to help them make a buck, but I will be expecting them to add to the software base.


    > Fourth, Grub will provide consulting services
    > for companies wanting to set up their own Grub
    > networks. Large corporate intranets could be
    > quickly and efficiently indexed into a central
    > database with the Grub client/server model.
    > Consulting and coding for these proprietary
    > installations is a common model in Open Source
    > oriented businesses like Sendmail, MySQL and
    > Apache.

  • We aren't planning are charging just anyone for this service. If you want to be indexed on a regular basis, then you'll have to pay for it. If you run the Grub client, you don't have to pay, nor does anyone else that you want to list. We realize that you are giving up bandwidth for the project and we plan on making good to those that do run it.

    More info in my next post.

  • by kordless ( 48957 ) on Sunday May 13, 2001 @12:42PM (#226345)
    Ok. It's us, the Grub guys here.

    First off, we didn't expect to get Slashdotted so fast. We really weren't ready, but we do appreciate the attention that it has brought. I'd like to address some of the issues that you guys have brought up because they ARE important to us. After all, you are (hopefully)running the client and we do care about what you think.

    OK, about the money thing. A few of you are blasting us for trying to make a buck off this idea. Last time I looked, it takes money to pay for servers, bandwidth and programmers! We didn't start this thing with the intention of ripping people off - we did it because current search engine crawling technology is behind the times and we thought we could fix it. Don't fault us for having a revenue model and a desire to build a solid company that feeds us.

    A LOT of you guys could benefit from having your web pages continuously indexed. It would help your customers AND it possible could increase the quality of service you provide. Besides, we aren't proposing to charge you for this service if you help out by running the client - that would defeat the purpose of the whole project.

    Don't you already pay for bandwidth that gets used up by Google, or even Excite? What's the difference if we use it instead, and it possible works better for you in the end?

    About the bootloader thing. Sorry about that guys, I didn't realize there was an Open Source project named Grub until we had the cards printed, domain registered, incorporated and had the plaque on the door. I've fielded a few emails about Grub (the bootloader) and we try to get them pointed in the right direction. BTW, we don't have anything to do with either

    About the security problems. We have thought about this and do have a solution proposed (though not implemented). We are planning on scheduling the same URLs out to multiple clients, in much the same way that SETI@Home does. If we get bogus results back from a particular client, then we'll know fairly quickly that someone is pulling a fast one on us. There are a few other things we can do, but it will take time to implement them.

    About the database. We really don't know about licensing the thing. Any comments or suggestions are MORE than welcome. We would like to leave it open for anyone to use or query, but charge LARGE corporations (like Google) for accessing LARGE bits of it.

    Give us some ideas on what you would like to see us do and we'll listen.


  • Will it kill my connection as efficiently as BearShare does?
  • From the FAQ []: (emphasis mine)
    • Q: What exactly does do?
    • A: is a company with a single purpose -

      ...We will make all software written during the project Open Source as well as all the hows and whats of setting up the network and database. If there is someone we can help by sharing what we've got, we'll share it.

      Q: That's insane, what will Grub's revenue be if it doesn't charge for the software?

      A: Open Source is not synonymous with NOT making money! We have come up with a hybrid business model that uses four distinct methods for generating revenue...

      By placing the crawler closer to the data (i.e. on the web server itself) our client will be able to analyze and index the data local to the system on which it is running.

      Q: So if I were a system admin or a website author I'd want to run the client?

      A: Yes! Anyone that provides web hosting/authoring services will have a use for running our client. In addition to crawling a portion of the Internet, the client can index the admin's/author's entire site each and every night, and then submit that summary to grub's servers for incorporation into the database. Running the client will allow them to provide an added value for their clients - having their web pages updated to the biggest index, each and every day.

    So there are supposed to be selfish reasons for people to run grub nodes.
  • I wonder if they could pay people running the client, just like Processtree [] plans to. I'm against some company getting a free ride with my bandwidth, but if they're going to pay me for it (even a relatively small amount), then I'm all for it.

    What do you guys think?

  • As it's a search engine, the content is only useful if it's kept fresh. I don't think that people would continue running the client if the database went the way of CDDB. Whether they like it or not, they've got to keep the people running the clients happy.

    As an investor, I wouldn't look to favourably on a business plan that relied on keeping the great unwashed slashdot audience happy (who else is going to run the client?)
  • Words like "to", "that" and "the" are known as Stopwords in the search engine biz. They are not indexed because in and of themselves they contain no valuable content. They are only valuable as part of a search phrase where they lend some additional meaning to the content around them. No search engine will include these stopwords in its index. Google would not be able to survive on a mere 4000 machines, it would require 20,000 or so were it to index stopwords as well.

    When you do a search for a phrase - indicating so however that search engine prefers the syntax,usually just by putting it in quotes - the search engine does look for placement usually, even if it ignores the stopwords. So the number of words between two indexed keywords should be considered when it is searching the index.

    As for the way the interface works - it differs from search engine to search engine, but I would agree that it ought to support simple Booleans and probably phrasing in the manner of mathematical statements using brackets.

  • If grub can provide adequate real-time indexing of the web then they
    don't really need to compete with google's search heuristics. They'll
    be providing a genuinely new service that could comfortably coexist with
    google's excellent searches of static pages.
  • Why not add a "Report Dead link" button next to each search result and have that site queued for indexing? This sounds very logical and yet it hasn't been implemented. Why not? Will there be too many dead links?
  • You should read the Google help page []. But here are some hints:

    1. Google does not support parenthesis groupings, like you seem to mean you want "(dogs OR cats) AND (birds OR fish)". All searches are AND by default, so you can just go "dogs OR cats birds OR fish" to get the result. For "(dogs AND cats) OR (birds AND fish)", two searches is the easiest way.
    2. If you want to search for a word like "the", put a plus-sign in front of it, "+the"
    3. Google does not find "cheerleaders" when you search for "cheerleader", you have search for "naked cheerleader OR cheerleaders"

    I'm not big on reading instructions either, but there are times when a moment's research pays big dividends.

  • This is an important point. It really makes grub and google complementary.
  • and why isn't this in java?

    This build has been compiled and run on:
    Linux RedHat 6.1 (2.2.12-20)
    Linux RedHat 7.0 (2.2.16-22)

    can't run it on windows, bwahahahaha

  • they should have just done it in java from the start.
  • So, this company uses your resources to create a large (distributed?) database....
    I wonder how we prevent the GraceNote scenario from occurring again?
  • This sounds like it could be fairly easy to abuse. Hack a client together that tells people your competitors sites are broken... bingo.
  • crawler technology looks cool.
    i.e. why search site unless you know it's been updated.
  • We are trying the client for awhile. It seems to be nice about throttling its usage. Which means it will not kill my bandwidth. Memory and processor usage is acceptable too. But I have concerns about the data too. But I WILL drop them a line, with more specific questions, such as will the data stay open source, since we are providing the mass resources, etc.. Hope you do too, dont just post, ask.
  • Having read that people are shocked by someone selling all your hard earned data to someone else, I have to say I'm disappointed.

    While they initially might seem big and evil, the source to their product is freely available [].

    Is it a crime to make money? They haven't hid anything. They clearly spell out what they plan to do with the data that they collect and how they plan to make money.

    How many open source projects, while being really good projects and open and free (in all senses) have lasted? Eazel being the most recent example. If they use open source and make money this can only be a good thing.

    If you don't like it, download the source and do it yourself.


  • So this means I have just installed a search engine in my MBR?

    Well, as long as it boots Linux I don't care what it REALLY is...
  • Seeing as how this company intends on using the work of volunteers to further their business plans, I really wouldn't want to help them.

    However, since their web crawling client is under the GPL, wouldn't it be possible for someone or a group of people to start a free (as in freedom) project using the client and a database that will always be available to the public? This would be a great idea, and would get many more users than a company that is likely to just close off its database after thousands of volunteers made contributions.

    Because, after all, that's the real power of Free Software.


  • First IMDB, then CDDB. Is 'grub' the next big steal from the volunteers project? Might not be, but might be too. Come back when the content is open, free and GPL'ed (or under some equivalent license that prevents it from being taken away and locked up). For now, I'll pass thanks.
  • My fundamental issue is with the following statement from your investors sections:

    "...on what placement they get in Grub's search result sets."

    This would imply to me that if I pay money to have my site indexed regularly, my links will be placed higher in priority over other similar links. Yes, many commercial services do this. However, they don't rely on community bandwidth to do their indexing.

    The benefit your service would provide to the average user would more reliable searches as the index data is up to date. Mucking with the relevancy of returned data based on corporate funding changes this - in my opinion it makes it much less worthwhile. Hence much less worthwhile for me to provide valuable resources for indexing.

    Now, if you were to provide a benefit to the users of your indexing engine other than searching then it would be a different story (perhaps pay them for their services?).

    Perhaps we need a mojonation of the search world. If I run a search agent, then I earn mojo/points/dollars/whatever. I can use these later to get faster searches. If I don't provide cpu time/net bandwidth, then I need to provide some other form of input (such as money).

  • by totalslacker ( 186017 ) on Sunday May 13, 2001 @11:02AM (#226366)

    Let me get this straight: they want me to run their client on my machines, using up my cpu and network bandwidth so that they can resell that information to other search engines?

    I particularly like this piece from their "Investors" page:

    Third, Grub will begin charging website customers for content control. Content control consists of indexing updated information on a regular basis and controlling link placement in search results. Large sites who's revenue depends on sustained inbound web traffic will be charged based on the amount of data that they submit into Grub's database, and on what placement they get in Grub's search result sets.

    So basically, the sites who are will to spend the most money will get their url's pushed up to the top of the list. Relevancy be damned.

    Someone please tell me why I should dedicate my resources to this?

    I think the smartest thing about the whole idea was putting the whole thing under the guise of an "open source", "peer to peer", "distributed", "let's make the world better" search engine. They might have managed to get some real interest if they had done a better job at hiding their financial motives.

  • That said, I get frustrated by some of [Google's] quirks.

    That's the least of your problems.

    Lastly, it's still not clear to me whether a search for "naked cheerleader" gives the same result as "naked cheerleaders". Hence, I tend to use OR and AND (+) a lot in my searches, which as I just said doesn't seem to work very well.

    Here's your real problem. How can you tell that they're cheerleaders if they're naked? :o
  • i take it they dont have many 'clients' yet.. i just signed up, and im number 24 :-)
  • When you put "the" or "to" in a quoted phrase the "+" sign doesn't always work how it should. Or maybe its the quotes that don't always work correctly. Either way I get iffy results when I need to combine a plus and quotes.
  • I don't want to loose Google.

    "With a rubber duck, one's never alone."
  • "A better search engine is a pleasant dream though."

    Build a better search engine and the world will beat a path to your website?'

    --Jo Hunter

  • So it sounds like they want to provide the info they gather to other existing' search engines. Hey - now Grub crawling the internet and sending its data to Google to make Google even better - I'm all over that. Of course, if they send data to Excite, I'll stop running the client. I cannot believe how Excite (and all the affiliated search engines they have now purchased) pretty much requires payment to get added and if you use the free form 'the site will be reviewed and there is no assurance it will be added. Process may take 4 to 6 weeks.'

    Thank goodness for Google!

    But again - this brings up the question similar to what happened with CDDB. Here you have internet volunteers providing free CPU power and bandwidth to provide raw material to for profit companies. Now granted - it is slightly different since you can still Google for free :) I'm not that selfish, but obviously there are some companies I'd be HAPPY to play a small part in improving their data set (Google) and others that given recent developments with URL submission and monetary sorting of search results that I wouldn't want to give data to unless they paid for it :)

    Which now that I read the site more is their business plan. Read their Investor Page [] I get a squirrely feeling about this. I don't care if the client is open source or not. Why should I use up my precious bandwidth to supply content to a for profit company to sell to other for profit companies? Yes, they give the data away to non profits, but heck - most of them use Google anyway :)

    And of course they are following hte lead of the other greedy search sites - adjusting search result order for money which I can't stand. Google is the one search engine that got it right - sort data by relevance and popularity.

    I'll read more about it - but I think I'm gonna pass on this on - I just don't see the benefit for the volunteers who run this both on a selfish individual scale and a broader Internet community scale


  • Read their Investor Page [] - they absolutely plan on charging the search engines to use the data AND to sell top result spots to the highest bidder. Open source or no open source - this is a joke - they won't get a sliver of my bandwidth.

    Here is the section outlining what they plan to do with all this free data 'volunteers' give them:

    The first revenue stream will come from selling URL status information to companies like Google and Altavista. This status information will enable existing crawlers to target the crawls for a particular day, based on the highly up-to-date information contained in our database. These status updates are similar in nature to the service provided by someone like NetMind, in which a change on a website triggers an action. Grub's database will be much vaster by comparison however, enabling it to provide services directly to wholesale search engines.

    Second, Grub will begin selling "wholesale searches" to other search engines and companies. Grub will make strategic alliances with other search engines much in the same way that Google has done with Yahoo and Inktomi has done with Hotbot. Grub will also provide one-shot search results for a large search query, delivering the data in a database format (like XML) instead of a web format.

    Third, Grub will begin charging website customers for content control. Content control consists of indexing updated information on a regular basis and controlling link placement in search results. Large sites who's revenue depends on sustained inbound web traffic will be charged based on the amount of data that they submit into Grub's database, and on what placement they get in Grub's search result sets.

    Fourth, Grub will provide consulting services for companies wanting to set up their own Grub networks. Large corporate intranets could be quickly and efficiently indexed into a central database with the Grub client/server model. Consulting and coding for these proprietary installations is a common model in Open Source oriented businesses like Sendmail, MySQL and Apache.

    Guess they thought we were really that stupid!


  • In reading the FAQ over - they state that they will, at some point, have a front end to search the data:

    Q: How much longer will it be before has a searchable index?
    A: The first phase of the client and server project has just started. We expect that phase to take somewhere between 2-3 months to complete. At that time, we will begin deploying to the client to beta testers - at which time the database will begin to grow. A searchable index will become available sometime between now and then that will access the database directly. Update: We expect the database to come online sometime in Jan 2001.

    Looks like they are a little behind schedule on this one.

    And another telling tidbit from their FAQ:

    Q: Why would I want to run this client? At least with SETI, I'm doing something - like looking for aliens.
    A: We like aliens too, but ours is noble cause if there ever was one - to have a decent index of the Internet free for any individual to use when they need it. The reasons that you'll want to run it will vary, but we think you'll see the advantages to be gained by running our client - especially if you are a system admin, or author of a web site.

    So I guess my main concern is that a) they could pull a GraceNote and b) the whole selling top result spots to big companies that may have NOTHING to do with what I'm searching for.


  • What if web masters manipulate the spider and return false search results (for luring people into pr0n, spam, propaganda...)?

    This is something web masters could do to existing search engines like Google as well. From a technical point of view, a search engine's crawler, or any other client, does not request pages from a server, but invoke methods of objects which are named by URLs. So to fool search engines one simply has to make sure the GET method of such an object returns different results to crawlers and browsers. In addition the web master would have to hide the fact that invocation of the GET method returns dynamically created content, but that's simple.

  • I haven't read all of the comments, but it seems like the major problem most people have with Grub is that they are giving resources and then they have to pay to see the database. Why not offer a search button form the client, that would encourage the use of the clients and have the database free to people that contributed the computers.
  • You couldn't be more wrong, I love paying the few bucks to linux distributors, but did you not read their Investor Page?

    "Large sites who's revenue depends on sustained inbound web traffic will be charged based on the amount of data that they submit into Grub's database, and on what placement they get in Grub's search result sets."

    The freeness of the client code is not the problem here, it's how they're planning to implement the search engine based on data that they receive. And ofcourse the client code has to be open souce to get wind under it's wings, but to use the good impression the mere words "open source" give to promote their data selling/mangling schemes strikes a bit odd to me. Atleast it's the first such effort I've seen...

  • This is an excellent project.

    As far as I can remember this is pretty much the first real business effort that's really trying to ride the "open source" reputation.

    Even their web site front page has :

    "it's an open source, distributed internet crawler!"

    What a load of crap, all I see is dollar signs are flashing in their eyes when they're just hoping to score big by selling those slots search result slots, open source or not.

    A while ago it was multimedia this and interactive that, then it was everything starting with 'e' and now it seems to be the pure greed in "open source" / "GPL" disguise. /. shouldn't fall for stuff like this.

    I can't wait for M$ to release their Open Source GPL'd office200x with some copyrighted document formats for what you need to pay $1000 licence fees... I hope atleast then /. see thru the fog.

  • by GNU Zealot ( 442308 ) on Sunday May 13, 2001 @06:26AM (#226379) Homepage
    Why don't we create an open, wholesome Grub?

    We could use the existing Grub software, but modify it to report to a community-run free database. Modifying the software to report to different servers would be rather easy, as would reverse engineering and replicating their database and website. However the sticking point would be funding for the large amount of bandwidth a site like this would need.
  • by 6EQUJ5 ( 446008 ) on Sunday May 13, 2001 @04:53AM (#226380) Homepage
    Preface: Google is by far the best search available for general, random stuff.

    That said, I get frustrated by some of its quirks.

    For one thing, it excludes common small words like "to" "that" "the". Those words can be important when you're searching for a specific quote, say an old song or a line from a movie that you once heard.

    Google doesn't seem to understand strict, logical use of parentheses, almost like it's really searching for the characters "(...)", or even the word OR, which contradicts my first complaint!

    Lastly, it's still not clear to me whether a search for "naked cheerleader" gives the same result as "naked cheerleaders". Hence, I tend to use OR and AND (+) a lot in my searches, which as I just said doesn't seem to work very well.
  • No windows client? How useful. Shazam.
  • If you decide later on that the "owners" are naughty, despicable people, you can uninstall the software. The main advantage to this particular search engine, as I see it, is to maintain a real-time database. You have control over your contributions because they expire quickly.
  • Open Source is not the same thing as free. You are very right that they have confused open source with free, but their code is released under the GNU license. Since it is using the GPL instead of the LGPL, they can not insert proprietary parts into the core at all. If they were to try putting functionality into external and proprietary processes, the project could be forked to benefit only the community, and not the prorietary corporation.

    There is no risk of the system becoming proprietary, because it is truly free. There is also nothing morally wrong with making money using free software. It is only bad when money becomes such a priority that the information is made proprietary. The only risk is that they may license the data in a different and non-free way. So long as they use the GPL and collect their data under a Free license, there is no risk.

  • This is an excellent project. I don't know of any other free software projects to index content on the Internet. Far too many companies develop client/ server applications and make the mistake of keeping them proprietary, and even if free replacements get written they often only replace the proprietary clients. This is bad because it actually increases the use of the proprietary server!

    This project is entirely free. Thus it is much better. People should go to the project homepage on sourceforge [] and help out. The current goal is only to index content, and a later stage will implement intelligent search functionality. See the project overview here []. I am sure that they would love to have more people who are able to do that helping out.

    Hackers, get involved with this project that can replace one of the most used pieces of proprietary software, the Internet search engine!

  • The same thoughts crossed my mind. If the crawler becomes a success, will they be trying to claim the name grub for their sole use?

"The following is not for the weak of heart or Fundamentalists." -- Dave Barry