Building a Bigger Search Engine 278
skreuzer writes "Wired is running a story about a distributed web crawler called Grub. People who choose to download and run the client will assist in building the Web's largest, most accurate database of URLs. This database will be used to improve existing search engines' results by increasing the frequency at which sites are crawled and indexed. Conceivably, Grub's distributed network could enable state information to be gathered on every document on the Internet, each and every day."
Will Grub take off or be smashed? (Score:5, Insightful)
Also the grub engine crawls everything, including adult content and other questionable content. They have a setting to turn it off, but it does not block it. With the current questioning of international law relating to accessing illegal websites this could have major consequences for the average user.
So for the time being I have stopped using the grub client until some serious questions are answered. It's an interesting concept and if it was being used in more of an academic setting it could be interesting. However I believe that search engines like Google are doing pretty good themselves.
Go calculate [webcalc.net] something
Re:Will Grub take off or be smashed? (Score:2)
is that good or bad?
Re:Will Grub take off or be smashed? (Score:4, Insightful)
Actually, if I had a gun to my head, I'd choose to run Grub, because the client is open-source. I used to run SETI@home, but then the news came out that they'd been sitting on a potential root vulnerability for a long time. That really brought home to me the risks of running someone else's closed-source app on my box.
Re:Will Grub take off or be smashed? (Score:2)
Do you have any references? Please back up your claims.
I like the anecdote, "Gee, this closed source thing turned out to be a huge risk! I'll stay open source, thanks.", but I'd like some proof
Re:Will Grub take off or be smashed? (Score:5, Informative)
here [tudelft.nl], and here [slashdot.org]
Actually I think the hole potentially gave the ability to run arbitrary code, which isn't the same as a root vulnerability.
Re:Will Grub take off or be smashed? (Score:2, Insightful)
God knows that Google, by virtue of being a commercial entity, has absolutely nothing to offer you.
Anti-capitalist fucktard.
Re:Will Grub take off or be smashed? (Score:3, Interesting)
How about improving existing search engines with more accurate databases? Commercial organizations like Google might be involved and that's another matter. There might still be a reward to the public.
Re:Will Grub take off or be smashed? (Score:5, Insightful)
Re:Will Grub take off or be smashed? (Score:5, Interesting)
If you're a criminal, installing the Grub client might be a great idea.
Great idea, but will it pan out? (Score:5, Insightful)
That unfortunately seems like a naively optimistic hope. While the
vast majority of people may be altruistic, it only takes a few
unscrupulous individuals to completely undermine a fair result.
It's interesting that this idea is an extension to Google's model in
many ways. Essentially Google is able to index so much of the
interent by having 50,000+ servers. I don't think that's what makes
Google such a useful search tool, rather I think it's accuracy and
relevancy. If my search results started getting poluted with bogus
hits, I would stop using it almost immediately.
Unfortunately, by letting people run the client on their machine and
having it send the results back to the server, I think spoofed
results are inevitable. I don't think it will be possible to
safeguard the results either, it will be interesting to see how well
this project survives *when* people start spoofing results. It's
been a problem for SETI@home, and it's something that undermined some
peoples faith in the project as a whole. If the spoofed results are
more widespread and have a larger impact as they would in a system
like this, it may ultimately prove fatal to the project.
One factor that has been asbolutely critical to Google's success has
been their ability to remain resistant to spoofing attempts. It's
still a question mark how well grub will perform in that context.
Looksmart (Score:3, Interesting)
Altruistic? (Score:5, Funny)
(Oh, I can't remember. Have I MetaModerated Recently?)
Re:Altruistic? (Score:4, Insightful)
Of course, I am the first one to question this trend. Has anyone else considered the possibility that one day we'll wake up, and notice that google is charging for access to it's basic searching services?
I for one, would probably pay. I have become so dependent on it. What price? That's a good question...
Re:Altruistic? (Score:2)
Re:Altruistic? (Score:5, Funny)
======
The main executable has been renamed to "grubclient" out of respect for the GNU Grub bootloader, who's executable is named "grub". They were out first, so we decided to pick another name. If you have a catchy suggestion for a new name, please let us know.
I nominate "parasite".
Re:Great idea, but will it pan out? (Score:5, Interesting)
Biiig questions to answer (Score:5, Interesting)
I bet one of the big successes in Folding and distributed.net is that many people run the clients on work boxes, knowing that there's little actual overhead incurred to their work. How different that is for a URL sucker.
I wonder what broadband ISPs think of Grub.
Re:Biiig questions to answer (Score:5, Interesting)
If it becomes a problem, I imagine ISPs will declare it a commercial bandwidth usage, and order users to stop or move to a business class plan for more money.
Re:Biiig questions to answer (Score:2)
Haiku :-) (Score:5, Funny)
Sniffing out all the good porn
Not just bootloader
I love being a Slashdot subscriber - it gives me fifteen minutes to figure out a good joke before anyone has a chance to post!
Seriously though, shouldn't they change the name? "GRUB" is already a bootloader. They should change the name
"Agh! You have E-Coli on your computer!"
Re:Haiku :-) (Score:3, Funny)
Re:Haiku :-) (Score:4, Funny)
Hee hee.
Re:Haiku :-) (Score:5, Funny)
How about Firebird? I'm sure that won't cause any problems :-)
Re:Haiku :-) (Score:5, Funny)
It's ok though because they'll all still be different projects, so nobody will get confused.
Business Plan? (Score:2, Insightful)
Should we expect to see many commercial efforts focussed on providing similar "crawl" or "index" capabilities, but each honed to a specific niche market? A scientific crawler? A retail links database?
One could argue that similar efforts targeting music resources have resorted to less automated techniques, i.e. human-driven sharing.
Thoughts?
Hrmm, I wonder how long... (Score:4, Insightful)
It still sounds like a really cool idea though.
Re:Hrmm, I wonder how long... (Score:3, Insightful)
Clients artificially increasing their ranking isn't an issue, since the client has nothing to do with a site's ranking.
grub is already taken (Score:2, Insightful)
Hmm searchengine eh? Why don't you call it grab ?
Robert
Re:grub is already taken (Score:2)
I don't see Phoenix being used for BIOS and a browser as a problem, I don't see Firebird being used for a database and a browser as a problem, and I don't see grub the bootloader and grub the web spider conflicting. They're entirely different products, and there are only so many words out there. Here [google.com] is one of a million examples of a name that is taken by tons of different companies.
If previous results are any guide (Score:5, Funny)
2. Tech-savvy people tend to be loners.
3. Loners most often search for porn.
C1. Tech-savvy people search for porn.
4. Items searched for most often reach the top of the list.
5. Porn is searched for often by tech-savvy people.
C2. Porn will be easier to find with this new search engine.
Count me in!
Re:If previous results are any guide (Score:2)
1. Tech-savvy people will install this.
2. Tech-savvy people tend to be loners.
3. Loners most often search for porn.
C1. Tech-savvy people search for porn.
4. Items searched for most often reach the top of the list.
5. Porn is searched for often by tech-savvy people.
C2. Porn will be easier to find with this new search engine.
6. pr0nit !?!
Re:If previous results are any guide (Score:5, Funny)
great news! API? (Score:2, Interesting)
This is going to challenge Google's search, which will entice them to cut loose some of those really cool google labs concepts. Froogle, Google News, and all of the other cool things that they are working on are great services and are going to be the focus of innovation over at Google.
Also, Looksmart needs to develop and release an API for this system. You can only use the google api for 2,000 searches per. day. If they allowed unlimited usage, it would get a lot of developer backing.
Unlimited Use? Try Wishful Thinking. (Score:3, Insightful)
The point is that I wouldn't look anytime soon for LookSmart to allow unlimited usage of this API. It's too large of a project for them to just let people use i
Re:Unlimited Use? Try Wishful Thinking. (Score:2)
This I dispute sir. Targeted keywords on google, where my clickthrough ratio has averaged 1.3-1.5%, are a goldmine for my site and money very well-spent (averaging $500 a month on those ads, paying .05 in 97% of all cases.)
I've been a google advertiser since Feb. 02, consider their program extremely lucrative, and I guess they like me 'cause I got a pictur
Re:Unlimited Use? Try Wishful Thinking. (Score:2)
Typically, where I advertise, there are eight or nine other people trying for the same keyword. I've got the green-shifted look despite paying the minimum because I'm allowed to include "free" in my description, but there's usually five people above me, meaning they're paying at least six cents; often as much as
Not news for us webmasters (Score:2, Insightful)
Re:Not news for us webmasters (Score:5, Interesting)
Re:Not news for us webmasters (Score:2)
No; because someone at Wired News wrote about it.
Re:Not news for us webmasters (Score:3, Insightful)
I never heard tell of Grub.org before.
I found it interesting....
not every link on slashdot is going to directly relate to you....
Comment removed (Score:3, Funny)
Firewalls? (Score:5, Insightful)
Re:Firewalls? (Score:4, Informative)
Actually, the robots.txt issue is one they're still working on. Right now it doesn't check the file very often, which upsets some webmasters.
They're open to suggestions, so maybe you could suggest a list of blacklisted IP's/hostnames. I suggested they look into supporting gzip compressed web pages, and they said they'd look into it.
Re:Firewalls? (Score:3, Informative)
If you wanted to forbid the client from working, network admins could block port 3136 (I think it is), which would prohibit communication with the central server.
My understanding is that grub does not just crawl away randomly, rather it's given a list of things to crawl by the central
Re:Firewalls? (Score:2, Interesting)
Google Toolbar (Score:5, Interesting)
Re:Google Toolbar (Score:5, Interesting)
Re:Google Toolbar (Score:2)
Does being a kick-ass tool (for those unfortunate enough to be using Internet Explorer) count as incentive?
Re:Google Toolbar (Score:2)
Hardly distributed crawling (Score:2, Interesting)
They use the screensaver grub clients to check if a web page has been modified since the last time it was crawled (by the centralized crawl done by Looksmart). They probably use some smart MD5 checksum of the pages and send that with the urls to be crawled to the clients. If the checksum of what the grub client crawled doesn't match then the centralized crawl is instructed to re-fetch that url.
They go this route because the If-Modified-Since HTTP 1.1 request
Re:Hardly distributed crawling (Score:3, Insightful)
Grub hits us quite often. I've seen the same URL hit multiple times in one day by different hosts. It's ignoring the "revisit-after" meta tag (7 days), but then, so are most of the
The Distributed Search Engine (Score:2, Interesting)
Google's technology is superior... (Score:4, Funny)
Re:Google's technology is superior... (Score:2)
Re:Google's technology is superior... (Score:2)
I think it's pretty lame myself.
But whenever someone mentions or links to pigeon rank around here it gets +4/5 funny every time.
My Take on Grub (Score:2, Informative)
What about the RIAA? (Score:4, Insightful)
What's the difference between my machine indexing them and the university students recently being hauled into court for indexing open shares? Why would I not be held liable for contributory copyright infringement?
No thanks.
Re:What about the RIAA? (Score:2, Interesting)
The Church of Scientology [clambake.org] has already threatened Google and gotten results moved; I can, in all honesty, see the RIAA going for it.
It would be an earthshattering case, but here's the thing: the RIAA stands a disturbingly good chance of winning.
I hope, I pray they don't were they to try it- and try they most certainly will, because they think they can get money out of the lawsuit and they want money. That's very likely a major motive.
Oh, and to mods-for-a-d
They realize they aren't the REAL GRUB (Score:5, Informative)
Notice
======
The main executable has been renamed to "grubclient" out of respect for the GNU Grub bootloader, who's executable is named "grub". They were out first, so we decided to pick another name. If you have a catchy suggestion for a new name, please let us know.
Re:They realize they aren't the REAL GRUB (Score:2)
Honestly, it's going to be hard to come up with any name that someone, in some way, thinks they already have claims to..
But, to keep this completely on topic, it seems the grubclient has problems.. It works fine on a Slackware 8.1 workstation, but bombs out with a segfault after a few minutes on a Slackwar
A better use for my screensaver time (Score:5, Insightful)
Altruism has its place, but since I'm more likely to die of cancer than of not having the complete www indexed I think I'll be selfish and work towards a cure for something that may affect me.
You can run both (Score:4, Informative)
Re:You can run both (Score:5, Funny)
Unfortunately, so is my ISP. In fact, they've already sold it to other customers.
Re:A better use for my screensaver time (Score:2)
curious. (Score:2)
Oh, just great. (Score:2)
Joy of freaking joys.
Flood Control (Score:2, Interesting)
The sheer volume of this project concerns me, however. The very fact that it got Slashdotted may cause it to be a bit heavier than expected!
It sounds like a good use of spare bandwidth, but if it's going to wind up a superscanner, it's going to send a hell of a lot of requests.
I tried it and deleted it as quickly: it's not very good at bein
Re:Oh, just great. (Score:2)
Indexor or Search Engine? (Score:5, Interesting)
I expected some way to search... this looks more like a project to index the web rather than make the results available for public use via web interface. Did it strike anyone else odd that there was no web form on the home page with which to search?!
It seems like a good concept, but the availability of the information collected needs to be accessible without installing the client. I'm not game to install distributed computing apps without some freely available benefit. The "for the good of the world" motivation went out the window for me about a day after my first Seti At Home experience. (But now BitTorrent [bitconjurer.org], there was appreciable benefit. I had RedHat 9 isos within 8 hours of their initial release!)
blah (Score:2)
It's all about eyeballs.
baloney.
Show me the profits.
actually.. (Score:2)
Web searching will only get harder... (Score:3, Insightful)
I've already discovered this with comic books turned into movies. Finding synopses of the comic book [24.191.192.76] X-Men is nigh impossible. Finding syopses of the movie [xmen-the-movie.com] s [xmen2-themovie.com] is much, much easier. Damn near every site online about X-Men, Spiderman, The Hulk, Batman, etc. deal with the movies, and sifting through the cruft is not easy. And that's just comic books. Other topics can be just as hard to find, and this doesn't even touch upon fake search results that only turn up porn or worse, a blank page (happens frequently).
Searching for MORE stuff isn't going to help. Searching better is the key. Google goes a long way towards this, but even it has the same problems of finding too much crud.
Re:Web searching will only get harder... (Score:2)
And in related news . . . (Score:2, Redundant)
Or not. What a difference maturity makes.
Good Idea, Bad Implementation (Score:3, Insightful)
Re:Good Idea, Bad Implementation (Score:2, Insightful)
Alternate idea (Score:2)
Something that the i.e. squid cache, and is some kind of client of that kind of network will be more useful, at least for common users (the ones that don't have yet a proxy cache will gain a lot in internet navigation, and will not use extra bandwidth, it will use just what they already downloaded) and for the "search" engine will give another approach of ranked results, giving more results for the sites that are more accessed,
What _is_ a good project? (Score:4, Interesting)
So what worthy causes are out there?
Re:What _is_ a good project? (Score:2)
Re:What _is_ a good project? (Score:3, Interesting)
How about helping with some cool math prime search?
ars Team Prime Rib [jobnegotiator.com] - cool prime searching stuff.
A mix of misc science stuff.
dc projects - some Opensource, some not. [rynok.org]
And all projects at distributed.net [distributed.net] come with source too.
DDoS (Score:4, Interesting)
If this thing gets too popular without proper throttling, they could cause real havoc.
The open faucet, not the blown dam (Score:2, Informative)
The database of "check-me"s is randomized rather evenly. Even if this takes off, I don't see how it could really do serious damage to any but the truly dinky servers: the hits will not come in all at once and flood the whole connection. While it very well could end up a constant stream, it's unlikely to be the massive stream that make
Legalities? (Score:4, Interesting)
1) How different is this than the princton kiddies system? I don't know about you, but I don't want a 95 billion dollar bill arriving in the mail...
2) What if you local (cache?) contains a few links to kiddie porn? Not your fault, right? Software does it's own thing, you cannot control, BUT what will the FBI think? The FBI Scottland Yard, RCMP are currently heavily investigating Kiddie Porn cases (good work IMHO), but what if your the unlucky sap who getts stuck with a few sketchy URLs? Or Worse Yet, what if this GRUB keeps a cache of the website like google does? Then what?
3) What about material that is legal locally, but illegial somewhere else... eg. Nazi stuff in Germany, Falun Gong in China, etc... The last thing I want is to be refused to be given a travel visa cuz my PC has an illegial cache...
Good idea in principle, but with sketchy content on the web, I don't think I will be the one keeping track of it all. If there is a way to filter out the questionable stuff then maybe, but since the purpose is to be as inclusive as possible, it seems incompatible.
_CMK
Re:Legalities? (Score:2, Interesting)
Not to mention the fact that it still goes and hits all those sites, and with the government trying to smash that little thing we call "privacy," anything questionable will likely go on your permanent record- the one that doesn't exist, but they somehow have anyway.
The approach is inherently flawed (Score:4, Interesting)
So the more popular it gets, the more incentive people will have to promote their sites by feeding it fake index information. If this magically got to be very popular, within weeks search results would become meaningelss and it would drop back into obscurity. The more likely result would be that it will never become popular in the first place.
Besides, who wants to donate his CPU and bandwidth resources for a commercial company, anyway?
Just terrific. A massively powerful DDOS tool. (Score:2)
Normally, most search engine's spidering methods are designed to be pretty nice to servers - such as only requesting pages once every 30 seconds or so.
However, I've seen times when the methods of some of the search engine spiders were foiled by such simple things as having a large number of virtual hosts on a machine. Combine that with a number of front-end machines all connected to the same database server, and things can get really nasty.
In one particularly bad incident, several fairly big-nam
The have cracked it (Score:2, Funny)
2. Let everyone else fill it
3. Profit
The second step is finally found!!! YAY
Grub does NOT look for robots.txt (Score:2)
Re:Grub does NOT look for robots.txt (Score:3, Informative)
64.241.242.18 - - [18/Mar/2003:17:25:30 -0700] "GET
64.241.242.18 - - [19/Mar/2003:19:41:05 -0700] "GET
64.241.243.81 - - [30/Mar/2003:22:10:41 -0700] "GET
CPU cycles are NOT wasted or "available" (Score:2, Insightful)
However, that is not true at all! CPU cycles are not wasted. When the CPU has nothing to do, it sleeps. At least in a modern operating system (i.e. about everything after Windows 95).
By "donating your wasted CPU cycles" you will actually increase the power consumption of your computer. This will be very noticable in a laptop, but when you watch
My first Grub hit coming over to my site (Score:2)
Re:search.msn.com is the future (Score:5, Interesting)
You have to be kidding or working for Microsoft, or both! Have you ever searched for Linux on MSN? Try it - here [msn.com].
Notice the third result? "Learn about the Microsoft alternatives and how to move to them from open source products." I shit you not! I don't think Google would ever use this kind of dirty, underhanded trick. Great "hand-picking", mate.
Re:search.msn.com is the future (Score:3, Funny)
Results 1-15 of about 609 containing "linux"
I seem to remember there being more than 609 websites with Linux information on them...
Re:search.msn.com is the future (Score:2, Interesting)
Re:search.msn.com is the future (Score:2, Insightful)
Read the fine print (Score:3, Insightful)
Nothing that other search sites don't do. They just mark their paid adverts a little more obviously.
Re:search.msn.com is the future (Score:2)
Yes, Google's algo only asked Microsoft to go to hell [computerworld.com], of course, taking it down after the story was reported far and wide.
Re:Search engine software and lack of A . I . (Score:3, Insightful)
But it still kind of irks me that people think that a computerized 'dumb' search result could compete with a human rating system that filters spam,porn,and other garbage results. Google should hire some REAL PEOPLE that can do some sort catagorized intelligent directory so we can have QUALITY at the beginning of a search result. Some sort of HUMUN RATING system is needed to sort. The software is not up to par.
Re:How about picking the types of content to crawl (Score:2)
Re:Whatever (Score:2)