Indexing the Entire Web? 98
cah1 writes "BBC is carrying a story about another new search engine All The Web. The designers are planning to have the whole shooting match, all billion pages, indexed by the end of the year. " You can also read press from the company as well. I'm skeptical-they claim to be able to catch up within the first year, and keep up thereafter. But they claim to have 200 million already, so who knows?
Very Strange... (Score:1)
Re:ad infinitum et ad nausium (Score:2)
Re: Distributed Spidering (Score:1)
This is probably obvious, but it's not only computational load that would be more evenly distributed. With some knowledge of the preferred routes of various levels of the net hierarchy, the traffic of the spidering could be more contained to small areas of the network.
Far flung links could be handled at higher levels and passed down to other spidering nodes closer to the link target (from a routing perspective). This would mean a little more computation overhead somewhere but I imagine it wouldn't be too bad. The benefits of distributed spidering seem to me quite attractive...
On the other hand, if it really was that feasible, wouldn't one of the Big Boys take it up, or is it too much hassle to develop a business model for a search engine based on volunteer spiders?
Christopher
Search engine span, accuracy (Score:1)
(a) Spanning more pages is only half the story. You need to combine huge page indexes with a lookup scheme like google's where the chaff is separated from the wheat. Otherwise you'll just be drowning in 5 times as many useless hits, and you'll need a search engine to search through the 100,000+ hits returned for your query to find what you're actually interested in.
(b) Does anyone have statistics for what %age of the web is excluded in
Re:yeah, it's fast. but it's pretty weak (Score:1)
this search engine... (Score:1)
But, if I do the same thing on Yahoo/Infoseek/Lycos/Altavista, I either get nothing that pertains to me with my name or another different, obscure page on the website. When I type in the companys' names I may or may not get the company's website at the top.
What is more important to me than how many pages it returns is how many RELEVANT pages it returns. And yes, it is supposed to read my mind to some extent and know what I want.
The Ideal Search Engine? Not quite, but good. (Score:1)
I think he misses the point. IMHO the ideal search engine (1) covers all of the Web (Yes, I *know* it's impossible! This is an *ideal*.) and (2) allows me to construct a proper Boolean search argument.
Boolean is very important to me. It allows me to pare those results down from 1,276,349 to 280. When I pare down the number of results then the top hits are far more likely to be relevant. So far as I know (and correct me please if I'm wrong) the only search engine that allows the proper construction of Boolean arguments (AND, OR, parentheses and NEAR) is Alta Vista. Other engines such as HotBot and Google allow some ability to refine the argument, but not enough for my taste. This new engine still doesn't satisfy that desire either.
However, it does give some tools (phrases, + and - but no parentheses or OR) so having a bigger database is a Good Thing.
I found it snapped back the results pretty quickly too.
By the way, something that isn't discussed very often, but is pretty relevant in evaluating the effectiveness of search engines is latency. At the WWW8 conference in Toronto, I heard a paper that made the observation that search engines have a bad tendency to "forget" URLs. I.e. the same argument given over time will some times not discover a site that an earlier search found. On occasion, a later search will then "rediscover" the page. (Sorry I don't have the reference to hand. I've really got to do some housekeeping. . . ) The moral of this is: bookmark that interestin' site when you find it or you may never see it again.
Re: Distributed Spidering (Score:2)
I've only considered this as a strictly volunteer project, directed by a university and the top level hosts and database hosted there, with some corporate sponsorship thrown in for good measure.
I don't know if this would work if commercialized, since a lot of the folks who have the knowledge, experience and compute power to participate would probably not feel too warm or fuzzy about helping to build the next Yahoo!, especially when the IPO made the company worth millions overnight. It would certainly be tough to maintain the same level of participation after going commercial, unless some hitherto unforeseen way of rewarding participation per contribution were discovered. Perhaps corporate sponsors could offer premiums to contributors based on sites spidered? Maybe something along the lines of frequent flyer miles?
Re:all and then some? (Score:1)
>Ok so repeating an effort to find the various purported etymology of the word "strawberry" I
>searched with +etymology +strawberry +origin on both yahoo (my standard) and alltheweb.
>Yahoo found 60 while alltheweb found 117, but a number of allthewebs' finds were xxx sites!?
>How many xxx sites actually use the word etymology and if this is more do we really want more?
I tried "CalTrans Bridge Design Manual" in Google! [google.com], Inference Find [inference.com], and All The Web [alltheweb.com]. Google gave me many links to CalTrans sites and some associated ones. Inference Find found the CalTrans sites and a bunch of tangentially related sites. All the Web found a bunch of CalTrans sites and related sites, but numbers 19 and 21 were porn sites, and putting CalTrans at the end of the string got me more porn sites.
Not a terribly useful site, IMO.
looking for something that isn't there. (Score:1)
god! I hate search engines.
(but, he did it fast!)
Re:It seems that... (Score:1)
And then there are all the privacy concerns...
--
Re:It's the custom hardware, stupid.. (Score:1)
> You claim to KNOW this?
> I noticed you live in the same city as FAST headquarters..
> But maybe you cant talk about that
Actually, I think Frode should updates the curriculum vitae on his home page, to include the fact that he's a FAST employee. I claim to know this.
:)
Let's do it! (Score:2)
I just created http://www.egroups.com/group/dizz-net/ [egroups.com] as a an email discussion list. You can subscribe by sending email to dizz-net-subscribe@egroups.com [mailto]. There are a lot of interesting issues, many already mentioned here:
-david.
Re:I sure HOPE it doesn't index the entire web (Score:1)
Who says a billion? (Score:1)
BTW: Who needs to sort through all that junk after doing a search. Use metacrawler, and get a pretty good compilation of the best search engines out there.
Re:I sure HOPE it doesn't index the entire web (Score:1)
Re:QUESTION .... (Score:1)
Try view source on this page to see the way I would handle it:
Anti-Linking Script [angelfire.com]
Of course, there are other more sophisticated ways to deal with it, but this can work if the people aren't bound and determined to link to you.
Re:Non-scientific analysis (Score:1)
Out of date (Score:1)
Wow - Looks Great (Score:3)
Relevance before size (Score:1)
Nice touch, they use OpenSource software :-) (Score:1)
Here's what Netcraft [netcraft.com] has to say about it: www.alltheweb.com is running Apache/1.3.6 (Unix) PHP/3.0.11 on FreeBSD [netcraft.com] .
Both Apache [apache.org] and FreeBSD [freebsd.org] are well-proven OpenSource software projects. I imagine this is going to be very stable ;)
Most search engines are distributed! (Score:1)
I even read somewhere some of the engines even use multiple Linux machines with applications written in Perl for indexing.
sulka
Re:Linked to Lycos ? (Score:1)
Eyvind Bernhardsen
Not that great (Score:1)
----------------------
"This moon-cheese will make me very rich! Very rich indeed!
Re:They're running Apache/FreeBSD (Score:1)
Their spider was not very nice (Score:1)
Re:I sure HOPE it doesn't index the entire web (Score:1)
Has anyone else noticed this?
Yes, but it's not playing nice. (Score:1)
Shame on them.
It's fast, anyway. (Score:2)
As for its coverage: it may be "the result of more than a decade of research into optimising search algorithms and architectures", frankly this sounds dubious.
If it covers 30% of the web it'll be twice as good as existing engines, but I suppose thirdoftheweb.com isn't that catchy.
It seems that... (Score:2)
Re:What is the problem ? (Score:2)
But another problem, is the amount of dynamically generated content. There simply ISN'T any way for a search engine to safely index everything on the web, because it can't know which CGI's just serve up a finite selections of pages from a database, and which randomly generate content, as long as no decent clues are given.
The amount of dynamically generated content is growing dramatically, so this will be an increasing problem.
They're running Apache/FreeBSD (Score:1)
Seems to be the platform of choice for serious stuff like this.
What is the problem ? (Score:1)
Surely if you just hit port 80 of every machine registered in DNS, and search recursively from the pages retrieved by that, you'll get a greater number of pages than the 10-20 percent most search engines have ?
Or is it the case that the problem is in the indexing of the data, and searching it quickly enough, rather than retrieving it ?
Re:It's fast, anyway. (Score:1)
They have a special fast search chip or something, hardware regexp matching etc.
They are certainly not beginners on the searching scene so they might be able to do it.
(This is really old news, it was on
Zyklone
Maybe all the web, but it's not useful (Score:1)
I'll stick to Google.
--
http://www.wholepop.com/ [wholepop.com]
Whole Pop Magazine Online - Pop Culture
Re:It seems that... (Score:2)
ad infinitum et ad nausium (Score:1)
Non-scientific analysis (Score:2)
I made two searches; one for the word 'Microsoft' and the other for 'Linux'.
Altavista gave : 12,682,370 (M$) and 4,526,430 (LX).
FAST gave : 4689227 (M$) and 2570827 (LX).
So.. If FAST currently is ~40% bigger than Altavista, how come they return numbers that much lower? With such large numbers it can't be pure coincidence, In My Humble Opinion.
-Snotboble
Re:Non-scientific analysis (Score:1)
Obviously, this would tend to skew the results somewhat.
I imagine as they get closer to their goal, the search results will become more relevant.
Re:It seems that... (Score:3)
Not to harp on one of my pet ideas or anything, but I think a distributed spidering project could be pulled off. The trick would be to delegate the work based on compute power and bandwidth, with the "low-end" clients doing the grunt work of spidering, then passing the raw data up to the bigger iron with more bandwidth where the relationships between sites could be ferreted out, keywords could be indexed and context established, etc. These sites could then pass the cooked data back to the top level servers (compressed, of course) for whatever final work needs to be done and then insertion into the database. The idea is to have each client do the work it's best suited for, and to distribute the load more evenly. Bandwidth could be a problem, but I think a lot of the data could be "tokenized" somewhat once references have been established, and some compression would probably help.
If I had the networking know-how I would put together a proposal and start taking flame-mail, er, suggestions. Since I don't, I hope someone who does and is as crazy as me will pick up on the idea.
Re:searchenginewatch.com (Score:1)
Eyvind Bernhardsen
It's the custom hardware, stupid.. (Score:2)
Check out this
http://www.fast.no/product/fastpmc.html [www.fast.no]
gaute
-- We plunge for the slipstream the realness to find
Re:They're running Apache/FreeBSD (Score:2)
Linux Max Uptime: 845 days, 08:59m
FreeBSD Max Uptime: 690 days, 23:48m
Then again, there are about 1/10th the number of FreeBSD entrants... overall not a real big sampling group in general.
Plus there's no information about hardware anyone is using and why the machine was rebooted (kernel ugprades, hardware upgrades or crash).
Overall, it's sorta pointless other than a nice figure to say my oscar meyer is bigger than yours.
--
Re:It's fast, anyway. (Score:2)
Eyvind Bernhardsen
Re:It seems that... (Score:1)
much faster if the index was carried at the
server, with a hierarchy of index servers
not doing any spidering at all, if possible.
Sound familiar? Its Harvest's SOIF format:
http://www.tardis.ed.ac.uk/harvest/
http://www.tardis.ed.ac.uk/harvest/docs/old-man
Just my 2c - I'd be happier if much *less* of
the web was indexed...just the useful stuff.
And if search engines could only recognize a
mirror when they see one, then I wouldnt get
so many identical replies...
-Baz
dynamic content (Score:1)
Two problems: (Score:1)
2. Slow, if they index everything you will notice definite slowness. Even if they find some kind of uber-fast way of searching through stuff their servers will be slowed down by net-troglodytes searching for the "internet" or the letter "a".
Imagine how many pages would pop up if you searched for the word "pictures".
Re:Wow - Looks Great (Score:1)
Re:It's the custom hardware, stupid.. (Score:1)
You claim to KNOW this?
I noticed you live in the same city as FAST headquarters..
But maybe you cant talk about that
No, seriously, there are a couple of pages at the fast site that imply rather clearly that alltheweb uses the PMC.
Not explicitly though, you'r right about that.
I seem to remember a picture of one of those dell machines full of those cards, but of course I cant find it now...
Anyway, just look at this quote from the PMC faq,
and compare this with allthewb's claim of scaling lineraly.
>Since the PMC search through data at a fixed speed (100 MB/s), the
>response time for a query is independent of its complexity. In a
>software solutions the response time increases more than linear with
>increasing query complexity.
Gaute
-- We plunge for the slipstream the realness to find
Re:I sure HOPE it doesn't index the entire web (Score:1)
My search-engine criteria... (Score:1)
I think I'll stick to Google.
Dupe! (sorta) (Score:1)
I sure HOPE it doesn't index the entire web (Score:1)
I, for one, would like to keep some of the webpages I post on the internet un-indexed because they were ment for a couple of friends, not a couple billion people to rummage through.
Re:Maybe all the web, but it's not useful (Score:1)
I hope that made some sense :)
Re:It's the custom hardware, stupid.. (Score:1)
Re:I sure HOPE it doesn't index the entire web (Score:1)
Re:Wow - Looks Great (Score:1)
even though AllTheWeb returns 2.5 times the hits
that Google returns, there's a slight difference:
Google puts my homepage firmly on spot #1, whereas
AllTheWeb (probably by coincidence) has it at
number eight between a mass of irrelevant
mailing list archive links.
I'll stick with Google - it has this uncanny
ability of putting what you want behind the
"I'm feeling lucky" button...
Re:It seems that... (Score:1)
How many people would want their browser recording and sending off a list of every site they visited during the day? Even then, I doubt it would be a particularly good way of finding new sites that weren't already in your search engine.
Chris
more URL are not good (Score:1)
expired and bad URL's. The more URL an engine adds the more it becomes unusable.
Re:Non-scientific analysis (Score:2)
Linked to Lycos ? (Score:1)
Re:It seems that... (Score:1)
It has been suggested on the mailing list once but i don't know what happened with that idea.
The problem is that you would have to store a huge amount of data somewhere. So you would probably need a Big Company(tm) sponsoring or leading the project. The clients would probably duplicate alot of work, but this is not a major problem.
Re:They're running Apache/FreeBSD (Score:1)
Harvest: Distributed indexing (Re:It seems...) (Score:1)
The basic idea was that the pages are indexed locally at the server, and indexed data are gatherad and can be queried at "brokers".
old story (Score:1)
I'm tired of old stories being new.
That story last week about N2H2 and Bess...
Bess is not new, as the subject thingy said, been around for several years, i know, i fought it at my friend's house.
Re:It seems that... (Score:1)
My opinion is that no ONE center could organize all the data on the whole net, since it is so wide spread and far flung. My idea (somewhat corresponding to distributed filesystems) was that every client held a piece of the index and had some sort of reliability rating. Low reliability nodes would have to be backed up on fallover, duplicate nodes. Anyway there would be a whole distributed hierarchy of nodes based both spacially and I guess on reliability. When you asked the master node, or perhaps your regional node for something, it would forward it on to who IT thought might have the right answers. Each node would do the same, in turn, until the host itself was reached, or a terminal node was reached. The info would then be fed back to you. Yes it would be slower, but you WOULD get the correct answers. Also, if nodes were distributed spacially, then regional/local nodes could more frequently check for page expiration and 404s...one of the major problems is that all these CENTRAL search engines have LOADS of outdated crap. Sure you find a lot...but it's all invalid.
My Seti client could sure share some CPU with a distributed indexing client...somebody set this up already!
Re:What is the problem ? (Score:1)
~Tim
--
Re:It seems that... (Score:1)
Will they pay to get the porn sites? (Score:1)
to pay for porn sites?
Do they have a copy of my son's
Final Fantasy tribute pages?
Questions Questions Questions
~broken~, all hits not shown? (Score:1)
Running the above query says: "12 documents found," but it only shows results 1-10, and doesn't have a link to more results.
Now I don't know exactly how many pages that match this criteria are *actually* out there, but it seems as though you should show all the matches that you count, unless you're padding your counts
The name must mean it's true (Score:1)
Re:Wow - I'm famous! (Score:2)
1. The plan for an old CS group project from college, where my name was referenced!
2. 2 broken links to ZDNet talkbacks of mine.
3. A CNet page with a dorky little media player I wrote and released as freeware.
4. Some random Italian site hosting Win95 software including my dorky media player with full description extracted!!
Wow...my head is swelling...
Hmm...it didn't find my page though...heh
Aaron
Failure to display all results. (Score:1)
(by the way, "Notopia" was the name of a great radio program on KCSB that disappeared several years ago, and Evil Farmer was a great band in the Santa Barbara Calif area which also disappeared several years ago. I miss both of them. Unfortunately, antizeus is still with us.)
Re:The name must mean it's true (Score:1)
QUESTION .... (Score:1)
-
example of site that doesn't want indexed: i know of a pagaen group's site that has info for the group to view quickly without waiting for snail mail
-
Re:It seems that... (Score:1)
The nice thing about this approach would be that you could have multiple front-ends, too, so the search engine "site" itself wouldn't get bogged down--automatic mirrors!
This should be fairly simple to implement--a list of sites vistied (with dates) on the one hand, and index diffs (for the content itself). The only question is: How do we keep it from getting "sold out" and losing quality? (not that selling out is bad, but someone mentioned lycos going to hell after getting sold).
Re:What is the problem ? (Score:1)
Someone else (who is still down at 0, because he posted it anonymously) came up with a much better answer, which is that the hardware and bandwidth required to index 100% of static content is extremely large, and anyway most content is not static. Its this last point, I think, which is most important - by definition nothing you read daily is static content.
all and then some? (Score:1)
+etymology +strawberry +origin
on both yahoo (my standard) and alltheweb.
Yahoo found 60 while alltheweb found 117, but a number of allthewebs' finds were xxx sites!?
How many xxx sites actually use the word etymology and if this is more do we really want more?
Re:Scan the WHOLE of the web? (Score:1)
Re:It's the custom hardware, stupid.. (Score:1)
- Frode