Please create an account to participate in the Slashdot moderation system


Forgot your password?
The Internet

Indexing the Entire Web? 98

cah1 writes "BBC is carrying a story about another new search engine All The Web. The designers are planning to have the whole shooting match, all billion pages, indexed by the end of the year. " You can also read press from the company as well. I'm skeptical-they claim to be able to catch up within the first year, and keep up thereafter. But they claim to have 200 million already, so who knows?
This discussion has been archived. No new comments can be posted.

Indexing the Entire Web?

Comments Filter:
  • That was weird, I clicked on the reply to the H-1B article. I then waited half an hour, entered my message and submit it, and it was attached to an article that didn't even exist when I first hit the reply button. Strange...

  • Assuming it will be a rather large amount of data, who will index thier index? (and who will index that index... and that one... and that one......)
    It seems to me that this is not a new problem. Juvenal said, `Sed quis custodiet ipsos custodes?' :-)
  • The idea is to have each client do the work it's best suited for, and to distribute the load more evenly. Bandwidth could be a problem, but I think a lot of the data could be "tokenized" somewhat once references have been established, and some compression would probably help.

    This is probably obvious, but it's not only computational load that would be more evenly distributed. With some knowledge of the preferred routes of various levels of the net hierarchy, the traffic of the spidering could be more contained to small areas of the network.

    Far flung links could be handled at higher levels and passed down to other spidering nodes closer to the link target (from a routing perspective). This would mean a little more computation overhead somewhere but I imagine it wouldn't be too bad. The benefits of distributed spidering seem to me quite attractive...

    On the other hand, if it really was that feasible, wouldn't one of the Big Boys take it up, or is it too much hassle to develop a business model for a search engine based on volunteer spiders?


  • Two points:
    (a) Spanning more pages is only half the story. You need to combine huge page indexes with a lookup scheme like google's where the chaff is separated from the wheat. Otherwise you'll just be drowning in 5 times as many useless hits, and you'll need a search engine to search through the 100,000+ hits returned for your query to find what you're actually interested in.
    (b) Does anyone have statistics for what %age of the web is excluded in /robots.txt?
  • If you used it a couple of weeks back, you used the old index. A demo-version running an index of about 70 million pages (I believe) have been running for some months. The announcement yesterday is about the new index that claims to be the world's largest.
  • It's ok. I would say "mediocre". The only reason that it doesn't get a "bad" score is when I type my name in, it brings up right at the very top some pages from my website. I don't know what kind of search it uses, but if I type in "Sun Microsystems" or "Dana Corporation" .. I kind of expect to get the company's web site right at the top. But mostly what I see are news articles with the companys' names within them. Also, when I type my name in, I get the really obscure pages on my website.

    But, if I do the same thing on Yahoo/Infoseek/Lycos/Altavista, I either get nothing that pertains to me with my name or another different, obscure page on the website. When I type in the companys' names I may or may not get the company's website at the top.

    What is more important to me than how many pages it returns is how many RELEVANT pages it returns. And yes, it is supposed to read my mind to some extent and know what I want. :)
  • The BBC story was pretty good, but at another story on this new search engine ( tml), one commentator said,"What does it mean to have another 100,000 or 200,000 links show up in a search? . . . The only thing that matters is the top 10 links you get back . . ."

    I think he misses the point. IMHO the ideal search engine (1) covers all of the Web (Yes, I *know* it's impossible! This is an *ideal*.) and (2) allows me to construct a proper Boolean search argument.

    Boolean is very important to me. It allows me to pare those results down from 1,276,349 to 280. When I pare down the number of results then the top hits are far more likely to be relevant. So far as I know (and correct me please if I'm wrong) the only search engine that allows the proper construction of Boolean arguments (AND, OR, parentheses and NEAR) is Alta Vista. Other engines such as HotBot and Google allow some ability to refine the argument, but not enough for my taste. This new engine still doesn't satisfy that desire either.

    However, it does give some tools (phrases, + and - but no parentheses or OR) so having a bigger database is a Good Thing.

    I found it snapped back the results pretty quickly too.

    By the way, something that isn't discussed very often, but is pretty relevant in evaluating the effectiveness of search engines is latency. At the WWW8 conference in Toronto, I heard a paper that made the observation that search engines have a bad tendency to "forget" URLs. I.e. the same argument given over time will some times not discover a site that an earlier search found. On occasion, a later search will then "rediscover" the page. (Sorry I don't have the reference to hand. I've really got to do some housekeeping. . . ) The moral of this is: bookmark that interestin' site when you find it or you may never see it again.
  • I've only considered this as a strictly volunteer project, directed by a university and the top level hosts and database hosted there, with some corporate sponsorship thrown in for good measure.

    I don't know if this would work if commercialized, since a lot of the folks who have the knowledge, experience and compute power to participate would probably not feel too warm or fuzzy about helping to build the next Yahoo!, especially when the IPO made the company worth millions overnight. It would certainly be tough to maintain the same level of participation after going commercial, unless some hitherto unforeseen way of rewarding participation per contribution were discovered. Perhaps corporate sponsors could offer premiums to contributors based on sites spidered? Maybe something along the lines of frequent flyer miles?

  • >Ok so repeating an effort to find the various purported etymology of the word "strawberry" I
    >searched with +etymology +strawberry +origin on both yahoo (my standard) and alltheweb.
    >Yahoo found 60 while alltheweb found 117, but a number of allthewebs' finds were xxx sites!?
    >How many xxx sites actually use the word etymology and if this is more do we really want more?

    I tried "CalTrans Bridge Design Manual" in Google! [], Inference Find [], and All The Web []. Google gave me many links to CalTrans sites and some associated ones. Inference Find found the CalTrans sites and a bunch of tangentially related sites. All the Web found a bunch of CalTrans sites and related sites, but numbers 19 and 21 were porn sites, and putting CalTrans at the end of the string got me more porn sites.

    Not a terribly useful site, IMO.

  • I've just used the search engine,... got two results, wherefrom one non existing.

    god! I hate search engines.

    (but, he did it fast!)
  • This would find too many false links. Here's one reason why: often when I edit and view an existing page, I edit a temporary copy instead and replace the real page only when I'm satisfied with the changes. Clearly, you don't want the temporary version to be indexed...

    And then there are all the privacy concerns...


  • > "Actually"?
    > You claim to KNOW this?
    > I noticed you live in the same city as FAST headquarters..
    > But maybe you cant talk about that ;-)

    Actually, I think Frode should updates the curriculum vitae on his home page, to include the fact that he's a FAST employee. I claim to know this.

  • Hey, I've been thinking about this very same problem for quite some time and some fellow nerds and I have been thinking about how to do it. How about we start a mailing list to further discuss this as an open source initiative?

    I just created [] as a an email discussion list. You can subscribe by sending email to [mailto]. There are a lot of interesting issues, many already mentioned here:

    • quality is usually more important than quantity
    • a distributed app has the potential to be much more "fresh" than other search services
    • a network protocol needs to be designed carefully -- you don't want to be sending all the web haphazardly around the web every day. clients might be assigned to monitor nearby sites. there are some cool opportunities to use this system just to map the internet.
    • searching is a different beast from crawling. parallel searching -- like FAST and others -- requires major resources which an open source project couldn't manage.
    • full text vs topic searching: does a distributed system with clients fetch documents index every word or summarize? Topic searching is probably more appropriate for distributed searching, but full text is often more desirable.
    • interesting security issues come up, like how to keep clients from poluting the database.
    • etc...


  • Well, they say they honor ROBOTS.TXT However your post suggests differently. You ought to e-mail them and find out. Their robots policy is stated here: aq/faqfastwebsearch/faqfastwebcrawler.html []
  • So who says that there are only a billion pages out there, maybe 2 billion exist. So how do they know when they're done?

    BTW: Who needs to sort through all that junk after doing a search. Use metacrawler, and get a pretty good compilation of the best search engines out there.
  • Yep, I got an email from FAST telling me that they had a bug in their ROBOTS.TXT handling when they indexed my site, and they've now fixed it.
  • Hi,

    Try view source on this page to see the way I would handle it:

    Anti-Linking Script []

    Of course, there are other more sophisticated ways to deal with it, but this can work if the people aren't bound and determined to link to you.

  • Uh, Altavista doesn't ignore home pages. I hit them all the time when they happen to have a word I'm searching for.
  • Whenever I hear about a new search engine, like many other people on /., I have a set of queries that I tend to try. My results were that it seemed somewhere in the ballpark with Google and AltaVista as far as numbers. One of the things that they pride themselves on, is the ability to check more often and catch broken links faster... If this is the case, why am I seeing more broken links off of their engine than any of the other major ones. They are getting the same pages, but the other engines wiped these pages out quite a while ago because they were no longer current.
  • by Aaron M. Renn ( 539 ) <> on Tuesday August 03, 1999 @05:18AM (#1768467) Homepage
    I judge search engines by the most important criteria of all - how many references to me they have. Alltheweb now has vastly more than runner up Google, making them the biggest ever. I type in "Aaron M. Renn" and I got 1604 on AllTheWeb, ~500 on Google and only ~180 on AltaVista. Even if that number drops as I searched through the pages, it's still impressive. I did look through the plain "Aaron Renn" listings too, where they also crushed the competition (though it's a much smaller number of pages since I virtually always use my middle initial). Believe it or not, there is a page out there with another "Aaron Renn" on it. Pretty weird.
  • It was good to see that the BBC pointed out that relevance of the search results is probably more important than the number of pages in the database -- and that Google seems somehow to have the most relevant results time after time. Alltheweb didn't do very well on my standard test word, so I'm sticking with Google.
  • Here's what Netcraft [] has to say about it: is running Apache/1.3.6 (Unix) PHP/3.0.11 on FreeBSD [] .

    Both Apache [] and FreeBSD [] are well-proven OpenSource software projects. I imagine this is going to be very stable ;)

  • Alltheweb is distributed (see ), Hotbot is distributed and I guess most of the others are distributed too.

    I even read somewhere some of the engines even use multiple Linux machines with applications written in Perl for indexing.

  • All the Web is not Lycos and has nothing to do with Lycos.

    Eyvind Bernhardsen
  • I've been using alltheweb for a couple weeks (yeah, ever since it first showed up on Slashdot). My take on it is that it is fast, as in blazing zippy fast, but mostly useless. It tends to return a lot of pages that are all the same. Dig past the first 10 results or so, and you find that after that it's just one page from the same FAQ, only they list every mirror separately. I haven't found it to be better than any other search engine, and worse than most. Stick with a combination of Google, Altavista, and Yahoo. And, for howto questions :-)
    "This moon-cheese will make me very rich! Very rich indeed!
  • Exactly. Linux users are much more likely to recompile/rebuild/tweak the kernel, whereas *BSD users just run it. They wait for the kernel writers to distribute and don't bother rebooting. I have a linux boxen up 236 days and it won't get rebooted
  • Their crawler (FAST-WebCrawler/0.3) was not very nice when it blasted through my site. The general guideline is that a crawler should grab one page a minute. Their crawler grabbed multiple pages a second even when going through a bunch of pages generated by CGI.
  • Unfortunately, AllTheWeb does seem to ignore ROBOTS.TXT. It has indexed every page on my site ( []), including all those that have been disallowed by ROBOTS.TXT, where no other search engines have. I don't know if that's because it has bugs in its ROBOTS.TXT analysis or because it just ignores it. Either way, it's not good.

    Has anyone else noticed this?

  • This particular search engine isn't honoring the robots.txt file (at least, not on my site). I checked to see if it knew about my pages, and it had indexed deep into my site DESPITE the "disallow" directives in my robots.txt file.

    Shame on them.
  • I've been using if for a few days now, and it seems impressive. It's certainly fast. Google is still my engine of choice (even though it's visited my page a ton of times, and still won't find it when I search for it).
    As for its coverage: it may be "the result of more than a decade of research into optimising search algorithms and architectures", frankly this sounds dubious.
    If it covers 30% of the web it'll be twice as good as existing engines, but I suppose isn't that catchy.
  • This would be a great application for a distributed computing application, lots of computers indexing the web, and after they finish that, they can revisit sites for broken, moved and changed content sites... First post?
  • by Anonymous Coward
    It's a problem of cost, bandwidth, and enough hardware. All of which can be solved relatively easy. The software to do the indexing is hardly any difficult to write - I have one I've written myself, and indexed a few million pages with. The reason I don't put up a search engine tomorrow, is that I certainly couldn't afford the hardware, and the fact that it's a lot of work to retrieve data from the index in a way that give good results.

    But another problem, is the amount of dynamically generated content. There simply ISN'T any way for a search engine to safely index everything on the web, because it can't know which CGI's just serve up a finite selections of pages from a database, and which randomly generate content, as long as no decent clues are given.

    The amount of dynamically generated content is growing dramatically, so this will be an increasing problem.

  • According to

    Seems to be the platform of choice for serious stuff like this.
  • I may be being a bit slow here, but what is the problem which prevents coverage of the entire wbe by search engines ?

    Surely if you just hit port 80 of every machine registered in DNS, and search recursively from the pages retrieved by that, you'll get a greater number of pages than the 10-20 percent most search engines have ?

    Or is it the case that the problem is in the indexing of the data, and searching it quickly enough, rather than retrieving it ?
  • Well, it's from the same people who used to run and that one was VERY fast until lycos bought it and messed it up.

    They have a special fast search chip or something, hardware regexp matching etc.
    They are certainly not beginners on the searching scene so they might be able to do it.

    (This is really old news, it was on /. a couple of months ago.)

  • I try searching for "HotMedia". In Google, the first(!) result is the HotMedia homepage at IBM. Here, I don't see this page in the first results page.

    I'll stick to Google.
    -- []
    Whole Pop Magazine Online - Pop Culture
  • On that topic: surely this could actually done by the web browser rather than a distributed client. If you have a page online you're bound to check it yourself to make sure it's OK. With an appropriate browser or plugin your page could then be indexed and submitted to a search engine. And then once you start surfing any page you visit could be automatically indexed. The only problem is the millions of submissions you'd get each day.
  • Assuming it will be a rather large amount of data, who will index thier index? (and who will index that index... and that one... and that one......)
  • I wondered about those 200M pages already indexed, and I dug into Altavista, which says it has ~140M pages indexed.

    I made two searches; one for the word 'Microsoft' and the other for 'Linux'.

    Altavista gave : 12,682,370 (M$) and 4,526,430 (LX).
    FAST gave : 4689227 (M$) and 2570827 (LX).

    So.. If FAST currently is ~40% bigger than Altavista, how come they return numbers that much lower? With such large numbers it can't be pure coincidence, In My Humble Opinion.


  • As one of the other posts pointed out, they currently have little to no content from NA.

    Obviously, this would tend to skew the results somewhat. ;0

    I imagine as they get closer to their goal, the search results will become more relevant.
  • by davie ( 191 ) on Tuesday August 03, 1999 @03:44AM (#1768499) Journal

    Not to harp on one of my pet ideas or anything, but I think a distributed spidering project could be pulled off. The trick would be to delegate the work based on compute power and bandwidth, with the "low-end" clients doing the grunt work of spidering, then passing the raw data up to the bigger iron with more bandwidth where the relationships between sites could be ferreted out, keywords could be indexed and context established, etc. These sites could then pass the cooked data back to the top level servers (compressed, of course) for whatever final work needs to be done and then insertion into the database. The idea is to have each client do the work it's best suited for, and to distribute the load more evenly. Bandwidth could be a problem, but I think a lot of the data could be "tokenized" somewhat once references have been established, and some compression would probably help.

    If I had the networking know-how I would put together a proposal and start taking flame-mail, er, suggestions. Since I don't, I hope someone who does and is as crazy as me will pick up on the idea.

  • Searchenginewatch's current size comparison is correct as of July 1, but All the Web hasn't been running with 200M documents for that long.

    Eyvind Bernhardsen
  • Sloppy journalists...

    Check out this []


    -- We plunge for the slipstream the realness to find

  • No one every said Linux was stable on every single machine in the world, it supports a whole lot of hardware which itself isn't all that stable itself. :)

    Linux Max Uptime: 845 days, 08:59m
    FreeBSD Max Uptime: 690 days, 23:48m

    Then again, there are about 1/10th the number of FreeBSD entrants... overall not a real big sampling group in general.

    Plus there's no information about hardware anyone is using and why the machine was rebooted (kernel ugprades, hardware upgrades or crash).

    Overall, it's sorta pointless other than a nice figure to say my oscar meyer is bigger than yours.

  • All the Web doesn't use the pattern matching chip, it's all done in software on 50 Dell servers running BSD. What's new is the 200M documents and the official announcement (it's only been up on trial until now).

    Eyvind Bernhardsen
  • Why do it on the client? Indexing would be
    much faster if the index was carried at the
    server, with a hierarchy of index servers
    not doing any spidering at all, if possible.

    Sound familiar? Its Harvest's SOIF format: al/node151.html#SECTION0001200000000000000 00

    Just my 2c - I'd be happier if much *less* of
    the web was indexed...just the useful stuff.
    And if search engines could only recognize a
    mirror when they see one, then I wouldnt get
    so many identical replies...

  • How is it possible to index the entire web? The entire *static* web should be relatively simple, but dynamic content really throws a monkey wrench in things. And dynamic content is becoming much more commonplace. Not even going into forms, a page referenced by a URL may be different day to day, or even minute to minute (like slashdot).
  • !. For every site that goes up, one goes down. I don't know how they are going to keep up with dead links.

    2. Slow, if they index everything you will notice definite slowness. Even if they find some kind of uber-fast way of searching through stuff their servers will be slowed down by net-troglodytes searching for the "internet" or the letter "a".

    Imagine how many pages would pop up if you searched for the word "pictures".

  • I applied a similar test - supplied some of the keywords in my web-page (samba encryption smbmount)- and BINGO. The very first entry. This really surprised me because my provider is an obscure German one. I don't know what it does to the competition but it certainly impressed me.
  • "Actually"?
    You claim to KNOW this?
    I noticed you live in the same city as FAST headquarters..
    But maybe you cant talk about that ;-)

    No, seriously, there are a couple of pages at the fast site that imply rather clearly that alltheweb uses the PMC.
    Not explicitly though, you'r right about that.
    I seem to remember a picture of one of those dell machines full of those cards, but of course I cant find it now...

    Anyway, just look at this quote from the PMC faq,
    and compare this with allthewb's claim of scaling lineraly.

    >Since the PMC search through data at a fixed speed (100 MB/s), the
    >response time for a query is independent of its complexity. In a
    >software solutions the response time increases more than linear with
    >increasing query complexity.


    -- We plunge for the slipstream the realness to find

  • Good point. I can also see the xtian right and censorware manufacturers being excited about having a comprehensive list of sites that need shutting down/blocking. If people want to remain unpublicized on search engines, they should be able to. Respecting ROBOTS.TXT is a simple solution already available, and I hope FAST will come around on this.
  • Is how well it finds my own home page. Google takes me straight to my main page if I enter my e-mail addy, but this engine only shows me one of the files on my page. It has to do with music files, so I think maybe their algorithm decided it was more relevant/interesting to the average person than my main page.

    I think I'll stick to Google.
  • In this article [].
  • I really hope that they don't seriously mean they will be indexing the entire web. That would mean their crawler would have to completely ignore ROBOTS.TXT.

    I, for one, would like to keep some of the webpages I post on the internet un-indexed because they were ment for a couple of friends, not a couple billion people to rummage through.

  • This is probably because, from memory, Google works by basing the relevance of a link on how many other pages link to that. This is why it's hard to find obscure stuff on Google, but if it's something you know that is quite popular it's the best way to find the most popular sites about it.

    I hope that made some sense :)

  • also it appears dell has some type of sponsorship - at least has a dell logo on it...
  • If your pages are only intended for a couple of people, try putting .htaccess/.htpasswd access on your directory, or even just leave your page "out of the web" by making sure nothing else links to it. (and make sure it doesn't show up in a dir listing) If there's a crawler that can get to you page that way, I'd be VERY surprised.
  • I did the same check (but then for my name :), and
    even though AllTheWeb returns 2.5 times the hits
    that Google returns, there's a slight difference:
    Google puts my homepage firmly on spot #1, whereas
    AllTheWeb (probably by coincidence) has it at
    number eight between a mass of irrelevant
    mailing list archive links.

    I'll stick with Google - it has this uncanny
    ability of putting what you want behind the
    "I'm feeling lucky" button...
  • You're forgetting the privacy issue.

    How many people would want their browser recording and sending off a list of every site they visited during the day? Even then, I doubt it would be a particularly good way of finding new sites that weren't already in your search engine.

  • by Anonymous Coward
    If you want to impress me with a search engine, re-run the search machines and get rid of all the
    expired and bad URL's. The more URL an engine adds the more it becomes unusable.
  • Probablly because alltheweb is indexing EVERY page it comes across, even those "Hello, I'm so and so and I love cats..." pages that most search engines thankfully ignore. It even had my webpage in there, which is a first for search engines.
  • Did anyone else notice that these are the same guys that have or had before lycos got hold of it and there's gratuitous links to lycos all over their site. It's not just lycos in clothing is it ?
  • could do it.
    It has been suggested on the mailing list once but i don't know what happened with that idea.
    The problem is that you would have to store a huge amount of data somewhere. So you would probably need a Big Company(tm) sponsoring or leading the project. The clients would probably duplicate alot of work, but this is not a major problem.
  • But [] seems to be running Linux []. Anyway...according to the Uptime List [], FreeBSD [] has much higher uptimes than Linux []. Looks like it is the choice of the folks that don't reboot. I think those are mainly to be found in commercial environments like this one. Quite funny - a search for my nick/handle only finds results on /. [] and [fm] [] :)
  • There was the project Harwest a while ago, not sure what came out of it (a search revealed this document [] seems sane and not much more than a year old).

    The basic idea was that the pages are indexed locally at the server, and indexed data are gatherad and can be queried at "brokers".

  • This was on /. several months ago
    I'm tired of old stories being new.
    That story last week about N2H2 and Bess...
    Bess is not new, as the subject thingy said, been around for several years, i know, i fought it at my friend's house.

  • This too, was one of my ideas, a while back when there was a post that invokes talk about a distributed cryptographic filesystem.

    My opinion is that no ONE center could organize all the data on the whole net, since it is so wide spread and far flung. My idea (somewhat corresponding to distributed filesystems) was that every client held a piece of the index and had some sort of reliability rating. Low reliability nodes would have to be backed up on fallover, duplicate nodes. Anyway there would be a whole distributed hierarchy of nodes based both spacially and I guess on reliability. When you asked the master node, or perhaps your regional node for something, it would forward it on to who IT thought might have the right answers. Each node would do the same, in turn, until the host itself was reached, or a terminal node was reached. The info would then be fed back to you. Yes it would be slower, but you WOULD get the correct answers. Also, if nodes were distributed spacially, then regional/local nodes could more frequently check for page expiration and of the major problems is that all these CENTRAL search engines have LOADS of outdated crap. Sure you find a lot...but it's all invalid.

    My Seti client could sure share some CPU with a distributed indexing client...somebody set this up already!
  • The problems are that neither all machines are listed in DNS, nor or all web servers running on port 80, let alone that not-all pages are linked from somewhere!

  • Sorta like my idea...every node is a server AND a client...
  • How much do they have in the budget
    to pay for porn sites?

    Do they have a copy of my son's
    Final Fantasy tribute pages?

    Questions Questions Questions
  • uery=%22Robert+Ames%22+woodlands+-golf

    Running the above query says: "12 documents found," but it only shows results 1-10, and doesn't have a link to more results.

    Now I don't know exactly how many pages that match this criteria are *actually* out there, but it seems as though you should show all the matches that you count, unless you're padding your counts ;^)= (btw, that last claim is completely unsubstantiated, I'm just feeling mean :^)=
  • This must be the bestest search engine ever, because the name says it is. You can't do better than "alltheweb". Everyone else might as well pack up and go home.
  • Just for fun I decided to search for myself on Alltheweb. To my surprise I found:

    1. The plan for an old CS group project from college, where my name was referenced!

    2. 2 broken links to ZDNet talkbacks of mine.

    3. A CNet page with a dorky little media player I wrote and released as freeware.

    4. Some random Italian site hosting Win95 software including my dorky media player with full description extracted!! head is swelling... didn't find my page though...heh

  • I did three searches on this thing:

    1. antizeus: 27 hits, and it displayed only the first 20.
    2. notopia: 17 hits, and it displayed only the first 10.
    3. "evil farmer": 33 hits, and it displayed only the first 20.
    One would think that they'd get this sort of detail worked out early on in the development process. Despite that problem, I was impressed by the thoroughness. There was some stuff there that I'd never see on other search engines.

    (by the way, "Notopia" was the name of a great radio program on KCSB that disappeared several years ago, and Evil Farmer was a great band in the Santa Barbara Calif area which also disappeared several years ago. I miss both of them. Unfortunately, antizeus is still with us.)

  • I heard 130%oftheweb was actually fabricating new content to swell its index...heh
  • anyone know what they intend to do about the sites that don't want to be indexed? ... is it forced indexing? ... is there any laws (???) anyone can think of that may affect this?
    example of site that doesn't want indexed: i know of a pagaen group's site that has info for the group to view quickly without waiting for snail mail ... they are not on any search engine because they don't want to be ... i seriously doubt they want to be indexed either
  • Isn't this the way that the old Archie system worked? Lots of different servers would index stuff, and every night they'd exchange what they'd learned (to spread the knowledge, and to ensure that none of the other servers revisited sites too soon). They claimed (IIRC) to visit every public FTP server at least once a month.

    The nice thing about this approach would be that you could have multiple front-ends, too, so the search engine "site" itself wouldn't get bogged down--automatic mirrors!

    This should be fairly simple to implement--a list of sites vistied (with dates) on the one hand, and index diffs (for the content itself). The only question is: How do we keep it from getting "sold out" and losing quality? (not that selling out is bad, but someone mentioned lycos going to hell after getting sold).
  • But surely all the index pages on machines listed in DNS on port 80 + all the pages they (recursively) link to is more than 20% of the web's content ? Almost all sites are both on port 80 and in DNS.

    Someone else (who is still down at 0, because he posted it anonymously) came up with a much better answer, which is that the hardware and bandwidth required to index 100% of static content is extremely large, and anyway most content is not static. Its this last point, I think, which is most important - by definition nothing you read daily is static content.
  • Ok so repeating an effort to find the various purported etymology of the word "strawberry" I searched with
    +etymology +strawberry +origin
    on both yahoo (my standard) and alltheweb.

    Yahoo found 60 while alltheweb found 117, but a number of allthewebs' finds were xxx sites!?

    How many xxx sites actually use the word etymology and if this is more do we really want more?
  • You never stopped to wonder why your Quakz Gamez were so slow?
  • Actually, All The Web doesn't use the PMC. It's running on 50 standard Dell servers using the Fast Search software.

    - Frode

The rich get rich, and the poor get poorer. The haves get more, the have-nots die.