Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
The Internet

Search Engines Can't Keep Up 82

joshwa writes "The Boston Globe today reported a study by Nature saying that search engines barely index one-sixth of the pages on the net. To a certain extent it's a plug for the Northern Light search engine, which claims to be the most comprehensive (at a staggering 16 percent of the web), but it's an interesting read nonetheless. "
This discussion has been archived. No new comments can be posted.

Search Engines Can't Keep Up

Comments Filter:
  • The article points out that one reason for low coverage is the lag (search engines are months out of date), combined with the incredibly rapid increase in the number of pages, (100% growth in about a year). So even if *everything* six months old were indexed, coverage will still be only 50%.
    Anyone who does any searching quickly realizes this, so the study isn't breaking ground here, although maybe it quantifies the problem.

    Beyond this, I don't see how the study's result could be meaningful.

    1) How did they come up with their estimate of 800 million web pages? If that number is bogus, so is the %age. They can measure the pages they found, but how do they measure the pages they couldn't find? Different techniques of estimation might provide great variance in the number of web pages.

    2) Counting pages (and computing coverage) is especially problematic given the increasing amount of content generated dynamically.
  • No one would want a product which says: "I think you will really hate this page. Am I right?"
    I think that would be a very useful thing. Sort of like moderators on Slashdot. The need for negative feedback is clearly seen here, as well as positive feedback. Something as simple as +1 and -1 works very well here. If I came across a page that was way off the topic (say a keyword spammed porn site) I could say "irrelevant - porn site" and it would negative rate that page.... etc. And again, like Slashdot's threshold, the ability to look below the threshold is critical. Yes, you might negative weigh some pages wrongly, but someone could reweigh those (reset to zero? or plus 1?) so that they'd come back into the 'normal' view. Hmmm... I think a slashdot moderation based search engine would be great. You would have the same limit we have here: Can't moderate your own urls, only a limited number of points at a time. Moderation checks and balances. Etc. Rob or Hemos... remember me if you end up the next Jerry Yang! Seth
  • Me thinks you have a bit of a way to go yet! I attempted my search for Linux Home Automation and it failed to bring up a site in England (Fortune City). I kept the search to just central & eastern Europe. It only brought up 1 site about Linux (there are quite a few more sites than that in Europe).

    --
    Linux Home Automation - Neil Cherry - ncherry@home.net [mailto]
    http://members.home.net/ncherry [home.net] (Text only)
    http://meltingpot.fortunecity.com/lig htsey/52 [fortunecity.com] (Graphics)


  • An interesting read:
    http://www.research.digital.com/SRC/personal/Krish na_Bharat/WebArcheology/measurem ent.html [digital.com].
    (Don't know why a space got inserted in the link, just remove the space after you get the 404 error. Sorry!)

    Also from Compaq (DEC) SRC:
    Web Archeology [digital.com]
    Mercator Web Crawler [digital.com]

  • Google's algorithm is simpler than the one described in Scientific American. The Clever project marks certain pages (authorities) as having _content_ and other pages (hubs) as having links to good pages. Content doesn't necessarily have links to good pages, and good pages don't necessarily have content. Google treats everything the same, so in theory it's not as good. Still, since the IBM folk don't have anything available for us to try it's hard to compare.
  • heh, sorry.
    had to put my 2 cents in. google rules.
    ------------------------------------------
    Reveal your Source, Unleash the Power. (tm)

  • i don't know about that. yahoo seems to be leaning more towards shopping and less towards infomation. besides the fact that yahoo does a lousy job of cleaning links. a list of results generated from a search is a little like playing minesweeper for dead links which renders it pretty useless when the listing of results is limited.
  • Its a little late but here are links on older versions of the search engine coverage

    98 Results [nec.com]

    99 IEEE Paper [nec.com]

  • So use the text-only version of Hotbot. The solution is right in front of your face.
  • Hey, thanks for the link. I've seen the Netscape open directory at Netcenter, but it never seems to come to the forefront - I guess that's what branding is all about. When I need an answer, I think Hotbot :-)


    Yes, humans do do it better! I use NetMechanic's link checker to keep my links pages up to date - the only problem is, it seems to cache the pages somehow - links that have been removed from the page physically still show up in the report.


    I should also mention the Mining Co, now About.com. I had given up hope on looking for decent 3D graphics sites, and to my amazement, I found a whole section devoted to it and VRML there!

  • When I put up my first pages [tripod.com], I submitted them to the search engines, waited a few days, searched on my name and got pages and pages of junk. I did a new page, on people with the same name as me, figuring that would get me into the running. Nothing.

    So where are my hits coming from? Well, go to MetaCrawler and search for scuba, pictures, women [go2net.com]. That gets you pictures of me with various celebrities (none underwater) along with a mix of dive sites, scuba porn sites and the charming pages of www.whitesonly.org.

  • by twdorris ( 29395 ) on Thursday July 08, 1999 @06:39AM (#1813694)
    It's OK that only 16% of the web is summarized by search engines. The other 84% is dedicated to sex sites anyway...and we all have those bookmarked by now...

  • It's not _quite_ as it seems. Read about moderation, this explains how scores are awarded. The moderator has no control over what word (Informative, Insightful) is used, in the literal sense anyway. Look, just read the pages about moderation [slashdot.org] apparently these are a little out of date though.

    To become a moderator, you need to be a user. I really don't understand why regular readers aren't users - IMHO of course :)

    Hopefully, this piece was "informative".

    Mong.

    * Paul Madley ...Student, Artist, Techie - Geek *
  • Yes, I want all the web pages, so if I'm trying to track down my friend JimBob I can find him.

    I also want them to ignore meta tags, or any text in a tag - or at least have that be one of the search options, which'll cut down on an 31337 P0rn site popping up on EVERY search.

    I agree with the sidebar problem. The other problem is half the stuff they seem to have indexed has moved on by the time I search and a lot of "That member's page can't be found" or just 404 errors pop up.

  • How is Yahoo! (an index) even able to compete anymore? Goes to show you what good a little name-recognition can get you...I would bet they are at less-than 1% coverage now!

    I think that a 100% human-entered index is still handy. If only they could somehow quadruple the number of monkeys on typewriters we might really have something: http://dmoz.org/


    -AP
  • Interestingly I find hotbot to be the best search engine in this respect as it has a very useful advanced search option ... "Find color with word stemming located anywhere created at any time but do not include pages containing x,y,z,a,b,c" ... I do a basic search, get a few dozen thousand matches, look over what pages I'm getting that I don't want, hit back and add those keywords to my "don't match" list ... eventually I get a few hundred websites that are highly related. Then I bookmark the search results (as they're dynamic).
  • To make what I'm thinking of possible, you'd need to have a standard indexing format. I'm sure Microsoft has one we can use, as long as half the links point back to them :)

    Anyway, Altavista, Yahoo, Infoseek, etc... could make deals with the big ISP's/web host services such as Mindspring, Netcom, Earthlink, Geocities, Tripod, etc... Those sites would then index their own sites, which would save your spider/crawler a lot of time.

    The indexes could then be merged back at Altavista, Infoseek, etc... Or the search of those sites could hit all the distributed indexes.

    There is still the issue of other sites not located on these big ISP's, like .edu's and ibm.com's.
  • 1.) Get a massive machine (we're talking massive, beefy, huge, makes any mere mortal shake with fear)


    2.) Grab ahold of a gigantic dataline (OC-192 anyone?)


    3.) Set up an engine to visit every IP concievable. then check each site for every directory concievable... And every filename concievable..


    4.) Brace for lawsuits


    5.) Throw more hardware at it


    6.) run SETI@home (in it's spare time)


    7.) a month later, all indexed, sued up the wazoo, time to start *all* over again


  • I think a distributed project would be great, assuming anonymity, and the fact that most people's outgoing pipeline is fairly unused. If browsers could simply toss the current page an META summary to a search server for it to check if that page is indexed and update that information. If the page isn't indexed, it can do it's spidering on it ... of course, a majority of visited pages would probably be indexed, but the work completed would increase exponentially (especially personal sites).

    For an excellent example that is almost there, see the Open Directory [dmoz.org].
  • Once again I couldn't find a Sherlock plugin, although they do have nice comment tags for parsing the output. Anyone knows of an official one (unlike Google, which just has three unofficial ones ands their recommended one looks out of date to me) let me know. Until then, I have put one up on our Sherlock page [electricfish.com]. Enjoy.
  • I have been a moderator at times.

    An individual moderator can only nudge the score up or down by 1 point. The adjective displayed is the one select by the last moderator to grade a message.
  • by ChrisGoodwin ( 24375 ) on Thursday July 08, 1999 @04:15AM (#1813705) Journal
    One sixth is a staggering 16 percent.

    Reminds me of a joke, but I can't remember the specifics. Something like "Fully 33 percent of our foos are bar, but only one third of their foos are bar."
  • by Anonymous Coward
    Tangentially related is this short preprint on "the diameter of the WWW" [lanl.gov]. Talks about how many average hops it is from any given web page to any other, and how this might affect search engines.
  • Hmmm...what is the limiting factor in indexing pages? Is it bandwidth? Or CPU? Or just the fact that so many go up and down so fast? If it's bandwidth or CPU, would a distributed project work??? I know you can get dumb Yahoo pager and Altavista Search and all that junk...what if they had "Download: Altavista Index Agent/Spider" or something, where people could use their spare cycles/bandwidth to index...would it work? Does that even make sense? Like SETI, the server could give them some chunk of "namespace" to index and the spider/agents could go at it.
  • What if someone were to design a 'neural net'-based search engine? Initially it could be much like any other 'dumb' search engine. Instead of linking directly to the target sites, have the links go back to a redirector on the search engine's server, enabling the search engine to get feedback on what pages the user actually utilizes, out of the multitude returned. For instance, when I do a search on "ADSL and linux", it would learn that I only clicked on the links that actually had relevant material, and ignored the multitude of XXX/porn sites that put large blocks of common keywords on their pages... Over time and use, the engine could learn what sort of information is really relevant to "ADSL and linux", and what sort is really not, and rank them accordingly.

    Well, perhaps the computing power for something like this isn't available yet. But it'd be nice...
  • by substrate ( 2628 ) on Thursday July 08, 1999 @04:36AM (#1813712)
    The percentage of indexed web sites is small, but the amount of data that represents is pretty staggering. Unlike an encyclopedia or other reference book which can cross reference between the a concept in the index and a number of appearances of the concept in the body of the text a web search engine has a much harder job (as do people trying to use the search engine). For an encyclopedia some person does the job of indexing things with an understanding of context, so for instance 'green' in the index would be referenced to entries on 'colours', 'the spectrum' but not 'grass'. The web search engine blindly returns every instance of the word 'green' with no regard to context. So if the person was actually wondering how to make 'green' with his box of crayolas (since his sister ate every shade of green in his box of 64) he'd either have to wade through each site till he found what he was after or choose a better search term.

    Machines aren't very good at being intelligent in this manner, so suppose a new search engine was created. You type in a search term and it comes back with a list of matching pages. You again wade through the list but now you also can award a number of relavence points to the ones that matched closest. This would work well for a while, but would break down in the long run, as the web continues to expand new pages will be unranked, so they would not appear in the ranked lists of potential hits (at least for popular search terms) and so won't be ranked.

    What might work better would be a search by reduction. Type in some overgeneralized search term and the text on the page is distilled down to a brief outline. There are already packages which can create fairly decent summaries of documents. You click on a button that indicates "I like this, find me more like it" which means that there's something you like about the summary so it generates a number of new more specific search terms from the summary and comes up with a new list.
  • The search by reduction, sound similar to the concept
    Aeiwi [aeiwi.com] is based on.

    Aeiwi have a unique interface that allows users to add more search
    terms until they have a manageable number of results.

    Knud
  • I searched a few terms on northen lights and was sorely disappointed by the accuracy of the hits I received back. It's not even as good as other spider search engines IMHO. Google.com does a better job at having serious hits appear early in the returns.


    The layout is pretty nice, or rather clean, but the search was slightly slow. Who wants to catalog more of the Web when it means that much more noise to wade through? I also think the 1/6 to 16% thing is hilarious.

  • The most frequently searched topic is "sex" and hopefully their 16% covers predominantly sex sites so all is actually well.

    Should anyone bother with meta tags anymore?
  • Where I work "they" are cracking down because 40% of all sick days are taken on Mondays or Fridays.

    Joe
  • I dunno about the rest of you, but my sarcasm meter was really pinging on joshwa's "staggering"...
  • When 80% of the pages on the web are ``JimBob's Personal Web Page'' or ``Click HERE FOR 31337 Pr0N!'' do we really need (or want?) all those web pages bogging down the search engines? I'd say that only about 10% of the web is useful information. If the crawlers can get that (and if it's useful, it will get linked to (in theory) from other web pages) then that's probably all we'd want...

    One problem I have with engines are sites with changing sidebars... when the sidebars mention one of my keywords because it was a recent article when the crawler went by, but the article has nothing to do with what I want...

  • I would be so happy if search engines kept their links current. There's nothing worse than searching Yahoo and finding something you need, but half the links listed are dead...a good example is the category for color picker tools.
    I was looking for a Director page...and came up with a page apologizing to people who came from Yahoo. It read something like "This link has been dead since Dec. 16, 1997, if you're wondering how long Yahoo keeps old URLs".


    Yeah, when I need something in-depth, Google, Hotbot and Ask Jeeves do the job pretty good!

  • I guess the way I'd go about getting total number of web pages would be:

    1. Search a random set of IP addresses for web servers. Make this as big as you can. Just look at port 80 for now to make the task easier.

    2. Find out how many pages are on each of the servers you find (OK _this_ is hard)

    3. choose a subset of the machines you scanned and check them for web servers on unusual ports.

    Work out the number of servers on odd ports for each running on 80 and add this to the servers found in (1).

    Average (mean) the number of pages per server you attempted to establish in (2)

    Now we know the % of IPs that run a server, we know the number of possible IP addresses there are so we can get a fair guess at the number of servers out there. We know average pages per server and robert's your father's brother...

    Problems:

    Big sites with many pages verses home servers on cable. Hit a few too many big sites, or too few and you are way out.

    Just finding the number of pages on a server, how many to geocities have ? do they tell this stuff ?

    Active pages, does slashdot have infinite pages if you keep adding users and generating unique views ?
  • I've always had trouble with search engines. I've registered my pages with various services and basically it hasn't helped. Most people find my pages though my sigs. Or off other similar pages which have links to my page.

    I did a search on Northerlights and it didn't find my pages but did find pages wih links to my page. I also used the power search but that failed also.

    I probably shouldn't be too upset as I get about 1000 hits per month. Since it is a specilaized page I don't think I'll get any more hits. But it does tick me off that if I want poeple to find my page I have to pay for it. I thought a search engines reputation was supposed to garner it more attention and therefor more advertisement dollars. Now to increase their rep's I have to pay them.
    --
    Linux Home Automation - Neil Cherry - ncherry@home.net [mailto]
    http://members.home.net/ncherry [home.net] (Text only)
    http://meltingpot.fortunecity.com/lig htsey/52 [fortunecity.com] (Graphics)

  • Ideally, then, the "index" would really be a "living" database...every node containing a fraction of it which changed over time (nodes should probably ISPs or people with dedicated connections). A "query" would somehow trickle through index servers/nodes getting the most relevant hits. For instance, it makes sense for the indexes of items to be served from *where* they are located (hence ISP). Wanna find out about "foobar", well central server knows that server Y over there has high "foobar"ish content, so it forwards its request and gives results from that node a higher priority. Each node does this until some threshhold is hit. Of course latency would probably be staggering and if a node went down, there'd have to be a backup.

    I sort of like the idea of a self-indexing web...almost like a neural net (insofar as it is mutable through requests being submitted)...central databases don't seem to work well..they are too "removed" from the actually destination...
  • This research is actually being published in today's Nature. [nature.com] The Globe just regurgitated yesterday's NEC Research Institute press release, [eurekalert.org] and did a good job of hiding the attribution in the middle of the article. NECRI [nec.com] will be making more info [wwwmetrics.com] available via the web, but it wasn't up as of last night.

    (Note to Rob: I submitted this same story to /. yesterday afternoon, with links and proper attribution to NECRI and Nature, but I guess accuracy doesn't count as much as timing.)

  • I was actually involved in research on this very concept, back in 1996 (seems like pre-history, eh?). You can check out Professor Jude Shavlik's research at http://www.cs.wisc.edu.

    I left the project because I don't think it works. The system would have involved THE WORLD LARGEST NEURAL NET, by having inputs which contain information describing all of the "important" words on the page and the distance between various words and the font size used to display the words.

    IMHO, there were a few insurmountable problems with the project. One, the neural net was way too large. There are too many words to search, and the word list would need to grow over time (in 1996, would the words "Linux" and "PalmOS" and "WinCE" have been frequent enough to merit their own input nodes? Probably not. Today, on the other hand....). How do you design a neural net which changes the number of input nodes over time, but doesn't lose its current weights? I don't know if there's any research on this, but it would be interesting.

    There are various problems with synonyms and related words as well. I also wasn't sure that the Hn tags were good indicators of importance. Web pages arent structured like outlines anymore.

    The biggest problem is the lack of NEGATIVE feedback. You only tell the neural net search engine what you like, not what you don't like. Neural nets are initialized with random weights for various technical reasons (Prof. Shavlik has experimented with starting neural nets off with rule-based knowledge in his KBANN project). That means that some things which you DO like will most likely get negative weights at first and you'll never see them. While you might specify a list of words you do NOT want to see (which would help the inputs), you would probably not spend time examining pages to see if they do NOT interest you (which means you would never do back propagation with a negative answer). No one would want a product which says: "I think you will really hate this page. Am I right?" The problem is that this is a very necessary part of training a neural net.

    This isn't to say that the research project hasn't shown some results, but it isn't as ideal of a solution as you'd think.

    -jon

  • To make what I'm thinking of possible, you'd need to have a standard indexing format. I'm sure Microsoft has one we can use, as long as half the links point back to them :)

    Isn't that part of what the META tag is for? Or the LINK tag?

    Looking over my copy of the HTML 4.0 specification, there's not a specified list of META attributes, but maybe the following should be considered standard for search engines:

    • "description": for an overview of your page
    • "keywords": give something for spiders to index by

    The following LINK attributes should be set also:

    • "home": Topmost level of your site
    • "copyright": Copyright info
    • "made": Author information

    That way, a search result could take the format of:

    • Page Title
    • META description
    • URL
    • Home LINK attribute
    • Author name (or webmaster of a larger site)
    • Copyright information
    • Keyword relevancy

    The best thing about the LINK attributes is that at least one browser, iCab [www.icab.de], provides a set of buttons for several LINK attributes -- start, end, next, prev, home, search, help, made, etc. Too bad it's MacOS only; maybe someone could create a similar set of buttons for Mozilla?

    Anyway, Altavista, Yahoo, Infoseek, etc... could make deals with the big ISP's/web host services such as Mindspring, Netcom, Earthlink, Geocities, Tripod, etc... Those sites would then index their own sites, which would save your spider/crawler a lot of time.

    Now there's a thought! Then meta-search engines like Metacrawler could have more meaningful returns.

    Am I the only one that thinks a search engine should be a commodity? I don't care which search engine I use, so long as I get the best results. (Keeping paid advertisements out of the search results would be a benefit, too...)

    There is still the issue of other sites not located on these big ISP's, like .edu's and ibm.com's.

    Maybe someone should consider an EduSearch search engine, indexing only sites under the .edu domain? (Especially if its index can be used by a larger metasearch engine...)

    As for ibm.com and the like, large corporate web sites should have some form of search facility; an Alertbox column from UseIT.com [useit.com] discussing corporate intranets says that having some form of search facility should be considered essential -- I don't see why the same shouldn't be true for their Web shingle as well.

    Jay (=

  • I don't think Yahoo! is human-entered index.

    But I agree that Yahoo! can't compete anymore, if you want your site to be indexed with it you have 2 options.
    1. You add it for free and it shows up in a month or six :(
    2. You pay (this sucks) for it and it shows up very fast
    However I still think the search engines are the biggest solution for somebody finding your site, secondly are abnner ads.


    =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
    Belgium HyperBanner
    http://belgium.hyperbanner.net
  • The moderation choices are things like "funny," "insightful," etc. Just like the adjectives you see here. There is also "overrated" and "underrated," if I recall correctly.

    Moderators don't directly set the score, they just somehow nudge it in a particular direction with a particular adjective.

    I don't know if it takes more than one moderator to assign a particular adjective to a post.
  • In another message on this topic, I commented that search engines should be a commodity. It shouldn't matter what engine you use, so long as you get the most, best results.

    Could we turn web search engines into a distributed hierarchy like DNS? I don't expect my ISP's DNS server to have every IP address on the planet, but I expect it to be able to find the ones I need.

    Have each of the major ISPs (expecially those that give their members web space!), free web page providers, companies that do virtual domain hosting, and large corporate/education/organization sites maintain their own index of web pages.

    There could be generic, "top level" engines like Yahoo and Altavista (which could choose to exclude indexes of porn sites) but also more focused engines -- educational sites, business sites, scientific and technical sites; hell, why not a porn engine?

    Would this work?

    Jay (=
  • There's been quite a bit of work done in the last couple of years for devising completely new methods of storing spider information. Scientific American very recently had a description of one of them, although there are a few others as well.

    The system in Scientific American works by analysing not merely the contents, but the relationship of the links. It then classifies sites and documents according to the pattern of links into and out of them. This helps in priortizing "authoritative" sites, for example.

    You should check out the article and its bibliography.

  • Hey...how about running standard index daemons or something then instead of pointless chargens?
  • Why do we need to search the whole Web?
    Are we afraid that someone in New Guinea has the answer to our life's problems?
    I don't see why searching the whole web is any more relevant an activity than reading every book that has been written. Some will see a flaw in this: they'll say, "Reading the web and searching the web aren't the same thing - I want to know my choices." Fine, I say - you don't know all your choices when it comes to books, either.

    Then there's the "quality" argument: "I don't want all of the references to 'X' - I want only the 'good' references to 'X'." On the Internet not only does no one know if you're a dog, they don't know if you're a dog with bad taste! I think this argument needs to be changed; I like the Social Sciences Index idea, personally: the number of references to an article makes it "important". That is, the greater the number of times that an article is refererred to by another article, even if the reference is only to refute the original, the higher the ranking of the article. We already see this in action - they're called portals. They are the hot spots of the web...

    --
  • Just to set the record stright, the poster's assertion that the article is a Northern Light plug is completely baseless. The authors (Lawrence and Giles) work at The NEC Research Insitute (where I work), which has no connection to Northern Light. In fact, they did an earlier and less comprehensive study a year ago that showed Hotbot and Altavista had the greatest coverage at that time.
  • Posted by Jeff Martin:

    These cached engines need to update on a daily basis if they intend to remain functional.
    As websites often update, they change the pages and the names of pages to fit a new look or feel.The searches I used found pages that did not exist anymore, nor have they for a few months now.
    Oh well maybe people will "back up" in the URL when they visit...
  • I switched to NL quite some time ago, and it's comprehensive enough for me (dunno about your "common searches"). What I like best is
    1. Breaks down the search results by their type and location (mini-directory)
    2. Doesn't annoy your with stupid plugs (ahem, "recommendations").

    Apparently they also have a fair amount of non-web (presumably OCR scanned) material, but I've never tried purchasing it.
  • I run and manage a web site in my *COUGH* spare time, whose purpore is to categorize other sites with Middle Eastern dance (better known as belly dance) content.
    Having started up a coupe of years back, I can say I've seen some of what this article is talking about. More and more, I see sites listed and mentioned by work of mouth than I had not found via any of the major search engines. Even with date restrains, a search of the majors (Altavista and HotBot in my case) can eat up days, literally.
    The reviews I write tend to note this fact -- although I have a few "big" Middle Eastern Dance sites, my focus and goal is noting all the little sites that are being left behind. Most of them still come from the search engines, but it's just too much. Even with 100 workers, I'd still not get them all, could not.
    I can't say I know of a realistic way of overcoming this. What would be good is to have a strong effort to have all the major ISP's offer an easy way to register with all the search engines any pages their users create. It's easy to create a web site, but so many people get left behind in actually promoting it, and when they do, they do so very poorly. (For the moment, let's ignore those who just don't do HTML well) Without the promotion, it's just for a few families and friends, unles the content is really interesting, and is promptly drowned out by the chaos of the web.
    Also, I think projects like Google and the push towards XML are imperative to the health of the web. We need to more away from the free-form nature of _everything_ on the WWW, and towards some more structure, more focus. Peple simply need to be able to find stuff, and they cannot right now. I'm going to do my part -- my site is being converted to an XML for the far future, and, for the near future, the perl scripts that build it have already been rewritten to be moved to an server with CGI, so that people can search my site, specifically.
    Just my two cents.
  • One search engine not complete enough for you? Search a bunch of them with a meta-search engine. I like SavvySearch [savvysearch.com].

  • If a page is truly useful, likely someone is accessing it. A distributed program to harvest those pages could be quite useful. You could choose when to allow it to examine your browsing history, and when to pull back the curtain, as it were. Of course, you'd have to make privacy guarantees. You'd also want to make the source code visible to the world. If a page you were browsing was unknown to the system, then spidering from it would probably be quite productive, so the program could harvest your spare CPU cycles to spider from any pages that you visit that the search engine does not yet know about. Everyone would have an incentive to participate to make sure that the pages they want to see indexed are actually indexed.

    To avoid the Netscape "What's Related?" fiasco, the authors should allow the end user editorial control, and provide for some discretion over and anonymizing of the results submission.
  • Posted by foole:

    Searching "Linux home page perl" on hotbot:
    (After clicking Reload, due to "Connection Reset by Peer")
    8 advertising graphics(including a mini-form letting me look up "Linux home page perl" at kidflix.com)
    4 "search partner" links(including "How to Buy a House Online")
    The search results start halfway down the page.

    I do not enjoy scrolling to look at the reason I'm on the site in the first place: search results. Hotbot is good as far as the search engine itself goes, but I find myself at Google and Northern Light these days simply because of presentation!
  • Where I work "they" are cracking down because 40% of all sick days are taken on Mondays or Fridays.

    Dilbert used that very item, and that's where I heard it from.
  • Outdated information != useless information

    if I have a thirty year old peice of equipment, I still want to be able to find a thirty year old document to describe it in detail. Even better -- thirty years of accumulated information describeing it in detail.
  • for Selling Fantasy Real-Estate.

    If you scroll down the page, you'll find the story about Kevin Roseler, an employee at Origin Systems, who was dismissed after he abused his priviledges at Ultima Online to generate castles/gold/etc. and sell it for $7000 on Ebay. What he did wasn't technically legal, but it was an abuse of power and whatnot.
  • The link to the Boston Globe is dead. Check out an article from Pigdog Journal [pigdog.org] about the exact same topic. It also has a link to a BBC article about it.

    Web Search Engines Are Falling Down on the Job! [pigdog.org]
    1999-07-07 18:18:38

  • Whoops, little trouble typing there...

    Anyway, just go to the Pigdog front page (www.pigdog.org). The article is right there. This slashdot message board doesn't like the long URL for the article.

  • by SimonK ( 7722 )
    Thats what google (www.google.com) does. Its pretty good when you use good search terms. It still sucks when your search turns up a bunch of irrelevant links and they also end up in the indexing process along with the relevant ones.

    In the end, the only solution is to structure the data better than HTML allows. XML here we come ...
  • Frankly the fact that only 16% of sites are indexed is something of a relief until the search engines can get their indexing better sorted out. Google does the best job of prioritising away obviously irrelevant results, but it still gets it wrong a depressing amount of the time.

    It seems to me that the only way we're ever going to get away from ever-deteriorating keyword searches and ever more corruptible and less competent cataloging sites is to switch over to better (ie. more logical, more meaningful) forms of mark up than HTML provides. XML anyone ?
  • That's just it - throw "quality" out the window.
    Several journal indices don't measure "quality" per se - they measure impact, essentially.
    That is, the more often an article is cited, the more impact the article has - either as a source of truth, radical theory, or wrong thinking.
  • There was an interesting article in the July Scientific American about something similar to this where sites were classified as hubs and sources, and ranked according to a scheme where the more hubs that linked to you the higher your score as a source, and conversely a hub scored higher depending on the quality of the sources linked to it. They call it hyper linking, due to there determination of quality through the meta information of the actual infrastructure of the web.

    The article should be located here:

    http://www.sciam.com/1999/0699issue/0699raghavan .html
  • Sounds like you might be interested in the Mozilla Directory, i.e. DMOZ.org [dmoz.org] -- it's an Open Source web directory, more or less. There's a feature on there for the editors which point out dead links and even sends an e-mail to the editors warning them about the dead link so it can be corrected.

    Plus, as the tag line goes, "humans do it better."

    I use Google [google.com] for a search engine and DMOZ for a web directory. Either way, I tend to find what I need much more often than not.

    -Augie, is an editor on DMOZ by way of full disclosure

  • so how does a new site that has in depth detail on your required subject ever get seen. when this site is launched it has zero references to it - so in your terms it has no value, will not get indexed, leading to it not getting seen, leading to no new references to it...

    alternatively - for example if you want to know about biker groups in your area, so you do a search on biker - the results are predominantly porn sites, since these sites sometimes have a tangential reference to bikes, but reference each other a huge amount, giving them 'quality', in your terms, relative to bikes.

    I don't think that quality can be measured using number of references as the primary criterion.
  • I believe that's what Google is based on, only indirectly. Google claims to rank results based on the keywords matched and the number of referrals in its database that point to the page. For example, if you search for "linux", the first seven hits are as follows: "www.linux.org", "www.redhat.com", "www.planetit.com/[something]", "www.debian.org", "www.li.org", "linuxtoday.com", "www.linuxjournal.com". If you search for "linux stuff" your first hit is Slashdot.

    Once you add more words to your search, the feature starts to stand out, picking out the most popular of the sites by the number of other pages which link to it.

Remember to say hello to your bank teller.

Working...