Is The Web Becoming Unsearchable? 249
wayne writes: "CNN is running a story on web search engines and their inablity to keep up with the growth of the web. Web directories such as Yahoo! and the Open Directory Project can take months to add a site and the queue of unreviewed sites is growing. Most search engines are even further behind and are filled with off-topic and dead pages. The trend is toward pay for listing. Will the free, searchable web fade away?" The article gets beyond the "Wowie, so much content, engines can't keep up" typical blather and addesses some of the reason search engines have a hard time keeping up.
Signal to noise ratio (Score:1)
Yahoo taking months to add a site (Score:1)
Yowie. (Score:4)
Yowie.
----
Yes (Score:1)
Next question?
Call Intercept (Score:1)
I wouldn't call all of those services useless...there's an interesting one by Verizon called Call Intercept If the person's number is unavailable or anonymous on Caller ID, they are sent to a message asking them to identify themselves. If they don't, then they don't get through. Great for those "please stay on the line for an important message..." phone calls that telemarketers & bill collectors love :).
Re:Google (Score:1)
Re:Yes (Score:1)
Re:AltaVista hates Lynx (Score:2)
Apparently this is an attempt at foiling script-based ping and if down, submit as dead type attacks on other people's entries.
I think a more reasonable way of handling this would be to, eg., check the site for 2 days in 12 hour increments (to allow for, eg., eBay's Sunday maintenance Windows and such). If no positive response during that period then drop the link.
In any case, I was only using that mechanism as an example of a saner way than having 100 votes to automatically mark a site as dead. I don't personally use Altavista's search engine or condone it, and how this mechanism could be linked to a browser button (which could work with Altavista if they used my method instead of requireing a multi-submit + enter text from a GIF reporting process)
Sounds like a good title for a trivial patent, even..
Method of verifying URL availabity for a database of URL's
Re:Another way (Score:3)
The correct way to handle this situation is how the search engines already do - when a link is reported dead, they just make a request to the link. If it generates an HTTP 404 response code, or the site is down, it's marked actually dead.
I'm not convinced this is always a good idea, though - I've worked for a guy who would battle for top positioning on the search engine with a few competitors. When either of them noticed that the other's site was down, they'd submit the other site as a dead link. I like google's Cached page mechanism, which allows you to view sites that are currently unreachable. Great for when you need docs from a site which happens to be down at the time.
This is actually trivial to implement, as shown in Google's toolbar page: http://www.google.com/options/toolbar.html [google.com]Of course, you'd need to use this technique with a search engine who takes dead link submissions. Eg., Altavista and its "Add or Remove a Page" link here: http://web.altavista.com/cgi-bin/query?pg=addurl [altavista.com]
Good one! (Score:1)
The *only* way to search the web.. (Score:1)
(Well, not really, but it's damn good...)
It's about the only Window$ app I use anymore.
It's kinda gone down hill after the parent company was bought out by ZDNet but it still really works pretty well.
It meta-searches about a dozen of the major search sites simultaneously.
I use it alot to search for the meaning of obscure error messages and error codes and stuff like that.
Used to use it alot for searching out what cryptic .dll filenames were related to...
t_t_b
--
I think not; therefore I ain't®
Google spidering (Score:2)
I've also been kicked from first to sixth on a search for "book reviews" [google.com] :-(.
Danny.
Re:Hmm... (Score:1)
--
A few errors (Score:1)
Search engines can't type things into forms and get results in an intelligent way.
It's just a shame that they get confused in their expressions.
Nice piece generally though. 550 billion wab pages is an awful lot..
Unsearchable... nope not here. (Score:1)
I can usually find what I'm looking for either using Google or altavista. The hurestics used for google are the best I've seen in any search engine. I can ususally find stuff that is anywhere from several days to several years old.
Come on... these are the same people that were claiming that we would all run out of IP's by now. They don't seem to realize that everything adapts.
Re:possible solution (Score:1)
Re:Google (Score:1)
There is a way to fix this... (Score:2)
BTW, AFAIK Google doesn't change rankings for money, it adds those little side-links for money. I do hope they stop adding gingerbread now lest the site end up as cluttered and useless and Deja did.
Neurogrid (Score:1)
What I've been looking for (Score:2)
There are lots of relly informative
(btw, if anyone finds a
========= Put my nick in front of the "_". I love my computer
Yeah (Score:2)
Either that or freinds will send me links.
Hmm... (Score:2)
Re:Hmm... (Score:2)
Re:Gnutella (Score:2)
If you want 99.9% of Internet traffic be nodes forwarding search requests and results back and forth, that's the way to go.
bad web site design. (Score:2)
yeah, but (Score:2)
Ok, then here's an easy one.... (Score:2)
Yes, it's "Vanity Web Surfing", but if Google indexes my site, why doesn't it automatically categorize it? (whine whine)
So, yes, Google is pretty derned good. But it's still not a directory, and the directory it does have covers, what, 1% of the web? 0.01%?
Primitive Replacement for a Directory (Score:3)
The phone company provides you with one free listing (unlisted is optional), and makes you pay for each extra category (like in the Yellow Pages -- and if you're not from the U.S., please see http://www.bigyellow.com/supertopics for an example) that you want something listed in. Search engines ought to be replaced with something similar.
Yes, I know Yahoo and Dmoz try, but they don't go out and actively index sites, making their use limited, and the number of sites even more limited. If Google were to create a Yahoo/Dmoz style directory, that would help. Better yet, if people were forced to provide either META tags, or some information when they acquired their domain (part of whois?)....
For example, where can I get my oil changed in Paris, France?
This was solved years ago, but... (Score:3)
To use a book analogy, the entire web is built on Dewey Decimal addresses (URLs), when what we need is those combined with ISBN numbers (URNs).
I didn't make up the idea of URNs - the concept was first described to me by Peter Deutsch, the inventor of Archie, at Interop sometime in the early 90's, shortly after the web got going. (Back when there were no search engines, and we found out about new web sites by visiting NCSA's What's New page, which for a while, anyway, actualy cataloged *every* new web site that appeared, and some of us could claim to have surfed the entire web...)
The idea behind URNs is that they would be a unique identifier for the content. The same content living on different sites would have severl URLs, but only a single URN. This is still needed today, but the problems that kept it from being implemented then are even more intractable today: Who hands out URNs? (IANA didn't want to touch that!) How do you handle versioning? What about dynamic content? Who are the librarians?
We still desperately need somthing that fills this need, but it's not likely we'll get it. One last parting thought - in discussing this with Deutsch, he pointed out that these are new problems to us, but that the library scientists had solved them quite some time ago: It is only the typical CS insistence on reinventing everything and dismissing the knowledge of those in other fields that makes the process so incredibly painful... Hubris strikes again.
Dizz-net (Score:2)
We had some cool ideas, but the infrastructure for such a thing would be huge. I have a bunch of interesting messages from the mailing list describing some pretty cool stuff, like having nodes only search for stuff that near them, network-wise, to lessen the load at critical points. There was also some talk about moderation ("Click here if this link is not relevant to your search") and heuristics to stop common abuses (spider-bait).
It never happened, because it's pretty heavy stuff to implement properly.
I'm sure some patent-squatter has a patent on it already, with the full intention of letting someone else do the hard work.
Are libraries becoming useless? (Score:2)
Posted by Hemos on 03:53 PM March 27th, 2001
from the we-talk-and-talk-about-same-crap dept.
segmond writes: "CNN is running a story on libraries around the world and their inablity to keep up with the growth of the number of books published. Libraries such as ones belonging to even the biggest instutions such as Harvard, Yale and MIT can take months to add a book to their collection and the queue of unreviewed books is growing. Most libraries are even further behind and are filled with off-topic and old assembly books about VAX and Z80 programming. The trend is toward pay for listing your book. Will the free, searchable library fade away?" The article gets beyond the "Wowie, so much content, libraries can't keep up" typical blather and addesses some of the reason libraries have a hard time keeping up.
Search engines can't find everything... (Score:3)
DirectSearch - Invisible Web Search [gwu.edu]
The InvisibleWeb [invisibleweb.com]
WebData.com - Invisible Web Search [webdata.com]
InfoMine - Scholarly Internet Resource Collections [ucr.edu]
AlphaSearch - Invisible Web Search [calvin.edu]
IIRC, Slashdot even ran an article about this not too long ago - I think this [slashdot.org] is it, not sure...
Worldcom [worldcom.com] - Generation Duh!
Re:possible solution (Score:3)
Of course it is. (Score:2)
The trend is toward pay for listing (Score:2)
Is this really a big deal? Hasn't anyone used the yellow pages in a phone book before? People have to pay to be listed in that, and its very useful for finding a companies.
Ad impressions are increasing! Increasing! (Score:2)
--
Oh boy, more gloom and doom (Score:2)
Now maybe there are vast areas of the web unavailable to google searches because of language quirks or protective admins, but so what.
They have as much a right to exist uncataloged as I do to have an unlisted phone number. If sites want to be indexed, they can register with a search engine. If they don't, and are unreachable, so be it. I don't see what the problem is.
Google (Score:5)
------------
CitizenC
Progress between Yahoo! and Google (Score:4)
Google's approach is novel; make the web pages rank themselves. If more people link to your site, it's probably a better site. If few enough people link to it, it probably isn't and besides that it'll probably never be found.
Web site creators have to do the legwork to get their sites recognized, and going to a general search engine to do it isn't the way. If someone makes a site and tells their friends about it, and their friends like it and link to it, it'll get picked up; that's the way of the web. (At least, it'll get picked up by crawlers like Google, and even ranked highly if enough people link to it).
Search enginge tech has to catch up to dynamic pages yet, but it's the fault of the content creators if they want their pages on search engines but can't code enough alt tags to make their stuff show up.
In any case, the bulk of the web does work, and good pages get recognition. I've always eventually been able to find what I'm looking for on the web, no matter what the topic. Search engines have to grow like everything else, but so far they're the best thing going and getting better.
The power of "Word of Mouth" (Score:4)
This is how I found
Did anyone out there get hooked up to
-----
Then why did they refuse me on DMOZ? (Score:2)
Re:Then why did they refuse me on DMOZ? (Score:2)
Topic Specific Search Engines (Score:2)
If you need to find more relevant documents on specific subjects, I recommend using topic-specific search engines. I maintain one for all subjects relating to Paganism and Wicca on my Omphalos website [omphalos.net]. True, the site submissions have to be manually approved and this can lead to backlogs of site submissions, but since I spider all of the websites I have included in the directory (totalling over 140,000 webpages so far) the relevancy of any search results is raised by the lack of clutter from unrelated websites.
Similarly, if you are searching for information on Space Exploration try Spaceref [spaceref.com] where I used to work. Again, the directory is manually generated, and the results are greatly improved overall.
Nothing guarantees improved relevancy (for general purposes nothing beats Google in this respect), but using specialty search sites helps immensely in many cases.
Micropayments and Minipayments (Score:2)
If I were to set up a search engine:
Every unique domain name found would get crawled for free. You paid for a domain name, you must care about your content.
Every geocities-style cheap personal page would require a small fee to get crawled. Too much schlock; scan only the stuff people care about. You don't wanna pay your own fee? Ask a visitor to pay the fee. PayPal or something newer/better should do the trick.
Every dynamic page like slashdot, everything2, or real estate listings, would have to have a more expensive agreement in place to get anything indexed. The buck stops at cgi. Waste no time on something that will probably be gone tomorrow.
Commit on the resources it will take to prune and groom the stale dead stuff out of the index, regularly. Dead links are bad business.
The Poor Man's Site Announcement Service (Score:2)
Well, I'm tired of them too, and I write pages that I submit to search engines from time to time, and I've come up with what I feel is the best way to submit links to a bunch of sites:
Direct links into the pages that have the URL submission forms on a bunch of search engines.
That's it!
I got all these search engines off the Search Engines Category [dmoz.org] at the Open Directory Project [dmoz.org]. If you know of any pages that list a bunch of other search engines (there are many smaller ones, and a lot of special purpose ones) then drop me a line at crawford@goingware.com [mailto].
In my index I provide brief notes about some of the engines, including mentioning whether they refuse to accept submissions without payment. I don't provide links to submission forms for the engines that won't list a site for free, and I'd like to ask you not to support the trend towards paid index and spider placement.
You should understand that the vast majority of visitors to your sites don't get there through search engines, they get there because other people like your page and give you a link. The main value of search engines is to "prime the pump" so a few people start finding your site and then know to create a link for it.
Create successful web sites by writing good web sites - see Some Web Application Design Basics [sunsite.dk] for links to a few good pages written by experts that will start you well on the road to an appealing, successful website.
Thank you for your attention.
Mike [goingware.com]
Problems with DMOZ (Score:2)
Also, I'm not impressed with ODP's handling of new applicants [dmoz.org]. I applied once last year and received NO reply, not even a rejection letter. I had applied to edit the category of "Personal Pages -- Surnames starting with U". It was to get my feet wet, learn how to be an editor, see how time consuming it might be before adding a more serious category. I mentioned that in my application.
I resubmitted it in February and successfully received . . . a rejection letter! They decided I have a personal stake in the category (note my last name) and might be biased. Oh no! We must prevent the potential for abuse of Web Pages about people named U* [dmoz.org]!
If I'm not allowed to edit for categories that I know something about and I'm interested in, then what exactly should I volunteer for, and why should I?
possible solution (Score:2)
My idea is to come up with a standard set of headers that provide directory/hierarchy information for search engines. This is much more useful than keywords, et al., because they allow for top-down directories such as Yahoo! and the Open Directory project. Sites like this could be automatically created simply by crawling the web and organizing sites according to a category specified in their header.
The problem with keywords is that it's easy to spam them. If you need more hits, just add "bestiality", "Natalie Portman", and "hot sluts" to your keywords. The keywords often have nothing to do with the actual site.
It would be much harder, however, to spam a directory structure, especially if most search engines limited the amount of directories a page could specify to, say, two or three.
The header would be easy to implement. It could be done very easily within the comment tags of existing HTML. The only problem is getting people to do it. It would work beautifully if Yahoo! or another large site were to give up on "hand-picked" sites and start letting people specify their own location on the structure. Then anyone who wanted their site to be locatable would specify a hierarchical subject category in their header.
Great idea. It'll never happen.
Re:Of course it is. (Score:3)
It has a second, separate business re-selling articles from trade journals, professional publications, etc., for which you do pay... but less than you would pay to buy the same thing in dead-tree format from the publisher.
What confuses people is that, by default, the main engine will return hits on both the web and the special collection.
Re:Directories are not search engines (Score:2)
This would require a lot of human verification, for there are many possibilities for abuse. I could always report my competitors for false keywords, just to keep them out of the listings. And as soon as we get to more exotic topics, who can say if a keyword is relevant or not? And how relevant is relevant anyway - if a porn site does have many pictures of women getting out of girl-scout uniforms, is "girl-scout" a valid keyword?
There are simple ranking algorithms, that weigh uncommon keywords more, and take into consideration how many keywords the site claims to relate to. These might be more effective.
If you want "information," ask for it. (Score:2)
Most searches for herbal medicines (e.g. "5-HTP") find you way more hits (especially the high ranking ones) from companies trying to sell you it than actual objective information about it.
Had you typed 5-htp information [google.com] into Google, you would see 5-htp information, with Harvard as result #2.
HTML; +the (Score:2)
"html" 188,000,000
But, as usual for Google, the first three results are highly relevant for at least one common sense of the search term. (The first is W3C's official HTML standards site.) I didn't realize how bad AltaVista sucked until I tried it after using Google for a year.
does anyone find anything better than "and"???
+a comes close. It seems they're blocking searches for +the.
Sites you may have missed (Score:2)
Yep, all that content, and yet when there's a slow day at work I can still run out of interesting stuff to look at on the internet.
little gamers [gamespy.com], penny arcade [penny-arcade.com], goats (not goatse) [goats.com], and badtech [badtech.com]: online comics. It'll take a while to browse the entire archive.
everything 2 [everything2.com]: nearly half a million writeups on topics from aardvark [everything2.com]s to zzyzx [everything2.com].
P2P advertising explained (Score:2)
Basically, using the peer-2-peer revolution (buzzword alert) in advertising is the next thing.
I hope you're not talking about spamming Gnutella [slashdot.org].
some companies are try to combine the peer to peer aspect of traditional word of mouth and the web.
In this model, surfers are paid to recommend the sites to other surfers. Spedia [spedia.net] is a prime example, as was AllAdvantage until it went to a "sweepstakes" scheme. Other examples can be found in the many sites that use Recommend-It [recommend-it.com].
Hatten är din, hatten är din, habeetik, habeetik.AltaVista hates Lynx (Score:3)
Of course, you'd need to use this technique with a search engine who takes dead link submissions. Eg., Altavista and its "Add or Remove a Page" link
AltaVista does not allow submissions [rose-hulman.edu] from visually impaired users or users of text-based web browsers such as Lynx, Links, or w3m. Its submission page [altavista.com] uses a GIF image (burn all GIFs [burnallgifs.org]) to display rotated text in various fonts. The user is supposed to read the text and enter it into a field below. But visually impaired users, users on text browsers, and users on browsers whose developers have been cease-and-desisted by Unisys [burnallgifs.org] never see the GIF and cannot contribute links to AltaVista.
Re:I think it can be good (Score:2)
But you're right in some ways, too. If you search for "children's toy company" or something (and temporarily ignoring the other 'toys' listed
Good points and bad points about both. I think the best would be a two-tier system - a pay-per-listing one for commercial stuff (Amazon, etc) and a free one with a reference-check system for information-search purposes. Maybe the pay-per-listing could subsidise the free one?
Grab.
Re:Progress between Yahoo! and Google (Score:2)
Re:Google (Score:2)
Sure this is a problem, but it's more an example of applying the wrong tool. Google was never intended for comprehensively finding every scrap of information about a particular topic; it was designed to find the few most relevant and interesting sites discussing a particular topic. Using a general purpose tool for a highly specific task is a wonderful way of getting frustrated but not an efficient approach to solving your problems.
In fact, there are specialized search engines for dealing with specific topics. There are engines specifically for looking for images, ones for looking at specialized topics, and so on. There are also specialized, classified catalogues of information of exactly the kind you suggest are needed out there for people who need to know about them. If, for instance, I want to learn about a specific topic in biology, I might very well start out by looking at PubMed [nih.gov], a special purpose index of biological research articles. You just have to know where to look for the special purpose tools.
Re:Yes (Score:3)
Except that this isn't true. If I look up, say, Ronald Reagan [google.com], none of the top 5 hits are big commercial sites. They include the Whitehouse pages on former presidents, a fan page, the Reagan Presidential Foundation, the Reagan Library, and the Official Reagan Web Site. If I look up Linux Kernel [google.com], the #1 site is the Kernel Archives page. Maybe you're looking for data where there just aren't many interesting independant web sites out there, which is not something that can be cured with a better search engine.
guide to guides (Score:2)
The content on mini-portals is a million times better than Yahoo's old haphazard system. I gave up submitting non-commercial links to Yahoo because you wait months before being sure they didn't list you, then resubmit and wait months, then resubmit... etc.
Gnutella (Score:3)
--------------------
Re:What about the Yellow Pages? (Score:2)
Re:Directories are not search engines (Score:2)
Re:Another way (Score:2)
Re:Google (Score:2)
The most difficult thing on the internet, to my belief, is to find the very specialized article that you are looking for. The problem is that it may not even exist. Finding the same very specialized article in a huge library full of journals is even more complicated. So what? Next article.
Re:Directories are not search engines (Score:2)
This would make searches SO much more acurate. It would just take someone to have the balls to say "you are abusing the keywords so now nobody will ever get to your site from our search engine."
Web directories could be automated. (Score:3)
I have a suggestion to anyone who is thinking of implementing a better directory. First, define the categories, and allow any site to submit their site to their categories. Then, introduce moderation to the mix. Allow users of your directory to rank sites in terms of suitability to the category. Allow them to create red flags for people submitting porn to health->teens->sexuality, and so forth. Let the users do the work!
I think moderation works well for sites like slashdot, why not a moderated web directory?
Pay For Placement Engines (Score:2)
Whilst Google is clearly the best for non-commercial searches, GoTo is apparently the best for commercial searches (if you want a service someone will make money from supplying).
It nicely gets around the problem of manual classification, by effectivley using market forces to make an advertiser classify themselves correctly (or pay for referrals which make them no money).
Let say I have a hotel in San Francisco, but bid for the general term Hotel ($1.03). Now I will presumably only get some custom if they were looking for a Hotel in SF - otherwise I just paid GOTO $1.00 for a useless referral. Better I list myself as HOTEL SAN FRANCISCO, even though this costs ($1.71), I will have a much higher conversion ratio.
Of course, if I am a US Hotel Chain or Broker, then maybe I would bid on the general Hotel keyword.
End of self serving Sales Pitch :) Personally I'd like to see us create a GoTogle (TM) :-) that combines the best of both approaches.
Winton
Re:Directories are not search engines (Score:4)
Well, what you're describing sounds a lot like META KEYWORD tags.
Having been an Open Directory editor in the past, I don't really think the problem is finding the right pages. Actually the biggest problem is just that a lot of editors aren't active, and it's hard to know who's active, because they're listed as editors even if they haven't logged in or checked submissions for a year. This creates problems for editors who have to cooperate with other editors, and may also give outsiders the impression that Open Directory is overwhelmed in general, when really it's just that the editor they submitted to is AWOL.
Yahoo is doomed to failure because they don't have enough people working for them. Open Directory works just fine, because they have orders of magnitude more eyeballs working in parallel. No, Open Directory doesn't list every page on the web, and that's just fine with me as a user -- it's more useful because it's selective.
The Assayer [theassayer.org] - free-information book reviews
Re:Google (Score:2)
Known item searching is dead easy using any search engine, so long as the item is in the database. It's also easy to find something about anything, so people who just want some information without being overly concerned with how accurate or complete that information is can also easily find something to keep them happy.
Serious research, on the other hand, requires a more quality-conscious search. A researcher will want all of the most relevant information about a topic, and Web search engines do not provide this very well at all. Weighted keyword searching is no substitute for professionally catalogued and classified documents in cases like this. In some cases, researchers will want an exhaustive search: everything relevant about the topic. For example, a Ph.D. candidate would almost certainly begin their thesis by locating everything academic published in their field of study. This is downright impossible with Web search engines: even if their databases were complete, relevancy is so bad that you would probably have to wade through thousands upon thousands of hits to find a hundred or so truly relevant sites. This is especially true of any subject that is susceptible to search engine spamming.
Re:Google (Score:2)
Re:Google (Score:2)
Re:The power of "Word of Mouth" (Score:2)
I can't be karma whoring - I've already hit 50!
Re:Directories are not search engines (Score:2)
Open Directory works rather well, IMHO, as a directory because the editors have a strong sense of ownership and are given small enough chunks to do that the work is very manageable at the individual level (and they can do it in their spare time easily). But the human element is always going to be a potential issue with any directory. A problem you just don't have with Google.
Re:Directories are not search engines (Score:2)
Re:Directories are not search engines (Score:3)
Having actually tried to implement a DDC based web directory once, I am familiar with the problem that many pages would possibly fall under many categories. This is a problem with any directory-based approach, especially if you list a page in one category and then the page changes enough so that the category no longer applies.
In your example, I would hope it would not be too much trouble for you to put a different class number into the pages that make up each logical section of your site. Or if the site is small enough, it would likely fall under something like "personal web pages", which may have a number of subclasses itself, and then you'd choose the one you felt appropriate.
Again, this is a common issue among all directories, where do you put stuff? Do you allow multiple listings/classes per site/page? You still end up having to include some sort of keyword or text-based search so that users are not forced to browse the directory structure, guessing at the classification they are looking for or where it lies in the hierarchy. Text searches also allow for the possibility of searching based on content rather than metadata.
Most of this is a non-issue, given that Google seems to have rather successfully implemented a non-directory type of engine-- succeeding where Altavista was simply unwieldy. At least that's my impression. I usually find what I want with Google.
Re:possible solution (Score:3)
Directories are not search engines (Score:5)
Yahoo and the like are doomed to failure until someone implements something like the Dewey Decimal System for web pages and then convinces a large number of webmasters to correctly classify their pages using it. That way a machine can do the hard work and only the person designing the page need do the actual work of making sure the page is classified correctly.
Obviously this is fraught with problems similar to those of keyword spamming, but it's either that or build something like DMOZ on a decentralized basis, so that any individual maintainer builds a set of links that are tailored to his/her interests and either uploads them to a central sever or provides them as an XML document for an engine to work with.
Google's crawlers are part of the problem (Score:2)
and thousands of clippings. The indexing started in 1983.
For any name, you can get all the other names that share
pages with that name throughout the entire database. In
other words, each name search produces a page that contains
anywhere from several to several hundred additional names
-- all pre-linked directly to their own searches, which do
the same thing. You get the idea.
It's a bot's worst nightmare. But if you are Google, with
lots of crawlers to sic on the task, it quickly can become
my nightmare instead of Google's. Indeed, Google doesn't
seem to care much.
Last October I noticed that Google was inclined to stumble
into our cgi-bin on rare occasions, and actually do a
decent job of delivering referrals to the name data that it
got from us. I lifted the robots.txt exclusion to see what
would happen. No other bots have even delivered referrals
as consistently as Google, so I can only assume that Google
is the only bot that's even serious about going after the
dynamic web.
Either that, or their algorithms do a much better job on
our names, which are all listed as surname-first throughout
our site. If you search for a name in the news as Firstname
Lastname without quotes, Google will put our Lastname,
Firstname high on the list due to two facts: Our name is
part of the anchor description and they give link data more
points, and secondly, the two words are close to each other
and this adds to the score (even though they are backwards).
Google has come by once a month since ever since I lifted
the robots.txt. Each time they spend about 10 days solid,
24/7, with from three to five crawlers, chasing all the
name searches. The rate from all the crawlers together for
those 10 days varies from about two name searches per
second to several per minute.
It's very erratic during that time; the crawlers don't talk
to each other, and there's no detectable pattern that
they're following. They don't manage to get through the
entire database of 115,000 names by any means. There is an
incredible amount of waste and duplication.
I had to install a load-sensitive thermostat so that when
our server hits a certain load threshhold and it's Google
calling, it starts delivering "Server too busy" responses
instead of the search that was requested. That seems to
work pretty well, but they get all those "Server too busy"
messages stored in their cache copy for that name.
To put it bluntly, their bots are dumber than toast, and
if you don't watch them, they can turn your server into
toast.
Last November I wrote to Larry Page and offered to send him
the damn database on CD-ROMs, in discrete HTML files using
any specification he cared to define, so that his crawlers
wouldn't have to load down our servers once per month.
Mr. Page never responded. The letter was e-mailed, faxed,
and snail-mailed. Someone from google.com did a Larry Page
search shortly after I faxed it, so I'm pretty sure they
read the thing. I offered these CD-ROMs for free, and I
didn't ask for any changes in PageRank or any other
considerations. It would simply mean that I can get my
names onto Google efficiently and comprehensively, without
enduring that 10-day orgy once a month.
My point is that there is no real effort at Google to make
any sort of accommodation on a case-by-case basis with the
so-called "deep web." Until that happens, sites such as
mine have difficulty in allowing Google's crawlers to run
amuck once per month. We have other customers to consider.
searching is always hard (Score:2)
Although technologies such as frames, ASP and JSP, cold fusion, or Flash may make it harder to design a crawler friendly web page, such pages need not be crawler hostile. As the article points out, the issue is how the site handles requests that contain no parameters. The incompetent designer will treat such a request as an error. The more thoughtful designer will display a useful page with appropriate meta tags.
The second issue is intellectual property and the true number of pages on the web. Suppose we create a site on the history of widgets. This site contains 10 base pages backed by a database of 100,000 widgets. Is the true size of the site 10 or 1 million pages? I would say that their size is 10 pages and indexing 0.001% of the possible pages in a complete index. The problem is how to make these 10 pages representative of the site. It may be reasonble that a search of '1145 crusade keepsake widget' might fail, but our design should allow the more general search 'history widgets' to succeed.
Anyone who has done library research in the pre-computer age knows that is takes skill and determination to find citations. The fact that we have replaced 1 million tiny cards and 1 thousand volumes of indexes with an online database does not mean that search and design skills are no longer necessary. Unfortunately, we cannot assume that user will have the proper search skills, so we, as designers, must learn better design skills.
No one expected Yahoo to scale infinitely (Score:4)
The only "problem" is that the Internet is simply too large for one engine to index. People go to Google expecting to search every web document that's online, a labor comparable to going to your local library and expecting their database to tell you about every book in existence on a particular topic or by a particular author. Even the Library of Congress [loc.gov] isn't that comprehensive.
I disagree with the article's claim that "much of the most interesting and valuable content [on the Web] remains hard to find." I think that the most interesting and valuable content is easy to find, provided that you start looking in the right place. Which means that if I want information on the latest US school shootings, I don't go to Yahoo or Google and search for "school shootings", I go to those sites and search for major news sources (BBC, CNN, Reuters, etc.) and use their up-to-the-minute search engines.
The role of search engines isn't "shrinking" by a long shot; it's just becoming less comprehensive. Searching on the Web is now a two-step process instead of a one-step process, and you have to apply a little more intelligence than you could back in 1995. If high school students researching their latest humanities paper have a problem with that, well, they should ask us twentysomethings what it was like to have to use card catalogs and microfiche for our own high school projects.
Not unsearchable yet (Score:2)
Google consistantly returns good information on every search I make. A fairly superficial, PR-ish overview of their technology is here [google.com]. The gist of it is that, among other things, the number of links TO a page is considered part of the criteria for ranking. (The theory is that an important or well established page will have many links to it.)
OTOH, human-edited directories like Yahoo and dmoz are going to have a really tough time as the web continues its exponential growts. I get so many dead links from these services that it's not worth the bother.
What about the Yellow Pages? (Score:2)
Google thriving on bloat... (Score:2)
Maybe search engines relying on older methods are having problems, but using Google, I honestly haven't had a problem locating material quickly at all. You just have to have the right approach in searching for things...
Like I said, most of this is common sense and redundant to most people who've searched for stuff before. But you'd be amazed how many people have no idea how to find the information they need, when you can get it in less than ten seconds, including the time needed to plan the search and type in the query. I try to use this sort of list when telling people how to find info., sort of like teaching a person to fish so they can feed themselves for a lifetime.
Re:Google (Score:4)
Trolls throughout history:
Fluff (Score:2)
I was always very happy with Internet searching, so I was surprised to see an article talking about some big Internet content crisis. I see their point about the 'surface' and the 'deep' web, but these are also the same terms used in BrightPlanet's whatepaper [completeplanet.com] on the subject. Since it's pretty obvious that BrightPlanet invented the term, the entire article comes into question: why didn't they draw a distinction between the company whitepaper's thoughts and facts?
And in the fourth paragraph:
An unsubstantiated 550 billion pages, or about 100 pages for every living human being? I'm no expert, but that's ridiculous.
They quoted the Google people saying how hard it is to search for anything besides text, and then spruced some BrightPlanet PR. It sounds like someone's meeting the quota at Reuters, more of that fantastic deep content we should all pay for.
Yes (Score:2)
Google IS indexing dynamic pages (Score:2)
Not so fast.
While that is true of older, cr@ppier search engines like AltaVista and Inktomi, Google can and does index dynamic pages. (Indeed, more than 60 percent of new users to one of my sites come in via dynamically generated .cfm detail pages that have been indexed on Google.)
It seems to me that if you want your content to be indexed, getting on Google (and by extension, Yahoo, since Yahoo uses Google results in addition to its directory), is pretty darn easy. I have to say, I'm not nearly as frustrated with search engines as I was in the days B.G. (Before Google)
Unsearchable? (Score:2)
Another way (Score:4)
The solution to indexing the web completely, or much more completely, has to lie in another methodology. How about a distributed solution? Google@home? distributedYahoo!.net? Honestly...there are ways to tackle the problems, and the reason why this entire system exists is because people refused to just shake their heads and say, "Nope, can't do it...sorry!"
How about a button in browsers that enables you to mark a page as a dead link? Just hit that button and a centralized system gets a reference to the URL currently in your browser. That centralized system is funded by all search engines and all search engines draw from it. Yes, I know..."What if a user falsely claims a site to be dead?" Well, what if it took 100 different IPs claiming it to be dead before it really was considered dead? If you don't get many people hitting the site from a search engine in the first place, then you probably aren't serving it up to too many people.
How about a system for pre-indexing an entire site, such that the person who runs it can have a single document at the root of their domain with the index results? A standard could be developed that would even go so far as to map out the existing sub-sites (for AOL personal sites, for example) so that the engine could go to each one for the index documents.
I guess that what I mean to say here is that the problem is largely based around the hugeness of the web, and how brute force is no longer enough. But that's not really that big a problem...all that's needed is a bit of creativity.
The Holy Bible is MY search engine of choice. (Score:2)
searching (Score:2)
Its not a trend, its companies attempting to keep afloat in whats becoming a bull market. Its amazing to see how companies like google stay in business when they show little methods of collecting any kind of revenue. E.g., the only means of Google obtaining revenue is what? Charging for a company for a copy of its search engine? Why would a company pay for a search engine when the market if overflooded with them?
Ad based revenue, we all know where those click me businesses are going.
We also know most of the "web rings" never went anywhere, but for a search company to think people would pay for finding something on the net, they'd be shit out of luck, maybe corporations may do this, but I'd just make my own search engine (freely distributed) post it somewhere and let the whole "submit your site for free" revolution take place again.
Privacy Info [antioffline.com]
Deep web content and other searching problems. (Score:2)
If the sites themselves are complaining about no one able to find their content, aren't there ways to help that? Run a query on their database site to generate a possible site list of the content and then provide that list to the search engines. The search engines could then provide a link (found based on a content search) that would put the user on the page where they enter the form (or whatever) information to generate the page needed. Not being familiar with XML, but knowing that it has some features to aid in content grouping, could this be needed to recode the sites in?
Obviously if the sites themselves dont want this deep content easily viewed except by deep clicking through their whole site, or some pay-per-view system, that is their choice. I feel that they are limiting themselves however. If they think they have robust enough content to useful to users, they should strive to make that content as widely available as possible.
Should proprietory websites even be considered as 'Internet-web content'? Those seem to me to be 'Intranet content' which most often should not be seen by the general public (ie: internal company policies only needed by employess of company X). For that information to be set free you should either need a very savvy person to break in from the outside or a traitor from the inside. If its only certain products listed that the company doesn't want to available to the public, well that is too bad for them, I'll just get a quote elsewhere and pay someone else my money.
"evidence of a widening gap between the deep Web and the freely-accessible 'surface Web,' which could become a clutter of recreational and amateur-oriented content -- the online equivalent of public cable access television or self-published novels." Funny, ever since the late eighties, I've always seen the whole web like this. It's more like the big corporations tried to muscle in on the public cable channel and realized they might be better off on their own channel.
Not your normal AC.
Dynamic sites (Score:2)
Much fuss is made about the search engines needing to "fix the problem" of not being able to index sites like microsoft.com because the pages are dynamically generated. Is this really a problem?
Microsoft (or whatever over dynamic site you wish to pick) chose to make their content unindexable. Don't try to make it someone else's problem. Let people who use the search engines find third-party information instead. If the site designers wanted their site in the search engines, it would be there. Many of the sites built with ColdFusion or ASP contain basically static information anyway, and making them dynamic just reduces your traffic.
Sites like Slashdot are dynamic. A search engine can't be expected to keep up with something that changes every 30 seconds. However, making all of the archives static HTML allows them to be searchable by the engines and takes some load off the server, to boot.
I went for a "best of both worlds" approach on my personal site [robson.org] by writing a perl site generator. Each time I update the site, I re-run the site generator, which takes about a minute. My server carries a lighter load, but I still have "dynamic" links to related articles and such that the site generator builds.
The Sky Is Falling, The Sky Is Falling! (Score:5)
Re:Of course it is. (Score:2)
Very cool and clever idea. Now small businesses can promote their sites without having to invest mega-$$$$ for the traditional "banner ad".
And Freenet? (Score:2)
Victim of its success (Score:3)
Even Slashdot is too big. How the hell are you supposed to follow a conversation this big.
especially with the goatsex.
I'm gonna start mailing postcards.
Excelsior,
ME
No, it's just a big opportunity... (Score:3)
Magnitude of Problem (Score:2)
I'm one of the authors of Sparkseek [sparkseek.com], a remotely-hosted search service. I'm also a student at Pennsylvania State University. I want to give you an idea of what kind of problems researchers in the field of internet text retrieval have to deal with.
Larry Page, one of the co-developers of the Google search engine said in his 1997 research paper entitled "The Anatomy of a Large-Scale Hypertextual Web Search Engine" [scu.edu.au] that the primary benchmark for information retrieval, the Text Retrieval Conference, uses a fairly small, well controlled collection for their benchmarks. The largest benchmark they have available is only 20GB compared to the 147GB from Google's crawl of 24 million web pages. Today, Google has over 1.4 billion web pages in their database and a reported 4,000 node linux cluster [slashdot.org].
One of the problems I have encountered and digress that I've found difficult to deal with is the shear amount of redundancy in web content. Anybody who has ever tried a search for any linux command has no doubt encountered hordes of duplicate MAN pages in their results.
Not only that, but I honestly don't believe that when it comes to search engines, more is better. I have noticed over the past 6 months, as google has made great increases in its index sizes, that results have consistently become worse and worse. Search engines really need to begin narrowing the focus of their index and creating multiple indexes. Educational institutions should be separated from commercial establishments.. if I'm performing research on some subject, the last thing I want is to arrive at a commercial establishment pitching some product.
Also, the method google utilizes when creating their indexes creates a huge scalability problem. Their indexes are updated less frequently that ever, and if you read their document that was published in '97, it's not hard to see why.
Michael Tanczos
recommendation instead of seeking (Score:2)
A totally new approach could be that you don't search but interesting web resources gets recommended to you by your personal agent. We are currently working on a peer-to-peer system that doesn't exchange files but exchanges recommendations for web sites.
It's much like a good friend suggests that you have to look at a interesting web site. You can see all the marketing blurb at http://www.iowl.net/. At the moment this is a seminar paper of some people (including me) at the Wuerzburg University of Applied Sciences. We have a working prototype that will be released hopefully in about a month or so.