Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
The Internet

Is The Web Becoming Unsearchable? 249

wayne writes: "CNN is running a story on web search engines and their inablity to keep up with the growth of the web. Web directories such as Yahoo! and the Open Directory Project can take months to add a site and the queue of unreviewed sites is growing. Most search engines are even further behind and are filled with off-topic and dead pages. The trend is toward pay for listing. Will the free, searchable web fade away?" The article gets beyond the "Wowie, so much content, engines can't keep up" typical blather and addesses some of the reason search engines have a hard time keeping up.
This discussion has been archived. No new comments can be posted.

Is The Web Becoming Unsearchable?

Comments Filter:
  • by Anonymous Coward
    of the entire web has degraded so much that it's not the search engines that are full of useless garbage... it's the web itself and these engines simply have indexed what exists "out there". Garbage In, Garbage Out -- still holds true after all these years.
  • You can say that again. I submitted a site last November and it @!##@% still isn't there in its directory! What's the deal?

  • by Skyshadow ( 508 ) on Tuesday March 27, 2001 @01:03PM (#336253) Homepage
    Yep, all that content, and yet when there's a slow day at work I can still run out of interesting stuff to look at on the internet.

    Yowie.

    ----

  • by Trepidity ( 597 )
    Yes.

    Next question?
  • It's kind of like caller ID and all the other useless services.

    I wouldn't call all of those services useless...there's an interesting one by Verizon called Call Intercept If the person's number is unavailable or anonymous on Caller ID, they are sent to a message asking them to identify themselves. If they don't, then they don't get through. Great for those "please stay on the line for an important message..." phone calls that telemarketers & bill collectors love :).

  • This [google.com] is probably something like what you're looking for, though the word "Aida" can't be found at all with the others.
  • by pod ( 1103 )
    I've found this to be the case as well. I don't know what the author of the parent post was looking for (porn perhaps?) but everything I've searched for so far has turned up plenty of free resources in the first few pages of hits. In fact, looking to buy something (looking for a supplier) is pretty tough, and sites like Yahoo are more useful in this area.
  • Interesting - I had never submitted a link to Altavista personally to see how the whole process works. After seeing your description of the GIF mechanism, I've tried it and see what you mean.

    Apparently this is an attempt at foiling script-based ping and if down, submit as dead type attacks on other people's entries.

    I think a more reasonable way of handling this would be to, eg., check the site for 2 days in 12 hour increments (to allow for, eg., eBay's Sunday maintenance Windows and such). If no positive response during that period then drop the link.

    In any case, I was only using that mechanism as an example of a saner way than having 100 votes to automatically mark a site as dead. I don't personally use Altavista's search engine or condone it, and how this mechanism could be linked to a browser button (which could work with Altavista if they used my method instead of requireing a multi-submit + enter text from a GIF reporting process)

    Sounds like a good title for a trivial patent, even..

    Method of verifying URL availabity for a database of URL's

  • by ninjaz ( 1202 ) on Tuesday March 27, 2001 @02:01PM (#336259)
    "What if a user falsely claims a site to be dead?" Well, what if it took 100 different IPs to claiming it to be dead before it really was considered dead?
    Actually, it is trivial to maliciously get 100 IP's to claim a site to be dead. All you need is a page that gets 100 hits/day and an IMG tag embedding the URL to the dead link reporting page w/ the target URL embedded. Whoever hits the page will unwittingly make a request to mark your target dead from their own IP. Or, script kiddies could create botnets for the purpose of submitting dead links to get high-profile sites delinked, etc.

    The correct way to handle this situation is how the search engines already do - when a link is reported dead, they just make a request to the link. If it generates an HTTP 404 response code, or the site is down, it's marked actually dead.

    I'm not convinced this is always a good idea, though - I've worked for a guy who would battle for top positioning on the search engine with a few competitors. When either of them noticed that the other's site was down, they'd submit the other site as a dead link. I like google's Cached page mechanism, which allows you to view sites that are currently unreachable. Great for when you need docs from a site which happens to be down at the time.

    How about a button in browsers that enables you to mark a page as a dead link?
    This is actually trivial to implement, as shown in Google's toolbar page: http://www.google.com/options/toolbar.html [google.com]

    Of course, you'd need to use this technique with a search engine who takes dead link submissions. Eg., Altavista and its "Add or Remove a Page" link here: http://web.altavista.com/cgi-bin/query?pg=addurl [altavista.com]

  • Thanks for the laugh...
  • ..remains WebFerret.

    (Well, not really, but it's damn good...)

    It's about the only Window$ app I use anymore.

    It's kinda gone down hill after the parent company was bought out by ZDNet but it still really works pretty well.

    It meta-searches about a dozen of the major search sites simultaneously.

    I use it alot to search for the meaning of obscure error messages and error codes and stuff like that.

    Used to use it alot for searching out what cryptic .dll filenames were related to...

    t_t_b
    --
    I think not; therefore I ain't®

  • Google used to spider my sites almost twice a month, but it seems to have reduced its crawl frequency since it started indexing dynamic content. Another problem with the latter is that e.g. book pages at Amazon can appear multiple times in search results, as Google follows links from different associate programs.

    I've also been kicked from first to sixth on a search for "book reviews" [google.com] :-(.

    Danny.

  • Really? Google found it for me on about 2,510 pages.
    --
  • The piece tries to make a good point about dynamic content that is generated by user input not being indexable which is true.
    Search engines can't type things into forms and get results in an intelligent way.
    It's just a shame that they get confused in their expressions.
    Nice piece generally though. 550 billion wab pages is an awful lot..

  • I can usually find what I'm looking for either using Google or altavista. The hurestics used for google are the best I've seen in any search engine. I can ususally find stuff that is anywhere from several days to several years old.

    Come on... these are the same people that were claiming that we would all run out of IP's by now. They don't seem to realize that everything adapts.
  • Umm... sure, it would be a great idea if it would work. But the whole proposal depends on the directory structure being harder to spam than keywords. I don't see any reason why it would be any harder to put "teens->education->health" in the directory structure for hardcoreteensex.com than it would be with current keyword-based schemes. I'd love to hear why you think that this would be different than what already goes on... but I'm not holding my breath about being convinced.
  • Darn. We thought we could get that one past you...
  • ...on Google, at least: link early, link often. Link to your favourite sites on every page you make, which boosts them in the ratings.

    BTW, AFAIK Google doesn't change rankings for money, it adds those little side-links for money. I do hope they stop adding gingerbread now lest the site end up as cluttered and useless and Deja did.

  • My friend has/is developing a system and tools for creating a p2p search network. This seems like one way to interconnect searches and information as it becomes more interspersed thoughout the know universe. Have a look at Neurogrid [neurogrid.net]
  • I'd like to see some specilized search engines, nothing too complicated. What I've been wanting for some time is a search engine of just .edu.

    There are lots of relly informative .edu sites out there, but they don't show up well on search engines, and may are burried levels deep. i.e. college.edu/~professor/fall2000/class/topic/lotsof info.html

    (btw, if anyone finds a .edu engine, PLEASE let me know!)

    ========= Put my nick in front of the "_". I love my computer
  • I hardly ever use search engines anymore. Most of the sites that I find are linked directly off of pages that are specializing in what I am looking for anyhow, and I find that the content is usually of a higher quality anyhow.

    Either that or freinds will send me links.

  • Well, I can't find anything when I search for addesses ..
  • It's amazing that there are 2,510 people who spell at least as bad as Hemos. :)
  • ...and look how well Gnutella scales.

    If you want 99.9% of Internet traffic be nodes forwarding search requests and results back and forth, that's the way to go.

  • Unsearchable net is just a result of ignorance by search bot writers and web site creators.
    • Ignorant robots.txt usage. A site has all text in their database. Now bots start hitting. After a while, admin notices that bots a churning too cpu time. To fix the problem, admin puts a robots.txt killin bots, instead of creating a light robot-friendly area.
    • Greedy bots Robot writers usually don't help much by atacking servers in overload bursts, causing mayhem on many sites.
    • Too limited syntax on robots.txt Why can't we ask bot's to visit only on specific hours? why can't we se a sensible hit rate? Ofcourse at first most bot's would ignore such tags, but forcing
    • The unindexed. If there a know references to someones personal homepages / scientific article, how could the bot's find them? A Nightly generated site index could help, but raises some privacy questions, if done on a public server.
  • Seven of the first ten have nothing to do with automo repair. Two of them are iffy at best. My grading of One right and two half-right out of 10 answers is still an "F".
  • Try to find, using the Google Directory, pictures of Yellowstone National Park, taken by me. No fair using the search function. (However searching the directory for "pictures of yellowstone and scott purl" will result in two misses, and nothing else.)

    Yes, it's "Vanity Web Surfing", but if Google indexes my site, why doesn't it automatically categorize it? (whine whine)

    So, yes, Google is pretty derned good. But it's still not a directory, and the directory it does have covers, what, 1% of the web? 0.01%?
  • by scotpurl ( 28825 ) on Tuesday March 27, 2001 @01:19PM (#336292)
    What the web REALLY needs is a directory. An honest-to-goodness, telephone/yellow pages style directory. This whole nonsense about keyword searching is providing people who just want traffic with a lot of free advertising and listings.

    The phone company provides you with one free listing (unlisted is optional), and makes you pay for each extra category (like in the Yellow Pages -- and if you're not from the U.S., please see http://www.bigyellow.com/supertopics for an example) that you want something listed in. Search engines ought to be replaced with something similar.

    Yes, I know Yahoo and Dmoz try, but they don't go out and actively index sites, making their use limited, and the number of sites even more limited. If Google were to create a Yahoo/Dmoz style directory, that would help. Better yet, if people were forced to provide either META tags, or some information when they acquired their domain (part of whois?)....

    For example, where can I get my oil changed in Paris, France?
  • by dublin ( 31215 ) on Tuesday March 27, 2001 @10:02PM (#336295) Homepage
    This is a real problem, but the fundamental reason it's a problem is one that's well-understood by library scientists: We only have addresses, not content identifiers.

    To use a book analogy, the entire web is built on Dewey Decimal addresses (URLs), when what we need is those combined with ISBN numbers (URNs).

    I didn't make up the idea of URNs - the concept was first described to me by Peter Deutsch, the inventor of Archie, at Interop sometime in the early 90's, shortly after the web got going. (Back when there were no search engines, and we found out about new web sites by visiting NCSA's What's New page, which for a while, anyway, actualy cataloged *every* new web site that appeared, and some of us could claim to have surfed the entire web...)

    The idea behind URNs is that they would be a unique identifier for the content. The same content living on different sites would have severl URLs, but only a single URN. This is still needed today, but the problems that kept it from being implemented then are even more intractable today: Who hands out URNs? (IANA didn't want to touch that!) How do you handle versioning? What about dynamic content? Who are the librarians?

    We still desperately need somthing that fills this need, but it's not likely we'll get it. One last parting thought - in discussing this with Deutsch, he pointed out that these are new problems to us, but that the library scientists had solved them quite some time ago: It is only the typical CS insistence on reinventing everything and dismissing the knowledge of those in other fields that makes the process so incredibly painful... Hubris strikes again.
  • Check out Dizz-net [dizz.net]. It's basically an article spawned by a conversation on Slashdot over a year ago that moved to a mailing list.

    We had some cool ideas, but the infrastructure for such a thing would be huge. I have a bunch of interesting messages from the mailing list describing some pretty cool stuff, like having nodes only search for stuff that near them, network-wise, to lessen the load at critical points. There was also some talk about moderation ("Click here if this link is not relevant to your search") and heuristics to stop common abuses (spider-bait).

    It never happened, because it's pretty heavy stuff to implement properly.

    I'm sure some patent-squatter has a patent on it already, with the full intention of letting someone else do the hard work. :-)*
  • Are libraries becoming useless?

    Posted by Hemos on 03:53 PM March 27th, 2001
    from the we-talk-and-talk-about-same-crap dept.
    segmond writes: "CNN is running a story on libraries around the world and their inablity to keep up with the growth of the number of books published. Libraries such as ones belonging to even the biggest instutions such as Harvard, Yale and MIT can take months to add a book to their collection and the queue of unreviewed books is growing. Most libraries are even further behind and are filled with off-topic and old assembly books about VAX and Z80 programming. The trend is toward pay for listing your book. Will the free, searchable library fade away?" The article gets beyond the "Wowie, so much content, libraries can't keep up" typical blather and addesses some of the reason libraries have a hard time keeping up.

  • by cr0sh ( 43134 ) on Tuesday March 27, 2001 @02:55PM (#336301) Homepage
    Look up information on the "Invisible Web" - islands typically untouched by search engines, where you need another site to "hop" to these nets of information - cool stuff can abound in these disconnected areas. Here are some links to get started with:

    DirectSearch - Invisible Web Search [gwu.edu]

    The InvisibleWeb [invisibleweb.com]

    WebData.com - Invisible Web Search [webdata.com]

    InfoMine - Scholarly Internet Resource Collections [ucr.edu]

    AlphaSearch - Invisible Web Search [calvin.edu]

    IIRC, Slashdot even ran an article about this not too long ago - I think this [slashdot.org] is it, not sure...

    Worldcom [worldcom.com] - Generation Duh!
  • by GoofyBoy ( 44399 ) on Tuesday March 27, 2001 @01:17PM (#336302) Journal
    You mean the META tag already exisiting?
  • What do you expect? Pay for listing is the only way search engines will make money. Think about it: Would you use a search engine that charged a little, but provided much better results (ie no dead links, no off-topic stuff)? Think NorthernLight.Com does this.
  • The trend is toward pay for listing

    Is this really a big deal? Hasn't anyone used the yellow pages in a phone book before? People have to pay to be listed in that, and its very useful for finding a companies.

  • Frankly, this article doesn't depress me as much as the quality of google results impresses me. Whether it's 1% or 100% of the available space, I can very often find exactly what I'm looking for.

    Now maybe there are vast areas of the web unavailable to google searches because of language quirks or protective admins, but so what.

    They have as much a right to exist uncataloged as I do to have an unlisted phone number. If sites want to be indexed, they can register with a search engine. If they don't, and are unreachable, so be it. I don't see what the problem is.

  • by citizenc ( 60589 ) <caryNO@SPAMglidedesign.ca> on Tuesday March 27, 2001 @12:56PM (#336309) Journal
    I don't know WHAT they are talking about -- I can find ANYTHING that I look for on Google [google.com] -- even sites that I have just created a day or two ago have been found. These people just aren't using the right search engine, dammit! =)

    ------------
    CitizenC
  • by ZahrGnosis ( 66741 ) on Tuesday March 27, 2001 @01:24PM (#336310) Homepage
    The article skims over the fact that search engine technology is progressing fairly rapidly, and that some companies (Google) are creating new technologies that exploit the way the web works while Yahoo! and some others are relying on older technology for some things (like filtering pages by hand for their directory!).

    Google's approach is novel; make the web pages rank themselves. If more people link to your site, it's probably a better site. If few enough people link to it, it probably isn't and besides that it'll probably never be found.

    Web site creators have to do the legwork to get their sites recognized, and going to a general search engine to do it isn't the way. If someone makes a site and tells their friends about it, and their friends like it and link to it, it'll get picked up; that's the way of the web. (At least, it'll get picked up by crawlers like Google, and even ranked highly if enough people link to it).

    Search enginge tech has to catch up to dynamic pages yet, but it's the fault of the content creators if they want their pages on search engines but can't code enough alt tags to make their stuff show up.

    In any case, the bulk of the web does work, and good pages get recognition. I've always eventually been able to find what I'm looking for on the web, no matter what the topic. Search engines have to grow like everything else, but so far they're the best thing going and getting better.
  • by decipher_saint ( 72686 ) on Tuesday March 27, 2001 @01:05PM (#336312)
    In the beginning all the best stuff was "word of mouth"... it still is ;-)

    This is how I found /. originally, many moons ago a fellow nerd clued me in.

    Did anyone out there get hooked up to /. through a Search Engine result?

    -----

  • I have been a PHP programmer for 2 years now and I applied to review the PHP sites. They rejected me citing an overabundance of PHP reviewers. Does this mean that they want people to review anything instead of what they know?
  • I too applied for that category more than a year ago, and despite the fact that I am both a PHP programmer, and worked for a Canadian Search Engine called Maplesquare [maplesquare.com] as the resident "Cybrarian" in charge of maintaining the database of links and descriptions, I was summarily refused.

  • If you need to find more relevant documents on specific subjects, I recommend using topic-specific search engines. I maintain one for all subjects relating to Paganism and Wicca on my Omphalos website [omphalos.net]. True, the site submissions have to be manually approved and this can lead to backlogs of site submissions, but since I spider all of the websites I have included in the directory (totalling over 140,000 webpages so far) the relevancy of any search results is raised by the lack of clutter from unrelated websites.

    Similarly, if you are searching for information on Space Exploration try Spaceref [spaceref.com] where I used to work. Again, the directory is manually generated, and the results are greatly improved overall.

    Nothing guarantees improved relevancy (for general purposes nothing beats Google in this respect), but using specialty search sites helps immensely in many cases.

  • If I were to set up a search engine:

    Every unique domain name found would get crawled for free. You paid for a domain name, you must care about your content.

    Every geocities-style cheap personal page would require a small fee to get crawled. Too much schlock; scan only the stuff people care about. You don't wanna pay your own fee? Ask a visitor to pay the fee. PayPal or something newer/better should do the trick.

    Every dynamic page like slashdot, everything2, or real estate listings, would have to have a more expensive agreement in place to get anything indexed. The buck stops at cgi. Waste no time on something that will probably be gone tomorrow.

    Commit on the resources it will take to prune and groom the stale dead stuff out of the index, regularly. Dead links are bad business.

  • Are you tired of all those annoying paid search engine placement services? Ever tried using the free ones, only to be annoyed with tons of ads and to find your URL submissions blocked by the robosubmission filters on the search sites?

    Well, I'm tired of them too, and I write pages that I submit to search engines from time to time, and I've come up with what I feel is the best way to submit links to a bunch of sites:

    Direct links into the pages that have the URL submission forms on a bunch of search engines.

    Keep a text window open with your URL, title, description, for-public-consumption email address and the like, and use "Open Page in New Window" on all these links to manually copy and paste your information into a bunch of search engine submission forms.

    That's it!

    I got all these search engines off the Search Engines Category [dmoz.org] at the Open Directory Project [dmoz.org]. If you know of any pages that list a bunch of other search engines (there are many smaller ones, and a lot of special purpose ones) then drop me a line at crawford@goingware.com [mailto].

    In my index I provide brief notes about some of the engines, including mentioning whether they refuse to accept submissions without payment. I don't provide links to submission forms for the engines that won't list a site for free, and I'd like to ask you not to support the trend towards paid index and spider placement.

    You should understand that the vast majority of visitors to your sites don't get there through search engines, they get there because other people like your page and give you a link. The main value of search engines is to "prime the pump" so a few people start finding your site and then know to create a link for it.

    Create successful web sites by writing good web sites - see Some Web Application Design Basics [sunsite.dk] for links to a few good pages written by experts that will start you well on the road to an appealing, successful website.

    Thank you for your attention.


    Mike [goingware.com]

  • the biggest problem is just that a lot of editors aren't active

    Also, I'm not impressed with ODP's handling of new applicants [dmoz.org]. I applied once last year and received NO reply, not even a rejection letter. I had applied to edit the category of "Personal Pages -- Surnames starting with U". It was to get my feet wet, learn how to be an editor, see how time consuming it might be before adding a more serious category. I mentioned that in my application.

    I resubmitted it in February and successfully received . . . a rejection letter! They decided I have a personal stake in the category (note my last name) and might be biased. Oh no! We must prevent the potential for abuse of Web Pages about people named U* [dmoz.org]!

    If I'm not allowed to edit for categories that I know something about and I'm interested in, then what exactly should I volunteer for, and why should I?

  • Actually, I've been working on a proposal for a possible solution to this mess. It will never be implemented, of course, because the web is based on tradition and archaic protocols, not on innovation, but I think it nifty food for thought anyway.

    My idea is to come up with a standard set of headers that provide directory/hierarchy information for search engines. This is much more useful than keywords, et al., because they allow for top-down directories such as Yahoo! and the Open Directory project. Sites like this could be automatically created simply by crawling the web and organizing sites according to a category specified in their header.

    The problem with keywords is that it's easy to spam them. If you need more hits, just add "bestiality", "Natalie Portman", and "hot sluts" to your keywords. The keywords often have nothing to do with the actual site.

    It would be much harder, however, to spam a directory structure, especially if most search engines limited the amount of directories a page could specify to, say, two or three.

    The header would be easy to implement. It could be done very easily within the comment tags of existing HTML. The only problem is getting people to do it. It would work beautifully if Yahoo! or another large site were to give up on "hand-picked" sites and start letting people specify their own location on the structure. Then anyone who wanted their site to be locatable would specify a hierarchical subject category in their header.

    Great idea. It'll never happen.

  • by eldurbarn ( 111734 ) on Tuesday March 27, 2001 @01:21PM (#336334)
    Actually, Northern Light does not charge to access its search engine, or to access it's classification links of the web.

    It has a second, separate business re-selling articles from trade journals, professional publications, etc., for which you do pay... but less than you would pay to buy the same thing in dead-tree format from the publisher.

    What confuses people is that, by default, the main engine will return hits on both the web and the special collection.

  • If yahoo had an option where you could submit a site that you think had off-topic keywords [...] and they wouldc ompletely remove all occurances of an offending site from their database [...]

    This would require a lot of human verification, for there are many possibilities for abuse. I could always report my competitors for false keywords, just to keep them out of the listings. And as soon as we get to more exotic topics, who can say if a keyword is relevant or not? And how relevant is relevant anyway - if a porn site does have many pictures of women getting out of girl-scout uniforms, is "girl-scout" a valid keyword?

    There are simple ranking algorithms, that weigh uncommon keywords more, and take into consideration how many keywords the site claims to relate to. These might be more effective.

  • Most searches for herbal medicines (e.g. "5-HTP") find you way more hits (especially the high ranking ones) from companies trying to sell you it than actual objective information about it.

    Had you typed 5-htp information [google.com] into Google, you would see 5-htp information, with Harvard as result #2.

  • "html" 188,000,000

    But, as usual for Google, the first three results are highly relevant for at least one common sense of the search term. (The first is W3C's official HTML standards site.) I didn't realize how bad AltaVista sucked until I tried it after using Google for a year.

    does anyone find anything better than "and"???

    +a comes close. It seems they're blocking searches for +the.

  • Yep, all that content, and yet when there's a slow day at work I can still run out of interesting stuff to look at on the internet.

    little gamers [gamespy.com], penny arcade [penny-arcade.com], goats (not goatse) [goats.com], and badtech [badtech.com]: online comics. It'll take a while to browse the entire archive.

    everything 2 [everything2.com]: nearly half a million writeups on topics from aardvark [everything2.com]s to zzyzx [everything2.com].

  • Basically, using the peer-2-peer revolution (buzzword alert) in advertising is the next thing.

    I hope you're not talking about spamming Gnutella [slashdot.org].

    some companies are try to combine the peer to peer aspect of traditional word of mouth and the web.

    In this model, surfers are paid to recommend the sites to other surfers. Spedia [spedia.net] is a prime example, as was AllAdvantage until it went to a "sweepstakes" scheme. Other examples can be found in the many sites that use Recommend-It [recommend-it.com].

    Hatten är din, hatten är din, habeetik, habeetik.
  • by yerricde ( 125198 ) on Tuesday March 27, 2001 @11:59PM (#336345) Homepage Journal

    Of course, you'd need to use this technique with a search engine who takes dead link submissions. Eg., Altavista and its "Add or Remove a Page" link

    AltaVista does not allow submissions [rose-hulman.edu] from visually impaired users or users of text-based web browsers such as Lynx, Links, or w3m. Its submission page [altavista.com] uses a GIF image (burn all GIFs [burnallgifs.org]) to display rotated text in various fonts. The user is supposed to read the text and enter it into a field below. But visually impaired users, users on text browsers, and users on browsers whose developers have been cease-and-desisted by Unisys [burnallgifs.org] never see the GIF and cannot contribute links to AltaVista.

  • Not quite. Disney can pay zillions to be top in a search for "animation techniques", but they're actually not a reference site for learning how to do animation. Ditto a search for "electronic circuit design" - Intel could pay to be listed on there, but you're not going to find much info about designing electronics on their site. Paying for listing on those kind of things simply increases the noise, whereas Google's system looks for sites which are popular references on a subject.

    But you're right in some ways, too. If you search for "children's toy company" or something (and temporarily ignoring the other 'toys' listed ;-) then pay-per-listing is more likely to show you ToySmart or whoever (are they still going? can't remember), which you actually want.

    Good points and bad points about both. I think the best would be a two-tier system - a pay-per-listing one for commercial stuff (Amazon, etc) and a free one with a reference-check system for information-search purposes. Maybe the pay-per-listing could subsidise the free one?

    Grab.
  • Yeah, but without a truely intelligent AI, search algorithms will always be exploitable. Keyword spamming is the old school method, and with google, maybe a combination of keyword spamming and link spamming (have tons of other bogus sites link to yours) would work.
  • Sure this is a problem, but it's more an example of applying the wrong tool. Google was never intended for comprehensively finding every scrap of information about a particular topic; it was designed to find the few most relevant and interesting sites discussing a particular topic. Using a general purpose tool for a highly specific task is a wonderful way of getting frustrated but not an efficient approach to solving your problems.

    In fact, there are specialized search engines for dealing with specific topics. There are engines specifically for looking for images, ones for looking at specialized topics, and so on. There are also specialized, classified catalogues of information of exactly the kind you suggest are needed out there for people who need to know about them. If, for instance, I want to learn about a specific topic in biology, I might very well start out by looking at PubMed [nih.gov], a special purpose index of biological research articles. You just have to know where to look for the special purpose tools.

  • by rgmoore ( 133276 ) <glandauer@charter.net> on Tuesday March 27, 2001 @01:14PM (#336357) Homepage
    Yes, free, independent sites ARE tough to find, even with Slashdot's favorite Google. Eveyr time you search for ANYTHING, the first 1000 hits are always for a commercial site.

    Except that this isn't true. If I look up, say, Ronald Reagan [google.com], none of the top 5 hits are big commercial sites. They include the Whitehouse pages on former presidents, a fan page, the Reagan Presidential Foundation, the Reagan Library, and the Official Reagan Web Site. If I look up Linux Kernel [google.com], the #1 site is the Kernel Archives page. Maybe you're looking for data where there just aren't many interesting independant web sites out there, which is not something that can be cured with a better search engine.

  • The answer to keeping pace with web growth is to have sites like Yahoo be a "guide to guides" instead of "guides to everything." Instead of listing 50 links in a "Cheese" category, list one or two or three links to web sites that are their own mini-portals to cheese.

    The content on mini-portals is a million times better than Yahoo's old haphazard system. I gave up submitting non-commercial links to Yahoo because you wait months before being sure they didn't list you, then resubmit and wait months, then resubmit... etc.

  • by Alexius ( 148791 ) <alexiusNO@SPAMnauticom.net> on Tuesday March 27, 2001 @01:10PM (#336365) Homepage
    What ever happened to the peer to peer idea of searching? I remember when Napster and GNUtella started, people were talking about how this might actually alter the way searching was happening on the web. By having each server tell us what they have, we are assured that when someone searches for how to replace a broken window, they won't get what they don't want [microsoft.com].
    --------------------
  • Libraries are government-funded, so everyone has paid for them already. A government-funded search engine might not be a bad idea though.
  • How does the Dewey system address that, since a book can also fall into more than one category?
  • And how long do you think it will be before microsoft.com, mpaa.org and riaa.org disappear from all search engines?
  • Well, the article has a point: finding the right webpage is difficult, but it has been since all this ever started and is not something particular to Internet.
    The most difficult thing on the internet, to my belief, is to find the very specialized article that you are looking for. The problem is that it may not even exist. Finding the same very specialized article in a huge library full of journals is even more complicated. So what? Next article.
  • You know solving the keyword problem woulnt be too hard. I mean if yahoo had an option where you could submit a site that you think had off-topic keywords (like if it were a porn site and it had the keyword 'girl scouts' or something children might be searching for) and they would completely remove all occurances of an offending site from their database, then maybe things could be well classified. People would only use on-topic keywords so that they dont get banned from yahoo.

    This would make searches SO much more acurate. It would just take someone to have the balls to say "you are abusing the keywords so now nobody will ever get to your site from our search engine."
  • I think most search engines are ignoring metatags these days because they are so commonly spammed. So, the only way to have a directory these days is to have it be completely manually built. Thus, it is impossible to have a comprehensive web directory, unless you are willing to put up with spam.

    I have a suggestion to anyone who is thinking of implementing a better directory. First, define the categories, and allow any site to submit their site to their categories. Then, introduce moderation to the mix. Allow users of your directory to rank sites in terms of suitability to the category. Allow them to create red flags for people submitting porn to health->teens->sexuality, and so forth. Let the users do the work!

    I think moderation works well for sites like slashdot, why not a moderated web directory?

  • [Disclaimer, I work for GOTO [goto.com]].

    Whilst Google is clearly the best for non-commercial searches, GoTo is apparently the best for commercial searches (if you want a service someone will make money from supplying).

    It nicely gets around the problem of manual classification, by effectivley using market forces to make an advertiser classify themselves correctly (or pay for referrals which make them no money).

    Let say I have a hotel in San Francisco, but bid for the general term Hotel ($1.03). Now I will presumably only get some custom if they were looking for a Hotel in SF - otherwise I just paid GOTO $1.00 for a useless referral. Better I list myself as HOTEL SAN FRANCISCO, even though this costs ($1.71), I will have a much higher conversion ratio.

    Of course, if I am a US Hotel Chain or Broker, then maybe I would bid on the general Hotel keyword.

    End of self serving Sales Pitch :) Personally I'd like to see us create a GoTogle (TM) :-) that combines the best of both approaches.

    Winton

  • by bcrowell ( 177657 ) on Tuesday March 27, 2001 @02:40PM (#336385) Homepage
    Yahoo and the like are doomed to failure until someone implements something like the Dewey Decimal System for web pages and then convinces a large number of webmasters to correctly classify their pages using it. That way a machine can do the hard work and only the person designing the page need do the actual work of making sure the page is classified correctly.

    Well, what you're describing sounds a lot like META KEYWORD tags.

    Having been an Open Directory editor in the past, I don't really think the problem is finding the right pages. Actually the biggest problem is just that a lot of editors aren't active, and it's hard to know who's active, because they're listed as editors even if they haven't logged in or checked submissions for a year. This creates problems for editors who have to cooperate with other editors, and may also give outsiders the impression that Open Directory is overwhelmed in general, when really it's just that the editor they submitted to is AWOL.

    Yahoo is doomed to failure because they don't have enough people working for them. Open Directory works just fine, because they have orders of magnitude more eyeballs working in parallel. No, Open Directory doesn't list every page on the web, and that's just fine with me as a user -- it's more useful because it's selective.


    The Assayer [theassayer.org] - free-information book reviews

  • Known item searching is dead easy using any search engine, so long as the item is in the database. It's also easy to find something about anything, so people who just want some information without being overly concerned with how accurate or complete that information is can also easily find something to keep them happy.

    Serious research, on the other hand, requires a more quality-conscious search. A researcher will want all of the most relevant information about a topic, and Web search engines do not provide this very well at all. Weighted keyword searching is no substitute for professionally catalogued and classified documents in cases like this. In some cases, researchers will want an exhaustive search: everything relevant about the topic. For example, a Ph.D. candidate would almost certainly begin their thesis by locating everything academic published in their field of study. This is downright impossible with Web search engines: even if their databases were complete, relevancy is so bad that you would probably have to wade through thousands upon thousands of hits to find a hundred or so truly relevant sites. This is especially true of any subject that is susceptible to search engine spamming.

  • WTF are you looking for?
  • I don't know WHAT they are talking about -- I can find ANYTHING that I look for on Google -- even sites that I have just created a day or two ago have been found. These people just aren't using the right search engine, dammit! =)
    They're talking about using the Web for serious research. The article actually misrepresents the problem for hardcore researchers on the Web. The problem is not so much finding information, it's finding information you can trust. But for most other people the problem is just finding the information and it's not just that they're not all using Google, it's also that they don't know how to search properly. They don't know who to formulate queries which are specific enough to weed out the bad pages.
  • No, but I've done some searches after finding /. that would have led me to the site.

    I can't be karma whoring - I've already hit 50!
  • Keywords are not especially helpful in auto-creating directories. They are of limited value because only about 10% of web sites use them at all. Of those that do use them, there is no limit or structure to them. They are easily spammed. This is exactly why they were discarded as useful by SEs a long time ago. I have found keywords and descriptions helpful in my own efforts at classifying web pages because, once verified by a human (me), they could be used as a partial basis for text based searches (in which I also included META descriptions). If no keywords were given I frequently resorted to duplicating the description. If keywords were given, but no description, I could usually find a short excerpt from the site that could be copied and pasted.

    Open Directory works rather well, IMHO, as a directory because the editors have a strong sense of ownership and are given small enough chunks to do that the work is very manageable at the individual level (and they can do it in their spare time easily). But the human element is always going to be a potential issue with any directory. A problem you just don't have with Google.
  • A book can be cross-listed in a card catalog, from my understanding, but since the book can only be in one place on the shelf, it's not a big concern. The librarian simply chooses the dominant topic, or uses one of 000 general classes (for things like encyclopedias, periodicals, etc).
  • by ichimunki ( 194887 ) on Tuesday March 27, 2001 @01:24PM (#336398)
    I never said it would be easy! :)

    Having actually tried to implement a DDC based web directory once, I am familiar with the problem that many pages would possibly fall under many categories. This is a problem with any directory-based approach, especially if you list a page in one category and then the page changes enough so that the category no longer applies.

    In your example, I would hope it would not be too much trouble for you to put a different class number into the pages that make up each logical section of your site. Or if the site is small enough, it would likely fall under something like "personal web pages", which may have a number of subclasses itself, and then you'd choose the one you felt appropriate.

    Again, this is a common issue among all directories, where do you put stuff? Do you allow multiple listings/classes per site/page? You still end up having to include some sort of keyword or text-based search so that users are not forced to browse the directory structure, guessing at the classification they are looking for or where it lies in the hierarchy. Text searches also allow for the possibility of searching based on content rather than metadata.

    Most of this is a non-issue, given that Google seems to have rather successfully implemented a non-directory type of engine-- succeeding where Altavista was simply unwieldy. At least that's my impression. I usually find what I want with Google.
  • by ichimunki ( 194887 ) on Tuesday March 27, 2001 @02:47PM (#336399)
    This also brings up the problem of being able to use multiple pages that are essentially redirects to get around the listing limits. For instance, I make http://www.hotgrits.com/natalie1.html, .../natalie2.html, .../natalie3.html, etc which all are really mirrors of http://www.hotgrits.com/portman.html, which is the main page for my site. The only thing I change is the category for each page so that my site effectively shows up in numerous places in the directory. With a properly constructed CGI program I could be listed in every category without having to work that hard.
  • by ichimunki ( 194887 ) on Tuesday March 27, 2001 @01:07PM (#336400)
    Yahoo and DMOZ are web directories. This is a very human labor intensive way to categorize the web. Google is actually a search engine. It spiders out and runs an indexing algorithm of some sort to help it respond to queries. These are very different approaches.

    Yahoo and the like are doomed to failure until someone implements something like the Dewey Decimal System for web pages and then convinces a large number of webmasters to correctly classify their pages using it. That way a machine can do the hard work and only the person designing the page need do the actual work of making sure the page is classified correctly.

    Obviously this is fraught with problems similar to those of keyword spamming, but it's either that or build something like DMOZ on a decentralized basis, so that any individual maintainer builds a set of links that are tailored to his/her interests and either uploads them to a central sever or provides them as an XML document for an engine to work with.
  • I run a site that's a cumulative name index of 700 books
    and thousands of clippings. The indexing started in 1983.
    For any name, you can get all the other names that share
    pages with that name throughout the entire database. In
    other words, each name search produces a page that contains
    anywhere from several to several hundred additional names
    -- all pre-linked directly to their own searches, which do
    the same thing. You get the idea.

    It's a bot's worst nightmare. But if you are Google, with
    lots of crawlers to sic on the task, it quickly can become
    my nightmare instead of Google's. Indeed, Google doesn't
    seem to care much.

    Last October I noticed that Google was inclined to stumble
    into our cgi-bin on rare occasions, and actually do a
    decent job of delivering referrals to the name data that it
    got from us. I lifted the robots.txt exclusion to see what
    would happen. No other bots have even delivered referrals
    as consistently as Google, so I can only assume that Google
    is the only bot that's even serious about going after the
    dynamic web.

    Either that, or their algorithms do a much better job on
    our names, which are all listed as surname-first throughout
    our site. If you search for a name in the news as Firstname
    Lastname without quotes, Google will put our Lastname,
    Firstname high on the list due to two facts: Our name is
    part of the anchor description and they give link data more
    points, and secondly, the two words are close to each other
    and this adds to the score (even though they are backwards).

    Google has come by once a month since ever since I lifted
    the robots.txt. Each time they spend about 10 days solid,
    24/7, with from three to five crawlers, chasing all the
    name searches. The rate from all the crawlers together for
    those 10 days varies from about two name searches per
    second to several per minute.

    It's very erratic during that time; the crawlers don't talk
    to each other, and there's no detectable pattern that
    they're following. They don't manage to get through the
    entire database of 115,000 names by any means. There is an
    incredible amount of waste and duplication.

    I had to install a load-sensitive thermostat so that when
    our server hits a certain load threshhold and it's Google
    calling, it starts delivering "Server too busy" responses
    instead of the search that was requested. That seems to
    work pretty well, but they get all those "Server too busy"
    messages stored in their cache copy for that name.

    To put it bluntly, their bots are dumber than toast, and
    if you don't watch them, they can turn your server into
    toast.

    Last November I wrote to Larry Page and offered to send him
    the damn database on CD-ROMs, in discrete HTML files using
    any specification he cared to define, so that his crawlers
    wouldn't have to load down our servers once per month.

    Mr. Page never responded. The letter was e-mailed, faxed,
    and snail-mailed. Someone from google.com did a Larry Page
    search shortly after I faxed it, so I'm pretty sure they
    read the thing. I offered these CD-ROMs for free, and I
    didn't ask for any changes in PageRank or any other
    considerations. It would simply mean that I can get my
    names onto Google efficiently and comprehensively, without
    enduring that 10-day orgy once a month.

    My point is that there is no real effort at Google to make
    any sort of accommodation on a case-by-case basis with the
    so-called "deep web." Until that happens, sites such as
    mine have difficulty in allowing Google's crawlers to run
    amuck once per month. We have other customers to consider.

  • Though this article is more fluff than the useful information it would indicate to a search engine, it does ask a good question. Are technological advances reducing the ability of search engines? I would say no. Rather it is incompetent and malicious web page designers that are the problem.

    Although technologies such as frames, ASP and JSP, cold fusion, or Flash may make it harder to design a crawler friendly web page, such pages need not be crawler hostile. As the article points out, the issue is how the site handles requests that contain no parameters. The incompetent designer will treat such a request as an error. The more thoughtful designer will display a useful page with appropriate meta tags.

    The second issue is intellectual property and the true number of pages on the web. Suppose we create a site on the history of widgets. This site contains 10 base pages backed by a database of 100,000 widgets. Is the true size of the site 10 or 1 million pages? I would say that their size is 10 pages and indexing 0.001% of the possible pages in a complete index. The problem is how to make these 10 pages representative of the site. It may be reasonble that a search of '1145 crusade keepsake widget' might fail, but our design should allow the more general search 'history widgets' to succeed.

    Anyone who has done library research in the pre-computer age knows that is takes skill and determination to find citations. The fact that we have replaced 1 million tiny cards and 1 thousand volumes of indexes with an online database does not mean that search and design skills are no longer necessary. Unfortunately, we cannot assume that user will have the proper search skills, so we, as designers, must learn better design skills.

  • by mblase ( 200735 ) on Tuesday March 27, 2001 @01:10PM (#336407)

    The only "problem" is that the Internet is simply too large for one engine to index. People go to Google expecting to search every web document that's online, a labor comparable to going to your local library and expecting their database to tell you about every book in existence on a particular topic or by a particular author. Even the Library of Congress [loc.gov] isn't that comprehensive.

    I disagree with the article's claim that "much of the most interesting and valuable content [on the Web] remains hard to find." I think that the most interesting and valuable content is easy to find, provided that you start looking in the right place. Which means that if I want information on the latest US school shootings, I don't go to Yahoo or Google and search for "school shootings", I go to those sites and search for major news sources (BBC, CNN, Reuters, etc.) and use their up-to-the-minute search engines.

    The role of search engines isn't "shrinking" by a long shot; it's just becoming less comprehensive. Searching on the Web is now a two-step process instead of a one-step process, and you have to apply a little more intelligence than you could back in 1995. If high school students researching their latest humanities paper have a problem with that, well, they should ask us twentysomethings what it was like to have to use card catalogs and microfiche for our own high school projects.

  • Sorry, folks, but the web is clearly not unsearchable, at least not yet.

    Google consistantly returns good information on every search I make. A fairly superficial, PR-ish overview of their technology is here [google.com]. The gist of it is that, among other things, the number of links TO a page is considered part of the criteria for ranking. (The theory is that an important or well established page will have many links to it.)

    OTOH, human-edited directories like Yahoo and dmoz are going to have a really tough time as the web continues its exponential growts. I get so many dead links from these services that it's not worth the bother.

  • People pay to advertise in the yellow pages, what's wrong with being charged to list on the Internet?
  • Maybe search engines relying on older methods are having problems, but using Google, I honestly haven't had a problem locating material quickly at all. You just have to have the right approach in searching for things...

    • Forget about possible titles of the page. Choose one to three words that you think will be with the body of the page you're looking for. Choose words that will describe the theme of the website. Use nouns, mostly, verbs have too many modifiers.
    • Avoid negatives, articles ("a", "the") or words that are frequently misspelled or have different international spellings ("colour" vs. "color").
    • Use correct spellings of last names for celebrities. If you can't figure out the correct spelling of a celebrity from the entertainment industry, figure out the name of an associative body of work (movie, tv show), and check out imdb [imdb.com]. If you know how to spell "Mystery Men", you'll know how many L's there are in "Garofalo" (or is it "Gerafallo"?). Then head to your search engine armed with that correct spelling.
    • To narrow the search (ie: "Jordan" [google.com] might turn up a ton of different references), try to use a second word that will narrow the context (ie: "Jordan Bulls" [google.com]).
    • Avoid using brand names unless you want .com sites returned first. Chances are they'll show up on searches anyway.
    • Search only on Google [google.com] (or Google-based engines) as it uses IMO the best methodology for ranking sites -- chances are you'll want to see what everyone else is seeing too, and it's based on referential merit that the sites are ranked.
    • Heh, and if you're searching for a specific porn site, good luck. Pretty much every method possible of getting your site ranked higher in the searches has been used. I mean, have you seen some of those Meta tag listings?

    Like I said, most of this is common sense and redundant to most people who've searched for stuff before. But you'd be amazed how many people have no idea how to find the information they need, when you can get it in less than ten seconds, including the time needed to plan the search and type in the query. I try to use this sort of list when telling people how to find info., sort of like teaching a person to fish so they can feed themselves for a lifetime.

  • by MeowMeow Jones ( 233640 ) on Tuesday March 27, 2001 @01:20PM (#336427)
    A google search on the word "internet" just returned 65,500,000 hits. With that many hits, it makes it hard to figure out how to even get on the internet in the first place, let alone use a search engine!

    Trolls throughout history:

  • I was always very happy with Internet searching, so I was surprised to see an article talking about some big Internet content crisis. I see their point about the 'surface' and the 'deep' web, but these are also the same terms used in BrightPlanet's whatepaper [completeplanet.com] on the subject. Since it's pretty obvious that BrightPlanet invented the term, the entire article comes into question: why didn't they draw a distinction between the company whitepaper's thoughts and facts?

    And in the fourth paragraph:

    Despite the ever-ballooning size of the World Wide Web, which some experts claim is on the order of 550 billion Web pages, much of the most interesting and valuable content remains hard to find.

    An unsubstantiated 550 billion pages, or about 100 pages for every living human being? I'm no expert, but that's ridiculous.

    They quoted the Google people saying how hard it is to search for anything besides text, and then spruced some BrightPlanet PR. It sounds like someone's meeting the quota at Reuters, more of that fantastic deep content we should all pay for.

  • Yes, free, independent sites ARE tough to find, even with Slashdot's favorite Google. Eveyr time you search for ANYTHING, the first 1000 hits are always for a commercial site. the thing is, it's because the big commercial sites have most information that most people find most useful. Is there a good way to change this? Not that I can come up with, unless an 'alternative' search engine is created that doesn't accept large corporate sites. But realistically, that WAS Google, but even they couldn't live on 0 revenue.

  • The article [cnn.com] asserts that crawlers "can easily get trapped in a dynamically driven site."

    Not so fast.

    While that is true of older, cr@ppier search engines like AltaVista and Inktomi, Google can and does index dynamic pages. (Indeed, more than 60 percent of new users to one of my sites come in via dynamically generated .cfm detail pages that have been indexed on Google.)

    It seems to me that if you want your content to be indexed, getting on Google (and by extension, Yahoo, since Yahoo uses Google results in addition to its directory), is pretty darn easy. I have to say, I'm not nearly as frustrated with search engines as I was in the days B.G. (Before Google)

  • Gee, FUCKING TEENAGE SLUTS I wonnder SEX CUNT COCK why PUSSY the PUSSY net PUSSY is GAND-BANG ANAL SLUTS getting so CUM FACIALS hard GOAT SEX to search and TITIES index? They're ASS probably using the wrong search engines and PUSSY aren't "web-savy" PUSSY enough. I can find anyting I LOLITA want GOAT SEX on the THREESOME net... You DOUBLE PENETRATION just have BOOBS to GOAT SEX know ASIAN ANAL SLUTS where and how CUM DRENCHED WHORES to look for it. GOLDEN SHOWERS.
  • by Shoten ( 260439 ) on Tuesday March 27, 2001 @01:10PM (#336454)
    I think that neither the people who claim that this is impossible nor the people who want to dismiss it are correct. There is undoubtedly a major problem, and it is only getting worse. The flip side of that, however, is that while we are getting farther and farther from having a complete listing of the web in search engines, the ability of end users to find what they are looking for appears to be improving, particularly with the advent of better search engines like Google.

    The solution to indexing the web completely, or much more completely, has to lie in another methodology. How about a distributed solution? Google@home? distributedYahoo!.net? Honestly...there are ways to tackle the problems, and the reason why this entire system exists is because people refused to just shake their heads and say, "Nope, can't do it...sorry!"

    How about a button in browsers that enables you to mark a page as a dead link? Just hit that button and a centralized system gets a reference to the URL currently in your browser. That centralized system is funded by all search engines and all search engines draw from it. Yes, I know..."What if a user falsely claims a site to be dead?" Well, what if it took 100 different IPs claiming it to be dead before it really was considered dead? If you don't get many people hitting the site from a search engine in the first place, then you probably aren't serving it up to too many people.

    How about a system for pre-indexing an entire site, such that the person who runs it can have a single document at the root of their domain with the index results? A standard could be developed that would even go so far as to map out the existing sub-sites (for AOL personal sites, for example) so that the engine could go to each one for the index documents.

    I guess that what I mean to say here is that the problem is largely based around the hugeness of the web, and how brute force is no longer enough. But that's not really that big a problem...all that's needed is a bit of creativity.
  • It may be thousands of years old, but it has stood the test of time. It has no annoying banner ads and very little porn to distract you from what you were actually searching for. What is the name of this search engine ? The Holy Bible. No matter what subject you are interested in, the Holy Bible has something to say. From Geekiness, to Installing Linux, to how to get a date, to what to eat. Its all in there. I realise I will be marked as flamebait by the anti-religious slashdot zealots but if just one person is saved my my advice it will have been worth all the negative moderation in the world.
  • The trend is toward pay for listing. Will the free, searchable web fade away?"

    Its not a trend, its companies attempting to keep afloat in whats becoming a bull market. Its amazing to see how companies like google stay in business when they show little methods of collecting any kind of revenue. E.g., the only means of Google obtaining revenue is what? Charging for a company for a copy of its search engine? Why would a company pay for a search engine when the market if overflooded with them?

    Ad based revenue, we all know where those click me businesses are going.

    We also know most of the "web rings" never went anywhere, but for a search company to think people would pay for finding something on the net, they'd be shit out of luck, maybe corporations may do this, but I'd just make my own search engine (freely distributed) post it somewhere and let the whole "submit your site for free" revolution take place again.

    Privacy Info [antioffline.com]
  • The article touches on "deep web content" hidden by new technologies such as Active server pages and Cold Fusion. Is this seen solely as a problem from the search engines themselves, or are the sites designed as such the ones complaining?

    If the sites themselves are complaining about no one able to find their content, aren't there ways to help that? Run a query on their database site to generate a possible site list of the content and then provide that list to the search engines. The search engines could then provide a link (found based on a content search) that would put the user on the page where they enter the form (or whatever) information to generate the page needed. Not being familiar with XML, but knowing that it has some features to aid in content grouping, could this be needed to recode the sites in?

    Obviously if the sites themselves dont want this deep content easily viewed except by deep clicking through their whole site, or some pay-per-view system, that is their choice. I feel that they are limiting themselves however. If they think they have robust enough content to useful to users, they should strive to make that content as widely available as possible.

    Should proprietory websites even be considered as 'Internet-web content'? Those seem to me to be 'Intranet content' which most often should not be seen by the general public (ie: internal company policies only needed by employess of company X). For that information to be set free you should either need a very savvy person to break in from the outside or a traitor from the inside. If its only certain products listed that the company doesn't want to available to the public, well that is too bad for them, I'll just get a quote elsewhere and pay someone else my money.

    "evidence of a widening gap between the deep Web and the freely-accessible 'surface Web,' which could become a clutter of recreational and amateur-oriented content -- the online equivalent of public cable access television or self-published novels." Funny, ever since the late eighties, I've always seen the whole web like this. It's more like the big corporations tried to muscle in on the public cable channel and realized they might be better off on their own channel.

    Not your normal AC.

  • Much fuss is made about the search engines needing to "fix the problem" of not being able to index sites like microsoft.com because the pages are dynamically generated. Is this really a problem?

    Microsoft (or whatever over dynamic site you wish to pick) chose to make their content unindexable. Don't try to make it someone else's problem. Let people who use the search engines find third-party information instead. If the site designers wanted their site in the search engines, it would be there. Many of the sites built with ColdFusion or ASP contain basically static information anyway, and making them dynamic just reduces your traffic.

    Sites like Slashdot are dynamic. A search engine can't be expected to keep up with something that changes every 30 seconds. However, making all of the archives static HTML allows them to be searchable by the engines and takes some load off the server, to boot.

    I went for a "best of both worlds" approach on my personal site [robson.org] by writing a perl site generator. Each time I update the site, I re-run the site generator, which takes about a minute. My server carries a lighter load, but I still have "dynamic" links to related articles and such that the site generator builds.

  • by Eoli ( 320216 ) on Tuesday March 27, 2001 @12:57PM (#336472)
    You said the same thing two years ago [slashdot.org]!
  • Google has a really neato ad model that anyone can afford. You basically set up a small 2-3 line ad that is linked to certain keywords or phrases. You are billed around $15 per thousand impressions of your ad. You set up the limit that you're willing to pay for a bing! it's all done.

    Very cool and clever idea. Now small businesses can promote their sites without having to invest mega-$$$$ for the traditional "banner ad".

  • If you've been paying attention to the Freenet [sourceforge.net] scene you'll see that it's impossible to search it. Everything is handled as a key value where you can input keys and find pages. This makes it hard to censor and almost impossible to trace where the server of the page you're looking at is but it also means that there arn't any search engines. All that needs to happen is for authors to give their pages nice keys and everything should work fine. This is a lot like how META Keywords should be working on the Internet today.

  • The Web is a victim of its own success. Now every snake-oil salesman, fanboy and their grandmother has a website.

    Even Slashdot is too big. How the hell are you supposed to follow a conversation this big.

    especially with the goatsex.

    I'm gonna start mailing postcards.

    Excelsior,

    ME
  • by StarPie ( 411994 ) on Tuesday March 27, 2001 @01:12PM (#336485)
    Actually, this just is a great opportunity for the next Great Search Engine. Look at how well Google has done just indexing a small portion of the web (1%, according to the article). So that leaves the door wide open to anyone who can crack the puzzle of how to keep up with the web. If word gets around that something is better than Google, it'll be huge. You can say "oh, no one can index the whole web accurately," but there is someone out there with the brains and courage to try it -- and succeed.
  • Hello,

    I'm one of the authors of Sparkseek [sparkseek.com], a remotely-hosted search service. I'm also a student at Pennsylvania State University. I want to give you an idea of what kind of problems researchers in the field of internet text retrieval have to deal with.

    Larry Page, one of the co-developers of the Google search engine said in his 1997 research paper entitled "The Anatomy of a Large-Scale Hypertextual Web Search Engine" [scu.edu.au] that the primary benchmark for information retrieval, the Text Retrieval Conference, uses a fairly small, well controlled collection for their benchmarks. The largest benchmark they have available is only 20GB compared to the 147GB from Google's crawl of 24 million web pages. Today, Google has over 1.4 billion web pages in their database and a reported 4,000 node linux cluster [slashdot.org].

    One of the problems I have encountered and digress that I've found difficult to deal with is the shear amount of redundancy in web content. Anybody who has ever tried a search for any linux command has no doubt encountered hordes of duplicate MAN pages in their results.

    Not only that, but I honestly don't believe that when it comes to search engines, more is better. I have noticed over the past 6 months, as google has made great increases in its index sizes, that results have consistently become worse and worse. Search engines really need to begin narrowing the focus of their index and creating multiple indexes. Educational institutions should be separated from commercial establishments.. if I'm performing research on some subject, the last thing I want is to arrive at a commercial establishment pitching some product.

    Also, the method google utilizes when creating their indexes creates a huge scalability problem. Their indexes are updated less frequently that ever, and if you read their document that was published in '97, it's not hard to see why.

    Michael Tanczos

  • A totally new approach could be that you don't search but interesting web resources gets recommended to you by your personal agent. We are currently working on a peer-to-peer system that doesn't exchange files but exchanges recommendations for web sites.

    It's much like a good friend suggests that you have to look at a interesting web site. You can see all the marketing blurb at http://www.iowl.net/. At the moment this is a seminar paper of some people (including me) at the Wuerzburg University of Applied Sciences. We have a working prototype that will be released hopefully in about a month or so.

You knew the job was dangerous when you took it, Fred. -- Superchicken

Working...