Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?
The Internet

Web Searches For What Lies Beneath 80

fat_hot writes: "The New York Times has an article [here] (registration required) about specialized search engines which try to drill into the submerged mass of the Internet iceberg to try to limit searches to particular subjects (and hopefully thereby increase coverage of the limited scope)." Considering that a google search for friends' web sites and other good stuff usually turns up more dirt than paydirt, it's pleasant to contemplate more relevance in search engines.
This discussion has been archived. No new comments can be posted.

Web Searches For What Lies Beneath

Comments Filter:
  • by Anonymous Coward

    Northern Light [nlsearch.com] strikes me as doing the best job of returning relevant results, going so far as to thoroughly categorize the results by topic. Also has a greater portion of the web indexed than any other engine. The downside is that there is a bit of lag time in adding new domains to the bot's indexing runs...

    Google [google.com] is pretty good at giving relevant results, but it misses a lot of sites. AltaVista [altavista.com] is rather thorough, but not very good at relevancy ranking.

    These observations are simply based upon my own experiences with these engines, so your mileage may vary. When performing intensive searches, I generally use all three, but I'll often start with Ask Jeeves [ask.com], which is easily the best meta-search engine out there...

  • by Anonymous Coward
    Here is another issue with web searching.

    None of the major search engines (even Google) crawl .pdf files.

    They may take you to the document (often that is an issue, note google url) but not crawl through the item.

    Try throwing this or any other pdf url into google or any other search tool. http://www.census.gov/prod/ec97/97cfd2.pdf

    Even the searchpdf.adobe.com engine only searches summaries and is not that large of a dbase.
    This is not a technical issue. Most crawlers can handle .pdf material. Even tools like Atomz are capable of crawling this data.
  • I just yesterday found an essay on this subject. It can be found at http://www.lucifer.com/~sasha/articles/ACF.html He goes on a bit at times, so make yourself some coffee and print it out to save your eyeball(s). It's all about what he calls Automated Collaborative Filtering and Semantic Transport. Of course, I rarely have more than a little trouble finding what I'm looking for, but that may just be that I think I've found the most relevant info, not that I actually have. This paper lays some of the theoretical groundwork for revamping search technology. However, I would be hesistant to give up on the current engines. I think a "smarter" search should be regarded as an addition to the current toolset, not as a replacement. (End user moderation would help cut down on the detritus currently clogging the pipes though!)
  • These guys [hugedisk.com] are claiming responsibility for it.

    Here [wired.com] is a story from Wired about it.
  • by jd ( 1658 ) <imipak.yahoo@com> on Thursday January 25, 2001 @11:11AM (#481376) Homepage Journal
    ...May or may not be the answer.

    However, I suspect that whatever the answer to the search engine problem actually turns out to be, it will have the following characteristics:

    • Context-Sensitive Searching. If a word is NOT relevent to the context, it should be ignored in the search.
    • Dummy-page Detection. A stolen page, or one designed to trigger search engines, which then redirects you, should be ignored.
    • Relevence of Nearby Pages. The more relevent "adjacent" pages, the more likely this is to be what you're looking for.
    • Thesaurus-based indexing. Any word of similar or exact-same meaning should (optionally) be findable.
    • User-evaluation. Search results should be moderatable by users, to eliminate pages designed to beat the system and improve the ranking of pages that are useful.
  • Maybe I have some amazing skill that I've just not been aware of until now, but, if you're getting such poor results on google, I'm thinking it just might be you who's doing it wrong. I can search for just about anything, technical or otherwise, and come up with good results.

    That's not to say that it couldn't be improved - I'd love to be able to "for sure" get exactly what I wanted in the top three or four returns, but, often I'm searching for something a bit obscure that is only being described by common words (alas, I can't think of what was vexing me in that department last week).

    But, I think my point is still valid even if this super-search engine comes around: The search is only as good as the searcher allows it to be.

  • I believe that maggard is right about schools not doing their part in the information age, and teaching kids how to effectively use search engines, espically considering the fact that many schools are moving towards electronic card catalogs.

    by asdef

    Huh? Where'd I say that? Clearly your school isn't doing its part in teaching research & attribution.

    Contrary I believe many schools *are* doing their part. No not all, but many. Tragically school libraries & school librarians have been tremendously short-changed in the past few decades, ironically often in order to fund sexy things like computer labs.

    The truth is that the skills one needs to use in a library are even more critical now then they were in the past. As you correctly pointed out card-catalogues are dead, I can't think of any post-HS system that still seriously maintains one. Unfortunately the helpful Reference Librarian willing to walk a random person around and re-tech them the ropes have also been budget-cutted out of existence too. With the information explosion / the information economy the ability to search, prioritize, and compile material has become even more critical (not to mention the ability to comprehend the materials.)

    Corporate knowledge-bases, electronic paperwork, web-based 'employee handbooks', online job searches & apartment rentals; these all require the ability to search for information in an efficient and comprehensive way. Search-engine cluelessness is simply a symptom of a wider problem.

    That said again I believe schools are doing a reasonably good job. I know my old elementary & high schools are teaching kids how to use search engines, as is my old university library. My concern is for those out of the educational system.

    Reading the directions doesn't seem onerous to me. If one is performing searches and coming up empty or with useless material then figuring out how to fine-tune one's searching doesn't seem to require any great intuitive leap. Yes it would be wonderful to live in a world as trivially comprehensible as the doorbell but lacking that most folks have learnt to READ THE DIRECTIONS.

    Generally search engines do a great job of explaining how to use them. There are even search engines that try to out-think the user and parse their natural-language requests into regular search expressions. Google isn't one of these engines; it's a high-powered bare-to-the-metal engine that requires a certain amount of understanding by its users to use. On the other hand there are literally dozens of other engines that *do* walk a person through performing a decent search. The fact that folks pick the wrong tool for the job (a tool they neither know how to operate nor are willing to invest the 1-screen/2-minutes to learn) and then complain about their results seems to be just idiocy on the part of the user (or in this case an article author.)

    Yes, the original article clearly set up a straw-man in order to promote these dedicated search engines, on the other hand there are legions of folks who continue to use search-engines every day with poor results and do complain about them.

    The solution? I dunno - sell them more lottery tickets?

  • Actually I'm blaming folks who insist on using something that gives them poor service yet don't invest the 1-screen/2-minutes to learn how to get good results. Are these folks "victims"? No, they're just idiots: Intelligent folks learn how to select and use tools.

    That point aside I'm trying to figure out the rest of your posting. You don't like the fact that different search engines use different formats? Well pick one and just use it. You prefer a GUI interface instead of a command-line type one? There are lots of those. You'd prefer a walk-through format? There's lots of those too.

    I think you've got a point somewhere but I can't find it. I suppose my only comment would be that folks should, again, pick tools suited for the job. If it's not worth it to them to learn a seach syntax then they shouldn't use a search-engine that relies on one (DUH!) Google requires a syntax, many others don't, use one of them.

    As to search-engines getting tricked into returning misleading its, yeah that's a problem but not a big one. So 5% or even 10% of the hits are come-ons to porn sites, there's still going to be ~30% good hits (the rest misses of varying degrees) and that's enough to be productive with.

    Finally - don't tell someone not to be "smug and negative", I could insert some comments here about the apparent tone of your posting but that wouldn't be productive, lets just say I don't see those in my posting & drop it.

  • by maggard ( 5579 ) <michael@michaelmaggard.com> on Thursday January 25, 2001 @11:06AM (#481380) Homepage Journal
    Of course much of the problem is that few folks actually understand how to search properly.

    Most of us recall being brough into the school library and show how to use the card catalog, given a few assignements, etc. Unfortubately for those of us out of school the's not that set of skills in place to help searching.

    Boolean seaches, using key words, supplying partial words, phrases, etc. are all supported by most search engines but few folks understand how to use them.

    What's really suprising to me is that folks who use search engines regularly, indeed even rely upon them (journalists I mean you!) seem some of the most poorly prepared. There are lots of resources for learning how to do a good search, many from the search engines themselves and many more from third parties yet we still get these perennial "I can't find ..." stories.

    Honestly, I'm not into blaming-the-victim but how difficult is it to learn how to perform a good search? One screen of directions? Two minutes of time?

    Yes there's a place for specialized engines handling unique or limited content but most of the larger, more general purpose engines do nearly as well if properly used. Again, it's dependant on the user to learn how to define what they want, all of the tools in the world are no good if they're not taken advantage of.

  • It's currently very hard to search for information on the Web that is less than 2 weeks old. When you're keeping up with current events and industry developments, 2 weeks is just too long to wait for information.

    That's where specialty search engines like Moreover [moreover.com] come in. Eventually, sites like this will let you search those bits of the Web that change often (news sources, weblogs, discussion groups, sites like Slashdot, message boards, financial news, etc.), allowing people to keep up with things as they happen.

    Existing search engines are great at finding things that are archived on the Web, but poor at keeping up with what's currently happening. Looking for all the articles on the latest Shuttle mission, as well as what people are saying about it? You might find one or two things about it on Yahoo! or Google, but a search engine like Moreover will find the fluff article on CNN, the more in-depth article on Space.com, and a discussion about the mission on Slashdot. That's pretty powerful.

  • Your directory search example didn't work too well for me. And while you *can* search Yahoo's news archive, you'll only be searching sources that have a syndication deal with Yahoo. What news search engines provide is the ability to search most major news sources without any non-news sites, and many people want that.
    By the way, the NYTimes story mentions moreover.com, which is a great service. But since their search feature only searches headlines, allow me to mention my own project, NewsBlip.com, which performs full-text searches. Give it a try, thanks!
  • The only solution is to develop a system that is not easily manipulated.

    Or a review system. There is a way to do it, although it might be a pain in the ass. Basically, what you need is a web of trust and digital signatures.

    For example, suppose I have a list of keywords. I submit my page to a reviewer, and they judge whether or not my keywords are a reasonably good match for my page. If I pass the test, they PGP-sign my page.

    Then you just have a modified search engine that only returns pages that have a valid signature by someone who is on a list of authorities that the searcher trusts.

    This type of thing could be used for a more general web page rating or reviewing system. It's just that perhaps some reviewers might judge pages solely on the criterion of meta tags matching the content.

  • Interesting indeed. If you go here:

    http://www.google.com/search?q=cache:www.georgewbu shstore.com/+dumb+motherf******&hl=en [google.com], which is the cached link from Google you'll see the following:

    This is Google's cache of http://www.georgewbushstore.com/.

    Google's cache is the snapshot that we took of the page as we crawled the web.

    The page may have changed since that time. Click here for the current page without highlighting.

    Google is not affiliated with the authors of this page nor responsible for its content.

    These terms only appear in links pointing to this page: dumb motherf******

    Very obviously Google uses words from OTHER websites to link back to websites in searches. I'm not sure I like this. This looks as though by someone linking to my website and putting bad words in their website, I could be affected by it.. anyone able to comment on this?

  • Yeah,

    For instance, if you do a search for pornography on google, you can often times get a link to Disney.com. The reason for this is because many porn sites, if you click the, I AM NOT OVER 18 link, take you to www.disney.com.

    This is both something good and something bad in the way that google indexes.

    I have to agree with Arkaein here, it is very odd that someone was able to fool Google into thinking that the GW store was a top linked site. It would be nice if Google where to show you were the reference came from :-)

  • I've many times searched the internet with a search program to find nothing that I was looking for. It's very upsetting when you try to find something. I used to like Ask Jeeves (www.aj.com), but it still wouldn't find what I was looking for, then I found google (www.google.com) and was very happy to find that it did infact find what I was looking for most of the time. Why can't all search engienes look for what you type in?

    I think part of the problem lies in the fact that they match words all over the website.. ie... if I type in "hot green hamsters" the words Hot, Green, and Hamsters can appears anywhere on the homepage, even if I put them in "'s the search programs dont' always group them togethor. So A page talking about hot peppers, green peppers, and how hamsters eat the pepper gardens in Mexico, would bring up a search, even though it wasn't anything about what I was looking for.

  • I'm the webmaster of a Linux website [hardcorelinux.com] and AOLsearch [aol.com] ranks me number one under the keywords "big flabby butt" [aol.com]. Well, unless someone's willing to take a picture of my fat ass, the visitor is going to be sorely dissapointed!

  • ...that average people are morons.

    IIRC, Google uses an algorithm that, based on a combination of HTML tag size and logged click-throughs would sort the links. Neato-keen.

    Well, about a year ago when google was still young and fresh, you could type in your search strings, hit the "I'm Feeling Lucky" and get EXACTLY what you wanted. Blew me away time and time with its strange accuracy.

    But, as more of that click-thru data got integrated into the sifting, I got more and more of the crap that the sheep (ie, normal mom and pop AOLer types) wanted to look up. What the hell, man. Don't get me wrong, I still use google, but now I have to scan three pages deep before relevant pages come up.

  • 5) The number of times a particular page is linked to.
    7) The number of times the linking page is linked.

    This is not hard to manipulate. Put lots of links [slashdot.org] in your pages instead of "keywords".
    Once again the porn industry leads the way on the web ;-)
    instead of pages that look like this

    sex sex sex sex sex
    sex sex sex sex sex
    sex sex sex sex sex
    sex sex sex sex sex

    they now look like this

    sex [slashdot.org] sex [slashdot.org] sex [slashdot.org] sex [slashdot.org] sex [slashdot.org]
    sex [slashdot.org] sex [slashdot.org] sex [slashdot.org] sex [slashdot.org] sex [slashdot.org]
    sex [slashdot.org] sex [slashdot.org] sex [slashdot.org] sex [slashdot.org] sex [slashdot.org]
    sex [slashdot.org] sex [slashdot.org] sex [slashdot.org] sex [slashdot.org] sex [slashdot.org]

    Also I think google looks at the url to see if it matches the search word this is not difficult to manipulate either

    www.foo.com/sex.html has 20 links to www.foo1.com/sex.html etc etc
  • I worked for an web portal that had on it a well know search engine. Looking through the logs, the most common search was for "" i.e. nothing the users had just hit the search button. This search was double that for "sex" which came a poor second.
  • I remember reading an interview with Tim Berners-Lee(I would post the URL but altavista can't seem to find it) where he was amazed that everyone wanted to be on the World Wide Web...he had thought that it would fragment into little specialized pieces where each interest group would have their own domain ( ie mathematics, chemistry, physics, etc) and search engines would be limited to the domain.

    Will we end up there?
  • by WinDoze ( 52234 ) on Thursday January 25, 2001 @11:55AM (#481392)
    For some insight into how truly bad some people are at constructing a search request, check out Disturbing Search Requests [weblogs.com]. It's updated constantly and is a consistent source of wonder and amusement.
  • What did they expect? Google can't read minds yet.

    Almost a year ago (in the beggining of april, to be more exact) they announced this breakthrough technology. I'm not sure why it isnt in use now...

    Dont you love april fools? :)

  • Google can't read minds yet

    True, Google can't read minds, but if you type "Can Google read minds?" and click the I'm feeling lucky button you'll find out about a nifty feature that's almost as good as being able to read your mind.
  • > User-evaluation. Search results should be
    > moderatable by users, to eliminate pages
    > designed to beat the system and improve the
    > ranking of pages that are useful.

    I like 'em all but that last one. I don't want to search for linux and get a bunch of rootkit sites modded up to (Score: +5, l33t).

    In all honesty though. You'd have to put so much abuse-protection into place in a moderation system for search engines that it would blow your mind. I don't want the web rendered useless by 5cr1p7 k1dd13z who cracked the mod system. Oh well, people suck.

    Justin Dubs
  • Go to Google. You know where to find it.

    Punch in "Dumb Motherfucker".

    Click "I'm feeling lucky".
  • And its been around for a while as a concept. I used to work for SpaceRef [spaceref.com] who maintain an excellent niche search engine devoted to space exploration.

    I maintain Omphalos [omphalos.net] which is a niche search engine devoted to the modern alternative religions (Paganism, Wicca, etc) and related subjects.

    All it really requires is a reliable collection of websites focusing on a specific range of subjects and good search engine software to index their pages. The results are often much more relevant than those from the major search engines - although Google is generally an excellent choice IMHO.

  • I see lots of people trying to extract content from scientific journals using natural language processing. Tough problem. Getting the content isn't hard it is trying to sort out what it all means. For example a computer might predict that that method yyy is derived from method xxx simply because the sentence refers to both of them:

    Unlike method xxx, our method yyy does something completely different, unrelated and totally offtopic.

    Although you could envision ways of sorting through this example, realworld examples can be far more abstract and disjoined.


  • They searched for "Chavez" and then complained that there wasn't any information on Linda Chavez (the nominee for Labor Secretary).

    What did they expect? Google can't read minds yet.

    Bunch of mojacks.


  • and I found Harrison Ford - damn good movie tho - too bad the trailer ruined it
  • Honestly, I'm not into blaming-the-victim but how difficult is it to learn how to perform a good search? One screen of directions? Two minutes of time?

    Yes you are blaming the victim. The basic concepts of searching take less time to learn than fancy terms like "boolean". Ideals are nice, but the devil is in the details. Search engine sites perform a difficult task and some do a first rate job. For that they should be thanked, but nothing is perfect.

    What confounds the user mostly are all the syntaxes uses to express those concepts. They are different for every site and take some getting used to. It would be neat to see a search engine with more than one line for input. You could have a box for exact phrases, one for anyword matches, an exclusion box... It's not that command line syntax is ugly, it's that most people have better things to memorize.

    Another thing that confronts the user is the effeciency of the search itself. Very clever people constantly seek to fool search engines, and ocasionaly do. The result is garbage to wade through until the search engine can recover. I remember a time when all search.com would retrieve was porn sites. Even Google has been beat a few times.

    Let's not be so smug and negative. Look for the opertunities presented by user confusion. Be happy that these new search engines are comming.

  • Most web sites have an advanced search. In this advanced search you'll find an option to search for the exact phrase. If you enter "hot green hamsters" there, and search, it will return web sites that only contain that phrase.

    Some search engines do this if you enter the phrase in quotation marks, too.

  • They searched for "Chavez" and then complained that there wasn't any information on Linda Chavez (the nominee for Labor Secretary).

    No, the problem is that Google (and every other major search engine) takes forever (weeks to months) to spider new pages. So just after Chaves was nominated, none of the news articles about her had been indexed. By the time the spiders hit them, she'd already been dumped.

    The basic problem is that HTML spidering is a horribly inefficient way of indexing information that is often (especially in the case of news articles) stored in a nice, neat database.
  • Thanks for proving my point. My company, Thinkstream [thinkstream.com], is working on search engine technology that overcomes just this sort of problem. Instead of centralized, HTML spider-based search sites that cough up stale data, our technology is centered around live connection to diverse, distributed data sources (especially databases), regardless of storage method.
  • However their is a difference between niche searching, using "focused/targeted crawlers" and search engines that provide access to material that no search tool crawls (Invisible Web).


    Nuclear Explosions Dbase
    http://www.ausseis.gov.au/information/structure/ is d/database/nukexp_query.html

    Finally, it could also be asked that even if this material was crawled would the lack of an interface and search capability tailored to that data (specific sorts, etc) make pulling that material out of massive dbase (Google, AV, Excite, etc.) effective.

  • by Lord Omlette ( 124579 ) on Thursday January 25, 2001 @10:51AM (#481407) Homepage
    If your friends' have sites but not too many people link to them, they won't rank too highly in Google's eyes, will they?

    A Google search for 'dumb motherfucker' will yield George W. Bush's website, how inaccurate could Google possibly be?

    "a Google search on "chavez" led to several encyclopedia entries on Cesar Chavez" Would it have fucking killed them to type in "Linda Chavez labor secretary"? And this was very recent news, exactly how quickly do you expect Google to scan the entire internet for updates? How quickly could these 'iceberg drilling' search engines possibly scan the net? It's a deep web right now, what's invisible will bubble to the surface if it's relevant... Maybe they have a point on using the search engines to only scan specific areas, but I think websites which specialize in these areas should license the Google engine instead of Excite's... (you know what I'm talking about right? Every big site has some article you want to find, you go to look for it, you get the worst search interface possible that doesn't return any useful links...)
    Lord Omlette
    ICQ# 77863057
  • by rgmoore ( 133276 ) <glandauer@charter.net> on Thursday January 25, 2001 @11:18AM (#481408) Homepage
    Plus, don't you think it would be much easier if people actually didn't try to cheat search engines?

    Yeah, and it would be great if nobody stole money and gave to charity, too. It just isn't going to happen. Any system that is A) valuable and B) depends on everyone behaving honestly is doomed to failure. You're never going to get people to stop cheating the search engines as long as doing so is both possible and beneficial to the cheaters. The plain fact is that manipulating the system works, and people are going to keep doing it as long as it keeps working. The only solution is to develop a system that is not easily manipulated.

    5) The number of times a particular page is linked to.

    )7) The number of times the linking page is linked.

    Perhaps you should try looking at Google [google.com], a search engine that actually uses these in a clever way as the key part of its ranking system. It's remarkably effective at finding relevant information and at avoiding the kinds of simple manipulation you complain about. Other ranking schemes (like GoTo.com [goto.com]'s straight pay for placement system) are also relatively resistant to manipulation. I think that the long term solution is going to be natural selection; search engines that are easy to manipulate to give lousy results will go out of business and leave behind the ones that are actually useful.

    Personally, I think that it would be great that if there was an editing team that would simply delete misrepresented pages.

    Good luck. The latest versions of Google include over 1 billion pages. Manual sifting for poorly labeled ones just plain isn't an option if your primary goal is comprehensiveness.

  • The proverbial iceberg of data on the net lives in databases not accessible to search engines as we know them today. The power and complexity of the little engine that could would be far too sophisticated for the public to be allowed access to. It'll be interesting to see how they pull off the privacy end of the whole thing...
  • by shalunov ( 149369 ) on Thursday January 25, 2001 @11:08AM (#481410) Homepage
    The article talks about hand-picked sites to crawl to "eliminate irrelevant results". Isn't this what directories [dmoz.org] are about?

    Why does one need cheesy [financialfind.com] dotcoms [fuckedcompany.com] to tell us what a directory [yahoo.com] is?

    A directory search [google.com] limited to U.S. newspapers immediately brings up, say, an explanation [washtimes.com] by Linda Chavez about her relationship with the illegal alien in question.

    If one wants political news, one can go to a political news source [yahoo.com]. If one wants information on Linda Chavez, one can do a more specific search [google.com]. If one wants political news about Linda Chavez, one can (this must be getting very complex for your average dotcom founder) search a news archive [yahoo.com].

  • The premise of the article is good, but I feel that in a way that theory would stunt some acquisition of knowledge. Often in my own web searches, while seeking information about a certain specific subject or theme, I have come across other topics that interested me that had absolutely nothing at all to do with my original criteria. I know that this is commonplace, but it just reiterates the whole miracle of the internet to me: Not just information is available, but all kinds of information.

  • Sorry for the wrong links...

    There are spaces in the middle of the two last. Delete them and they will work.
  • As the articles says: "People may know to come to the library, but they probably do not know which reference books to pull off the shelf. Of course, in such cases, patrons can at least consult a reference librarian."

    In the example given by the article a "linda chavez" or "linda chavez labor secretary" query would be much better than the ordinary "linda".

    Moreover, there exists the problem of determining the category of what is being searched. A trend is the use of AI and ontologies by the search engines, which determine what is really relevant in a page and classify it during the indexing phase based on the different categories (economy, medicine, technology, entertainment, ...) defined by the taxonomy used. In other words the idea is to search the meaning not the words (see also www.oingo.com).

    What the article talks about are the knowledge based agents. A quite interesting article can be found at: http://www.cs.technion.ac.il/~cs236512/www-search- lab/ka/KnowledgeAgents.htm

    Another interesting link:
    - CMU World Wide Knowledge Base (Web->KB) project:
    http://www.cs.cmu.edu/afs/cs.cmu.edu/project/the o- 11/www/wwkb/index.html
  • Actually you can see the list of links. just click on the link below and there you go. Didnt find anything interesting but I only went through the first few pages.

    http://www.google.com/search?sourceid=navc lient&q= link:http://www.georgewbushstore.com/

  • This is a really in-depth mind-blowing thought. You mean we should use the hammer for nails and the screwdriver for screws.

    Seriously folks. The article is just saying use the right tool for the right job. It's a no-brainer. If you want news stories you search cnn.com or another newsite. If you are looking for financial information search The Wall Street Journal [wsj.com] or another financial page. Search engines like google (get the toolbar, it is great) or AJ are for general searches to get you started out on a topic so you can refine your search from there. Duh.

  • by Pinball Wizard ( 161942 ) on Thursday January 25, 2001 @12:43PM (#481417) Homepage Journal
    ...is a search that can read our minds and instantly infer the most relavent results.


    Searching for "John Smith" should return my friend John Smith and no one else.

    Searching for "C++ implementation of Knuth algorithms" should return exactly that, and leave out references to C++, Knuth, or algorithms.

    At the very least, large search results should immediately separate the mass of results into categories - i.e. "Jessica Alba" - up at the top should be pr0n - fan sites - commercial sites - etc. Yahoo does this, but there are way too many categories. Really, the web has maybe 10-12 different broad types of sites - commercial, homepages, academic sites, pr0n, multimedia, weblog - you get the point, the list isn't that long. We should be able to filter entire broad categories out of our searches. Altavista does a fairly good job with multimedia searches - unfortunately there still is way too much manual searching - it still doesn't read our minds enough within the broad category search.

    Google uses PageRank to determine the order of results, but does it track the sites its users click on after performing a search? No, but it should. Further, it should track users individually and be able to customize its results based on that persons individual personality. The more you use a search engine, the better it should work for you.

    I can't stress this enough: A search engine needs to be able to read our minds.

  • That is one of the coolest things I have ever seen in particular I like the fact that they are going after whoever did it with their lawyers. And the HTML on the page is all hosed up. Thanks for the laugh.
  • But she is indexed, and the top ten references are related to her.

    Searched the web for Linda Chavez labor secretary. Results 1 - 10 of about 1,390. Search took 0.08 seconds


  • by gallir ( 171727 ) on Thursday January 25, 2001 @01:54PM (#481420) Homepage
    They already realised it. You can read in the page:

    (Note: If you have arrived at this site through inappropriate references via a search engine, please be assured that we did not utilize this language in our site, our HTML, nor in our internet promotion of this site. What happened was the result of a malicious act and we are pursuing remedies through the efforts of our staff and attorneys.)

    I hope I am not liable in Spain for using those words. Please don't tell them where Spain is.


  • I'm sure that the guys at google are just as interested as you are, and are pouring over logs + data-files trying to figure out how somebody 'cheated' the system.
  • Actually, you'd need "news +for nerds" since the 'for' is normally ignored, being such a common word.
  • Yeah, I've read that... It was hysterical! I think I'll go read that again. Thanks for the link.
  • by zombieking ( 177383 ) on Thursday January 25, 2001 @11:44AM (#481424)
    I asked Jeeves "Where can I find a good search engine?" and was directed to a really good site where I can buy engine parts for my car online.

    Thanks for nothing you bastard butler!
  • Considering that a google search for friends' web sites and other good stuff usually turns up more dirt than paydirt, it's pleasant to contemplate more relevance in search engines.

    I disagree. I continually find close matches using Google, much better than anything I used previously (Hotbot was good for a while).

    When Yahoo started using them I rejoiced. It was the best of all possible worlds (good search engine, web of content like the calender, and hand-picked sites when all else failed).

  • The Cliche is:

    90% of everything is junk

    In truth, it maybe more than that.

    So we come to the needle in the hay stack,and how the databases that the search engines consult give priority to different terms, how they index the various sites, and how long it takes.

    Of course, for the person truly expert in these things, these are trivial details. They are as obvious as a traffic jam. For the rest of us, it is more a matter of "where did all these cars come from?"

    Unlike our computer, there is no central index for the full content of the web. It is a job that is done continously at a surface level, and takes a month or two or three.

    In that context, of course last night's news will not get indexed while we wait.

    Just like the tradition of game installation, search engines have been designed to be used by people who have a clue.

    Sometimes I swear that until we get a system designed by geniuses to be used by idiots, we will need to have some sort of internet user license or something. Other wise it is simply a matter of designing systems that can obey the command:

    "Do what I want, not what I say."

    This is an interesting problem in programming, is it not?

  • HugeDisk apparently takes sole credit for this issue. I doubt only they could do it, seeing as how much more popular sites (like Everything2 [everything2.com]) use the same words. But they do take the credit in their story [hugedisk.com] (use cached version [google.com] in case of slashdot effect)
  • ...is a search that can read our minds and instantly infer the most relavent results.
    Or people could learn some more skills and actually specify what they want more accurately. I rarely have any trouble finding what I want with Google.
    Google uses PageRank to determine the order of results, but does it track the sites its users click on after performing a search? No, but it should.
    It's not really that useful because there's no way for Google to tell if you are happy with the page you clicked on. They might be able to implement it via the toolbar, but it'd probably still require the user to tell them manually (something along the line of the "did you get what you want?" question you see on a lot of site searches.
  • The problem is also that people expect search engines to do their thinking for them (an expectation admittedly encouraged by search engines themselves). Search engines are algorithms. The content generated is a transformation of what you put in. If you know jack about what you are looking for this will often show in the results. If you know something to start with, your searches are likely to be more successful. Of course, you can always begin to intuite the workings of particular search alogorithms -- that's how people get used to or get proficient with one engine but can't use others ...
  • Username: cyph3rpunk0
    Password: cyph3rpunk

    Enjoy the article.
  • Google was great up until a few month ago (maybe even a year), when they started throwing around lots of publicity and at some point they explained to the whole stupid world in simple terms how they made Google better than the rest of the crop (i.e. the Pagerank system). At that point, the web-spammers, those pathetic fscks who spend their whole lives making content-free pages with only links and banners and popups, well they figured it out. They started creating zillions of ditzy pages containing trucks of keywords and only one link to the "real" website they were spamming about.

    The concept isn't new, it's just the sheer volume that made Google freak out. The reason behind it is that Google counts the number of links leading to one page as an indicator of that page's actual popularity. So the spammers simply created hundreds, thousands of dummy pages with single, prominently-placed links which fooled Google's crawler.

    The temporary solution, as always, will be to come up with a new crawling method that can filter out these poison pages, but of course it will only be a matter of time before someone "cracks" the new crawler. History repeating.
  • The example they used at the beginning of the article was fixed, they just typed "chavez" into a search engine, not "linda chavez". Of course they got tons of irrelevant links. You'd think the reporter could have picked up on what a bad example this was. I'm not saying that the search engines don't have flaws, but they could have picked something that demonstrated their point much better.
  • Upon a successful search they provide a choice at the bottom: "Try your query on altavista, deja, etc."

    Upon an unsuccessful search they do not offer you the choice.

    Obviously, they have no responsibility to offer it, but it's kind of slimy that the time you want the option the most, it's not offered.

    Also exact string searches are a little weird, particularly if you forget the +s for common words like "the."

  • I have a little experience at this.. Keeping the trash out takes a lot of tedious work.. I usually have to set a day aside to dig through the ton of unrelated crap that my site [radicalmatter.com] collects. I started it as sort of a way to keep up with what I thought was cool and exciting as I learned more and more about linux. It ended up being yet another dreaded responsibility.

  • Ex-Squeeze me ?
    This is /. and we are teaching each other how to operate common search engines?
    People who regard this as useful information should not under any circumstances be reading /.


  • Well, the article speaks of a lot of things, mostly though, it's links to specialized search engines. It gives the impression that in order to really find what you are looking for, you should use a highly specialized search engine. I disagree a bit on that.
    I know there are companies out there that has the technology to "put it all in one" so to speak. I have worked a little with Autonomy [autonomy.com], and I gotta say, I am deeply impressed by what it does. They employ technology called Bayesian Inference (from Thomas Bayes). The technology has to do with "calculating the probabilistic relationship between multiple variables and determining the extent to which one variable impacts on another" - Sounds wild, eh? Well it it. Together with this, their core engine, called DRE (Dynamic Reasoning Engine), relies on the theory of Claude Shannon, which states that "the less frequently a unit of communication (for example a word or phrase) occurs, the more information it conveys".
    The more input you give it, the more accurate it will be. Oh, and it's actually for all kinds of unstructured information - also e-mail.
    I ramble. You should check it out.
    Autonomy also makes Kenjin [kenjin.com], which is a piece of software that you install that will understand what you are looking at, and help you search for similar stuff. Kinda kool.
  • The problem most users have is NOT with syntax and boolean functions, it's simply that they're rarely being logical and specific. I mean really specific.

    E.G. - 3 people I work with were trying to find the name of the Abbott & Costello movie with the voodoo doll making witch in it (don't ask), they were searching and searching for made up names, years, actors names, ad nauseam, when 'Abbott Costello witch voodoo' (click I'm feeling lucky) brought it right up.

    The trick is visualizing and then boiling down your desired target text to specific unique words and then searching for those words. Sounds obvious but most people still expect technology to have animate, responsive, understanding qualities like it does in the movies.
  • The "search industry" is just that a industry. With each passing day all the big players become more and more dependent on capturing seekers rather than helping them find what they are looking for. The business plan is to sell you something, and not to give you anything, especially easy access to someone else's store. The fact is without a non-commercial alliterative the same forces that destroyed the great potential of television will close all roads on the internet that are not connected to their toll booths. Content that can't be found is content that might as well not exist. I think the destructive consequence of their being no clean, logical, complete index of the web is slowing the internets growth and if nothing changes no individual or small business will have any incentive to participate as a provider of content. The fact that I can walk to a local business in less time than it takes to find their web-site is a sad and shameful commentary on how pitifully broken the superhighway is.

    I have suggested a "fix" for those who give a crud.SEE This [ezboard.com]

  • I believe that maggard is right about schools not doing their part in the information age, and teaching kids how to effectively use search engines, espically considering the fact that many schools are moving towards electronic card catalogs.

    I personally am self taught in the 'art' of searching the net. this includes using boolean operators and as previously mentioned, the use of quotes around phrases. Why can't schools teach these usefull skills?
  • I use Google. Google is great and you're right, it does implement many of those features. Without cheaters, the fact still remains that a majority of sites fail to use description and keyword metatags.

    "Good luck. The latest versions of Google include over 1 billion pages. Manual sifting for poorly labeled ones just plain isn't an option if your primary goal is comprehensiveness"

    Well, the idea wouldn't be to look at ALL the pages, but rather, the main sites themselves. Porn is easy enough to find. By looking at the frontpage of sites, you'll be in better shape, even if you are just removing several million cheater domains.

    Heck, you could hire high school students to do determine if a site is cheating or not. Not that difficult.
  • Trust me it can read minds: Google Mentalplex [google.com]
  • by Cspine ( 263118 ) on Thursday January 25, 2001 @10:49AM (#481442) Homepage
    The problem isn't the searches, it's the people who make the webpages.

    Why doesn't everyone use metatags properly? What about specifying good (descriptive) title tags?

    Plus, don't you think it would be much easier if people actually didn't try to cheat search engines?

    In actuallity there would be some very easy ways to score pages for relevance then:
    1) The number of times a particular word shows up in the keywords, and description of the page.
    2) If the word actually appeared in the title of the page.
    3) The number of times the word appears in the body of the text
    4) The length of the supposedly searched word
    5) The number of times a particular page is linked to.
    6) The words used to in the link
    7) The number of times the linking page is linked.

    Wouldn't the world be happier. Personally, I think that it would be great that if there was an editing team that would simply delete misrepresented pages.

    Anyway. That's my two cents.

  • I disagree with the line about searching for stuff on google turning up dirt. If you know how to format a search properly, and which words are key, nearly anything can be found on google.

    OTOH, it is always nice to search technology getting better. There are some simple ideas which would aid searching, such as voluntary self classification of web sites into general categories (I'm sure this could easily be worked into one or of the emerging document stardards, if it hasn't been already). This would effectively divide the internet into a large number of overlapping sub-nets, as far as searching was concerned -- you could search everything, or just websites pertaining to 'games', etc... I think that a solution along these lines (although probably a better/more complex version) will be necessary before truly powerful searching becomes easily available.

    I can't envision some complex algorithm and/or a team of people classifying stuff ever being a strong solution without the aid of enhanced standards for the web.

    -Robert Thornburg

  • That's the whole point of google, it generates its results based on the significance of the site in question based on its relevance in the web's "big picture".

    It is pretty strange though that anyone was able to fool google into making it the top site returned for the query, though. Google gives links from highly visited sites more relevance, so some little two-bit web site won't have a great deal of influence.
  • Why does every fool out there submit stories that I have to jump through a hoop to get to. Give me YOUR stinkin password or stop advertising for the Times. Do your own legwork and find the same story at the another newspaper.. Attach THAT URL, and stop working for the NYT you huckster. Good god folks, think before you post info-demanding URLs.
  • If any system will ever gain self-awareness *without* it's programmers permission, ala sky-net, it will be a search engine.
  • your post might be insightful. In reality, however, google does not log click-throughs. The links you click go straight to where they say they do. It's possible that they log which pages in their cache are accessed, but I doubt this contributes to a pages rank. As explained on their site, google ranks pages based on which sites link to where.

    It may be true that as google indexes more newbie home pages, the average quality of the links it sees is going down, but that's another issue.

  • If we're talking about specialized search engines, then don't we need some way to know which sites to search? What is the feasability of creating a system where meta data about a site is entered in a database tied with domain registration? When I go register widgets.com I can specify that I'm commercial, serve north and central america, manufacture widgets, etc. Meta data about my individual pages could provide more detail, but the meta data at the domain level would direct the specialized search engine to my site in the first place. It just seems to me that even if a search engine is specialized it needs some way to find appropriate sites without brute force searching the net, or they will still have the same problems unless they have the manpower to filter the results.
  • Actually, I find that the most successful searches are ones which make use of a combination of search-engine techniques (quoted strings, boolean algebra, etc.) and intuition and/or common sense.

    Defining search criteria in such a way as to guarantee that your desired target shows up among the first page of hits is a bit like trying to find a jet that will take you to a specific street address.

    I usually search on a phrase that is likely to take me into the approximate quadrant of the haystack containing the needle, then take a scenic ramble through the neighborhood until I link my way to my destination. If the destination is not well linked-to, you won't likely find it with a search engine anyway.

    My apologies for the gruesome mixed metaphors! :-)

  • Interesting, but I don't know where you got your information, and it seems likely to be wrong. A Google representative gave an excellent talk here at the U of I [uiuc.edu] and explained as much as he could without giving away their "secrets". Actually, much of the information they make available [google.com] already. The key component is the PageRank system, which assigns a rank to a page based on how often it's linked to. Obviously, this isn't at all trivial in a graph as complicated as the web (esp. considering the amount of dynamic content around these days).

    In any case, the point is that I specifically asked the guy (he was actually one of their engineers, responsible for Google's SafeSearch, IIRC) if they did any click-through analysis to try and improve the relevance of their results. He responded with an emphatic "no". They believe it's just too much of a privacy concern.
  • You should read SatireWire [satire-wire.com]'s hilarious article "Interview with the Search Engine" about Ask Jeeves. One of the funniest things I've seen on the net. magi_caspar

"If it's not loud, it doesn't work!" -- Blank Reg, from "Max Headroom"