Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Search Engines-Does Obscurity Prevent Exploitation?

Posted by Cliff on Wed Sep 13, 2000 06:51 PM
from the how-does-your-search-engine-work dept.
GeekLife.com asks: "Search engines refuse to release (and often change) the exact criteria that determines their ranked results, presumably both to prevent competitors from stealing their techniques and to stop (or at least make less successful) attempts at "cheating" - optimizing a site to exploit these criteria, resulting in a higher ranking than it deserves to be. Is this an example where keeping the specifics a secret actually improves the tool? Or would releasing all the rules result in enough feedback ('given enough eyeballs...'), honing the criteria towards unexploitable results?" Interesting though. Can current systems be improved to give better results or have we reached an 'accuracy limit' as far as keyword-based searching is concerned?
This discussion has been archived. No new comments can be posted.
Display Options Threshold:
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1) | 2 | 3
  • Unexploitable? .... -1 flamebait by Emugamer (Score:1) Wednesday September 13 2000, @01:55PM
  • C;mon, open it up! by jailbrekr2 (Score:1) Wednesday September 13 2000, @01:56PM
  • Unfortunately not. (Score:3)

    by Hanzie (16075) on Wednesday September 13 2000, @01:56PM (#781000)
    Since there's a strong incentive to get your site listed in the search engines, the search criteria will always be exploited.

    A friend of mine left the company I work for and started making porn pages for an australia based porn company.

    He is supposed to make 400 pages per month, all somewhat different. He gets a bonus based on how many hits are generated, and a commission based on signups from his banner ads.

    He's doing pretty well financially

  • by vertical-limit (207715) on Wednesday September 13 2000, @01:56PM (#781001)
    Have we reached an 'accuracy limit'? Not for now, at least. While search engines have been improving, there's still a long way to go before they can serve up the correct page 100% of the time. Obviously, it's impossible for the search engine code to emulate the human brain; there's no way to tell exactly what the searcher wanted. Instead, search engines can only "guess", which is why you always end up with a few oddball results.

    The only way to achieve true search engine accuracy is to have an actual person search for pages on request. Why no company has thought of this, I'm not sure, as this could certainly be an explosive business opportunity here. The difficulty of finding trustworthy information on the Internet is legendary, and I'm a sure plenty of clueless newbies would pay a monthly fee to get better search results.

  • criteria. by mtvsucks (Score:1) Wednesday September 13 2000, @01:57PM
  • by jmv (93421) on Wednesday September 13 2000, @01:59PM (#781003) Homepage
    I think the many-eyeball argument doesn't apply here, because the it's not about finding bugs in the rules, but preventing cheating. When a site decides to "cheat", it doesn't exploit a bug. The scoring system is not like a kernel, where you know exactly what should happen, it a (generally) complex AI system. These systems are designed so that they work well enough in 95% (or 80%, this is not the point) of the time. There're going to miss a couple percent, but what else can you do. Now if you have access to the rule, you can make sure your site uses the 5% errors to go on top of the list. Unless someone thinks he can have 100% accuracy (how do you measure accuracy anyway!), the scoring rules shouldn't be released.
  • If everyone exploited the criteria... by ripicheep (Score:1) Wednesday September 13 2000, @01:59PM
  • Exploit for Google? by Adam9 (Score:1) Wednesday September 13 2000, @02:01PM
  • Limits (Score:4)

    by um... Lucas (13147) on Wednesday September 13 2000, @02:01PM (#781006) Journal
    I think that we may have reached an "accuracy limit" with search engines until such time that people don't mind search engines leaving cookies on their hard drives, so they can examine a user's past queries and use those to try to present more relevant results for that users current query. I really think that will be the only way for them to grow, because most search terms I've seen (basically, referrer logs for my site and few other sites i've worked on) only consist of 3 or less words. It's a rarity that someone enters more than that, so that doesn't give a search engine much to work with...

    However, if say google knew that I'd done searches for "albini" and "shellac" in the past, it could probably surmise that when i did a search for "big black", i'm actually looking for Steve Albini's first band, and not BIG BLACK BOOBS, et al...

    I can't figure how else something like that could be accomplished without a sacrafice of our hope for privacy...
  • New AltaVista is now a squirrel tech Google by spudboy (Score:1) Wednesday September 13 2000, @02:01PM
  • I like inaccuracy in search engines by un_eternal (Score:1) Wednesday September 13 2000, @02:02PM
  • goooooooogle by nutty (Score:1) Wednesday September 13 2000, @02:02PM
  • I vote for obscurity... by Kierthos (Score:1) Wednesday September 13 2000, @02:02PM
  • Re:Unexploitable? .... -1 flamebait by Anonymous Coward (Score:1) Wednesday September 13 2000, @02:03PM
  • Didn't we debate this already today? by Tairan (Score:1) Wednesday September 13 2000, @02:04PM
  • Ranked by referring pages by duckworth (Score:2) Wednesday September 13 2000, @02:04PM
  • by Xandu (99419) <matt.truch@net> on Wednesday September 13 2000, @02:05PM (#781014) Homepage Journal
    The google "algorithm" is explained on the Why Use [google.com] page on Google [google.com]. Although it doesn't give the *exact* code used, it explains (in english) the whole process pretty well.
  • Re:Unfortunately not. by Anonymous Coward (Score:1) Wednesday September 13 2000, @02:05PM
  • A matter of time (Score:3)

    by jjr (6873) on Wednesday September 13 2000, @02:06PM (#781016) Homepage
    With enough experimenting someone can find out how the system works. Either through keywords,page text,bribing .. etc whatever. People will find out how it works. Just a matter of time
  • Re:Search engines can -always- be improved by Hanzie (Score:2) Wednesday September 13 2000, @02:07PM
  • by MarcoAtWork (28889) on Wednesday September 13 2000, @02:08PM (#781018)
    Well, while there are user-submitted lists of sites (yahoo,whatever) I think it's just about time for a moderated search engine.

    The users could submit links in different subjects or categories with different keywords adding them to the harvested ones and, most important, registered users would be able to get x moderator points a week and vote down spam links or links that don't make much sense with the search one conducted.

    Add a healthy dose of meta-moderation (maybe three levels) and some obvious anti-cheat prevention techniques and it should work much better than a normal search engine.

    God knows many times even on google obviously poisonous sites come up in the search, it would be so nice to have a button to click to moderate down the page or the domain itself...
  • It's all about the benjamins... by L-Train8 (Score:1) Wednesday September 13 2000, @02:08PM
  • Closed source has it's usage... by NetDrain (Score:1) Wednesday September 13 2000, @02:08PM
  • A question within a question by m0nkeyb0y (Score:2) Wednesday September 13 2000, @02:09PM
  • Re:Exploit for Google? by YellowBook (Score:2) Wednesday September 13 2000, @02:10PM
  • It will never be unexploitable. by The.Tempest (Score:1) Wednesday September 13 2000, @02:11PM
  • by rongen (103161) on Wednesday September 13 2000, @02:11PM (#781024) Homepage
    The only way to achieve true search engine accuracy is to have an actual person search for pages on request. Why no company has thought of this, I'm not sure, as this could certainly be an explosive business opportunity here.

    Dear GOD, those people at my local library! They must be part of some top secret start-up R&D initiative! So helpful, and for FREE! I KNEW they were up to something!!! :)

    --8<--
  • Rotating Criteria by MWoody (Score:2) Wednesday September 13 2000, @02:11PM
  • Odd search results. by chotlhpah (Score:1) Wednesday September 13 2000, @02:11PM
  • Google: The Criteria Aren't Exploitable by Saint Aardvark (Score:2) Wednesday September 13 2000, @02:12PM
  • Obscurity? Not here... by Millennium (Score:2) Wednesday September 13 2000, @02:12PM
  • It seems like Google is kinda exploit-proof... by Dirtside (Score:2) Wednesday September 13 2000, @02:12PM
  • Re: PS by King of the World (Score:1) Wednesday September 13 2000, @02:13PM
  • by Tumbleweed (3706) on Wednesday September 13 2000, @02:14PM (#781031) Homepage
    Well, also the fact that a huge chunk of the web isn't even indexed at all.

    Other than that, though, the interfaces that most search engines use are pretty bad. There is usually no way to filter through a set of results to eliminate things that are obviously not what the searcher wants. Just being able to eliminate a set of domains from the initial results would make a huge difference for me.

    Also, most people have no clue how to effectively use search engines - and they're not all that interested in doing so. I've been working in the web industry for quite a long time, and most of my colleagues seem to have no idea that changing the settings can yield better results. The setting 'phrase' for instance, makes a HUGE difference much of the time - yet I've never seen a colleague change any default settings when doing a web search. If you're not willing to do so much as even toggle an individual setting, you deserve the crappy results you get.

    Oh, another thing - many of the links I get back are of dubious quality - even on the setting 'phrase', many results don't come back that match what I specified. If you play the the rules and the results STILL don't match, I have little faith in ANY results, even if the web site operators are trying to override accuracy. This is aside the very common result of '404 not found' pages.

    Right now, the best search engine I know of is a meta search engine called 'ProFusion' - I've had much better luck with it than with Google. Not enough control over Google...I also like that the results with Profusion ( http://www.profusion.com [profusion.com] ) come back with an option next to each result to open in a new browser window - now THAT's a nice idea!
  • Another idea - 'demote' button by MWoody (Score:2) Wednesday September 13 2000, @02:15PM
  • Banner adds can teach us alot by Gerad (Score:2) Wednesday September 13 2000, @02:15PM
  • by 2nd Post! (213333) <gundbear@pacbTWAINell.net minus author> on Wednesday September 13 2000, @02:15PM (#781034) Homepage
    Some really good points by previous posters that I want to recap:

    If you open up the criteria such that *everyone* exploits the criteria, then there is no discrimination. When the criteria is closed, only those who have found the exploits can get increased exposure, making it inherently unfair.

    Another issue is that what a search engine wants you to see is different than what you want the search engine to give you, in some cases.

    We want the union of two criteria; the results that give the search engine the most use/reuse(usefulness of the search) and the results that give the search engine the most financial recompense(so that the search engine can grow, get better, get faster, etc)

    They may not be correlated, but they are both very important. The most useful pages may not give them the most money, and the pages that pay them the most may not generate enough repeat use for them either.

    Perhaps the best search algorithm is two step:

    Rank according to links (the more links to a page, the more useful the page)
    Count repeat use (the more times a search has to be refined, the less useful the pages returned)

    Rank according to links already occurs at Altavista and Google.

    I don't know that anyone does the second.

    Say you do a search on Google; if you hit the next button, then the pages that were generated get knocked a few points. If you hit Google again a few minutes later with a variant search, then knock a few points to *all* the pages that got listed in the previous search. If a user goes back, and hits 'related' pages, increase the points to that page, and all the related pages. Repeat the above algorithm for every hit to Google.

    The nick is a joke! Really!
  • blah blah blah by dR.fuZZo (Score:2) Wednesday September 13 2000, @02:15PM
  • Re:Google: The Criteria Aren't Exploitable by Junks Jerzey (Score:2) Wednesday September 13 2000, @02:16PM
  • Track IPs by MWoody (Score:1) Wednesday September 13 2000, @02:16PM
  • I Asked Jeeves . . . by dgale (Score:2) Wednesday September 13 2000, @02:17PM
  • Re:criteria. by King of the World (Score:1) Wednesday September 13 2000, @02:18PM
  • Re:Search engines can -always- be improved by shion (Score:2) Wednesday September 13 2000, @02:19PM
  • Re:Exploit for Google? by King of the World (Score:1) Wednesday September 13 2000, @02:19PM
  • Re:Limits by 2nd Post! (Score:2) Wednesday September 13 2000, @02:20PM
  • Re:C;mon, open it up! by grahamsz (Score:2) Wednesday September 13 2000, @02:22PM
  • There is a way to vote! by 2nd Post! (Score:1) Wednesday September 13 2000, @02:22PM
  • Re:Odd search results. by MonkeyHanger (Score:1) Wednesday September 13 2000, @02:23PM
  • Probably an open source issue really by tbray (Score:1) Wednesday September 13 2000, @02:24PM
  • Why such a complicated system? by 2nd Post! (Score:2) Wednesday September 13 2000, @02:26PM
  • Re:Google: The Criteria Aren't Exploitable by Saint Aardvark (Score:1) Wednesday September 13 2000, @02:26PM
  • NEWSTORY: that slash forgot and is too lame by Anonymous Coward (Score:1) Wednesday September 13 2000, @02:29PM
  • Intelligent? I think not.... by NullStream (Score:2) Wednesday September 13 2000, @02:30PM
  • An AI-complete problem? by drfireman (Score:1) Wednesday September 13 2000, @02:49PM
  • Re:A question within a question by dbthomas (Score:1) Wednesday September 13 2000, @02:50PM
  • Take a step back for more power by Space Cow (Score:2) Wednesday September 13 2000, @02:51PM
  • Re:Another idea - 'demote' button by 2nd Post! (Score:1) Wednesday September 13 2000, @02:56PM
  • Re:An AI-complete problem? by NullStream (Score:1) Wednesday September 13 2000, @02:56PM
  • Re:Track IPs by Dwonis (Score:1) Wednesday September 13 2000, @03:01PM
  • Re:Search engines can -always- be improved by sracer9 (Score:1) Wednesday September 13 2000, @03:02PM
  • by Phrogman (80473) on Wednesday September 13 2000, @03:03PM (#781058) Homepage

    The biggest problems with Search Engines, is relevancy. The problem being that when I do a search for a word like "magic" the search engine will return results based upon its algorithm, but trying to produce relevancy from a single search word is just about impossible as a task. With a term like "magic" I could be looking for:

    • Magic as in Magic the Gathering - a collectible card game I used to play.
    • Magic as in the occult.
    • Magic as in sleight-of-hand.

    Or any of a large number of subjects that I could have in mind at the time of my search. The results from a search engine such as Google, will rank pages which contain the word magic in the page title, multiple times in the body of the page, in the META tags, in or near HREF links, or which are linked to by many other sites higher than those which do not meat these criteria. It differs from search engine to search engine, depending on criteria.

    None of these criteria for ranking take into account the nature of my query - what I had in mind when I did the search. In other words they do not directly address the relevancy of the results. If a search engine offered me the opportunity to pick from results it returned and gradually refine the search to produce better results it would be addressing this situation. Some do with a "search again in this result set" or "more like this" type option on their results pages, but its still kinda mechanical, and not all that reliable.

    I think it will take some sort of AI analysis of search requests based on user-feedback of some sort and with a learning capability to surpass the current crop of search engines. Until such time as we have some smart systems working behind the scenes on searching any improvements will no doubt be incremental rather than radical.

    Now, as for keeping the specifics of how a page is ranked secret I think its absolutely necessary. There is a constant, quiet, war going on between the search engines and the folks who want to get their websites listed at the top of the page when a result set is produced. The people who regularly submit their sites to the various search engines, with each search engine receiving a specially made page generated just for its benefit to ensure that the website gets the best ranking possible etc, are not interested in how accurate the search engine is, they simply want to come up first. The folks at the search engine generally want the most relevant pages to be returned. There is an essential difference of purpose between the two camps.

    On the side of the search engines, they have control over their ranking system, and change it peridically to prevent abuse of the system. The folks who are seriously trying to get to the top of the heap in the search engine results are constantly trying new methods to get ahead.

    For instance, at one point some webmasters were creating their webpages with a lot of text at the bottom of the page that was the same font color as the background, so that the search engines would spider the contents of the page but users would never see those contents. This let them list all sorts of words that scored higher in the search engines returns, but had little or no relevancy to the page contents. The search engines got wise to this trick and now most will penalize you for using it.

    Opening up the search engines ranking rules would only make the system easier to abuse more precisely. No matter how many eyeballs pour over the code, it will still not change the nature of the guy who will use any method at his disposal to get his porn page returned as Link #1 when you do a search for MP3 because its the hottest term currently being searched for.

    Google has altered this battle somewhat by ranking pages higher in their results based on how many other webpages contain links to that page (and also based upon the nature of the linking page. They use a distinction between pages which contain a lot of links - like a web directory such as my own Omphalos [omphalos.net] - and those which are linked to by a lot of other pages. Both get points for different reasons and in different instances. I don't remember the details), but even this is open to abuse, although with a bit more effort required. I know of a website which has over 200 different URLs registered and operational, all of which contain pages which point back to the main URL they are promoting. When a search engine such as Google goes to anaylize this website, it will rank it higher because it is linked to by so many separate domains and so many separate pages on those domains. Its harder to abuse, but it can be done.

    Of course, this is all basically irrelevant, since each of the search engine companies keeps their methodology and their source code highly protected. It is worth millions of dollars in revenue, and I cannot honestly see any of them deciding to release their software in this way.

    If you have not noticed, practically every graduate student who devises a new and effective method of indexing and ranking search results ends up creating their own company once they have delivered their thesis and entered the real world. That is certainly how Google started, and I believe is also how Ask Jeeves got going. I am sure that most of the other main search engines have gotten going in the same or similiar manners.

    All that said, If you want to play with a true search engine that is GPLed and works quite well, although not on the scale of a Google or an Altavista, try UDMSearch [mnogo.ru]. It runs just fine under Linux or FreeBSD (I have installed it on both in the past) and I am using it on my site under Solaris. It is still in an intense development cycle and new versions are released regularly, but its worth exploring if you are interested in how a search engine works, and want to get your hands dirty.

    For more information on the big boys, check out Search Engine Watch [searchenginewatch.com], and finally, if you are simply interested in Space, Space Exploration or Space Science, check out SpaceRef [spaceref.com].

  • Re:The quality of results is the fault of users & by khym (Score:1) Wednesday September 13 2000, @03:04PM
  • Re:<META `/usr/dict/words` by willis (Score:2) Wednesday September 13 2000, @03:05PM
  • by jmv (93421) on Wednesday September 13 2000, @03:06PM (#781061) Homepage
    If you open up the criteria such that *everyone* exploits the criteria, then there is no discrimination. When the criteria is closed, only those who have found the exploits can get increased exposure, making it inherently unfair.

    You seem to forget that the idea of search engine result scoring/ranking is not about being fair to all sites, it's about returning the best result possible.

    If you open the criteria, the sites that make money from ads will all use them (the result is going to be "fair" between those sites), but the problem is that the not-for-profit websites (which are much more common) won't chenge their page just to get more hit (they don't care). The result is that, though it ends up being fair to all the commercial sites, but as a user, you're less likely to find what you're looking for... which is the point of using a search engine.

    If you just want to be fair, have the search engine return a random URL. Now *that* would be fair!
  • Re:Unexploitable? .... -1 flamebait by g_mcbay (Score:2) Wednesday September 13 2000, @03:07PM
  • Release their assets? by Mr. McGibby (Score:2) Wednesday September 13 2000, @03:07PM
  • Re:Search engines can -always- be improved by B'Trey (Score:2) Wednesday September 13 2000, @03:13PM
  • What about Web Position software? by antdude (Score:2) Wednesday September 13 2000, @03:18PM
  • Search result usefulness declines over time. by kd5biv (Score:1) Wednesday September 13 2000, @03:22PM
  • About returning the best results possible... by 2nd Post! (Score:2) Wednesday September 13 2000, @03:23PM
  • Re:Search engines can -always- be improved by NicGCotton (Score:1) Wednesday September 13 2000, @03:26PM
  • Re:What about a moderated search engine ? by mrmag00 (Score:2) Wednesday September 13 2000, @03:26PM
  • by Restil (31903) on Wednesday September 13 2000, @03:29PM (#781070) Homepage
    That is only part of the way it works.

    Sites are grouped into categories known as Authorities and hubs. A hub points to lots of different pages (yahoo for instance). An authority has lots of different places pointing to it.

    Where the ranking comes into play is dependant on how good the hub or authority is. A hub is good (better than others) if it points to a number of good authorities. Likewise, an authority is good if it is linked to by a number of good hubs. Yes, this is a recursive process, and yes, it takes a number of passes to get the ranking to level out.

    If a pr0n site wanted to exploit google to get a higher ranking, they would first need to create a LOT of dummy sites to link to it, and all those dummy sites would need to be found by google's robot.

    However, just having a large number of dummy sites linking to the pr0n site is not sufficent. Those dummy sites would also have to link to a large number of other GOOD authority sites (on pr0n or whatever).

    Now, throw another wrench into the works. Google doesn't search only on keywords ON the site itself, but on the sites that refer to it, and the other way around. Thats why if you search for "more evil than satan himself" you end up with microsoft as a prominant result, even though the words evil and satan probably don't appear anywhere on microsoft's website (although maybe they should).

    This way, if you were searching for pages about a certain topic, but the pages themselves don't actaully use the words you're looking for, you will still find that page as long as there are good hubs out there that refer to that page and use your search terms in close proximity to the links.

    Now, if a hub points to a large number of authorities on a specific topic, words relevent to those topic will then become viable search terms to find the hub when searching, as the hub would also be a good source of information, even if it doesn't list the specific search terms. All of this affects the "ranking"

    So, for a dummy hub to get a high ranking, it would need to point to a large number of high ranking authority pr0n sites (which would anti-productive when what you're trying to do is advertise your own site). This would raise the hub rating for certain terms (specific to pr0n sites), and therefore raise the bar on the site you're trying to promote.

    Of course, trying to get a pr0n site to come up on a search for "teen" or even "sex" is not easy because while a pr0n site is generally fly by night, there are many legitamate sites which have been around for several years and have built themselves into the web structure well and therefore get catagorized correctly.

    -Restil
  • New MegaSearch Search Engine by TWX_the_Linux_Zealot (Score:2) Wednesday September 13 2000, @03:29PM
  • Search quality is hard by dca (Score:1) Wednesday September 13 2000, @03:29PM
  • Re:Search engines can -always- be improved by BradleyUffner (Score:1) Wednesday September 13 2000, @03:32PM
  • A problem with the opening the rules by Dacta (Score:2) Wednesday September 13 2000, @03:47PM
  • Yep, by firstpostfirstpost (Score:1) Wednesday September 13 2000, @03:47PM
  • Re:What about a moderated search engine ? by tswinzig (Score:2) Wednesday September 13 2000, @03:49PM
  • by Felinoid (16872) <emot@m-net.arbornet.org> on Wednesday September 13 2000, @03:52PM (#781077) Homepage Journal
    This isn't security as much as it is in the same argument base...
    The arguments against "Security by obscurity" apply here.. so just insert those arguments [here] and I'll move on...

    It works not by prevention so much as "reduced body count" and I guess thats the best a search engen can hope for.

    When someone thwarts security thats it.. your dead...
    When someone tricks a search to give them top results it's just a few websites.. it CAN be overlooked.

    So say... 1 person hacks AltaVista.. it's down... blah.. 100 persons hack AltaVista.. it's still down... 1 cracker vs 1,000 crackers... makes very little diffrence... it only takes one defect and one joker to ruin your day...

    But with searches... a defect becomes known and you don't fix it in time... 1,000 jokers and your screwed...
    1 joker however isn't a problem.... your still online and USUALLY you still give good results... just one bad result...
    You get bad results by random chance and user mistakes... so big deal...

    But your expecting the joker.. once he's discovered this little trick... won't make it public....

    Right now this dosn't happen...
    But it's a lot to risk...

    Recomendation.... sence obscurity is effective... but not perfict... give away the OLD system...
    Provide a liccens that basicly says "Any changes may be used by us at any time with out notice... but only we may do this... all else is open source"
  • Many of the listings are paid by threemile (Score:2) Wednesday September 13 2000, @03:52PM
  • by tswinzig (210999) on Wednesday September 13 2000, @03:52PM (#781079) Journal
    The biggest problems with Search Engines, is relevancy. The problem being that when I do a search for a word like "magic" the search engine will return results based upon its algorithm, but trying to produce relevancy from a single search word is just about impossible as a task. With a term like "magic" I could be looking for:

    Magic as in Magic the Gathering - a collectible card game I used to play.
    Magic as in the occult.
    Magic as in sleight-of-hand.


    I know this will blow your mind, but no advanced AI is necessary.

    Instead of typing "magic," you can add one or two more words to your query, and actually get the info you need! E.g. "Magic the Gathering."

    Pretty neat, huh kids?

    -thomas


    "Extraordinary claims require extraordinary evidence."
  • Re:The quality of results is the fault of users & by sillysally (Score:1) Wednesday September 13 2000, @03:57PM
  • Accuracy limit by Pinball Wizard (Score:2) Wednesday September 13 2000, @04:00PM
  • Re:Ranked by referring pages by pod (Score:1) Wednesday September 13 2000, @04:05PM
  • Re:The Bugaboo is Relevancy by Field Marshall Stack (Score:1) Wednesday September 13 2000, @04:08PM
  • Real reason? Money by geekoid (Score:1) Wednesday September 13 2000, @04:09PM
  • Re:Search engines can -always- be improved by TMB (Score:1) Wednesday September 13 2000, @04:09PM
  • Re:The quality of results is the fault of users & by Jester99 (Score:1) Wednesday September 13 2000, @04:10PM
  • Re:criteria. by Mortanius (Score:1) Wednesday September 13 2000, @04:15PM
  • release away by onShore_Jake (Score:1) Wednesday September 13 2000, @04:15PM
  • Re:About returning the best results possible... by Samrobb (Score:2) Wednesday September 13 2000, @04:19PM
  • Why do people cheat so much? by iabervon (Score:1) Wednesday September 13 2000, @04:21PM
  • Re:Another idea - 'demote' button by Steve Smithies (Score:1) Wednesday September 13 2000, @04:24PM
  • Re:The Bugaboo is Relevancy by phantomlord (Score:1) Wednesday September 13 2000, @04:28PM
  • Re:criteria. by King of the World (Score:1) Wednesday September 13 2000, @04:31PM
  • Re:Open up the criteria! by jmv (Score:2) Wednesday September 13 2000, @04:36PM
  • Re:Limits by DrgnDancer (Score:1) Wednesday September 13 2000, @04:41PM
  • When a search engine hits a 404 error... by antdude (Score:2) Wednesday September 13 2000, @04:42PM
  • My Gripes with Search engines..... by oblisk (Score:1) Wednesday September 13 2000, @04:48PM
  • Re:Unexploitable? .... -1 flamebait by Emugamer (Score:1) Wednesday September 13 2000, @04:48PM
  • Why not find out? by graniteMonkey (Score:1) Wednesday September 13 2000, @04:51PM
  • Re:A matter of time by dgris (Score:1) Wednesday September 13 2000, @04:53PM
  • Re:I vote for obscurity... by dgris (Score:1) Wednesday September 13 2000, @04:57PM
  • Re:When a search engine hits a 404 error... by Amit J. Patel (Score:1) Wednesday September 13 2000, @04:59PM
  • Re:Limits by RedWizzard (Score:1) Wednesday September 13 2000, @05:00PM
  • Google by superlame (Score:1) Wednesday September 13 2000, @05:03PM
  • ''Adversary'' view and randomness by Tom7 (Score:2) Wednesday September 13 2000, @05:03PM
  • 404 not found by Tom7 (Score:1) Wednesday September 13 2000, @05:05PM
  • Re:Open up the criteria! by Pinball Wizard (Score:2) Wednesday September 13 2000, @05:07PM
  • Re:It seems like Google is kinda exploit-proof... by Aciel (Score:1) Wednesday September 13 2000, @05:13PM
  • in theory... by inciteful (Score:1) Wednesday September 13 2000, @05:25PM
  • Re:404 not found by Erik Hollensbe (Score:1) Wednesday September 13 2000, @05:25PM
  • The politics of search engines by Jim Madison (Score:2) Wednesday September 13 2000, @05:36PM
  • Re:About returning the best results possible... by Erik Hollensbe (Score:1) Wednesday September 13 2000, @05:39PM
  • Re:Search engines can -always- be improved by pclinger (Score:2) Wednesday September 13 2000, @05:50PM
  • Profusion == No Linux by winterstorm (Score:2) Wednesday September 13 2000, @05:57PM
  • To answer the question that was asked, by nels_tomlinson (Score:2) Wednesday September 13 2000, @05:59PM
  • Re:The Bugaboo is Relevancy by Phrogman (Score:2) Wednesday September 13 2000, @06:03PM
  • Re:Search engines can -always- be improved by eudas (Score:1) Wednesday September 13 2000, @06:17PM
  • 3 probs with the current search engines: by M@T (Score:1) Wednesday September 13 2000, @06:21PM
  • Re:Exploit for Google? by eudas (Score:1) Wednesday September 13 2000, @06:21PM
  • Re:There is a way to vote! by eudas (Score:1) Wednesday September 13 2000, @06:25PM
  • can search engines be improved? by daviddlewis (Score:1) Wednesday September 13 2000, @06:27PM
  • speeds the process by aozilla (Score:1) Wednesday September 13 2000, @06:39PM
  • Re:Another idea - 'demote' button by eudas (Score:1) Wednesday September 13 2000, @06:49PM
  • Re:Intelligent? I think not.... by eudas (Score:1) Wednesday September 13 2000, @06:56PM
  • Re:Didn't we debate this already today? by skoda (Score:1) Wednesday September 13 2000, @06:59PM
  • Re:The Bugaboo is Relevancy by mojotoad (Score:1) Wednesday September 13 2000, @07:18PM
  • Re:The Bugaboo is Relevancy by mojotoad (Score:1) Wednesday September 13 2000, @07:21PM
  • Misapplication of 'open source' by The Kow (Score:1) Wednesday September 13 2000, @07:23PM
  • Google Algorithm by dzhei (Score:1) Wednesday September 13 2000, @07:30PM
  • Re:A question within a question by m0nkeyb0y (Score:1) Wednesday September 13 2000, @07:37PM
  • Re:A question within a question by Compuser (Score:1) Wednesday September 13 2000, @07:38PM
  • Re:some SE's and web-pimps already do this by fence (Score:2) Wednesday September 13 2000, @07:38PM
  • Re:What about a moderated search engine ? by Steeltoe (Score:1) Wednesday September 13 2000, @08:20PM
  • Keeping things secret... by Ascender (Score:1) Wednesday September 13 2000, @08:52PM
  • Re:Search engines can -always- be improved by sydb (Score:1) Wednesday September 13 2000, @09:44PM
  • Re:It seems like Google is kinda exploit-proof... by timboy3 (Score:1) Wednesday September 13 2000, @09:56PM
  • Re:Search engines can -always- be improved by Karellen (Score:1) Wednesday September 13 2000, @10:34PM
  • by Bazzargh (39195) on Wednesday September 13 2000, @10:51PM (#781138)
    Excellent description. I can only top that by providing links which go over the research underlying this stuff.

    The classic algorithm of this type is called HITS [cornell.edu], by J. Kleinberg.

    IBM's 'Clever' [ibm.com] is an enhancement to 'HITS'.

    Part of the success of these is that they can be mapped on to well known matrix solving problems...theres enough information in the documents above for you to work out how to write one.

    One wrinkle Restil doesnt mention is that the technique is not purely based around link structure. You _seed_ the process with content-ranked pages (hoping the process 'crawls' to the best set independently of the seed), and subsequently you may select the most relevant 'communities' of pages by content ranking. So if you are already in the top 100, say you may be able to content-mangle yourself up the list, but you need good linkages to get in first!

    A further criteria used is response time (I strongly suspect Google use this, I got hooked on it when I found that its sites _responded_ rather than hanging as most AltaVista sites did at the time). Again theres publications on this stuff: the shark search algorithm [scu.edu.au] is a spider with this feature.

  • Re:Search engines can -always- be improved by streetlawyer (Score:1) Wednesday September 13 2000, @10:52PM
  • Re:Unexploitable? Read the Google Paper by thenning (Score:1) Wednesday September 13 2000, @11:14PM
  • Re:The Bugaboo is Relevancy by DZign (Score:1) Wednesday September 13 2000, @11:14PM
  • Multiple search engines- inelegant but they work by cheekymonkey_68 (Score:1) Wednesday September 13 2000, @11:33PM
  • The more they tell us, the less they earn? by cah1 (Score:1) Wednesday September 13 2000, @11:51PM
  • Some bad things about dmoz.org by KjetilK (Score:2) Thursday September 14 2000, @12:08AM
  • Re:Limits by RegularFry (Score:1) Thursday September 14 2000, @12:14AM
  • Some engines control your clicks by Pseudonymus Bosch (Score:1) Thursday September 14 2000, @01:02AM
  • Re:Profusion == No Linux by rlk (Score:2) Thursday September 14 2000, @01:34AM
  • Open source search engine by dnnrly (Score:1) Thursday September 14 2000, @01:34AM
  • Your post, corrected by spell checker. by MaxGrant (Score:1) Thursday September 14 2000, @01:40AM
  • Re:Limits by thenerd (Score:1) Thursday September 14 2000, @01:48AM
  • Secret algorithms? by ficara (Score:1) Thursday September 14 2000, @01:50AM
  • Re:Limits by MrNixon (Score:1) Thursday September 14 2000, @01:52AM
  • :) by Pseudonymus Bosch (Score:1) Thursday September 14 2000, @01:58AM
  • Re:The quality of results is the fault of users & by magic (Score:1) Thursday September 14 2000, @02:14AM
  • Obscurity has its place. by Refrag (Score:1) Thursday September 14 2000, @02:19AM
  • Re:Many-eyeballs doesn't apply by BlueArcus (Score:1) Thursday September 14 2000, @02:35AM
  • Re:C;mon, open it up! by ShakespeareProj (Score:1) Thursday September 14 2000, @02:35AM
  • Re:Exploit for Google? by hey (Score:1) Thursday September 14 2000, @02:37AM
  • Google Cache by isaac_akira (Score:1) Thursday September 14 2000, @02:50AM
  • Oingo rocks, thanks for the pointer by Raffaello (Score:1) Thursday September 14 2000, @03:38AM
  • Re:What about a moderated search engine ? by MarcoAtWork (Score:1) Thursday September 14 2000, @03:42AM
  • Re:Limits by Phibian (Score:1) Thursday September 14 2000, @03:44AM
  • Problems with Inverted Indexing by harrisj (Score:1) Thursday September 14 2000, @04:07AM
  • Re:Limits by SquidBoy (Score:1) Thursday September 14 2000, @04:10AM
  • Re:Baner adds can teach us alot by ameoba (Score:1) Thursday September 14 2000, @04:22AM
  • Not at the local library any more... by goliard (Score:2) Thursday September 14 2000, @04:23AM
  • Re:Search engines can -always- be improved by Kaa (Score:1) Thursday September 14 2000, @04:28AM
  • Re:What about a moderated search engine ? by beth_linker (Score:1) Thursday September 14 2000, @04:40AM
  • How to make "ranking" really work. by jekk (Score:1) Thursday September 14 2000, @04:46AM
  • Re:Limits by jawtheshark (Score:1) Thursday September 14 2000, @04:55AM
  • Re:Ranked by referring pages by KingOfCartoons (Score:1) Thursday September 14 2000, @05:51AM
  • Re:Open up the criteria! by TheNightOwl (Score:1) Thursday September 14 2000, @06:09AM
  • by CaseyB (1105) on Thursday September 14 2000, @06:26AM (#781173)
    Moderation is a MUCH more difficult thing to apply to search engines than it might initially appear.

    You don't want the search to return the "best" sites! That's not the point. You want the search to return the most appropriate sites.

    If I search for "redhat reviews", I'll get both "redhat.com" and "joe's linux distro reviews" on GeoCities.

    Now, redhat is obviously going to have been historically rated higher than Joe's site, because it gets more traffic and is probably on the whole a much more useful and informative site. More people will have been happy with redhat.com as a search result than they have been with Joe's.

    So should redhat.com be at the top of the list, and Joe's site at the bottom? No! Because if I happen to be looking for an objective review, I don't care what redhat has to say -- I want to know what Joe thinks about the relative merits of redhat and debian. Redhat.com is NOT an appropriate site for a "redhat reviews" search even though it matches the terms and is a highly ranked site.

    So search results must be a function of both the site and the search terms, and moderation has to be based on this. This is a very nasty can of worms, because interpreting what the user wanted when he typed in the search is subjective. Doing a simple moderation on the intersection of the search terms and the desired result probably isn't feasible either, since freeform searches aren't discrete enough, there are too many possible ways to phrase the search for distro reviews. Determining what he wants and returning the best results based on previous moderation amounts to full blown natural language parsing and artificial intelligence.

  • Re:Google: The Criteria Aren't Exploitable by Tau Zero (Score:2) Thursday September 14 2000, @06:57AM
  • Aeiwi by AeiwiMaster (Score:1) Thursday September 14 2000, @07:07AM
  • Ontological Search Engines by winterstorm (Score:1) Thursday September 14 2000, @07:32AM
  • Re:Problems with Inverted Indexing by dehora (Score:1) Thursday September 14 2000, @07:45AM
  • Re:Profusion == No Linux by Keel (Score:1) Thursday September 14 2000, @07:56AM
  • audio/music search. by amchugh (Score:1) Thursday September 14 2000, @08:13AM
  • Re:Limits by lgas (Score:1) Thursday September 14 2000, @11:04AM
  • Re:Search engines can -always- be improved by SkunkPussy (Score:1) Thursday September 14 2000, @12:32PM
  • Re:Obscurity? Not here... by SkunkPussy (Score:1) Thursday September 14 2000, @12:45PM
  • Wow! That was magical! by exister (Score:1) Thursday September 14 2000, @02:32PM
  • Re:Submitcorner.com by AgentWebRanking Free (Score:1) Friday September 15 2000, @03:53AM
  • Re:profusion isn't so good by Tumbleweed (Score:2) Friday September 15 2000, @07:53AM
(1) | 2 | 3