Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

[ Create a new account ]

Could Open Source Lead to a Meritocratic Search Engine?

Posted by CmdrTaco on Wed Feb 14, 2007 12:00 PM
from the sure-if-you-donate-a-thousand-data-centers dept.
Slashdot contributor Bennett Haselton writes "When Jimmy Wales recently announced the Search Wikia project, an attempt to build an open-source search engine around the user-driven model that gave birth to Wikipedia, he said his goal was to create "the search engine that changes everything", as he underscored in a February 5 talk at New York University. I think it could, although not for the same main reasons that Wales has put forth -- I think that for a search engine to be truly meritocratic would be more of a revolution than for a search engine to be open-source, although both would be large steps forward. Indeed, if a search engine could be built that really returned results in order of average desirability to users, and resisted efforts by companies to "game" the system (even if everyone knew precisely how the ranking algorithm worked), it's hard to overstate how much that would change things both for businesses and consumers. The key question is whether such an algorithm could be created that wouldn't be vulnerable to non-merit-based manipulation. Regardless of what algorithms may be currently under consideration by thinkers within the Wikia company, I want to argue logically for some necessary properties that such an algorithm should have in order to be effective. Because if their search engine becomes popular, they will face such huge efforts from companies trying to manipulate the search results, that it will make Wikipedia vandalism look like a cakewalk." The rest of his essay follows.

This will be a trip into theory-land, so it may be frustrating to users who dislike talk about "vaporware" and want to see how something works in practice. I understand where you're coming from, but I submit it's valuable to raise these questions early. This is in any case not intended to supplant discussion about how things are things are currently progressing.

First, though, consider the benefits that such a search engine could bring, both to content consumers and content providers, if it really did return results sorted according to average community preferences. Suppose you wanted to find out if you had a knack for publishing recipes online and getting some AdSense revenue on the side. You take a recipe that you know, like apple pie, and check out the current results for "apple pie". There are some pretty straightforward recipes online, but you believe you can create a more complete and user-friendly one. So you write up your own recipe, complete with photographs of the process showing how ingredients should be chopped and what the crust mixture should look like, so that the steps are easier to follow. (Don't you hate it when a recipe says "cut into cubes" and you want to throttle the author and shout, "HOW BIG??" It drove me crazy until I found CookingForEngineers.com.) Anyway, you submit your recipe to the search engine to be included in the results for "apple pie", and if the sorting process is truly meritocratic, your recipe page rises to the top. Until, that is, someone decides to surpass you, and publishes an even more user-friendly recipe, perhaps with a link to a YouTube video of them showing how to make the pie, which they shot with a tripod video camera and a clip-on mike in their well-lit kitchen. In a world of perfect competition, content providers would be constantly leapfrogging each other with better and better content within each category (even a highly specific one like apple pie recipes), until further efforts would no longer pay for themselves with increased traffic revenue. (The more popular search terms, of course, would bring greater rewards for those listed at the top, and would be able to pay for greater efforts to improve the content within that category.) But this constant leapfrogging of better and better content requires efficient and speedy sorting of search results in order to work. It doesn't work if the search results can be gamed by someone willing to spend effort and money (not worth it for the author of a single apple pie recipe, but worth it for a big money-making recipe site), and it doesn't work if it's impossible for new entrants to get hits when the established players already dominate search results.

Efficient competition benefits consumers even more for results that are sorted by price (assuming that among comparable goods and services, the community promotes the cheapest-selling ones to the top of the search results, as "most desirable"). If you were a company selling dedicated Web hosting, for example, you would submit your site to the engine to be included in results for "dedicated hosting". If you could demonstrate to the community that your prices and services were superior to your competitors', and if the ranking algorithm really did rank sites according to the preferences of the average user, your site could quickly rise to the top, and you'd make a bundle on new sales -- until, of course, someone else had the same idea and knocked you out of the top spot by lowering their prices or improving their services. The more efficient the marketplace, the faster prices fall and service levels rise, until the prices just covered the cost of providing the service and compensating the business owner for their time. It would be a pure buyer's market.

It's important to precisely answer the question: Why would this system be better than a system like Google's search algorithm, which can be "gamed" by enterprising businesses and which doesn't always return the results first that the user would like the most? You might be tempted to answer that in an inefficient marketplace created by an inefficient search result sorting algorithm, a user sometimes ends up paying $79/month for hosting, instead of the $29/month that they might pay if the marketplace were perfectly efficient. But this by itself is not necessarily wasteful. The extra $50 that the user pays is the user's loss, but it's also the hosting company's gain. If we consider costs and benefits across all parties, the two cancel out. The world as a whole is not poorer because someone overpaid for hosting.

The real losses caused by an inefficient search algorithm, are the efforts spent by companies to game the search results (e.g. paying search engine optimization firms to try and get them to the top Google spot), and the reluctance of new players to enter that market if they don't have the resources to play those games. If two companies each spend $5,000 trying to knock each other off of the top spot for a search like "weddings", that's $5,000 worth of effort that gets burned up with no offsetting amount of goods and services added to the world. This is what economists call a deadweight loss, with no corresponding benefit to any party. The two wedding planners might as well have smashed their pastel cars into each other. Even if a single company spends the effort and money to move from position #50 to position #1, that gain to them is offset by the loss to the other 49 companies that each moved down by one position, so the net benefit across all parties is zero, and the effort that the company spent to raise their position would still be a deadweight loss.

On the other hand, if search engine results were sorted according to a true meritocracy, then companies that wanted to raise their rankings would have to spend effort improving their services instead. This is not a deadweight loss, since these efforts result in benefits or savings to the consumer.

I've been a member of several online entrepreneur communities, and I'd conservatively estimate that members spend less than 10% of the time talking about actually improving products and services, and more than 90% of the time talking about how to "game" the various systems that people use to find them, such as search engines and the media. I don't blame them, of course; they're just doing what's best for their company, in the inefficient marketplace that we live in. But I feel almost lethargic thinking of that 90% of effort that gets spent on activities that produce no new goods and services. What if the information marketplace really were efficient, and business owners spent nearly 100% of their efforts improving goods and services, so that every ounce of effort added new value to the world?

Think of how differently we'd approach the problem of creating a new Web site and driving traffic to it. A good programmer with a good idea could literally become an overnight success. If you had more modest goals, you could shoot a video of yourself preparing a recipe or teaching a magic trick, and just throw it out there and watch it bubble its way up the meritocracy to see if it was any good. You wouldn't have to spend any time networking or trying to rig the results, you just create good stuff and put it out there. No, despite whatever cheer-leading you may have heard, it doesn't quite work that way yet -- good online businessmen still talk about the importance of networking, advertising, and all the other components of gaming the system that don't relate to actually improving products and services. But there is no reason, in principle, why a perfectly meritocratic content-sorting engine couldn't be built. Would it revolutionize content on the Internet? And, could Search Wikia be the project to do it, or play a part in it?

Whatever search engine the Wikia company produced, it would probably have such a large following among the built-in open-source and Wikipedia fan base, that traffic wouldn't be a problem -- companies at the top of popular search results would definitely benefit. The question is whether the system can be designed so that it cannot be gamed. I agree with Jimmy Wales's stated intention to make the algorithm completely open, since this makes it easier for helpful third parties to find weaknesses and get them fixed, but of course it also makes it easier for attackers to find those weaknesses and exploit them. If you think Microsoft paying a blogger to edit Wikipedia is a problem, imagine what companies will do to try and manipulate the search results for a term like "mortgage". So what can be done?

The basic problem with any community that makes important decisions by "consensus" is that it can be manipulated by someone who creates multiple phantom accounts all under their control. Then if a decision is influenced by voting -- for example, the relative position of a given site in a list of search results -- then the attacker can have the phantom accounts all vote for one preferred site. You can look for large numbers of accounts created from the same IP address, but the attacker could use Tor and similar systems to appear to be coming from different IPs. You could attempt to verify the unique identity of each account holder, by phone for example, but this requires a lot of effort and would alienate privacy-conscious users. You could require a Turing test for each new account, but all this means is that an attacker couldn't use a script to create their 1,000 accounts -- an attacker could still create the accounts if they had enough time, or if they paid some kid in India to create the accounts. You could give users voting power in proportion to some kind of "karma" that they had built up over time by using the site, but this gives new users little influence and little incentive to participate; it also does nothing to stop influential users from "selling out" their votes (either because they became disillusioned, or because they signed up with that as their intent from the beginning!).

So, any algorithm designed to protect the integrity of the Search Wikia results would have to deal with this type of attack. In a recent article about Citizendium, a proposed Wikipedia alternative, I argued that you could deal with conventional wiki vandalism by having identity-verified experts sign off on the accuracy of an article at different stages. That's practical for a subject like biology, where you could have a group of experts whose collective knowledge covers the subject at the depth expected in an encyclopedia, but probably not for a topic like "dedicated hosting" where the task is to sift through tens of thousands of potential matches and find the best ones to list first. You need a new algorithm to harness the power of the community. I don't know how many possible solutions there are, but here is one way in which it could be done.

Suppose a user submits a requested change to the search results -- the addition of their new Site A, or the proposal that Site A should be ranked higher. This decision could be reviewed by a small subset of registered users, selected at random from the entire user population. If a majority of the users rate the new site highly enough as a relevant result for a particular term, then the site gets a high ranking. If not, then the site is given a low ranking, possibly with feedback being sent to the submitter as to why the site was not rated highly. The key is that the users who vote on the site have to be selected at random from among all users, instead of letting users self-select to vote on a particular decision.

The nice property of this system is that an attacker can't manipulate the voting simply by having a large number of accounts at their control -- they would have to control a significant proportion of accounts across the entire user population, in order to ensure that when the voters were selected randomly from the user population, the attacker controlled enough of those accounts to influence the outcome. (If an attacker ever really did spend the resources to reach that threshold point, and it became apparent that they were manipulating the votes, those votes could be challenged and overridden by a vote of users whose identities were known to the system. This would allow the verified-identity users to be used as an appeal of last resort to block abuse by a very dedicated adversary, while not requiring most users to verify their identity. This is basically what Jimmy Wales does when he steps in and arbitrates a Wikipedia dispute, acting as his own "user whose identity is known".)

This algorithm for an "automated meritocracy" (automeritocracy? still not very catchy at 7 syllables) could be extended to other types of user-built content sites as well. Musicians could submit songs to a peer review site, and the songs would be pushed out to a random subset of users interested in that genre, who would then vote on the songs. (If most users were too apathetic to vote, the site could tabulate the number of people who heard the song and then proceeded to buy or download it, and count those as "votes" in favor.) If the votes for the song are high enough, it gets pushed out to all users interested in that genre; if not, then the song doesn't make it past the first stage. If there are 100,000 users subscribed to a particular genre, but it only takes ratings from 100 users to determine whether or not a song is worth pushing out to everybody, that means that when "good" content is sent out to all 100,000 people but "bad" content only wastes the time of 100 users, the average user gets 1,000 pieces of "good" content for every 1 piece of "bad" content. New musicians wouldn't have to spend any time networking, promoting, recruiting friends to vote for them -- all of which have nothing to do with making the music better, and which fall into the category of deadweight losses described above.

An automeritocracy-like system could even be used as a spam filter for a large e-mail site. Suppose you want to send your newsletter to 100,000 Hotmail users (who really have signed up to receive it). Hotmail could allow your IP to send mail to 100,000 users the first time, and then if they receive too many spam complaints, block your future mailings as junk mail. But if that's their practice, there's nothing to stop you from moving to a new, unblocked IP and repeating the process from there. So instead, suppose that Hotmail stores your 100,000 received messages temporarily into users' "Junk Mail" folders, but selectively releases a randomly selected subset of 100 messages into users' inboxes. Suppose for arguments' sake that when a message is spam, 20% of users click the "This is spam" button, but if not, then only 1% of users click it. Out of the 100 users who see the message, if the number who click "This is spam" looks close to 1%, then since those 100 users were selected as a representative sample of the whole population, Hotmail concludes that the rest of the 100,000 messages are not spam, and moves them retroactively to users' inboxes. If the percentage of those 100 users who click "This is spam" is closer to 20%, then the rest of the 100,000 messages stay in Junk Mail. A spammer could only rig this system if they controlled a significant proportion of the 100,000 addresses on their list -- not impossible, but difficult, since you have to pass a Turing test to create each new Hotmail account.

The problem is, there's a huge difference between systems that implement this algorithm, and systems that implement something that looks superficially like this algorithm but actually isn't. Specifically, any site like HotOrNot, Digg, or Gather that lets users decide what to vote on, is vulnerable to the attack of using friends or phantom users to vote yourself up (or to vote someone else down). In a recent thread on Gather about a new contest that relied on peer ratings, many users lamented the fact that it was essentially rigged in favor of people with lots of friends who could give them a high score (or that ratings could be offset unfairly in the other direction by "revenge raters" giving you a 1 as payback for some low rating you gave them). I assume that the reason such sites were designed that way is that it just seemed natural that if your site is driven by user ratings, and if people can see a specific piece of content by visiting a URL, they should have the option on that page to vote on that content. But this unfortunately makes the system vulnerable to the phantom-users attack.

(Spam filters on sites like Hotmail also probably have the same problem. We don't know for sure what happens when the user clicks "This is spam" on a piece of mail, but it's likely that if a high enough percentage of users click "This is spam" for mail coming from a particular IP address, then future mails from that IP are blocked as spam. This means you could get your arch-rival Joe's newsletter blacklisted, by creating multiple accounts, signing them up for Joe's newsletter, and clicking "This is spam" when his newsletters come in. This is an example of the same basic flaw -- letting users choose what they want to vote on.)

So if the Wikia search site uses something like this "automeritocracy" algorithm to guard the integrity of its results, it's imperative not to use an algorithm vulnerable to the hordes-of-phantom-users attack. Some variation of selecting random voters from a large population of users would be one way to handle that.

Finally, there is a reason why it's important to pay attention to getting the algorithm right, rather than hoping that the best algorithm will just naturally "emerge" from the "marketplace of ideas" that results from different wiki-driven search sites competing with each other. The problem is that competition between such sites is itself highly inefficient -- a given user may take a long time to discover which site provides better search results on average, and in any case, it may be that Wiki-Search Site "B" has a better design but Wiki-Search Site "A" had first-mover advantage and got a larger number of registered users. When I wrote earlier about why I thought the Citizendium model was better than Wikipedia, several users pointed out that it may be a moot point, for two main reasons. First, most users will not switch to a better alternative if it never occurs to them. Second, for sites that are powered by a user community, it's very hard for a new competitor to gain ground, even with a superior design, if the success of your community depends on lots of people starting to use it all at once. You could write a better eBay or a better Match.com, but who would use it? Your target market will go to the others because that's where everybody else is. Citizendium is, I think, a special case, since they can fork articles that started life on Wikipedia, so Wikipedia doesn't have as huge of an advantage over them as they would if Citizendium had to start from scratch. But the general rule about imperfect competition still applies.

It's a chicken-and-egg problem: You can have Site A that works as a pure meritocracy, and Site B that works as an almost-meritocracy but can be gamed with some effort. But Site B may still win because the larger environment in which they compete with each other, is not itself a meritocracy. So we just have to cross our fingers and hope that Search Wikia gets it right, because if they don't, there's no guarantee that a better alternative will rise to take its place. But if they get it right, I can hardly wait to see what changes it would bring about.

internet utopian hellno whynot wikipedia
internet
story

Related Stories

[+] A Wikipedia WIthout Graffiti 290 comments
Frequent Slashdot Contributor Bennett Haselton writes "I'm a Wikipedia junkie. There's nothing more fun than switching back and forth between reading about the history of human evolution, and following the latest speculation about the identity of the mysterious R.A.B. in the Harry Potter books, and Wikipedia is the best site to find it all in one place. But as a fan, it's always been frustrating for me knowing that Wikipedia could never improve beyond a certain point -- as it becomes more popular, it becomes more tempting to vandalize, and in turn becomes less reliable, a point that many have made already. That's why I'm excited that sites like Citizendium are approaching the same problem with a different model, one that could enable them to become what Wikipedia almost was, but which its intrinsic nature kept it from being: a central, reliable source of freely redistributable information about almost anything. The main difference is that Citizendium articles, after initially being built up through the same collaborative process that Wikipedia uses, will go into an editor-approved stage, at which point an editor (publicly identifiable on the article's history page) signs off on the accuracy of the article, and further edits also have to be approved by an editor."
[+] How to Stop Digg-cheating, Forever 217 comments
The following was written by frequent Slashdot editorial contributor Bennett Haselton. He writes "Recently author Annalee Newitz created a bit of a stir with the revelation that she had bought her way to the front page of the story-ranking site Digg. Since Digg allows any registered user to go to a story's URL and "digg it" in order to push it upward through the story-ranking system, it was inevitable that services like User/Submitter would come along, where a Digg user can pay for other users to cast votes to push their story up to the top. User/Submitter says they are currently backlogged and not taking new orders, but they say the service will return and will soon feature services for manipulating similar sites like Digg competitor reddit. Even if the new U/S features are vaporware, it probably won't be long before other companies offer similar services. But it seems like all of these story-ranking sites could prevent the manipulation by making one simple change to their voting algorithm."
[+] The Knol Hypothesis 80 comments
Frequent Slashdot contributor Bennett Haselton sends in his latest, which begins like this and continues behind the link. "When Google's VP of Engineering announced their proposed Knol project, where users can submit articles on different subjects and share in the AdSense revenue from the article pages, he didn't mention "Wikipedia," but practically everyone else did who blogged about it. Here's what I think will happen, if Knol is implemented according to the plan: Even though it won't technically be a "Wikipedia fork," it will quickly become equivalent to one, with a "gold rush" of users copying content from Wikipedia to Knol articles hoping for a piece of the AdSense dollars. But I submit this will be a good thing, especially if bona fide experts in different fields join the gold rush as well and start signing their names to articles that they've vetted."
[+] Interviews: Hi, I Want To Meet (17.6% of) You! 372 comments
Frequent Slashdot contributor Bennett Haselton wants to make online dating better. Here's how he wants to do it. "Suppose you're an entrepreneur who wants to break into the online personals business, but you face impossible odds because everybody wants to go where everybody else already is (basically, either Match.com or Yahoo Personals). Here is a suggestion that would give you an edge. In a nutshell: Each member lists the criteria for people that they are looking for. Then when people contact them, they choose whether or not to respond. After the system has been keeping track of who contacts you and who you respond to, the site lists your profile in other people's search results along with your criteria-specific response rate: "Lisa has responded to 56% of people who contacted her who meet her criteria." Read on for the rest of his thoughts.
[+] Your Rights Online: Censorship By Glut 317 comments
Frequent Slashdot contributor Bennett Haselton writes "A 2006 paper by Matthew Salganik, Peter Dodds and Duncan Watts, about the patterns that users follow in choosing and recommending songs to each other on a music download site, may be the key to understanding the most effective form of "censorship" that still exists in mostly-free countries like the US It also explains why your great ideas haven't made you famous, while lower-wattage bulbs always seem to find a platform to spout off their ideas (and you can keep your smart remarks to yourself)." Read on for the rest of Bennett's take on why the effects of peer ratings on a music download site go a long way towards explaining how good ideas can effectively be "censored" even in a country with no formal political censorship.
This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login
Loading... please wait.
  • by UbuntuDupe (970646) * on Wednesday February 14 2007, @12:09PM (#18013422) Journal
    I like the essay except for this:

    "The real losses caused by an inefficient search algorithm, are the efforts spent by companies to game the search results (e.g. paying search engine optimization firms to try and get them to the top Google spot), and the reluctance of new players to enter that market if they don't have the resources to play those games. If two companies each spend $5,000 trying to knock each other off of the top spot for a search like "weddings", that's $5,000 worth of effort that gets burned up with no offsetting amount of goods and services added to the world. This is what economists call a deadweight loss, with no corresponding benefit to any party."


    This issue has long bugged me and it's hard to get answers about it. I don't understand how this is a deadweight loss (DWL) by his definition. Who got the $5000 worth of effort from each of them that they spent? That was the corresponding benefit to another party. How is this DWL different from the "non-DWL" example directly preceding, in which someone overpaid for hosting, but that was the hosting company's gain?

    Does anyone have a rigorous DWL definition that can be backed up by a valid example?
    • Re: (Score:3, Informative)

      Because the first example is equivalent to someone just handing the hosting company 50 bucks a month as a free gift. Money is exchanged, but nothing happens. In the second example, money is exchanged AND people work very hard for a long time to earn it and yet produce nothing. It would be like me paying you to dig a hole and then fill it in. The time you spend doing that is time you can't spend curing cancer.
  • Wikia search site uses something like this "automeritocracy" algorithm to guard the integrity of its results, it's imperative not to use an algorithm vulnerable to the hordes-of-phantom-users attack

    That right there is a billion-dollar idea that I'm sure more than a small horde of devs are working on for themselves or for vulture capitalists.

    Will Mr. Wales own the magic algorithm to use as he sees fit or what?
  • All you have to do to substantially reduce "gaming" the system is to not make it worthwhile.

    Since you can pay Google to have your site link placed right at the top of the search results, for less that what you'd pay someone to game the system to reach a similar position, it wouldn't make sense for large companies to try to "game" Google at all.

    If it weren't for the advertising, we'd probably see a lot more of this on Google.

    Maybe this project could implement something similar.
  • by SirGarlon (845873) on Wednesday February 14 2007, @12:15PM (#18013512)
    I seriously doubt this will turn into anything useful because it relies on a collective definition of "merit." When you and I search for information on the same topic, your needs and my needs may be totally differnt (I may be looking for a little bit of general background and you may be looking to compare and contrast the opinions of two recognized experts in the field). Even if all the hurdles against manipulation can be overcome, I don't see how "merit" rankings will amount to anything more than a popularity contest.
    • by nine-times (778537) <nine.times@gmail.com> on Wednesday February 14 2007, @12:47PM (#18013914) Homepage

      In fairness, I don't think that "merit" is relative with respect to search-engine results. In a simplified example, if I search for "sony", I'm probably looking for one of three things:

      1. The Sony website
      2. A website that sells Sony products
      3. A website that gives reviews of Sony products

      Therefore, the top results should reflect that. Most likely, I'm not looking for porn. I remember the days where search engines would return porn for any and all searches. The fact that Google was able to avoid this is part of what brought about its rise to power.

      Of course, not every example is so simple, but clearly there are results that are or are not correct for a given search.

    • Re: (Score:3, Interesting)

      I seriously doubt this will turn into anything useful because it relies on a collective definition of "merit."

      Good point. But furthermore, I can guarantee you this won't work, simply because web page rankings and spam filtering are essentially the same thing, and the spam issue has not been solved. That is, even when we don't have the problem of multiple conflicting opinions and all we're trying to do is model the preferences of a single recipient, we still can't do it!

  • StumbleUpon (Score:4, Informative)

    by EricBoyd (532608) <mrericboyd AT yahoo DOT com> on Wednesday February 14 2007, @12:22PM (#18013618) Homepage
    It's not a "search engine" per-say but a lot of your talk of "automated meritocracy" sounds exactly like what StumbleUpon [stumbleupon.com] does in order to recommend content to users. People vote on a page, those votes are passed through an automated collaborative filtering system, and then the page is shown to more users who are predicted to like it, rinse lather and repeat. Good content rises to the top of the recommendation queue, so that new users (or people who just joined a category) are shown the things which the vast majority of people liked, in order to build up a rating history to personalize that person's recommendations.
  • by currivan (654314) on Wednesday February 14 2007, @12:23PM (#18013632)
    There are two main directions where search can improve. One is better understanding of natural language, to disambiguate query terms and provide results where the wording on pages is different from the wording of the query.

    The other, which this approach can address, is to improve the term relevance scores and overall page quality metrics that mainstream search engines are based on. Google had its initial success because of two features of this type: one was Page Rank, a measure of overall topic-independent site popularity, and two, bettor use of anchor text, the words people write when linking to other pages.

    In both cases, they mined the link structure of the web, which was essentially aggregate community generated information about site quality that wasn't being spammed at the time. As they succeeded, regular people put less effort into writing their own link text, and spammers took over.

    The next source of this type of community generated content will probably be something incidental instead of deliberately created. If you build a central repository of reviews of web sites, you both make it easy for people to game your results, and you open yourself up to lawsuits from interested parties.

    However, untapped information already exists on what people find useful on the web in the form of their browsing histories, a special case of this being their bookmarks. Someone who could aggregate this information on what millions of people ended up looking at after they ran a particular search query would be in an excellent position to improve the traditional search engine scoring algorithm beyond link data.

    • Re: (Score:3, Interesting)

      There are two main directions where search can improve. One is better understanding of natural language, to disambiguate query terms and provide results where the wording on pages is different from the wording of the query.

      I'm highly skeptical about this path because NL works best in a specified (narrow) context. So if you can specify the context, then you must have already put web pages into context - driven by what? the semantic web? If you've done that, then NL is almost redundant. Like, maybe I want

  • by Bananatree3 (872975) * on Wednesday February 14 2007, @12:29PM (#18013722)
    There already exists a distributed, open source engine which has been around a while, which is called Majestic 12 [majestic12.co.uk]. It uses a client-based search engine, which crawls the web for hundreds of millions of URLS, and then sends the data back to central servers. The servers than compile the data and use user-based searching algorithms to perform the search. While the algorithms are still very much in alpha, it is still a very noteworthy project. Also, its URL base is currently around 30-35 Billion URLs.
  • Which Community? (Score:3, Insightful)

    by RAMMS+EIN (578166) on Wednesday February 14 2007, @12:42PM (#18013852) Homepage Journal
    ``First, though, consider the benefits that such a search engine could bring, both to content consumers and content providers, if it really did return results sorted according to average community preferences.''

    It's also interesting to ask "which community?" There is a small number of categories of things that define some high percentage of the things I search for. I am pretty sure there is a very small intersection of those categories with the categories of things the world's population as a whole searches for. There are also differences based on location and language. In short, my preferences are almost certainly very different from the average of all searchers.

    On the other hand, there are definitely groups of searchers whose preferences coincide with mine. For example, people who are involved in open source development, *nix users, computer scientists, environmentalists, English speakers, and people in the Netherlands probably have preferences that largely overlap with mine.

    This suggests to me that some sort of machine learning might be used, where the system guesses your search preferences based on what links you have followed in the past, and what links other people have followed in the past. In other words, the system (implicitly) tries to determine which communities you are part of, and gives you results that are prefered by members of these communities.
  • by DysenteryInTheRanks (902824) on Wednesday February 14 2007, @12:51PM (#18013980) Homepage
    He's thinking about this all wrong.

    A true open source search engine would let anyone roll their own algorithm. Each algorithm would be a sort of "plug in."

    The index would be the shared, open source part, collaboratively crawled (via PC software or browser plugin) by everyone who elects to participate.

    Algorithms would either work on the index after the fact, or, if they need access to the indexing process itself, would be part of a series of plugins run on the full HTML of each page.

    The index itself would have an open API, so people could build their own front end search websites.

    Trying to design the right algorithm up front is a premature optimization. I have no interest in helping Jimmy Wales become the next Sergey Brin. But I *would* participate in something that gives _me_ a shot, however distant, at founding the next Google, minus the massive spider farm.
  • by Animats (122034) on Wednesday February 14 2007, @12:59PM (#18014102) Homepage

    Rating by asking random users has been tried. At IBM. See United States Patent 7,080,064, Sundaresan July 18, 2006, "System and method for integrating on-line user ratings of businesses with search engines". Sundaresan has several patents related to schemes for asking users for ratings and using that info to adjust search rankings.

    The basic trouble with this approach is that, if you ask random users to rate random sites, they don't have enough time, energy, or effort to do a good job of it. If you ask self-selected users of the sites, the system can be gamed.

    This sort of thing only works where the set of things to rate is small compared to the interested user population. So it's great for movies, marginal for restaurants, and poor for websites generally.

  • by Bluesman (104513) on Wednesday February 14 2007, @01:01PM (#18014120) Homepage
    >The extra $50 that the user pays is the user's loss, but it's also the hosting company's gain.
    >If we consider costs and benefits across all parties, the two cancel out.
    >The world as a whole is not poorer because someone overpaid for hosting.

    And thus the broken window fallacy continues...

    Wealth is created through increased efficiency. A decrease in efficiency is a decrease in wealth, regardless of who benefits.

    By the "world is not poorer" logic, we might as well all ride horses, since we'd be paying oat producers and horseshoe manufacturers instead of the auto industry, so the world as a whole wouldn't be poorer.

    By paying more for inefficient hosting, that takes money away from more efficient uses.
  • The author of this piece takes about meritocratic search as if it were some real fixed ordering of the search results that we just have to be smart enough to uncover. This is anything but the case. For instance is the recipe for apple pie that makes better tasting pie but is too complicated for the inexperienced chief to make better or worse than the one which is extremely easy to follow but isn't as good? When talking about pie this sort of issue might not be a big deal but what happens when we start talking about things like climate science. Is the best result some sort of environmental activists site, a mass media story, a global warming skeptic's site or the actual scientific results that are too technical for most of the public to understand?

    Sure, wikipedia makes these compromises quit well but the idea of content neutral encyclopedia entries provides a well defined goal. The second that we get to a search engine we can no longer cling to content neutrality because we must choose how to rank the advocacy sites on both sides of the spectrum. Unlike wikipedia where one can neutrally remark that some people believe X and others Y in a search engine the community has to decide if "unwanted pregnancy" is going to take someone to the planned parenthood site, an abortion clinic or an anti-abortion site.

    In short there is no notion of the meritocratic search order, there are just tradeoffs between different sorts of searchers. Google is already navigating this maze of tradeoffs, including looking at what users like, so I fail to see the argument that a community search will obviously make better tradeoffs than Google.

    In fact anyone who has spent much time on the Internet realizes that every community tends to develop its own prejudices and biases pushing away those who disagree and attracting those who agree. Slashdot attracts open source zealots and repels the technically inept. Whatever community develops this search engine will have its own biases which will discourage participation by those who don't agree. This is just human nature.

    Likely I might enjoy the results returned by such a search since I suspect the participants are likely to be technically sophisticated nerds and others who have similar views as I do. However, it seems doubtful that they will provide the results people who are very different than those who run the search engine will appreciate.

    Besides, this whole project just smells hokey to me. It sounds like Wales is drunk on his success with wikipedia and advocating it as THE solution to any problem. Problems are pragmatic things and they shouldn't be solved by ideologies.
  • by Animats (122034) on Wednesday February 14 2007, @01:38PM (#18014554) Homepage

    We hadn't planned to announce this quite yet, but this is a good opportunity.

    We have a new answer to search - SiteTruth. [sitetruth.com] It's working, but not yet open to the public.

    Other search engines rate businesses based on some measure of popularity - incoming links or user ratings. SiteTruth rates businesses for legitimacy.

    What determines legitimacy? The sources anti-fraud investigators tell you to check, but nobody ever does. Corporate registrations. Business licenses. Better Business Bureau reports. The contents of SSL certificates. Business addresses. Business credit ratings. Credit card processors. All that information is available. It's a data-mining problem, and we've solved it. The process is entirely automated.

    Most of the phony web sites, doorway pages, and other junk on the web have no identifiable business behind them. Try to find out who really owns them, and you can't. When we can't, we downgrade their ranking. With SiteTruth, you can create all the phony web sites you want, but they'll be nowhere the beginning of any search result.

    Creating a phony company, or stealing the identity of another company, is possible, but it's difficult, expensive and involves committing felonies. Thus, SiteTruth cannot be "gamed" without committing a felony. This weeds out most of the phonies.

    SiteTruth only rates "commercial" sites. If you're not selling anything or advertising anything, SiteTruth gives you a neutral or blank rating. If you're engaged in commerce, you can't be anonymous. In many jurisdictions, it's a criminal offense to run a business without disclosing who's behind it. That's the key to SiteTruth.

    Our tag line: "SiteTruth - Know who you're dealing with."

    The site will open to the public in a few months. Meanwhile, we're starting outreach to the search engine optimization community to get them ready for SiteTruth. We want all legitimate sites to get the highest rating to which they're entitled. An expired corporate registration or seal of trust hurts your SiteTruth ranking, so we want to remind people to get their paperwork up to date.

    The patent is pending.

    • by Kelson (129150) * on Wednesday February 14 2007, @12:43PM (#18013864) Homepage Journal

      In general I see the termed "gamed" as subjective. When outcomes are matched to an individual's expectations, they see the system as working, when they disagree with the outcome, they call it gaming.

      Very true. For an example, look no further than the subset of SEO that sees no difference between settings up hundreds of automatically-generated pages linking to a site for the sole purpose of increasing search rankings and hundreds of individual people independently writing about (and linking to) a site. I've actually seen people in the linkfarm business claim that they're not doing anything different from bloggers.

      This is basically equivalent to saying that there's no difference between one person writing 10 letters to a politician under assumed names, and 10 people writing their own letters.

      • I don't think Google is highly regressive in the way you describe, but I suppose it certainly depends on your definition of regressive.

        Google is definitely regressive from the point of view that it tries to represent the average total mindshare about search terms - NOT the average CURRENT mindshare. So if you want to find the up and coming site that's ABOUT to be the new hotness but hasn't reached critical mass yet, you need something like the derivative of Google's PageRank.

        But this is definitely NOT what
    • Arrow's Theorem (Score:4, Informative)

      by attonitus (533238) on Wednesday February 14 2007, @03:23PM (#18015830)
      Such a theorem does exist and is proven! Arrow's Theorem [wikipedia.org] states that it's impossible to design a voting system that satisfies three really basic conditions:

      a) The removal of one candidate from the race would not affect the rank of the others;

      b) If everyone prefers candidate A to candidate B then the algorithm should rank A above B;

      c) There is no dictator (i.e. there's more than one person voting).

      The same criteria should also apply to a perfect search engine - the removal of one page from the web should not affect the relative ranking of the others, if everyone thinks page A is better than page B, page A should come first and, to be practical, the engine should take as input the priorities of more than just one person (it's not feasible to build a customized search engine that knows exactly the priorities of each individual user).

      Therefore, a perfect search algorithm does not exist