Forgot your password?
typodupeerror
The Internet

How to Stop Digg-cheating, Forever 217

Posted by CmdrTaco
from the good-luck-with-that dept.
The following was written by frequent Slashdot editorial contributor Bennett Haselton. He writes "Recently author Annalee Newitz created a bit of a stir with the revelation that she had bought her way to the front page of the story-ranking site Digg. Since Digg allows any registered user to go to a story's URL and "digg it" in order to push it upward through the story-ranking system, it was inevitable that services like User/Submitter would come along, where a Digg user can pay for other users to cast votes to push their story up to the top. User/Submitter says they are currently backlogged and not taking new orders, but they say the service will return and will soon feature services for manipulating similar sites like Digg competitor reddit. Even if the new U/S features are vaporware, it probably won't be long before other companies offer similar services. But it seems like all of these story-ranking sites could prevent the manipulation by making one simple change to their voting algorithm."

Before getting to that though, what's at stake? The revelation that Digg could be trivially manipulated did not cause the site to be overrun with bogus stories all at once -- most of the links on the front page still look interesting. Newitz said that her story, which was deliberately chosen to be as lame as possible, got buried by users soon after it hit the front page, which is how Digg cleans spam stories out of the system. However, she also said that in the time that the story was on the front page, the story got about 35,000 hits, whereupon her server crashed and the traffic was thereafter divided with two other mirror sites; presumably if the server had stayed up, she would have gotten about 100,000 hits, all for an initial expenditure of $100, which is orders of magnitude cheaper than buying advertising any other way. (If she had done the same thing with a good story instead of a deliberately lame one, presumably the traffic gains resulting from word-of-mouth and repeat visitors would have been even higher.) As long as the benefits outweigh the cost, more and more unscrupulous users are likely to pay for such services, and since the service provided by User/Submitter is easy to copy, probably similar services will spring up to drive the price down even further. If nothing changes, then eventually sites like Digg and reddit will be flooded with nothing but paid stories. Most of the stories on the front page will probably still be interesting (why would you pay to promote a link, unless it was good enough to draw repeat visitors and get the most value for your money?), but everybody who didn't pay for votes would eventually get crowded out.

One Good Samaritan, Jim Messenger, managed to shut down one Digg manipulation service called Spike The Vote, by buying it out (for a paltry $1,275 - they must have wanted to get out fast) and then turning over to Digg. He warned people that the moral was: Don't sign up for Digg manipulation services, since Digg might get your information from them and then you'll be banned. Actually, I think the moral is simpler: if you're going to try anything like that, do it from a throwaway account that you don't care about losing if you get caught. (Or, only sign up with manipulation services which publish a privacy policy promising never to share your information, especially not with sites like Digg. Then if Digg buys them out, then the site has violated their privacy policy and Digg as the new owner inherits the liability for that, so you can sue them, right?) But as the idea spreads, it will probably become impractical to play whack-a-mole by shutting down manipulation services as they keep springing up. Any time the cost of providing a service (clicking on a few buttons) is small compared to the benefits of receiving the service (100,000 hits in 24 hours), a market will exist for it one way or another, whether you're talking about drug-smuggling, prostitution, or selling Digg votes.

However, I think there's a way to fix it, and here it is. Have you ever seen people put a link in their profile to their HotOrNot picture, saying "Go here and vote me a 10!!"? Similar to the people who send links to their friends and say, "I just posted this, please Digg this for me!" The difference is that on HotOrNot, it doesn't work. On HotOrNot, you can cast votes for a picture in one of two ways. The first way is to go directly to the URL for someone's picture; the second way is to load the front page, where a random picture from the database is selected at random, and vote for whatever picture comes up. The catch is that the votes that you cast by going directly to someone's picture, are simply ignored in calculating the average score for that photo. The only votes that are counted are the votes cast for random pictures displayed on the front page. So if you want to manipulate the voting for your own photo, you'd have to load the front page hundreds of thousands of times waiting for your own picture to come up repeatedly, which is hard to do without being detected.

To enable an algorithm like this on Digg and reddit, the sites could present users with a sidebar box that displays random stories from the pool of recent submissions. (reddit already has a serendipity feature that users can use to select a random story from the available pool, which could be leveraged for this purpose.) Once a story has collected, say, 100 votes -- or whatever number is considered sufficient to provide a representative random sample of how the story appeals to people -- then on that basis the story can either be buried or promoted to the top, where it would be seen by, say, 100,000 people. The elegance of this system is that bad content would only be seen by 100 people on average before it's buried, whereas good content would be seen by all the 100,000 people who view it on the front page, so the average user sees 1,000 pieces of good content for every 1 piece of crap. Even if 75% of users ignore the random story box completely, that just means you have to display it to 400 users instead of 100 before you have enough data points for a good random sample.

I suggested essentially the same algorithm for how an open-source search engine could work without being vulnerable to gaming even by those who understood all of its inner workings. The main difference, of course, is that Digg and reddit actually exist now. Digg declined to comment on the possible merits of such an algorithm; reddit's Steve Huffman said that the idea sounded interesting, although even if the idea got full buy-in, naturally any proposed change would take a long time to bring to fruition.

But it seems that an algorithm similar to this one would be the only way to prevent cheating on sites like Digg that sort content based on user votes. So it's ironic that HotOrNot, the only site I know of that is using a variation of this algorithm and hence is probably the most secure against cheating, is also the one where cheating is least likely to be a problem. Getting a high placement on Digg might enable you to make some money, but getting a highly rated picture on HotOrNot isn't going to make you rich (unless it helps you meet a millionaire who is using the site to find his third wife). Also, making HotOrNot meritocratic doesn't give people an incentive to improve the "content" that they submit, because up to the limits of what can be done with hair and wardrobe, you can't make yourself that much more attractive. With Digg and reddit, on the other hand, I might work harder at submitting a good story, if I knew that it worked in a perfectly meritocratic fashion that pushed good stories right to the top.

If you do this, you don't need any of the other countermeasures listed in Annalee Newitz's follow-up piece "Herding the Mob", such as analyzing user account history for suspicious behavior. As long as most users in the system are legitimate, most of the users in your random sample will be legitimate as well, and their voting will be representative of what most of the community would think. A story could also get a high score within a specific sub-area of the site like the sports page, but kept off of the main site front page, if the story got a high score from a random sampling of sports-oriented users but a low score from a sample of everyone else.

You could even sub-divide the topical areas further, down to a level of granularity like "Would Barack Obama make a good president?" A site called Helium is currently trying something like this -- users can submit essays on subjects like "Racial inequality or oppression: Do they truly exist in todays society?", and vote on how to rank other essays against each other. The voting works on the random selection principle that I'm advocating here -- users are presented with a pair of randomly chosen essays from a given category (not necessarily the same category for which you submitted an essay) and told to vote for the better one, so there's no way to tell all your friends to go to the link for your essay and give it a high rating. The main limitation though is that while the votes can push you to the top of a particular sub-category, that won't cause your article to "break out" and get to the front page of the site -- Helium says that those front-page articles are chosen at random by employees from the among those articles that are highly rated within their narrow category, so just being good is not enough. And if you want to write something that doesn't fit into any existing categories, you have to create a new category for your essay like I did, which will then be a category containing one essay that nobody else ever sees. Perhaps both of these limitations could be overcome by adding the option to rate randomly selected essays on a scale of 1 to 10 -- thus providing a way to rate essays that exist alone in their own category, and also a way to find the best essays across the entire site, rated against each other.

If Digg or reddit adopts a model that uses the random-voter-selection method, then there's the issue of how to handle the votes cast by users under the current system -- the ones who go to a story link and click "digg it", which is what makes the existing system vulnerable to gaming. Digg could do what HotOrNot does, and just ignore those votes outright, but users would probably view this as deceptive. Perhaps Digg could say that votes cast by self-selected users (the ones who go straight to the story link) are counted along with votes from randomly-selected users, unless the average of the self-selected votes is significantly different from the average from the randomly-selected votes, in which case the self-selected votes are ignored. Hopefully this would satisfy most users and preserve the "community" feel of the site, and only a spoilsport would point out that counting the self-selected votes only if they agree with the randomly-selected votes, is exactly the same thing as ignoring the self-selected votes entirely.

I asked the owner of User/Submitter what he thought about this. He was willing to talk with surprising candor (except about things like his real name) and spoke as if he'd like nothing better than for Digg to make changes to their service that would block his system from working. To both Annalee Newitz and me, he said, "We find it interesting that Digg still allows anybody to view any user's diggs. By way of this 'feature,' User/Submitter is able to verify that our users actually digg the stories they're given. Without this feature, Digg users are given complete digging privacy, and User/Submitter cannot exist." Some have expressed skepticism that the Digg cheaters really want Digg to fix the problem. But as a security tester, I can understand that mentality. If you report a problem, and a company doesn't fix it, eventually you get tempted to publicize the problem to draw attention to it. And if they still don't fix it, and it's a fairly benign security hole that merely enables some pranksters to get some undeserved attention, why not build a service around exploiting the hole, if will highlight the problem and encourage it to get fixed?

So I'm going to go out on a limb and say the U/S guy sincerely wants Digg to be more secure. However I disagree with him about his proposed fix, that of hiding a user's digg history. First of all, it won't stop anyone who creates a multitude of accounts all under their control -- you can use Tor to make it appear that you're coming from many different IP addresses, and build up a history of "legitimate" votes before using your votes to push sites deliberately. (Be sure to use different browsers, or vary your User-Agent header if you know how to do that, so that a series of votes from identical browser types doesn't give you away.) If your service does work by paying other users to cast votes, then you could still audit whether they're casting their votes honestly -- for example, create a test story, use 5 sockpuppet accounts to digg it 5 times, then tell your confederate to digg it. If the number of diggs doesn't go up to 6, then you know they're not honoring their end of the deal, and kick them out of the system. As long as most confederates think there might be some chance of getting caught if they don't play along, most of them would probably cast the votes that they were paid for, since it costs them nothing to do so and they wouldn't want to jeopardize their stream of easy money.

I asked the owner of User/Submitter if his service could defeat the random-sampling algorithm I described. "It would slow down our service," he answered, "but certainly wouldn't eliminate it because eventually a U/S User will have an opportunity to vote on a U/S Submission by way of chance." But I don't see how this would beat the algorithm -- some U/S voters would still get to vote on the story, but as long as there are far more legitimate voters than U/S voters, then a random sampling will almost always contain far more legitimate voters. The U/S owner also said, "Randomized voting privileges would be unnecessarily confusing, frustrating, and fragmenting. Not to forget: unfair and undemocratic." Well, you could keep it from being "confusing" or "frustrating" by keeping the existing interface (with the possible addition of a randomly-selected-story box), so that the only changes would be in how the votes are handled under the hood. "Fragmenting"? If anything, it seems to me that the existing Digg/reddit algorithms would be more fragmenting, keeping users within their existing communities of friend who vote for each others' stories; a random-selection box would give stories with "crossover appeal" a greater chance of success, bringing them to the attention of users who might otherwise never have seen them. As for "unfair and undemocratic", presumably this is a reaction to the fact that the votes of 100 users decide what everyone else sees. But it's already the case with Digg that the votes of a small number of users decide what content becomes popular. At least with a random sample of users, it would be the case that the vast majority of the time, the voting outcome would be the same as it would have been if the entire site had voted, due to the magic of representative sampling.

So, I'm putting this suggestion out there for the same reason that Jim Messenger bought out Spike The Vote -- because I don't want sites like Digg and reddit to be manipulated by the abusers. In fact, if they used this algorithm, they would become more meritocratic than they are now, because the systems would strictly favor the highest-rated content, instead of content written by people who have informal networks of friends who can all go digg their stories for them. If I were to design the user rating system to make it cheat-proof, these are the exact details of what I would do:

  • Wherever they decide to post the "random story sampling" box (on the front page, or on a link off to a separate page, etc.), have it work so that as soon as new stories are submitted, they can be rotated into that box and displayed to a random set of users, until it's reached its total of 100 votes or however many are required to get a random sample.
  • You can have "shutout voting" to kill off stories early that are obvious spam or otherwise really useless, without going through the full 100 votes. (For example, if 90% of the first 10 votes are negative, then stop collecting votes.) This decreases the number of users "inconvenienced" by really obvious spam and other garbage.
  • For someone to submit content that gets rotated into that voting process, have them submit a Turing test (read numbers off of a graphic and type them in), or something similar. This prevents spammers from submitting spam content over and over just to have it viewed by those initial 10 voters. If they have to type in a number each time, it's not worth it.
  • When users give votes to a story, give them the option to say why they voted the way that they did. (This is especially valuable if they're giving negative votes, then the submitter would know what to improve.) Personally I think the comments would be more valuable if each user can't see other users' comments, at the time they submit their own comments; this prevents the "me too" effect where everybody echoes the first two commenters. (When I ask for independent comments from people, and they almost all say the same thing without seeing each other's comments, that's when I know they have a point!)
  • To prevent an attacker from having their own username hit the random-voting page over and over in hopes of voting up their own content, make sure that each user account is only allowed to vote on a given piece of content once (even if they found the content through the random-story page).
  • Require a Turing test for new user signups. This would prevent an attacker from registering a huge number of accounts just to hit the random voting page with different users over and over, in hopes getting to vote on their own submitted content eventually.

Then after running this system for a while, look through some collected data to determine if the system could be more efficient. For example, do you really need a sample of 100 votes every time? Suppose you determine that in 99% of cases, you get the same result just from tabulating the first 50 votes, as you would have gotten from tabulating all 100 votes. Then you could modify the system to collect only the first 50 votes, and then make a decision.

Suggestions for improvement? Flaws (hopefully not fatal)? Everyone who cares about keeping community sites like Digg free from abuse, and who wants to create a path for the best content to rise to the top, let's put our heads together and see what we can think of. The above is intended merely as a jumping-off point, and although I've worked it over and I can't see any specific points to improve efficiency, that's probably just because I've been looking at it too long. And if you Digg this story for me I'll give you 1,000 times as much cash as I gave my Mom last Mother's Day.

This discussion has been archived. No new comments can be posted.

How to Stop Digg-cheating, Forever

Comments Filter:

CCI Power 6/40: one board, a megabyte of cache, and an attitude...

Working...