Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
IBM

Randomizing Survey Answers For Accuracy 224

Saint Aardvark writes: "The New York Times reports that two researchers at IBM have come up with a way to persuade people to give correct answers to survey questions: randomize the results. Strangely enough, they can get accurate information out of the aggregate of enough answers -- but it's completely anonymized. Since conservative estimates say nearly half of all survey answers are bogus, there's an interest in persuading people to be more truthful. As ever, you can use the Random NY Times Registration Generator to falsify your registration details and read the article..."
This discussion has been archived. No new comments can be posted.

Randomizing Survey Answers For Accuracy

Comments Filter:
  • If I want to intentionally put a bogus answer into a poll, such randomization doesn't affect it at all. Quite often, I will answer a poll not in the way that I actually feel, but in the way that interests me the most at that particular moment. This randomization doesn't affect that.
    • I will answer a poll not in the way that I actually feel, but in the way that interests me the most at that particular moment

      Even worse, it's a lot of fun just to screw the poll and prove statistics wrong :D
    • Basically people providing false answers will ver often pick a false answer based on location. Many will pick the middle option, with others being picked less, though the first and last may have a oddly large amount of people chosing them compared to the second and second-to-last options.

      When the answers are given in random order, each cycles into the different spots. The liars are actually cancelling out other liars who used the form before them. The differences in the answers are mostly based on how the truth tellers answered (I say mostly because some liars may have a different way of selecting a lie, such as the longest string offered in the answers), and so you can derive more meaningful statistics from them.

      See this /. poll [slashdot.org] that illustrates the phenomenon. For an excercise, imagine what the results would look like if the offerings were randomly ordered for each person the poll was shown too. My bet is each would be near 12.5%.
  • Hrm (Score:2, Funny)

    by Anonymous Coward
    In the past, I'd give false answers. Now I'll need to randomize my true/false answers to throw their randomness off.
  • I don't get it. (Score:4, Insightful)

    by AKAImBatman ( 238306 ) <`akaimbatman' `at' `gmail.com'> on Sunday July 21, 2002 @12:12PM (#3926317) Homepage Journal
    Ok, fine. They've managed to come up with a model that doesn't actually collect any data. And how will this help people to enter REAL data? People don't give data because they don't trust the company. If they don't trust the company, do you really think they'll believe some mumbo-jumbo about "randomizing"?
    • Re:I don't get it. (Score:3, Insightful)

      by plugger ( 450839 )

      If they don't trust the company, do you really think they'll believe some mumbo-jumbo about "randomizing"?

      Fair point. One solution might be to perform the randomization on the client side and display the result. That way the user can see that the answers have been munged before they are sent.

      Then again, if all you are interested in is aggregate data, just don't ask for any personally identifying information.

      • Re:I don't get it. (Score:2, Insightful)

        by Otter ( 3800 )
        Fair point. One solution might be to perform the randomization on the client side and display the result. That way the user can see that the answers have been munged before they are sent.

        But, again, why would a user bother? People resent being pestered for information. It's minimally more work to lie than to provide accurate information and much more satisfying.

        • Re:I don't get it. (Score:3, Interesting)

          by DennyK ( 308810 )
          Heck, usually it's LESS work to lie. Much easier to select the first or last option in a list than to hunt for the one that applies to you, or say you live in "dkjhgkjhdgs dshkjgdsh, AL" than to actually type your real address. And if they insist on cross-checking your ZIP and state, then what else is there except CA and 90210? ;) (Guess crappy TV shows can have their uses after all... ;) ) I'd love to see a study done about what % of visitors put CA/90210 for a state/ZIP in those places that do the cross-checking. That would give you a damn good idea about how many people lie like hell on those surveys... ;)

          DennyK

          • I work in Marketing and part of my job is occaisionally analyzing the results of our eval surveys. I think far more people provide false answers out of laziness than out of deliberate lying.

            I think this because we use this data (in part) to analyze the effectiveness of our magazine ads in order to budget accordingly. For a long time, one of the magazines, beginning with the letter A had been getting really good results from the eval survey. So we put more money into ads and articles for that magazine, but didn't notice any increase in evaluation downloads.

            Then we changed the order of the magazine dropdown, and suddenly, no one was picking that magazine. (Now we regularly rotate the list.)

            Yes, there is quite a bit of false data -- lots of people who work at Foo or Test -- but I think it's mostly people trying to get to the survey quickly and not people trying to protect their privacy.

    • Re:I don't get it. (Score:2, Interesting)

      by vipw ( 228 )
      I don't give out real data because I don't feel a need to at all.
      I find it takes a lot less time to fill in crap data than real data, what really pisses me off is places that correllate the state you select with the zip code. Places like that seem to be deliberately positioning themselves AGAINST me, so I intentionally fill it with erroneous data because they have become my adversary in the case of this page.

      Filling in webforms doesn't become an issue of trust until I actually need them to have these data; in which case I try to be careful with who I give my credit card number, but don't care all that much about the rest.

      I think the only reason people give out real data when presented with pointless web forms (ala NYT) is that they are unsure if it will operate properly if they enter the wrong information. I assume a goodly percentage of truthful answers come from a demographic that never intentionally fills erroneous answers into web forms; people who aren't very interested in where limitations exist in these computers that they just happen to use.
      • Re:I don't get it. (Score:3, Insightful)

        by dboyles ( 65512 )
        ...what really pisses me off is places that correllate the state you select with the zip code. Places like that seem to be deliberately positioning themselves AGAINST me, so I intentionally fill it with erroneous data because they have become my adversary in the case of this page.

        You seem to have some sort of problem with this, as if they are somehow tricking you. No, it's just a validity check in an attempt to ensure accurate data. What I find interesting is that they would give you an error and ask you to fill in the form again.

        Let me explain: let's say you've filled out a 10-question form asking for name, email, age, location, and a few "consumer behavior" questions. If you've done all this accurately, it files your data and lets your proceed. But if you've done it inaccurately (in this case, filled out a ZIP/state that don't match), it kicks you back and makes you correct it. So this time you put in a valid ZIP/state. You submit it, and it files your data away and lets you proceed.

        The problem is that your data still isn't accurate, and therefore should be thrown out. Maybe your ZIP/state is correct now, but maybe you just put 90210/CA. A much better solution from a data integrity standpoint is to allow that user to enter junk data, but to not factor in that bad data when drawing conclusions.

        I think there needs to be much more research in this area if anybody expects to get good data out of the internet. IBM's studies seem to be a step in the right direction. Not only do they want to improve data integrity for the company, they're also factoring in another important issue: privacy.
        • I remember once I was watching an eleven year old kid fill out a form for something completely truthfully. When he hit submit, it took him back to the form, complaining that the age he gave was too young (for them to be collecting information on him), and suggesting that he fix it. So of course, he did.

          huh?
        • Re:I don't get it. (Score:3, Informative)

          by pheonix ( 14223 )
          My partner and my company does this for large corporations (a great deal in the automotive sector) and here's what we've found.

          Frequently, the people that give input simply misread questions... for example 'How many males over the age of 18 in your household INCLUDING YOU' as opposed to 'NOT COUNTING YOURSELF'. Or they make typos. Error checking can fix that frequently. Saying that just because they mis-keyed their zip, the whole dataset is incorrect is not correct.

          We've found that the most positive way to get good data is to get people that WANT to tell you their opinions to take the survey. Forcing someone to take the survey for free stuff or to take part in something just doesn't work. Giving them the free stuff then saying "Hey, would you like to give us your opinion" on the other hand, does. The only drawback is that you would assume you're tainting the respondent's opinion. Given the amount of research we've put in, we've actually found the opposite... people say "hey, I've already got my free shit, now I'll tell em how I REALLY feel". I don't see much of a purpose in what IBM has come up with.
  • by jedwards ( 135260 ) on Sunday July 21, 2002 @12:13PM (#3926320) Homepage Journal
    Did you lie when answering this question?

    O Yes
    O No
    O Cowboy Neal told me the answer
    • If logic holds true, then there will be 100% No answers (minus the people who take the Cowboy Neal option). If you're a lier, then you can't click yes because then you wouldn't be lying, so you have to choose no (because you're a lyer). If you aren't lying, then you have to choose no.
      • You're going about the problem the wrong way. Don't think about what you "can" and "can't" click. Because we both know this crowd would all click "yes" anyway.

        The fact is, those who are the "liars" must tick yes anyway, because it's the statement that cannot be true, hence it's a lie.

        Then again, anybody who is a liar can also tick NO, since he's lying.
    • by Subcarrier ( 262294 ) on Sunday July 21, 2002 @01:22PM (#3926551)
      Did you lie when answering this question? Yes

      Truth is often the most devious of lies.
      • Truth is often the most devious of lies.

        Truth is just an excuse for a lack of imagination
      • > > Did you lie when answering this question? Yes
        >
        >Truth is often the most devious of lies.

        "If I were to ask you what your answer to the question 'will you lie when answering this question' would be, how would you answer?" ;-)

  • There is a great amount of irony in the fact that we're all reading an article about obtaining accurate information by clicking on a link that will generate false information.

    That's just way too wonderful to put into mere words.
  • by treat ( 84622 ) on Sunday July 21, 2002 @12:14PM (#3926323)
    Do they expect that people will enter real data on the mere promise that it will be stored in some randomized, aggregate, or other form that does not invade their privacy? If the coroporation could not be trusted in the first place, no statement they make will make them trustworthy.
  • Missing the Point (Score:3, Insightful)

    by Inexile2002 ( 540368 ) on Sunday July 21, 2002 @12:14PM (#3926324) Homepage Journal
    Sounds all fine and dandy for science, but people are usually honest with a professional researcher who is going to gaurantee your ananymity, and moreover the research data is going to be used for something tangible rather than selling something right back to you.

    Market researchers want information on YOU. They want generic info on your demographics, but this information has been available from other venues for a long time. When spy ware and other information gathering techniques are employed against someone they are being used to collect data to target marketing at that person specifically. Literally employed against that person.

    As such, I'll still say that I'm female, in my 50's, from Yemen and making less than $12,000 a year. Randomize away.
  • While a lot of people are concerned about their privacy, somehow, I don't think that the fact that they won't be able to tie the answers to you will lead to any more truthful answering.

  • What's the point of living if you can't screw with market research? It's just fun, and my little way of getting some revenge for the countless webpages they cover with annoying advertisements, or the time they steal in between my TV programs.
    • What's the point of living if you can't screw with market research?

      Dr. Ann Cavoukian sounds like she can help you with that too. Maybe that was her plan in the first place. :)
  • I don't see how people would trust this any more than entering it normally.

    Typical session:
    What is your age? (Results will be randomized)
    23

    OK, we're putting down you are 28 based on a random number we picked. Aren't we good to protect your privay?

    (Then behind the scenes the database gets the real age put into it, how will the user ever know?)

    Even if the user can view their profile later on, the database can just store their real age + the so-called random modifier, and the user will be none the wiser.

    What a pointless "technology".
    • Re:Of course (Score:1, Insightful)

      by Anonymous Coward
      What a pointless "technology".

      Not at all, not at all. Like 80% of the stuff these days, it exists merely to get some nice paperwork for the students, after that it will be forgotten. Once they have their Masters/Doctorate in an incredibly narrow field, gotten themselves into debt, given money to textbook makers and given jobs to profs, they will have their paper that will get them a nice nice job, all the while perpetuating the myth of higher education and raising the bar for everyone else.

      Hardly pointless, is it? I mean, it's the only way for a modern society to still use capitalism.
    • by 80N ( 591022 )
      This is how it would work: You have a web page that asks you for your age (see 1 below). On the web-page is a JavaScript function that adds a random modifier. The value you entered is displayed as a non-input field to the right and the value you entered in the input field is replaced by the randomized version (see 2 below).

      1 Age [28] *Will be randomized*
      2 Age [56 (Randomized)] *28*

      The value 56 gets submitted to the server, not the value 28 - which is my real age ;).

      This is auditable because I can inspect the source code which is part of the web-page, and I can even monitor the network packets if I'm really paranoid.

      Now I could still lie, or mess with the algorithms in the Javascript, but what would be the point?

      80N

  • This is a very novell scientific idea with very little useful application, IMO. What percentage of the public is going to actually believe that their individual answers are not going to be stored? Or an even better question - what percentage of companies claiming to use this technique will not actually store the data entered by their users?

    -CySurflex

  • So what they're saying is that they've proven that their random number generator isn't really all that random? :p
    • No, what they are saying is that their random number generator is very random. If you have a perfectly randomly generated bell curve set of numbers, it makes it easier to reconstruct the original data. Think of an audio signal for example. If you have a sine wave (your data) mixed with white noise (perfectly random), you can quite easily pick out the sine wave. Its the one frequency that is louder than all the rest. However, if instead of white noise you have noise that is not perfectly random, you will not see a clear sine wave, but several different frequencies.
      • Or, what they are saying is that they used (or assumed) a realistically finite number of data points to try to reconstruct the original distribution. The random noise they add may well be perfectly characterized, and the random number generator perfectly random*, but if they are estimating 1000 randomized responses, or 10000, there is also a predictable, non-zero uncertainty in the result when they try to extract the original distribution.

        However, since the reconstruction error would depend on the number of respondants, which will vary dramatically from site to site, I might also guess the 5% number was rectally extracted, and only used to make a point for the article that it will still be better than the error due to respondants lying, despite not being perfect.

        All of this, or course, under the dubious assumption that people will stop lying just because random numbers have been added to their information, as numerous other posts here have discussed...

        *...yea..yea..I know, there's no such thing as perfect random number generator, but those tests you hear about mathemiticians running on RNG algorithms are for the truly anal-retentive who are worried about patterns showing up after the 2^64th repetion or whatever. I doubt that even a relatively low-tech randum number algorithm would be taxed by this technique.

  • a few people have posted that you still have to trust the company you're sending the info to to randomize the data for you. It doesn't have to work like this. You could have the program work so that the info is randomized at your end, maybe by having the browser make a call to a "registration" program. It could be open source so that we could be sure of what it's doing, and the company then can't get your real info. (without hacking your box)
    • This is probably the most sensible way of doing this, using, for example, Java or even ActiveX.

      However, it still doesn't fix the problem that people lie. Even if they know that their privacy is guaranteed, they'll still lie, simply because it's fun--after all, rules are made to be broken.
    • Trust is a necessity any way you slice it. If the randomization takes place completely on the client, the numbers probably won't be random enough. If the randomization takes place on their end, and you hit a button client-side to 'roll the dice' until you get a set of numbers you are comfortable with, the combined human psychology of those surveyed can mung the randomness (numbers ending in 7, for example, might be favored over numbers ending in 0 because of our subconscious understanding that numbers ending in 0 are easier to work with mathematically and are therefore 'less safe'). If you only get one set of numbers from the remote randomizer, you don't know that they aren't using an intentionally weak pseudorandom generator that they'll be able to reverse and get all the original results from (or simply giving the same set of numbers to everybody).

      I'm always a bit skeptical when I'm told I'm about to be surveyed anonymously, and I can't think of a way that this can be implemented (or at least is likely to be implemented) that would reassure me. The non-skeptics are filling in their information already. Perhaps businesses could pick one in five to survey and offer the people who don't want to take it the ability to just skip it; I'll bet a good amount of crap in the databases is coming from people who have to fill in eighty mandatory fields for free e-mail or music or whatever.

  • by verbatim ( 18390 ) on Sunday July 21, 2002 @12:30PM (#3926388) Homepage
    I think there is something to be said about companies that ask for information as an option versus companies that ask for information as a requirement.

    For example, company XYZ has released a program called Widget. In order to download Widget, users are asked to fill out a survey so that XYZ may guage the demographics of their target audience.

    Some sites will allow you to bypass this step and proceed to download the software. Other sites require this information before revealing the download link. I think that the psychological difference between "required" and "optional" would heavily influence the honesty of the answers.

    I know that I never honestly fill out required forms. I'll fill in a bunch of bogus details, get the link, and be on my way. However, if the form is optional, I may download first and, if I like the program, provide some details to the company. The difference? I'm not being forced to give anything up in advance.

    Is this true in general? I don't know. But it makes sense to me.

    I have an idea for something to replace the survey forms - an AI program to carry out a conversation with the user. Ah ha! We just have to watch out for users that say to the AI - "I am lying" - and hope the AI doesn't need therapy.
    • I'm the same way. I've setup about a dozen Juno accounts for friends and family, and every time I "fill out" the form for them by randomly clicking one thing from each selection. If it wasn't required, I wouldn't pollute their statistics, but oh well.
    • When a company tells me they aren't going to use the information I give them for anything but demographics research, then asks me for my phone number and address and makes both fields <b>required</b>, I consider it safe to assume that company is lying, and don't feel think it's at all naughty to fib.<br>
      On the other hand, if the company really only requires me to answer questions of demographic importance, such as what country and state/province I am from and my age, I am likely to respond truthfully.
      • You're right - the distance between the information and the user affects the result. That is, I'll tell you my age group, but not my birthdate. I'll tell you my region, but not my street-address.

        Kind of like what dboyles said [slashdot.org]: by allowing the user to skip questions they don't want to answer, the questions they do answer are far more likely to be honest.
    • I know that I never honestly fill out required forms. I'll fill in a bunch of bogus details, get the link, and be on my way. However, if the form is optional, I may download first and, if I like the program, provide some details to the company.

      I agree with your theory, but I want to expound on it a little bit.

      I don't think many people will be inclined to actually return to the site and voluntarily provide information. However, think about the people who would fill out optional forms in the first place. The demographic probably fits that of the casual internet user. That user is much more likely to provide accurate information - but just as importantly, they're unlikely to provide inaccurate information. So by making a form optional, you've seriously improved the integrity of your data.

      Then, a company can look at that (supposedly very good) data and make assumptions about the users. However, they must be careful to not assume that the data is a full picture just because it is not innaccurate (I'm purposely not using the word "accurate"). In other words, if 40% of the respondants indicate that they like Murder She Wrote, you can't assume that that extrapolates to 40% of your user base. Instead, the company must associate that data only with the respondants. But since they have very accurate information about their respondants, they can assume that their conclusions are equally accurate.

      So the question arises, "What about the non-respondants?" That's true, the company doesn't have accurate information about them. But what's better, good information about a small group, or bad information about a large one?
      • So the question arises, "What about the non-respondants?" That's true, the company doesn't have accurate information about them. But what's better, good information about a small group, or bad information about a large one?

        That's so backwards though. There is a difference between a Survey and a Census. Asking every single person that comes to your site what they think is a Census. Yes, that's obviously the best way, but not the most cost effective.

        A Survey is talking to a percentage of your user base, and extrapolating the data. If done properly, you can interview a random group of around 15% of your user base and be statistically 95% accurate. Thus far, it's the best we have, if you discount cheating and poor data collection practices.

        My point is, if you do your survey correctly, if 40% of your respondents indicate that they like Murder She Wrote, it's safe to say that 40% of your user base also does, plus or minus a small percentage. That's the whole point of statistics.

        • My point is, if you do your survey correctly, if 40% of your respondents indicate that they like Murder She Wrote, it's safe to say that 40% of your user base also does, plus or minus a small percentage. That's the whole point of statistics.

          That's assuming that the respondants represent an accurate model of your population. My argument is that that's not the case in optional, online polling. Maybe 40% of the respondants like Murder She Wrote, but maybe 70% of respondants were between the ages of 50 and 70.
          • That would call the quality of the surveying process into question, but not the survey itself. A few writeups we've been looking at indicate that, typically, there is no direct corrolation between age or sex and likelihood of accurate survey completion. There is a vague corrolation between tech-knowledge and accurate survey completion, but it's not particularly strong, and well within statistical bounds.

            Basically, if they don't get a representative sample within reason, it's due to a poorly administered survey, not the seemingly arbitrary nature of their polling.
    • >> information as an option versus...information as a requiremen

      The New York Times thinks I'm a 146 year-old lady who makes less than $10,000 a year, has 3 children in high-school, and enjoys golf and motorsports in her spare time.
  • That's actually an old statistical trick. Adding an homogeneous noise to any statistical data doesn't actually involve any noise in the final data accuracy. With a little button in java which randomize the data you've entered in the form( thus before sending the data to the firm ), it protects your privacy while still giving useful data to the firm. They got a nice idea, but sure it won't stop some people to fake answers "for fun". I do that sometimes :-)
  • I hope these companies aren't asking users to 'trust' them with thier personal information based on the fact that we are supposed to trust them to randomize it.

    Personally, if I don't trust them enough to tell them how much I make, I'm not going to trust them to randomize my results. I don't see how this will increase accuracy -- especially if I keep telling everyone I'm a 108 year old female in Uganda making $100,000+ per year year who works in the sales department of an Educational field and plans to make purchases of an suv, a house, a console gaming system, a optical mouse in the next six months and rates thier internet experience as very low. My e-mail address is sjobs@mac.com and I would like to apply for your quarterly, monthly, weekly, daily, and hourly newsletters and I do give permission to pass this information to your affiliates.

    • Grandma, is that you? How all of the family has been searching to find you. What a joyful day this is! No one believed you were serious when you, Sjembo Obsowetu, vowed to put an Apple computer in every classroom in our home country, but look at you now.

      Beef jerky?

  • Not only does this not make any sense, the article was really poorly written. There is no way this system will be any more truthful than the lame one we use now. What this does prove it the morons ay college actually believe the crap they write...what a sad concept. It is like actually believing the commercials we see on TV...
  • That's just stupid (Score:4, Insightful)

    by photon317 ( 208409 ) on Sunday July 21, 2002 @12:36PM (#3926407)

    Let me summarize:

    1) People lie on surveys, most likely because they don't trust the taker - but probably also just because they like putting in other answers (yeah, I'm a millionaire, woohoo!, etc). This only addresses the trust issue, ignoring other ptential sources of lying.

    2) In order to work around the trust issue, they've developed a method of injecting random noise into the original answers as they are recorded and then extracting useful data in the end.

    Notice their technology doesn't do anything to fix the underlying problem. The hope is that users will understand and trust the backend randomizer system, and that based on this trust they will answer more truthfully.

    Without bothering with all this mumbo-jumbo, I can build a trustworthy system. I simply record survey statistics, and I promise not to use the individuals' personal data invidually.

    They can either trust me that I'm telling the truth about this, or they can lie. In the IBM researchers' scenario, the users are again asked to trust that the backend system doesn't compromise them, and again they can choose to trust it or choose to lie.

    Given the above, why on earth would you bother with this research and uneccesary complexity. It's not going to make any difference over just promising your users that you don't invade their privacy. You could replace their research results with a banner on top of the survey that says "After you sumbit your data to us, we use Magical HibiJibi technology to prevent ourselves from invading your privacy, so please trust us and answer truthfully"

    What a waste of research.
    • Who said the randomization was made at the other side ? That would be pointless indeed.... With a simple java randomizer CLIENT SIDE, with a little button, you could have total clarity to what you send!( you could still check the java code if you are doubtful ). Don't take IBM scientits for more dumb than they are!

      • Yes, but surveys are tagetted at and only work with masses of people, the common plural man. From this person's perspective, it doesn't matter than the randomization technically happened in their PC in a java applet or javascript code. In either case they're entering personal data and trusting the company to not abuse them.
    • You seem to have missed the point. This technology assumes that users are going to lie, and mitigates the effects of those lies on the final results of the survey with a minimal loss in "accuracy."
      • No, YOU are missing the point :) It's made so the user actually say the TRUTH but then apply a homogeneous noise over these truth, thus protecting the privacy without destroying the "statistical distribution"( if you ever did some statistic, you'll know what i mean ).
        So if ppl will still lie, the accuracy won't change.... you can't have good accuracy with "wrong" data....
  • Yet I don't quite understand it , but I like it cause it somehow arises my interest in data-mining.
  • There are three kinds of lies: Lies, Damn Lies, and Statistics.
  • "Right now, the rate of falsification on Web surveys is extremely high," Dr. Cavoukian said. Conservative estimates are 42 percent, but anecdotally the rates are far higher
    Gee, considering that, the /. polls (even with their prominent disclaimers) seem to have more meaningful results than polls you see on websites, and probably even more than some "scientific" web polls. At least the results usually look right.
  • I heard something similar to this a while back where they were surveying college students asking personal questions like "have you had sex?" and to make students more inclined to answer truthfully the students would go into a room by themselves with a survey and a coin and for every question they we asked to flip a coin. If the coin came up heads they would answer truthfully and if it came up tails they would flip the coin again and heads would mean true and tails would mean false.

    I can't remember where I read this. If someone has a like could you please post it?
  • I agree with lots of folks here that this system works only if you don't have to trust the remote site to apply the obfuscating transformation. Here's a suggestion to make things somewhat more transparent.

    Create a form with attached Javascript. You enter the real data and hit the "obfuscate" button. The script then locally adds noise to your answers. At this point, the "obfuscate" button turns into "submit", allowing you to send the visibly obfuscated responses to the remote site.

    Of course, you'll probably want to read the source to make sure the real answers are not sent along with the obfuscated ones. Still, this scheme would go a ways toward creating the perception of honesty.
  • by WEFUNK ( 471506 ) on Sunday July 21, 2002 @12:48PM (#3926454) Homepage
    Interesting approach, but useless unless people actually understand and trust the system. For this to happen will probably require widespread adoption, an easy to understand explanation of the process, and assurances that answers really are randomized. These requirements obviously force a bit of a chicken and the egg scenario.

    Explaining the whole randomization process (how it protects privacy, how it provides useful info) will be a little much for most people I think, but a good user interface might alleviate this, perhaps with a 'randomize' button that is used before hitting the 'submit' button. This would take the user input and change it right in front of their eyes. Of course many would be rightfully concerned that the randomize button is just for show (or simply encodes but doesn't anonymize), but I think that enough people might buy into the false sense of security that demonstrated 'randomization' provides to at least partly improve the % of bonafide results. Also, the system could be set up so users who don't mind submitting traceable information could be encouraged ("extra 10% off") to submit without randomization, with a simple flag sorting data into randomized/anonymous and non-randomized/non-anonymous data).

    This approach would be even better if the randomization approach becomes a ubiquitous standard backed by a consistent and legally accountable and well-known entity/brand (IBM for instance). I'm not sure how well an open solution would work unless there was a central group assuming responsibility and accountability for the system, enforcing trademarks, and suing spoofers. Also, people feel safer when they feel there's someone to blame for any abuse/mistakes (hence, giving their credit card freely to a waiter but not to a website).
  • Old trick (Score:4, Informative)

    by guanxi ( 216397 ) on Sunday July 21, 2002 @01:17PM (#3926540)
    As another poster observes, if you don't trust them with the data, why trust them to randomize it?

    My college stats professor 10 years ago explained a simpler trick that puts control in the respondant's hands. It went something like this:

    With each question, the respondant flips a coin and looks at the second hand of a clock. Only the respondant can see the coin or the clock.

    If the second hand is between 1-30 seconds, they answer per the coin (e.g. heads=yes). If it's between 31-60, they tell the truth.

    The surveyor, knows very precisely the number of 'lies', can extract accurate data, and the respondant has confidence and control over their privacy. All without a transistor.
    • Re:Old trick (Score:3, Insightful)

      by cduffy ( 652 )
      The problem with these techniques is that you can't force the user to do it manually (as they won't), and the user can't trust their own computer (running someone else's software) to do it for themselves. That latter objection is the one that has botched any number of theoretically sound online voting systems.

      Useful in theory? Very. Useful in practice? Not so much.
    • Re:Old trick (Score:3, Interesting)

      by AJWM ( 19027 )
      Indeed, very old trick. (For my sins, in my earlier days I used to help PhD psych students run statistical analyses on their survey data.)

      A variation on this is to give the respondant a die (ie, half a pair of dice), tell them to pick a number between one and six, and every time they roll that number, intentionally give a false answer on the survey. Thus, looking at any individual survey response, you don't know whether it's true or false, but you can factor in the 16.7% false responses into the statistical analysis.

      Sure, that can be computerized, but as someone above pointed out, how does the respondant know he can trust it? The above old technique is entirely under the respondant's control.
    • I believe this technique (or the variant above with a single d6) has been used as a standard textbook example in the literature on Bayesian methods in biomedical statistics.

      ISTR it's in Tanner's book on Gibbs sampling, as a method used to extract accurate population estimates about embarassing, personal or even incriminating subjects, such as past exposure to STDs, sexual orientation, or the use of particular controlled drugs.

      Of course, your survey has to be big enough so that the expected number of true positives (N.p) stands out above the expected uncertainty in the number of false positives, approx sqrt(N.p'.(1-p'). If p is small, N may have to be really quite big.

    • Another version of this is to have two questions. E.g. if you want to know how many people have shoplifted, you actually give them two questions:

      1) Have you ever shoplifted?
      2) Do you have any siblings? (Or some other innocuous question.)

      You tell the person "Roll a die (or just mentally choose a random number between 1 and 6). If you get a 5 or 6, answer question 1. Otherwise, answer question 2." You can use the fact that there is a 1/3 chance of answering question 1, together with Bayes' Theorem, to figure out the percentage of people who said yes to question 1. People feel more confident about answering honestly, because the experiment is simple enough that most people believe the researcher doesn't know which question they answered (although some people will still be suspicious, of course).

      Note: if you have them mentally choose a number between 1 and 6, you first need to do another experiment to find the percentage of people who choose 5 or 6, since it probably is not 1/3.

      I read a nice little article on this subject a while back called "How to ask sensitive questions without getting punched in the nose", I believe it was in volume 3 of a series called Modules in Applied Mathematics, but I don't have it handy on my shelf. But it's a very well-known example in statistics, I believe it's called a randomized response design.
  • by HD Webdev ( 247266 ) on Sunday July 21, 2002 @01:29PM (#3926570) Homepage Journal
    "Judge, I did not know she was 14 years old. I'm pleading innocent by reason of randomized, aggregate data!"
  • by majcher ( 26219 ) <(moc.rehcjam) (ta) (todhsals)> on Sunday July 21, 2002 @01:33PM (#3926583) Homepage
    Hey, it's me. The guy who put together and hosts the New York Times random login generator [majcher.com]. First off, thanks for all your cards and letters - I originally just created that page to save myself some trouble, but I'm glad to see that everyone likes it so much.

    I'd also like to remind anyone who wants to download, copy, and mirror the source of that page on their own servers, or even as an HTML page on your desktop or whatever. It's just javascript, so it's portable, and that way you'll still be able to use it when the NYT lawyers finally get around to noticing it or they start blocking requests from my page or something. (It will also help distribute my load, though I haven't had any real trouble yet...)
    • It's a good idea... but I don't suppose there's a javascript-free version for those of us that know better than to enable javascript?
  • The kind of questions that most of these sites ask include stuff that is impolite for friends to ask each other sometimes, never mind some random business. If they want accurate results, they should include the option for people to answer with a "MYOB" option. People are rather unlikely to keep tossing in crap data when they have the "MYOB" option, at least not in the 40% range. There is no way in hell that anyone making 100k+/year would actually admit it and give a business their real e-mail address. They would be begging for a flood of advertisements.
    Why is it that online business feel they have the right to try and force so much personal information out of us? In brick 'n mortar stores, the worst info anyone asks me for is my zip code (or age to purchase alcohol). They can get my name if I use my credit card, but I can easily pay cash to avoid that.
    It's very ironic that NYTimes would run this story.... Why do they expect me to tell them where I live, work, and what I make, just to read their articles? The paper version is nowhere near this invasive.
  • The idea of using randomness to get better survey results is not a new one. In his 1990 book "Innumeracy", John Allen Paulos posits a system for asking a potentially embarrasing yes or no question whereby the examiner asks the subject to flip a fair coin before responding. If the subject gets heads he should give the embarrasing answer, tails he should tell the truth. The idea is that the subject is then spared the trauma of giving the embarassing answer since the examiner is not told the result of the coin flip and it is possible the subject just flipped heads. Knowing the "probability distribution" of a fair coin it can then be assumed that half the respondants gave the embarrasing answer as a result of their coin flip. These can then be removed from the data leaving a staticically accurate result.

    It seems that what the IBM folks are doing is a staightforward extension of this idea to a larger response domain (numerical ages as opposed to boolean questions) and to a more automated system in which the website flips the coin for the subject and amends his answer accordingly.
  • If the respondents are already randomizing the data, the statistical analysis should be able to produce the same result.

    Or hadn't they thought of that?
  • It was an interesting article, and I can see how this technique will work when the surveyors have the goodwill of the respondents, so that any respondent's primary concern is only that of keeping his individual privacy.

    But is privacy the core issue in market research, or is it simply a label of convenience that a lot people use for something else that we don't have easy words for? I will lie on many surveys even when I am fully confident of my personal anonymity-- though I prefer to avoid those surveys entirely when I can. OTOH, when a survey is done by a group that I have aligned myself with, I might well enthusiastically bare my soul without any regard to the privacy issue. And I know that I am not at all uncommon in these respects.

    I suspect that my reactions stem from the same source as nationalism, patriotism, ethnic pride, and that whole mess of things where I'm not behaving as an individual protecting my privacy, but as a member of a group who feels called upon to defend my group.

    Mostly I see marketing as an attempt by outsiders to mess with my group, to get us to buy stuff through conning us rather than letting us apply our own standards of value to the goods offered. I think I lie on surveys to protect my group from these subtle attacks; to misdirect and confound my group's enemy.

    So I really don't think privacy has much to do with it. I think all this lying is a natural group reaction to consumerism, and its belief that it is perfectly okay to sell product by conning your customers into thinking that what you are pushing today is something they want.

    Not in my group, buster. We don't need no steenkeeng pushers in our neighborhood.

  • People won't trust sites to actually randomize the data. Actually, people probably won't notice that the site is promising to, or take this as a reason to give good results. What they should do is set up a system where the randomization is done by the browser (which people trust), in accordance with a distribution specified by the site and provided to the user.

    That way, the browser tells you that your entry will be randomized to tell the site your age +-30 years, or give your actual gender 20% more frequently. Based on the numbers the site is using, you can decide whether to answer accurately, knowing just how hard it would be to track you based on this information. The web site would then be able to remove the noise from the aggregate data, and have a confidence based on the distribution they ask for (aside from people who think the margin is too small and lie).
  • by dpbsmith ( 263124 ) on Sunday July 21, 2002 @03:00PM (#3926830) Homepage
    As many others have noted, the technique is silly because if you don't trust survey takers in the first place, why would you trust them when they say they are following the IBM randomization technique?

    A couple of years ago, I received a survey in the mail that said the results would be kept completely confidential and anonymous. I thought it was odd that there was a mysterious seven-digit number in one corner, but anyway, I said to heck with it and pitched it. A week later I got a follow-up letter noting that I hadn't sent in my survey yet! Some anonymity!

    Incidentally, this is not the only time I've gotten "anonymous, confidential" surveys with mysterious multi-digit numbers. In at least one case, it was at a big company and the survey involved things that nobody in their right mind would want their bosses to know about... and there were mysterious multi-digit numbers on the forms and, indeed, checking with colleagues confirmed that the numbers were different on each of our forms. Naturally, we all put down safe, inaccurate answers.
  • I think it was first by Robyn Dawes...anyway, a very similar technique was used in what was a brilliant design for a study on sexual behavior during the perceived height of the AIDS epidemic in this country. In a nutshell, we were faced with an epidemic spread by sexual contact, but did not really know what the base rates for any of the more (or less) dangerous activities were, or if they had ever been tried.

    Asking people right out "Hey, did you have unprotected anal sex on your casual encounter?" was found to be not a particularly good way to elicit truthful answers. So what you do is give people a fair coin (or the equivalent) and have them flip the coin for each question. If the coin lands heads, they answer "yes". If the coin lands tails, they answer *truthfully*. Looking over an answer sheet, you have no idea which "yes" answers are real and which are not, and subject did feel like nobody really could "get" any personal information off their answer sheet. In the statistical aggregate, however, you could get perfectly useful average rates for a given population. (Basically, you just adjust for the "yes answer background".)

    A great idea, but its use in a wide-range study of this type was axed, I believe, when the study itself was blasted by certain members of congress...but that's another story.

  • They've missed the point about why their forms are full of bullshit.

    The forms are giant time-wasters.

    If the folks giving these surveys would stick to EXACTLY what they NEED to know, we wouldn't balk at filling them out properly- especially since personal data is one thing they generally do NOT need to know for marketing!

    Forget the name, address, interests (the BIGGEST time waster of all.) Generally, the most important information that you can get from site visitors is:

    1) Zip code. This tells you the geographic area that your visitors are coming from. Useful for location-relevant information, but completely impersonal.

    2) Age range. This is really the prime info that marketers want, as so much of their "science" is based on generational observation. Again, totally impersonal.

    3) How you heard about the site. This is the most important thing you can learn from your visitors, as it gives you some information on which advertisements are performing!

    If every site I signed up to asked me these three questions and these three questions ONLY, I'd answer them all truthfully. As it stands, I have to dig through a mountain of shit, and these days I generally just throw the shovel at the pile and move on.
  • ...which is not unusual on Slashdot - I do it all the time as well.

    The idea of randomising answers it not new. It has been used in 'socially sensitive' surveys for years, if not decades.

    Simple explanation:
    Have a survey of 10 questions people don't like to reveal the truth of, ech with a yes/no answer.

    For each question, either
    a) reply truthfully
    b) flip a coin and record whatever the coin gives.

    If challenged about your answer, you can always say that's the answer the coin required.

    Analyse the results for a large population of completed survays. Any significant deviation from 50% yes and 50% no answers tells you which way the population answered, without revealing who actually holds those views.

    All you need is a coin to randomise your answers. This is independent of any web form, doctored answer sheet etc etc - so particular answers cannot be pinned on you.

    It's fun administering the same survey to people with and without the randomisation - you get to see what people in general lie about!

    Hope this gives a usefule summary of the method.

    Regards,

    pgrb

  • I remember reading about something similar to this in a psychologly class in 1988 or so. The idea was for people doing a door-to-door survey asking things like sexual behavior. There's important public health reasons to have the data, but also strong reluctance to give honest answers.

    What they did was give the person being polled a spinner, like from a board game. (Remember those, oh you young /.ers? Maybe not...) It was divided in two parts, 2/3 would say "yes" and 1/3 would be "no". The questioner would ask if the person's answer to some yes/no question matched that shown on the spinner (which the questioner couldn't see). You couldn't know what any single person's answer was, but you could do the math and get how many had done what.

  • No company is really going to use this, but a company will claim it does to gain your "trust." Have you ever heard a hardcore marketing goon talk about trust? It's really chilling.

    I used to work for a company whose customers had to provide accurate information in order to sign up -- the service wouldn't work with false info -- but the problem was getting people to sign up.

    One of the main selling points was that customer data was completely secure: no one will ever be able to read your data, only an aggregate report of all our users. The company went to a lot of trouble to make this point convincing, going so far as to suggest that users had legal protections against abuse. There were people in the building who spent all their time trying to think of ways to convince more people to drop their defenses so we could exploit their information -- cold, calculating, 24-7, like WOPR spends all its time playing World War Three.

    I believed their claims until the day I saw a user's sensitive data on an engineer's screen. And then that engineer showed me another user's data, and another. "We've always had the ability to do this," he said, "for, ahem, quality control purposes."

    If a company tells you it isn't collecting the valuable data you provide, you need to assume it is lying (unless you can personally verify the claim or you are positive that the law protects you against abuse).

    Programs like this one could lead to greater truthfulness in the answers people volunteer on the Web, she said, provided that they were willing to replace some of their native caution with a bit of good will toward a company and its need for data-mining.

    "Right now, the rate of falsification on Web surveys is extremely high," Dr. Cavoukian said. Conservative estimates are 42 percent, but anecdotally the rates are far higher, she added. "People are lying," she said, "and vendors don't know what is false and accurate, so the information is useless."

    People are "lying" because corporations lie, as a matter of policy. This will never change because lies are more profitable than truth. Only corporations don't call their behavior "lying," they call it "marketing." So when I fill out an intrusive form with false information, I don't consider it lying either. I call it "standing up for my right to privacy." This system of "marketing" versus "standing up for my rights" is well-balanced, but this new masking technology is simply a marketing attempt to tip the scales in the corporations' favor by tricking consumers into volunteering information on false assumptions.
  • by hysterion ( 231229 ) on Sunday July 21, 2002 @05:00PM (#3927185) Homepage
    Rakesh Agrawal and Ramakrishnan Srikant have devised a data-mining program that would cloak individual truthful answers
    Don't trust these guys. They are (obviously) piping their names through some obfuscation algorithm.
  • Why do companies take these polls to begin with? To make money. Either there is money to be made in interpreting the results or even in providing the results in the first place (see election exit polls). If the pollsters are looking to make a profit off of the information, why not share that profit with the people that gave you the information to begin with?

    Going back to the example of the exit poll, if all you're going to do is try to make money by predicting who will win an election, its much more satisfying for the voter to lie and watch them squirm when they get it wrong. Why should we tell the truth?
  • (from mathematic perspective)but isn't random function reversible? Though the chance of reconstruct the scrumble is slim but we shouldn't rule out the risk. Why don't they use some irreversible functions like MD5?
  • I find it quite amazing how people will justify their behavior. This is a good example of the selfishness of people: I want everything my way and if it conflicts with my belief then I should have the right to discard it.

    The company is providing you with a product, often for free, and they request that for you to use their product you give them a little personal information. It is their product, so they get to make the rules. Your choices are to give what they want and take what you want, or you could just live without it. I don't understand the position of taking what you want and not leaving what they want.

    Or you consider this tiny piece of personal information part of the price. Instead of giving them $5, you give them your age, salary, and email address. You don't try to trick the grocery store clerk when you think the bill was too high for what you bought, do you? Why would this be any different? If you don't like the invasion of privacy, then the cost is too high for you and you don't take the product.

    I can see where people may say that capital is a required part of making the product and personal information isn't. Since they don't need my email address then I should feel free to not give it to them. However, this personal information often does translate into capital for them (the goal of business is to make money most of the time). Besides, that isn't your decision to make. The company wants your privates and they are giving you the product, so their desire carries more weight. If you were not receiving something back in return, then their desires would not override yours.

    It is only a "little" lie doesn't change the fundamental aspect that lying is a priori wrong.

  • ...that we use a Random NY Times Registration Generator to falsify [our] registration details to access an article about ways to persuade people to give correct answers to survey questions?

    Helluva page btw, majcher. Thanks :)

How many hardware guys does it take to change a light bulb? "Well the diagnostics say it's fine buddy, so it's a software problem."

Working...