Comment Should Netflix have known? (Score 1) 262
There are a lot of comments about whether she could/should have been outed by the data set or not, but nobody answered the question as to whether Netflix really should have known that the data they published was personally identifiable. The answer is an unmitigated 'yes'. People have been working on privacy in databases issues for a very long time (US census in the late 1800s, anyone?) but with the EU's strict privacy regulations, it's 'recently' become an important area of computer science research (see the "Privacy in Statistical Databases" conferences like PSD 2008).
The fact is though that there are several standard techniques that are well understood, easy to implement, and would have let them release the data without releasing any information. Probably the best fit would be to just lie about the zip codes -- take the data and make sure that there are at least n people in each "zip code", merging adjacent codes into one until you get enough to protect the innocent. There's also a lot of research about generating fake records that maintain similar statistical properties to the original data set. Both techniques do result in some loss of information, but remember that's a good thing because it helps protect the privacy. Besides, if for example I live in Green Bay, I really fail to see how much additional information can be gained by associating my records with the individual zip code for Green Bay instead of grouping everyone together into a single zip for the entire city.