user warning: Incorrect key file for table '/var/tmp/#sql_9a0_0.MYI'; try to repair it query: SELECT comments.cid AS cid, node_comments.title AS node_comments_title, node_comments.nid AS node_comments_nid, comments.subject AS comments_subject, comments.timestamp AS comments_timestamp, comments.comment AS comments_comment, comments.format AS comments_format, node_comments__comments.cid AS node_comments__comments_cid, node_comments__comments.nid AS node_comments__comments_nid FROM comments comments LEFT JOIN node node_comments ON comments.nid = node_comments.nid LEFT JOIN comments comments_comments ON comments.pid = comments_comments.cid LEFT JOIN comments node_comments__comments ON node_comments.nid = node_comments__comments.nid WHERE (node_comments.status <> 0 OR (node_comments.uid = 0 AND 0 <> 0) OR 0 = 1) AND (node_comments.type in ('blog')) ORDER BY comments_timestamp DESC, node_comments_title DESC LIMIT 0, 5 in /mnt/target03/357800/397260/www.safeinternet.org/web/content/sites/default/modules/views/includes/view.inc on line 755.

Anonymized Data Holds Promise as Well as Questions

Back in September, we posted about the fallacies of "anonymized data" and how several researchers had been able to take information that was supposedly stripped of all identifying information and "reverse engineer" it to match it to its origins.

Last week, the New York Times sounded the alarm once again, pointing to several recent incidents in which anonymized data was anything but.

The first instance was seemingly innocuous: movie selections. When Netflix ran its widely publicized Netflix Prize (for which the winners were recently announced), it released the movie selection choices for thousands of its users. The teams were then expected to take these movie selections and their ratings (based on a 1 to 5 scale) and create an algorithm that would generate better recommendations for movies. The team that could improve Netflix's recommendation engine by 10% would win $1 million dollars.

Pretty harmless, right? But what would happen if you're a recent college grad who's looking for jobs and you've recently rented some risque movies, giving them 5 star reviews. Would you want your potential employer to see this?

According to the Times piece, this is a question you should be asking yourself. Two computer scientists at the University of Texas at Austin wanted to test this scenario out, so they took the supposedly anonymized data and ran it against data that had been volunteered to the Internet Movie Database. Given that many people use their personal emails when signing into the IMDB, they were able to correlate the Netflix ratings with the IMDB rankings for several its users.

Though Netflix denies the results of their experiment, it appears they were successful with matching up at least some of the users.

The dangers of such reverse-engineering gets even less murky when you consider medical records.

The clinical information systems market in the United States has sales of $8 billion to $10 billion annually, and about 5 percent of that comes from data and analysis, according to estimates by George Hill, an analyst at Leerink Swann, a health care investment bank.

But by 2020, when a vast majority of American health providers are expected to have electronic health systems, the data mining component alone could generate sales of up to $5 billion, Mr. Hill said. Demand for the data is likely to be robust. Policy makers and hospitals will want to dig into it to analyze physician practices and glean information about patient and public health trends.

As we pointed out in our previous post, back in 1997, the Governor of Massachusetts had been identified in one of these reverse-engineer experiments based on his health records. Meanwhile, the industry is already selling some of this data to insurance companies and other interested parties.

This news comes as the House unanimously passed a resolution honoring National Cyber Security Awareness Month yesterday. As Rep. Yvette Clarke's said in a statement, "we are all interconnected and our national cyber infrastructure is only as strong as the weakest link in the chain." These recent experiments in identifying participants in anonymized data show that there are indeed still several "weak links" in the chain.

Data anonymization techniques

Data anonymization techniques have been the subject of intense investigation in recent years, for many kinds of structured data, including tabular, graph and item set data. They enable publication of detailed information, which permits ad hoc queries and analyses, while guaranteeing the privacy of sensitive information in the data against a variety of attacks. In all this advancement one thing I'd like to instill: the privacy for everyone's safety and security. GAR Labs

Stay informed. Sign up for updates

News Headlines