How anonymous is anonymized data?
In 2007, amid growing privacy concerns, Google announced it would begin "anonymizing" its server log data that is older than 18 to 24 months. This included masking both the IP addresses and cookies of its users' search queries. This server log data is immensely valuable to companies like Google because it allows them to analyze trends while it tries to improve its search algorithm. Being able to view the search habits of millions of people -- especially if you know their location -- can lead to insight into how users frame their search queries and what they're really looking for. Understandably, privacy advocates have expressed concern about all this data, since Google is essentially getting a glimpse into the private lives of all these people; it's not uncommon for a person to search for results relating to a medical condition or other sensitive personal information. Hence why Google decided to anonymize this data.
Writing about the decision, Search Engine Land's Danny Sullivan said, "By doing this, it will make it difficult, probably impossible, to trace any particular query back to a particular computer, much less a person that used that computer."
But is this true? Not necessarily. Before Google had even made that announcement, researchers were able to take "anonymous" data released by AOL and use it to trace the identity of at least one person. And earlier this week, Ars Technica published a damning piece making a strong argument that the cloak of anonymity is even thinner than you might think:
In AOL's case, the problem was that user IDs were scrubbed but were replaced with a number that uniquely identified each user. This seemed like a good idea at the time, since it allowed researchers using the data to see the complete list of a person's search queries, but it also created problems; those complete lists of search queries were so thorough that individuals could be tracked down simply based on what they had searched for. As Ohm notes, this illustrates a central reality of data collection: "data can either be useful or perfectly anonymous but never both."
This creates an obviously tricky conundrum. There's no question that this data is valuable and can be put to good use -- both for companies and the public good -- but the value brings inherent risks.
"Because most data privacy laws focus on restricting personally identifiable information (PII), most data privacy laws need to be rethought," Ars Technica concludes. "And there won't be any magic bullet; the measures that are taken will increase privacy or reduce the utility of data, but there will be no way to guarantee maximal usefulness and maximal privacy at the same time."
Today, the Washington Post published an article quoting Google economist Hal Varian, who was able to use Google Trend data and determine that, based on search queries, there are clear signs of our economy improving. Without this mass data he never would have been able to reach that conclusion. Obviously, there is some real value in what he and others are able to do, but it's important to keep in mind the risks as large companies like Google and AOL move forward with crowdsourcing their user data.