Lessons for Location Data Privacy

Summary
Paul Ohm has written the first comprehensive law review article that incorporates an important new subspecialty of computer science, reidentification science, into legal scholarship. His research and findings unearth a tension that shakes a foundational belief about data privacy: Data can be either useful or perfectly anonymous but never both. The excerpt highlights his findings on the failures of anonymization, a long-standing privacy protection technique that has become the bedrock of privacy legal frameworks, laws, and policies. While the author’s audience is regulators and the legal community, his findings are relevant to location data, technology, and application providers who collect, aggregate and distribute location data, and who seek to develop proactive policies to balance business objectives with privacy protections.


fingerprint With the permission of Paul Ohm, we are reproducing an excerpt of his original article Broken Promises of Privacy: Responding To The Surprising Failure of Anonymization, published in the UCLA Law Review (57 UCLA Law Review 1701 (2010).

Computer scientists have recently undermined our faith in the privacy- protecting power of anonymization, the name for techniques that protect the privacy of individuals in large databases by deleting information like names and social security numbers. These scientists have demonstrated that they can often “reidentify” or “deanonymize” individuals hidden in anonymized data with astonishing ease. By understanding this research, we realize we have made a mistake, labored beneath a fundamental mis- understanding, which has assured us much less privacy than we have assumed. This mistake pervades nearly every information privacy law, regulation, and debate, yet regulators and legal scholars have paid it scant attention.

Anonymization: the Purging of Personal Information

Anonymization plays a central role in modern data handling, forming the core of standard procedures for storing or disclosing personal information. Anonymization is a process by which information in a database is manipulated to make it dif cult to identify data subjects. Database experts have developed scores of different anonymization techniques, which vary in their cost, complexity, ease of use, and robustness. A very common technique is suppression, whereby a data administrator suppresses data by deleting or omitting it entirely. For example, a hospital data administrator tracking prescriptions will suppress the names of patients before sharing data in order to anonymize it.

Data administrators anonymize to protect the privacy of data subjects when storing or disclosing data. They disclose data to three groups:

THIRD PARTIES: For example, health researchers share patient data with other health researchers, websites sell transaction data to advertisers, and phone companies can be compelled to disclose call logs to law enforcement officials.

THE PUBLIC: Increasingly, administrators do this to engage in what is called crowdsourcing—attempting to harness large groups of volunteer users who can analyze data more e ciently and thoroughly than smaller groups of paid employees.

OTHERS WITHIN THEIR ORGANIZATION: Particularly within large organizations, data collectors may want to protect data subjects’ privacy even from others in the organization. For example, large banks may want to share some data with their marketing departments, but only after anonymizing it to protect customer privacy.

Reindentification: The “Reverse Engineering” of Anonymized Data

The reverse of anonymization is reidenti cation or deanonymization. Anonymized data is reidenti ed by linking anonymized records to outside information, hoping to discover the true identity of the data subjects. Advances in reidenti cation should trigger a sea change in the law because nearly every information privacy law or regulation grants a get-out-of-jail-free card to those who anonymize their data. In the United States, federal privacy statutes carve out exceptions for those who anonymize.
About fteen years ago, researchers started to chip away at the robust anonymization assumption, the foundation upon which [privacy policies and laws] have been built. Recently, however, they have done more than chip away; they have essentially blown it up, casting serious doubt on the power of anonymization, proving its theoretical limits and establishing what I call the easy reidenti cation result. This is not to say that all anonymization techniques fail to protect privacy—some techniques are very dif cult to reverse—but researchers have learned more than enough already for us to reject anonymization as a privacy-providing panacea.

Examples of Anonymization Failure

A. THE AOL DATA RELEASE. On August 3, 2006, America Online (AOL) announced a new initiative called “AOL Research.” To “embrace the vision of an open research community,” AOL Research publicly posted to a website twenty million search queries for 650,000 users of AOL’s search engine, summarizing three months of activity. Researchers of internet behavior rejoiced to receive this treasure trove of information, the kind of information that is usually treated by search engines as a closely guarded secret.

The euphoria was short-lived, however, as AOL and the rest of the world soon learned that search engine queries are windows to the soul. Before releasing the data to the public, AOL had tried to anonymize it to protect privacy. It suppressed any obviously identifying information such as AOL username and IP address in the released data. In order to preserve the usefulness of the data for research, however, it replaced these identi ers with unique identification numbers that allowed researchers to correlate different searches to individual users.

In the days following the release, bloggers pored through the data spotlighting repeatedly the nature and extent of the privacy breach. Thanks to this blogging and subsequent news reporting, certain user identi cation numbers have become sad little badges of infamy, associated with pitiful or chilling stories. User “No. 3505202 asked about ‘depression and medical leave.’ No. 7268042 typed ‘fear that spouse contemplating cheating.’” User 17556639 searched for “how to kill your wife” followed by a string of searches for things like “pictures of dead people” and “car crash photo.” While most of the blogosphere quickly and roundly condemned AOL, a few bloggers argued that the released data, while titillating, did not violate privacy because nobody had linked actual individuals with their anonymized queries. This argument was quickly silenced by New York Times reporters Michael Barbaro and Tom Zeller, who recognized clues to User 4417749’s identity in queries such as “‘land-scapers in Lilburn, Ga,’ several people with the last name Arnold and ‘homes sold in shadow lake subdivision gwinnett county georgia.’” They quickly tracked down Thelma Arnold, a sixty-two-year-old widow from Lilburn, Georgia who acknowledged that she had authored the searches, including some mildly embarrassing queries such as “numb fingers,” “60 single men,” and “dog that urinates on everything.” The fallout was swift and crushing. AOL red the researcher who released the data and also his supervisor. Chief Technology Officer Maureen Govern resigned. The edgling AOL Research division has been silenced, and a year after the incident, the group still had no working website.

B. MASSACHUSETTS GROUP INSURANCE COMMISSION— ACCESSIBLE RESEARCH DATA. In Massachusetts, a government agency called the Group Insurance Commission (GIC) purchased health insurance for state employees. At some point in the mid-1990s, GIC decided to release records summarizing every state employee’s hospital visits at no cost to any researcher who requested them. By removing elds containing name, address, social security number, and other “explicitidentifiers,” GIC assumed it had protected patient privacy, despite the fact that “nearly one hundred attributes per” patient and hospital visit were still included, including the critical trio of ZIP code, birth date, and sex.

At the time that GIC released the data, William Weld, then–Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. In response, then–graduate student Latanya Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, a city of fifty-four thousand residents and seven ZIP codes. For twenty dollars, she purchased the complete voter rolls from the city of Cambridge—a database containing, among other things, the name, address, ZIP code, birth date, and sex of every voter. By combining this data with the GIC records, Sweeney found Governor Weld with ease. Only six people in Cambridge shared his birth date; only three were men, and of the three, only he lived in his ZIP code.

In a theatrical ourish, Dr. Sweeney sent the governor’s health records (including diagnoses and prescriptions) to his office.

C. THE NETFLIX PRIZE DATA STUDY. On October 2, 2006, about two months after the AOL debacle, Net ix, the “world’s largest online movie rental service,” publicly released one hundred million records revealing how nearly a half-million of its users had rated movies from December 1999 to December 2005. In each record, Net ix disclosed the movie rated, the rating assigned (from one to ve stars), and the date of the rating. Like AOL and GIC, Netfiix first anonymized the records, removing identifying information like usernames, but assigning a unique user identifier to preserve rating-to-rating continuity.

Thus, researchers could tell that user 1337 had rated Gattaca a 4 on March 3, 2003, and Minority Report a 5 on November 10, 2003. Unlike AOL, Net ix had a specific profit motive for releasing these records.
Netflix thrives by being able to make accurate movie recommendations; if Netflix knows, for example, that people who liked Gattaca will also like The Lives of Others, it can make recommendations that keep its customers coming back to the website. To improve its recommendations, Netflix released the hundred million records to launch what it called the “Net ix Prize,” a prize that took almost three years to claim. The first team that used the data to significantly improve on Netflix’s recommendation algorithm would win one million dollars.

Netflix thrives by being able to make accurate movie recommendations; if Netflix knows, for example, that people who liked Gattaca will also like The Lives of Others, it can make recommendations that keep its customers coming back to the website. To improve its recommendations, Netflix released the hundred million records to launch what it called the “Netflix Prize,” a prize that took almost three years to claim. The first team that used the data to significantly improve on Netflix’s recommendation algorithm would win one million dollars.

As with the AOL release, researchers have hailed the Netflix Prize data release as a great boon for research, and many have used the competition to refine or develop important statistical theories. Two weeks after the data release, researchers from the University of Texas, Arvind Narayanan and Professor Vitaly Shmatikov, announced that “an attacker who knows only a little bit about an individual subscriber can easily identify this subscriber’s record if it is present in the [Netflix Prize] dataset, or, at the very least, identify a small set of records which include the subscriber’s record.” In other words, it is surprisingly easy to reidentify people in the database and thus discover all of the movies they have rated with only a little outside knowledge about their movie-watching preferences. The resulting research paper is brimming with startling examples of the ease with which someone could reidentify people in the database, and has been celebrated and cited as surprising and novel to computer scientists.

With knowledge of the precise ratings that a person in the database has assigned to six obscure movies, and nothing else, that person can be identified 84 percent of the time. If approximately when (give or take two weeks) a person in the database has rated six movies is known, whether or not they are obscure, the person can be identified 99 percent of the time. In fact, knowing when ratings were assigned turns out to be so powerful that knowing only two movies a rating user has viewed (with the precise ratings and the rating dates give or take three days), an adversary can reidentify 68 percent of the users.

To summarize, the next time your dinner party host asks you to list your six favorite obscure movies, unless you want everybody at the table to know every movie you have ever rated on Netflix, say nothing at all.

Data Fingerprints

What has startled observers about these new [research] results, however, is that researchers have found data fingerprints in non-PII [personally identifiable information] data, with much greater ease than most would have predicted. It is this element of surprise that has so disrupted the status quo. In the three examples above, researchers realized the surprising uniqueness of:

ZIP codes, birth dates, and sex in the U.S. population;

person’s search queries; and

the set of movies a person had seen and rated.

These results suggest that maybe everything is PII to one who has access to the right outside information.

Rethinking Personally Identifiable Information

Prior to these studies, nobody would have classified ZIP code, birth date, sex, or movie ratings as PII. As a result, even after these studies, companies have disclosed this kind of information connected to sensitive data in supposedly anonymized databases, with absolute impunity. These studies and others like them sound the death knell for the idea that we protect privacy when we remove PII from our databases. This idea, which has been the central focus of information privacy law for almost forty years, must now yield to something else. But to what?

In search of privacy law’s new organizing principle, we can derive from reidenti cation science two conclusions of great importance:

→ First, the power of reidentification will create and amplify privacy harms. Reidentification combines datasets that were meant to be kept apart, and in doing so, gains power through accretion: Every successful reidenti cation, even one that reveals seemingly nonsensitive data like movie ratings, abets future reidentification. Accretive reidentification makes all of our secrets fundamentally easier to discover and reveal. Our enemies will nd it easier to connect us to facts that they can use to blackmail, harass, defame, frame, or discriminate against us. Powerful reidenti cation will draw every one of us closer to what I call our personal “databases of ruin.”
→ Second, regulators can protect privacy in the face of easy reidenti cation only at great cost. Because the utility and privacy of data are intrinsically connected, no regulation can increase data privacy without also decreasing data utility. No useful database can ever be perfectly anonymous, and as the utility of data increases, the privacy decreases.

Thus, easy, cheap, powerful reidenti cation will cause significant harm that is dif cult to avoid. Faced with these daunting new challenges, regulators must find new ways to measure the risk to privacy in different contexts. They can no longer model privacy risks as a wholly scientific, mathematical exercise, but instead must embrace new models that take messier human factors like motive and trust into account. Sometimes, they may need to resign themselves to a world with less privacy than they would like. But more often, regulators should prevent privacy harm by squeezing and reducing the flow of information in society, even though in doing so they may need to sacrifice, at least a little, important counter values like innovation, free speech, and security.

Conclusion

Privacy experts, primarily lawyers and business executives charged with protecting their companies’ users, clients, and customers, cling to the idea that although anonymization may be weaker than we assumed, it has not failed. They may concede the need to change privacy policies or invest a bit more heavily in technology and expertise in response to the studies cited above, but they hope they need only small tweaks and not overhauls. In the meantime, I predict that computer scientists and talented amateurs will continue to release new examples of powerful reidentification, with each announcement shaking those who still cling to false faiths. As have the past announcements, these future announcements will surprise experts by how cheaply, quickly, and easily supposedly robust anonymization will fall.


Five Factors for Assessing the
Risk of Privacy Harm

DATA-HANDLING TECHNIQUES

At the very least, data-handling practices [should be graded] according to whether the risk of reidentification is high, medium, or low.

PRIVATE VERSUS PUBLIC RELEASE

Regulators should scrutinize data releases to the general public much more closely than they do private releases between trusted parties.

QUANTITY

Most privacy laws regulate data quality but not quantity. Laws tend to say nothing about how much data a data administrator may collect, nor how long the administrator can retain it. Yet, in every reidenti cation study cited, the researchers were aided by the size of the database.

MOTIVE

In many contexts, sensitive data is held only by a small number of actors who lack the motive to reidentify.

TRUST

The flip side of motive is trust. Regulators should try to craft mechanisms for instilling or building upon trust in people or institutions.