Over the course of the last fifteen or so years, the belief that “de-identification” of personally identifiable information preserves the anonymity of those individuals has been repeatedly called up short by scholars and journalists. It would be difficult to overstate the importance, for privacy law and policy, of the early work of “re-identification scholars,” as I’ll call them. In the mid-1990s, the Massachusetts Group Insurance Commission (GIC) released data on individual hospital visits by state employees in order to aid important research. As Massachusetts Governor Bill Weld assured employees, their data had been “anonymized,” with all obvious identifiers, such as name, address, and Social Security number, removed. But Latanya Sweeney, then an MIT graduate student, wasn’t buying it. When, in 1996, Weld collapsed at a local event and was admitted to the hospital, she set out to show that she could re-identify his GIC entry. For twenty dollars, she purchased the full roll of Cambridge voter-registration records, and by linking the two data sets, which individually were innocuous enough, she was able to re-identify his GIC entry. As privacy law scholar Paul Ohm put it, “In a theatrical flourish, Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office.”
Sweeney's demonstration led to important changes in privacy law, especially under HIPAA. But that demonstration was just the beginning. In 2006, the New York Times was able to re-identify one individual (and only one individual) in a publicly available research dataset of the three-month AOL search history of over 600,000 users. The Times demonstration led to a class-action lawsuit (which settled out of court), an FTC complaint, and soul-searching in Congress. That same year, Netflix began a three-year contest, offering a $1 million prize to whomever could most improve the algorithm by which the company predicts how much a particular user will enjoy a particular movie. To enable the contest, Netflix made publicly available a dataset of the movie ratings of 500,000 of its customers, whose names it replaced with numerical identifiers. In a 2008 paper, Arvind Narayanan, then a graduate student at UT-Austin, along with his advisor, showed that by linking the “anonymized” Netflix prize dataset to the Internet Movie Database (IMDb), in which viewers review movies, often under their own names, many Netflix users could be re-identified, revealing information that was suggestive of their political preferences and other potentially sensitive information. (Remarkably, notwithstanding the re-identification demonstration, after awarding the prize in 2009 to a team from AT&T, in 2010, Netflix announced plans for a second contest, which it cancelled only after tussling with a class-action lawsuit (again, settled out of court) and the FTC.) Earlier this year, Yaniv Erlich and colleagues, using a novel technique involving surnames and the Y chromosome, re-identified five men who had participated in the 1000 Genomes Project — an international consortium to place, in an open online database, the sequenced genomes of (as it turns out, 2500) “unidentified” people — who had also participated in a study of Mormon families in Utah.
Most recently, Sweeney and colleagues re-identified participants in Harvard’s Personal Genome Project (PGP), who are warned of this risk, using the same technique she used to re-identify Weld in 1997. As a scholar of research ethics and regulation — and also a PGP participant — this latest demonstration piqued my interest. Although much has been said about the appropriate legal and policy responses to these demonstrations (my own thoughts are here), there has been very little discussion about the legal and ethical aspects of the demonstrations themselves. As a modest step in filling that gap, I’m pleased to announce an online symposium, to take place the week of May 20th, that will address both the scientific and policy value of these demonstrations and the legal and ethical issues they raise. I’ll cross-post my own contribution here, but the full symposium will be hosted over at Bill of Health. Participants fill diverse stakeholder roles (data holder, data provider — i.e., research participant, re-identification researcher, privacy scholar, research ethicist) and will, I expect, have a range of perspectives on these questions:
I hope readers will join us on May 20.
This promises to be a very interesting event, but in the promotion I would beg that the icon of a cocaine addict who has two abortions be changed. It unintentionally furthers the widespread misperception that abortion is a phenomenon of the addicted or irresponsible. But consider these facts about abortion:
WHO HAS ABORTIONS?
• Non-Hispanic white women account for 36% of abortions, non-Hispanic black women for 30%, Hispanic women for 25% and women of other races for 9%.
• Thirty-seven percent of women obtaining abortions identify as Protestant and 28% as Catholic.
• About 61% of abortions are obtained by women who have one or more children.
• The reasons women give for having an abortion underscore their understanding of the responsibilities of parenthood and family life. Three-fourths of women cite concern for or responsibility to other individuals; three-fourths say they cannot afford a child; three-fourths say that having a baby would interfere with work, school or the ability to care for dependents; and half say they do not want to be a single parent or are having problems with their husband or partner.
Jones RK, Finer LB and Singh S, Characteristics of U.S. Abortion Patients, 2008, New York: Guttmacher Institute, 2010.
Finer LB et al., Reasons U.S. women have abortions: quantitative and qualitative perspectives, Perspectives on Sexual and Reproductive Health, 2005, 37(3):110–118.
See generally: http://www.guttmacher.org/
Posted by: Alta Charo | May 13, 2013 at 02:35 PM
Hi Alta,
Thanks for your comment, and I've made the change you suggest. Let me explain where I was coming from, though. The examples in the graphic, which I created, are meant to be taken from actual re-identification demos, and are cherry-picked by re-identification demonstration researchers to show just how extremely sensitive the information at issue can be. So the figures above aren't meant to be representative of any group. The figure, above, who searched for "beauty and the beast disney porn" is an actual search term from infamous AOL User 927 (who as far as I know hasn't been re-identified), but easily might have been. The "loves Ishtar" is my attempt at humorously referencing the Netflix prize dataset demonstration. Since the PGP demo "re-identified" some who had uploaded their 23andMe data, and since 23andMe returns ApoE genotyping, I included someone who had two copies of ApoE4 (though I don't know whether Sweeney in fact discovered anyone with any copies of ApoE4). And of course, we've had other genetic re-identification demonstrations.
Now to what you're concerned about. The abortion, cocaine and childhood abuse phenotypes were all taken from actual PGP participant profiles that Sweeney re-identified (though I mistakenly recalled sexual rather than physical abuse--now corrected). In her PGP demo paper, Sweeney notes just three PGP profiles of the many she re-identified that are particularly sensitive. What I'll call Participant 1 had one or more abortions and reported "marijuana intoxication," depression, panic disorder, and postpartum blues (among other things). Participant 2 reported "cocaine intoxication," "marijuana intoxication," childhood physical abuse, and liposuction (among other things). Participant 3 reported childhood physical abuse, "cocaine intoxication," depression, and bi-polar (among other things). (By the way, a reporter replicated Sweeney's results and contacted all three participants, who, as I understand it, agreed to be named by the reporter, or I wouldn't be calling attention to these profiles here, even though they are already part of the public record via Sweeney's paper.) So in these admittedly cherry-picked cases of re-identification, it turns out that single individuals often are linked to more than one (potentially) sensitive datum. It would have been more accurate to have shown a figure who had had an abortion and then went on to suffer from depression, post-partum blues, and panic disorder -- but I suspect that would have played into equally fraught perceptions about abortion and regret. In any case, to avoid all such objections, I've simply disaggregated all of the phenotypes.
Posted by: Michelle Meyer | May 13, 2013 at 03:31 PM