Main

Prompted by mounting concerns about the advancing science of reidentification, the National Cancer Institute (NCI) convened a group of experts to explore the scientific, ethical, and human participant dimensions of the identifiability of biospecimens and multiple types of “omic” data. The “omic” research fields are stimulating discovery and progress in medicine by leveraging highly annotated data sets in the generation and evaluation of numerous hypotheses with impact across a wide range of phenotypes. To foster the most productive and efficient use of research data to generate and test new hypotheses not necessarily envisioned at the time of data collection, the NCI endorses widespread sharing of data across different areas of research. This support for broad data sharing is exemplified by the National Institutes of Health (NIH) data-sharing policy for genome-wide association studies and anticipated policy for genomic data sharing, which seek to facilitate broad access by researchers to deidentified data in publicly supported or conducted studies, consistent with research participant consent.1,2

Provocative publications, however, have questioned the conventional wisdom that individuals included in data sets of aggregated molecular information are fully anonymous. These studies indicate that DNA variant data from pooled single-nucleotide polymorphism disease studies,3 data sets of RNA expression levels in tissue samples,4 or other quantitative traits5 can be linked directly to human research participants if a matched sample is available. Moreover, some investigators suggest that human beings can be uniquely identified from just 30–80 statistically independent single-nucleotide polymorphisms.6 Furthermore, a recent study demonstrated the ability to reidentify individuals without the availability of a matching sample. Researchers identified individual research participants using publicly available whole-genome research data to calculate short tandem repeats on the Y chromosome in order to query publicly accessible genealogy databases and infer surnames. The genomic research data were linked to metadata, which enabled the investigators to deduce personal identities.7

Evolving recognition of the increasing challenges for deidentification of human tissue and associated data inspired a federal regulatory proposal by the US Department of Health and Human Services (DHHS) to consider categorizing all biomedical research involving biospecimens, including collection, storage, and secondary analysis of existing tissue, as research involving identifiable data.8 The DHHS sought public comment on potential advantages and disadvantages of considering biospecimens and associated data inherently identifiable. The DHHS also solicited response as to whether specific data security protections, including encryption and periodic retrospective random audits, should extend to research using certain types of genomic data, such as genome-wide single-nucleotide polymorphism analyses or whole-genome sequences.

Ultimately, the ability to protect research participants from any unauthorized identification of their biospecimens or “omic” data is an issue of public trust that the research community must address forthrightly and transparently. The NCI must be proactive in contemplating the extent to which evolving technologies might undermine the protection of data that is currently deemed “deidentified” under the law and regulations. We wished to consider whether the burgeoning volume of “omic” data generated in biomedical research, coupled with evolving linkage capabilities posing potential informational risks to research participants and genetically linked populations, warranted a reexamination of informed consent disclosures or data-sharing policies. This article reflects a summary of the discussion from the NCI meeting and is not intended to represent an opinion of the NCI, the National Human Genome Research Institute, the NIH, or the DHHS.

Setting

On 11–12 June 2012, we gathered a group of leaders in the research community with varied relevant backgrounds (Supplementary Appendix S1 online) in order to convene a think tank on the science and policy of reidentifying human biospecimens and “omic” data. Our aim was to attempt to outline considerations for the protection of “omic” data and for the distribution to researchers of stored biospecimens and associated data within appropriate ethical and regulatory parameters. The agenda involved an initial day of presentations by leaders in medical research, bioinformatics, ethical and regulatory policy, consumer genetic testing, and patient advocacy. The presentations discussed current challenges posed by biotechnological advances and their effect on participant privacy and perceptions about data sharing. Moderator-led discussion followed each presentation, enabling think tank participants to share experiences and identify challenges and opportunities for empirical research. On the second day, participants convened in one of four preassigned breakout sessions (Supplementary Appendix S2 online) composed of individuals with varying expertise, for in-depth discussion of specified subtopics.

Major Findings

State of the science

The purpose of day 1 of the think tank was to report on the state of the science from a variety of expert perspectives and, specifically, to address current challenges surrounding the identifiability of “omic” data and biospecimens. There were a total of nine state-of-the-science presentations, including two keynotes. These talks provided background in multiple fields of expertise, including cancer genomics and proteomics, bioinformatics and statistics, legal and regulatory guidance, and industry applications of gene technologies. A brief description and highlights of the presentations are provided ( Table 1 ).

Table 1 Highlights of individual state-of-the-science presentations

Breakout sessions

Day 2 was devoted to breakout sessions with groups focused on specific questions posed by the think tank organizers (Supplementary Appendix S2). These deliberations framed suggestions for next steps, designed to move the science forward, while helping to address policy concerns and develop best practices around identifiability. Outcomes of the breakout group sessions are highlighted in Table 2 .

Table 2 Outcomes of breakout group discussions

Two broad themes emerged from think tank discussions about the identifiability of human tissue and data. The first theme concerned the issue of whether additional institutional or legal measures were necessary to address either negligent or intentional violations of research participant privacy. The second theme concerned the extent to which greater understanding and expansion of data-sharing options might provide an opportunity to increase the engagement of participants in the research enterprise.

With regard to the first theme, some experts argued that current governance strategies such as honest broker coding models and irreversible deidentification systems, although not immune to mishap, function well in many institutional biorepositories. According to this view, although it has indeed been demonstrated that with targeted effort, a researcher may infer specific information about an individual from aggregated data (such as cohort participation in a case or control group), this should not condemn current deidentification approaches and best practices. Although often highly publicized, incidents of reidentification from trace amounts of individual data within aggregated data sets are usually isolated and thus need not warrant rewriting of research consent disclosures about privacy, or constriction of researcher access to data. Although misuse of genetic information by researchers is demonstrably feasible, it constitutes such an extreme breach of medical and professional ethics as to best be handled by targeted professional measures rather than by broad policy strokes. For example, wrongdoers might potentially be banned from participation in federally funded research studies under procedures for handling scientific misconduct9 under federal regulations. These procedures contemplate researcher debarment for practices seriously deviating from those that are commonly accepted within the scientific community for proposing, conducting, or reporting research. Fines and compensation to individuals whose data have been improperly accessed or misused could provide additional deterrents to abuse.10 The research engine as a whole, however, should not be hindered by the remote possibility of an errant investigator reidentifying individuals without authorization. Several discussants suggested that discrimination or other harm caused by unauthorized reidentification of research participants seems largely to be a theoretical problem at this time. Moreover, in order to identify an individual in a pooled data set, a separate reference sample or at least some independent data about the individual is still generally required.

The second theme that emerged at the think tank takes a different, although not truly contradictory, approach to concerns regarding the risk of reidentification. This view is focused on the concept of balancing privacy risk with the potential benefits of broad data sharing. There is developing consensus that anonymity is more difficult to guarantee in the genomic era.11 Several meeting participants suggested that all “omic” data are theoretically identifiable. Moreover, the problem of reidentification is not unique to biomedical research, and existing public databases make reidentification of individuals easier.12 Given our waning ability to guarantee complete deidentification of data, some experts felt that medical researchers need to improve communication with participants regarding the remote potential for data security breaches and counterbalance the risks by conveying the potential societal and individual benefits of data sharing. Such benefits include the medical utility of participant contributions of data and the ability to return clinically relevant incidental findings or other research results under appropriate circumstances.

Discussion

As “omic” research advances and data are shared more broadly, communication of potential reidentification risks for participants is a challenge. Greater transparency is needed when informing participants about the limitations of data privacy, and efforts should be made to more clearly convey the complexity of this issue. Although consent forms routinely disclose that researchers cannot guarantee privacy, it is not clear whether research participants understand the rationale behind this limitation. In a study that examined comprehension of consent documents, participants scored poorly on a question evaluating the understanding of confidentiality issues.13 By improving communication with participants, they may become more willing to embrace sharing of their submitted data. Some meeting participants suggested that this is an opportunity to advance recruitment in medical research. A recent European study suggests that participants may be less concerned about the risks of data sharing and more interested in potential positive outcomes of such sharing.14

Much discussion at the think tank focused on the utility and empowerment of data sharing and its potential to stimulate collaborative engagement with research participants. Individuals facing illness or chronic disease may visit and join Internet groups such as CancerCompass and PatientsLikeMe, where they can share medical outcomes and learn about the symptoms or treatment response of others. The growth of such electronic communication spaces indicates the importance attached by many to the benefits of sharing their personal medical data.

As society becomes more accustomed to sharing and receiving information through social media and mobile communications, people who commonly use these resources might be more willing to release their genomic information to researchers who have engendered their trust. Others who are less comfortable with, or have no access to, such commercial technologies may be less inclined to disclose their personal information broadly. Nonetheless, the increasing popularity of social media and mobile technologies may underscore a cultural shift toward greater openness about personal information extending beyond early adapters.

In a randomized trial of consent options for data sharing in genome research by McGuire et al.,15 86.2% of research participants chose to release their genetic and clinical information in either open-access (through the Internet) or scientific (restricted for medical research only) databases, even after a debriefing and opportunity to change their data release option. A follow-up study showed that when deciding whether protecting privacy or advancing research is more important, participants who somewhat or highly trusted researchers predominantly chose advancing research as having greater priority.16 Participants in these studies were recruited by physicians at hospitals where they or their family members were receiving treatment, which may have influenced willingness to share data. The studies nonetheless suggest that research participants will incur some privacy risk with their individual genomic data when they trust those conducting the research and perceive that their information will benefit others.

Discussions of individual privacy risk in medical research trigger additional ethical challenges relating to risks for family members and genetically similar populations. If a privacy breach results in the disclosure of information pertaining to an inherited mutation, for example, that information may harm not only the individual research participant but also other genetically related people. Under such circumstances, an individual’s choice to participate in research may result in group discrimination or stigmatization. A research participant belonging to a relatively homogeneous genetic population who permits broad sharing of individual research data may enable the discovery of incidental findings that impact the entire population. The severity of consequences for public trust and future research recruitment may, in such cases, be quite high even though the statistical risk of reidentification of the individual research participant is quite low.

NIH data-sharing policy1,2 embraces the importance of data sharing as a tool for research progress, but we are only just beginning to explore how research participants understand and balance the risks and benefits of broad “omic” data sharing for translational medical research. Although extending researchers’ access to “omic” data imposes a degree of individual and group privacy risks, some research participants might prefer the informational benefits of data sharing, particularly when offered the return of clinically actionable research results similar to the proposed collaborative informed cohort model.17 Others might prefer more conventional deidentification options. We would anticipate that when data-sharing strategies are clearly communicated during the consent process, and research participants understand the potential risks and benefits of available data-sharing options, they are likely to be more comfortable releasing their data for broad future research uses.

The NCI think tank concerning the identifiability of biospecimens and “omic” data not only motivated dialogue across a range of expert perspectives but also identified opportunities for agency guidance, policy development, and empirical research. As an initial next step, more thought must be given to use of the term “identifiability” in research documents, as well as to its application in regulations and guidance. Identifiability is often conflated with the term “privacy” in consent forms, although the meaning of these terms is not identical. Privacy is a multifaceted concept with roots in constitutional law, encompassing broadly the right to be free from unwarranted invasion of personal liberties and the ability to make personal decisions affecting marriage, contraception, and other intimate matters. Identifiability, on the other hand, refers simply to the ability to determine unique facts about somebody or to figure out who he or she is. Consent forms often speak to the ability to protect privacy, but in reality, investigators can only seek to protect identifiability.

Moreover, the term “identifiability” itself has multiple meanings. In the research context, the term refers to the technical capability to resolve individuals through linkage to matched samples or independent data. In the regulatory context, however, identifiability refers quite differently to whether specific investigators may readily ascertain a research participant’s identity or gain access to specified identifiers. Therefore, “deidentified” data in the regulatory context may be scientifically identifiable, causing confusion when the term is used. Institutions that engage in “omic” research, including biorepositories and academic medical centers, should examine how their data protection policies reflect the different meanings of “identifiability” and ensure clarity in their policies and consent documents. Institutional guidelines should be developed to assist investigators and institutional review boards in determining whether specific “omic” data are or are not identifiable. Such guidelines should illustrate the connections among the various definitions, laws, and policies related to privacy and identifiability. Beyond institutional policy, clarification regarding identifiability at a broader level can be implemented through the research industry, NIH policy measures, endorsements from professional organizations, and changes in regulations such as the Health Insurance Portability and Accountability Act and the DHHS Common Rule.

Additional important steps include the development of strategies for greater transparency in communicating data-sharing options and empirical investigation of participant preferences in balancing the risks and benefits of various data-sharing models. The public seems to support broad research uses of biospecimens and collected medical data,18 but the extent to which people understand the potential risks and benefits of different tiers of data sharing (i.e., controlled access versus public access) is not clear. To improve scientific literacy about biospecimen donation and possible modes of data sharing, a model suggested for consideration during the think tank discussion is that of organ donation. Organ donation programs provide education and community outreach, with the goal of enabling individuals to make informed decisions, documented on their driver’s licenses, about donating organs for transplant. Similarly, we could adopt a community-based model to disclose the potential risks and benefits of donating biospecimens and associated data for medical research and allow individuals to document their choices on driver’s licenses. Perhaps the most widely shared observation among think tank participants was the determination that additional empirical research about the perceived and actual risks and benefits of “omic” data sharing is needed. In particular, we should explore the underpinnings of information altruism (i.e., desire of participants to assist medical research by sharing their personal data broadly) and its relative impact across disease and population communities. Developing research collaborations with private corporations that provide personal genomic information can aid in moving the science forward at a more rapid pace, given that many of these companies have research arms and have expressed an interest in collaborating with academic institutions and the NIH. Such collaborations would need to proceed cautiously, however, in order to prevent untoward research focus on populations that have both the resources and the inclination to pay for personal genetic information. In addition to research into the relative risks versus benefits of data sharing, further statistical analysis of the probability that someone could be reidentified via his or her “omic” data is needed. This analysis should take into account fiduciary responsibilities and institutional relationships at various stages, including the consent process, storage and stewardship of data, and methods for addressing potential downstream misuse (although at present, misuse has not been documented as a widespread problem).

Another think tank suggestion was to conduct a pilot study of participants with a completely open-access data model and multiple types of “omic” or phenotypic data. Such a project could investigate the willingness of individuals to share data under this model, examine real-world consequences of truly open-access data, assess the feasibility of enrolling participants using such a model, and describe the experiences of individuals and empirical evidence about whether data would be shared more rapidly. Results from such suggested research could guide the development of best practices for research institutions and biorepositories and inform research consent disclosures about informational risk.

In conclusion, the NCI think tank concerning the identifiability of biospecimens and “omic” data identified two main themes regarding how the research community should manage the difficulty of ensuring deidentification and the remote risk that future reidentification will occur and cause harm. The first theme concerned whether additional institutional or legal measures are necessary to address a violation of research participant privacy. The second concerned the extent to which greater understanding and expansion of data-sharing options might provide an opportunity to increase engagement of participants in the research enterprise. The NCI think tank also identified several opportunities for improved guidance, policy development, and empirical research. The challenge of deciphering different uses of the term “identifiability” was discussed, suggesting that more should be done to clarify the meaning for both researchers and study participants. The ability to communicate the changing reality of reidentification risks is fundamental to the development of best practices for seeking the consent of research participants in biospecimen collection and storage protocols. In addition, we must gather more empirical data about people’s data-sharing preferences, to facilitate increased secondary research uses of their biospecimens and data and thereby hasten the pace of medical progress. Addressing these priorities should help the research community and policy makers in developing a measured approach to the issues of reidentification in the coming era of high-throughput population-based genomic research and should better define the balance between patient protections and advancing science.19

Disclosure

The authors declare no conflict of interest.