Routes for breaching and protecting genetic privacy

Erlich, Yaniv; Narayanan, Arvind

doi:10.1038/nrg3723

Review Article
Published: 08 May 2014

Routes for breaching and protecting genetic privacy

Yaniv Erlich¹ &
Arvind Narayanan²

Nature Reviews Genetics volume 15, pages 409–421 (2014)Cite this article

13k Accesses
251 Citations
228 Altmetric
Metrics details

Subjects

An Erratum to this article was published on 17 June 2014

This article has been updated

Key Points

Privacy breaching techniques can work by cross-referencing two or more pieces of information to gain new, potentially harmful, knowledge on individuals or their families. Broadly speaking, the main routes to breach privacy are identity tracing, attribute disclosure attacks using DNA (ADAD) and completion of sensitive DNA information.
Identity tracing exploits quasi-identifiers in the DNA data or metadata to uncover the identity of an unknown genetic data set. ADAD links the identity of a known person to a sensitive phenotype using DNA-derived data. Completion techniques also work on known DNA data and aim to uncover sensitive genomic areas that were masked to protect the participant.
In the past few years, the range of techniques and tools to carry out privacy breaching attacks has expanded. Although most of these techniques are currently beyond the reach of the general public, they can be implemented by trained persons with varying degrees of effort and success.
There is considerable debate regarding risk management. Some support a pragmatic, ad-hoc approach of privacy by obscurity, whereas others support a systematic, mathematical approach of privacy by design. Privacy-by-design algorithms include access control, differential privacy and cryptographic techniques.
So far, data custodians of genetic databases have primarily adopted access control as a mitigation strategy. New developments in cryptographic methods may usher in additional 'security-by-design' techniques.

Abstract

We are entering an era of ubiquitous genetic information for research, clinical care and personal curiosity. Sharing these data sets is vital for progress in biomedical research. However, a growing concern is the ability to protect the genetic privacy of the data originators. Here, we present an overview of genetic privacy breaching strategies. We outline the principles of each technique, indicate the underlying assumptions, and assess their technological complexity and maturation. We then review potential mitigation methods for privacy-preserving dissemination of sensitive data and highlight different cases that are relevant to genetic applications.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: An integrative map of genetic privacy breaching techniques.**

**Figure 2: A possible route for identity tracing.**

Privacy challenges and research opportunities for genomic data sharing

Article 29 June 2020

Sociotechnical safeguards for genomic data privacy

Article 04 March 2022

Functional genomics data: privacy risk assessment and technological mitigation

Article 10 November 2021

Change history

17 June 2014
In this article, an incorrect citation was given in reference 107. The citation should have been: Ayday, E., Raisaro, J. L., McLaren, P. J., Fellay, J. & Hubaux, J.-P. Privacy-preserving computation of disease risk by using genomic, clinical, and environmental data. Proc. USENIX Security Workshop Health Inf. Technol. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.309.1513 (2013). This has now been corrected online. The editors apologize for this error.

References

Fu, W. et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493, 216–220 (2013).
Article CAS PubMed Google Scholar
1000 Genomes Project Consortium et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
Roberts, J. P. Million veterans sequenced. Nature Biotech. 31, 470–470 (2013).
Article CAS Google Scholar
Drmanac, R. Medicine. The ultimate genetic test. Science 336, 1110–1112 (2012).
Article CAS PubMed Google Scholar
Burn, J. Should we sequence everyone's genome? Yes. BMJ 346, f3133 (2013).
Article PubMed Google Scholar
Kaye, J., Heeney, C., Hawkins, N., de Vries, J. & Boddington, P. Data sharing in genomics — re-shaping scientific practice. Nature Rev. Genet. 10, 331–335 (2009).
Article CAS PubMed Google Scholar
Park, J. H. et al. Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nature Genet. 42, 570–575 (2010).
Article CAS PubMed Google Scholar
Chatterjee, N. et al. Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nature Genet. 45, 400–405 (2013).
Article CAS PubMed Google Scholar
Friend, S. H. & Norman, T. C. Metcalfe's law and the biology information commons. Nature Biotech. 31, 297–303 (2013).
Article CAS Google Scholar
Rodriguez, L. L., Brooks, L. D., Greenberg, J. H. & Green, E. D. The complexities of genomic identifiability. Science 339, 275–276 (2013).
Article CAS PubMed Google Scholar
Institute of Medicine (US) Roundtable on Value & Science-Driven Health Care. Clinical Dataas the Basic Staple of Health Learning: Creating and Protecting a Public Good: Workshop Summary (National Academies Press (US), 2010).
McGuire, A. L. et al. To share or not to share: a randomized trial of consent for data sharing in genome research. Genet. Med. 13, 948–955 (2011).
Article CAS PubMed PubMed Central Google Scholar
Oliver, J. M. et al. Balancing the risks and benefits of genomic data sharing: genome research participants' perspectives. Publ. Health Genom. 15, 106–114 (2012).
Article CAS Google Scholar
Careless.data. Nature 507, 7 (2014).
Schwartz, P. M. & Solove, D. J. Reconciling personal information in the United States and European Union. 102 California Law Rev. http://dx.doi.org/10.2139/ssrn.2271442 (2013).
El Emam, K. Heuristics for de-identifying health data. IEEE Secur. Priv. 6, 58–61 (2008).
Article Google Scholar
Lunshof, J. E., Chadwick, R., Vorhaus, D. B. & Church, G. M. From genetic privacy to open consent. Nature Rev. Genet. 9, 406–411 (2008).
Article CAS PubMed Google Scholar
Brenner, S. E. Be prepared for the big genome leak. Nature 498, 139 (2013).
Article CAS PubMed Google Scholar
McClure, S., Scambray, J. & Kurtz, G. Hacking Exposed 7: Network Security Secrets and Solutions (McGraw Hill, 2012).
Google Scholar
Solve, D. J. A taxonomy of privacy. Univ. Pennsylvania Law Rev. 154, 477 (2006). This work organizes various concepts of privacy violations from a legal perspective.
Article Google Scholar
Ohm, P. Broken promises of privacy: responding to the surprising failure of anonymization. UCLA Law Rev. 57, 1701 (2010).
Google Scholar
Golle, P. Revisiting the uniqueness of simple demographics in the US population. Proc. 5th ACM Workshop Privacy in Electron. Soc. 77–80 (2006).
Sweeney, L. A. Simple Demographics Often Identify People Uniquely. Carnegie Mellon Univ. Data Privacy Working Paper 3 (2000).
Google Scholar
Sweeney, L. Testimony of Latanya Sweeney before the Privacy and Integrity Advisory Committee of the Department of Homeland Security. US Homeland Security [online], (2005).
Google Scholar
Sweeney, L. A., Abu, A. & Winn, J. Identifying participants in the personal genome project by name. Data Privacy Lab [online], (2013). This study shows identity tracing of PGP participants using metadata and side-channel techniques.
Google Scholar
Code of Federal Regulations Title 45 Section 164.514 (US Federal Register, 2002).
Benitez, K. & Malin, B. Evaluating re-identification risks with respect to the HIPAA Privacy Rule. J. Am. Med. Informat. Associ. 17, 169–177 (2010).
Article Google Scholar
Kwok, P., Davern, M., Hair, E. & Lafky, D. Harder Than You Think: a Case Study of Re-identification Risk of HIPAA-Compliant Records. NORC at The University of Chicago Abstract 302255 (2011).
Google Scholar
Bennett, R. L. et al. Recommendations for standardized human pedigree nomenclature. Pedigree standardization task force of the national society of genetic counselors. Am. J. Hum. Genet. 56, 745–752 (1995).
CAS PubMed PubMed Central Google Scholar
Malin, B. Re-identification of familial database records. AMIA Annu. Symp. Proc. 2006, 524–528 (2006).
PubMed Central Google Scholar
Israel v. N. Bilik and others 24441-05-12 [online], (in Hebrew) (2013).
Khan, R. & Mittelman, D. Rumors of the death of consumer genomics are greatly exaggerated. Genome Biol. 14, 139 (2013).
Article PubMed PubMed Central Google Scholar
Gitschier, J. Inferential genotyping of Y chromosomes in Latter-Day Saints founders and comparison to Utah samples in the HapMap project. Am. J. Hum. Genet. 84, 251–258 (2009).
Article CAS PubMed PubMed Central Google Scholar
Gymrek, M., McGuire, A. L., Golan, D., Halperin, E. & Erlich, Y. Identifying personal genomes by surname inference. Science 339, 321–324 (2013). This paper reports end-to-end identity tracing of anonymous research participants from DNA information and Internet searches, and a risk assessment for the US population.
Article CAS PubMed Google Scholar
King, T. E. & Jobling, M. A. What's in a name? Y chromosomes, surnames and the genetic genealogy revolution. Trends Genet. 25, 351–360 (2009).
Article CAS PubMed Google Scholar
King, T. E. & Jobling, M. A. Founders, drift, and infidelity: the relationship between Y chromosome diversity and patrilineal surnames. Mol. Biol. Evol. 26, 1093–1102 (2009).
Article CAS PubMed PubMed Central Google Scholar
Motluk, A. Anonymous sperm donor traced on internet. New Scientist 2 (3 Nov 2005). This article discusses the first public case of identity tracing using genealogical triangulation.
Stein, R. Found on the web, with DNA: a boy's father. Washington Post A09 (13 Nov 2005).
Naik, G. Family secrets: an adopted man's 26-year quest for his father. The Wall Street Journal (2 May 2009).
Lehmann-Haupt, R. Are sperm donors really anonymous anymore? Slate (1 Mar 2010).
Gymrek, M., Golan, D., Rosset, S. & Erlich, Y. lobSTR: a short tandem repeat profiler for personal genomes. Genome Res. 22, 1154–1162 (2012).
Article CAS PubMed PubMed Central Google Scholar
China News Network. Ministry of Public Security statistics: “King” into the most common surname in China has 9288 million. Eastday [online], (in Chinese) (2007).
Huff, C. D. et al. Maximum-likelihood estimation of recent shared ancestry (ERSA). Genome Res. 21, 768–774 (2011).
Article CAS PubMed PubMed Central Google Scholar
Henn, B. M. et al. Cryptic distant relatives are common in both isolated and cosmopolitan genetic samples. PLoS ONE 7, e34267 (2012).
Article CAS PubMed PubMed Central Google Scholar
Lowrance, W. W. & Collins, F. S. Identifiability in genomic research. Science 317, 600–602 (2007).
Article CAS PubMed Google Scholar
Kayser, M. & de Knijff, P. Improving human forensics through advances in genetics, genomics and molecular biology. Nature Rev. Genet. 12, 179–192 (2011). This is a comprehensive review of methods to predict phenotypes from DNA information.
Article CAS PubMed Google Scholar
Silventoinen, K. et al. Heritability of adult body height: a comparative study of twin cohorts in eight countries. Twin Res. 6, 399–408 (2003).
Article PubMed Google Scholar
Kohn, L. A. P. The role of genetics in craniofacial morphology and growth. Annu. Rev. Anthropol. 20, 261–278 (1991).
Article Google Scholar
Zubakov, D. et al. Estimating human age from T-cell DNA rearrangements. Curr. Biol. 20, R970–R971 (2010).
Article CAS PubMed Google Scholar
Ou, X. L. et al. Predicting human age with bloodstains by sjTREC quantification. PLoS ONE 7, e42412 (2012).
Article CAS PubMed PubMed Central Google Scholar
Lango Allen, H. et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467, 832–838 (2010).
Article CAS PubMed PubMed Central Google Scholar
Manning, A. K. et al. A genome-wide approach accounting for body mass index identifies genetic variants influencing fasting glycemic traits and insulin resistance. Nature Genet. 44, 659–669 (2012).
Article CAS PubMed Google Scholar
Liu, F. et al. A genome-wide association study identifies five loci influencing facial morphology in Europeans. PLoS Genet. 8, e1002932 (2012).
Article CAS PubMed PubMed Central Google Scholar
Walsh, S. et al. IrisPlex: a sensitive DNA tool for accurate prediction of blue and brown eye colour in the absence of ancestry information. Forens. Sci. Int. Genet. 5, 170–180 (2011).
Article CAS Google Scholar
Byers, S. Information leakage caused by hidden data in published documents. IEEE Secur. Priv. 2, 23–27 (2004).
Article Google Scholar
Kaufman, S., Rosset, S. & Perlich, C. Leakage in data mining: formulation, detection, and avoidance. Proc. 17th ACM SIGKDD Int. Conf. Knowledge Discov. Data Mining 556–563 (2011).
Acquisti, A. & Gross, R. Predicting Social Security numbers from public data. Proc. Natl Acad. Sci. USA 106, 10975–10980 (2009).
Article CAS PubMed PubMed Central Google Scholar
Noumeir, R., Lemay, A. & Lina, J. M. Pseudonymization of radiology data for research purposes. J. Digital Imag. 20, 284–295 (2007).
Article Google Scholar
Pakstis, A. J. et al. SNPs for a universal individual identification panel. Hum. Genet. 127, 315–324 (2010).
Article PubMed Google Scholar
Lin, Z., Owen, A. B. & Altman, R. B. Genomic research and human subject privacy. Science 305, 183 (2004).
Article CAS PubMed Google Scholar
Mailman, M. D. et al. The NCBI dbGaP database of genotypes and phenotypes. Nature Genet. 39, 1181–1186 (2007).
Article CAS PubMed Google Scholar
Homer, N. et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 4, e1000167 (2008). This is the first study to show an ADAD from summary statistic data.
Article PubMed PubMed Central CAS Google Scholar
Jacobs, K. B. et al. A new statistic and its power to infer membership in a genome-wide association study using genotype frequencies. Nature Genet. 41, 1253–1257 (2009).
Article CAS PubMed Google Scholar
Visscher, P. M. & Hill, W. G. The limits of individual identification from sample allele frequencies: theory and statistical analysis. PLoS Genet. 5, e1000628 (2009).
Article PubMed PubMed Central CAS Google Scholar
Sankararaman, S., Obozinski, G., Jordan, M. I. & Halperin, E. Genomic privacy and limits of individual detection in a pool. Nature Genet. 41, 965–967 (2009). References 64–65 provide excellent mathematical analyses of ADAD using allele frequency data.
Google Scholar
Wang, R., Li, Y. F., Wang, X., Haixu, T. & Zhou, X. Learning your identity and disease from research papers: information leaks in genome wide association study. Proc. 16th ACM Conf. Comput. Commun. Security 534–544 (2009).
Im, H. K., Gamazon, E. R., Nicolae, D. L. & Cox, N. J. On sharing quantitative trait, GWAS results in an era of multiple-omics data and the limits of genomic privacy. Am. J. Hum. Genet. 90, 591–598 (2012).
Article CAS PubMed PubMed Central Google Scholar
Lumley, T. Potential for revealing individual-level information in genome-wide association studies. JAMA 303, 659 (2010).
Article CAS PubMed Google Scholar
Zerhouni, E. A. & Nabel, E. G. Protecting aggregate genomic data. Science 322, 44 (2008).
Article CAS PubMed Google Scholar
Johnson, A. D., Leslie, R. & O'Donnell, C. J. Temporal trends in results availability from genome-wide association studies. PLoS Genet. 7, e1002269 (2011).
Article CAS PubMed PubMed Central Google Scholar
Gilbert, N. Researchers criticize genetic data restrictions. Nature http://dx.doi.org/10.1038/news.2008.1083 (2008).
Malin, B., Karp, D. & Scheuermann, R. H. Technical and policy approaches to balancing patient privacy and data sharing in clinical and translational research. J. Investig. Med. 58, 11–18 (2010).
Article PubMed PubMed Central Google Scholar
Clayton, D. On inferring presence of an individual in a mixture: a Bayesian approach. Biostatistics 11, 661–673 (2010).
Article PubMed PubMed Central Google Scholar
Report on the workshop on establishing a central resource of data from genome sequencing projects. National Genome Research Institute [online], (2012).
Schadt, E. E., Woo, S. & Hao, K. Bayesian method to predict individual SNP genotypes from gene expression data. Nature Genet. 44, 603–608 (2012).
Article CAS PubMed Google Scholar
Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nature Rev. Genet. 11, 499–511 (2010).
Article CAS PubMed Google Scholar
Nyholt, D. R., Yu, C. E. & Visscher, P. M. On Jim Watson's APOE status: genetic information is hard to hide. Eur. J. Hum. Genet. 17, 147–149 (2009). This study clearly shows the limited use of masking sensitive DNA areas.
Article PubMed Google Scholar
Humbert, M., Ayday, E., Hubaux, J.-P. & Telenti, A. Addressing the concerns of the Lacks family: quantification of kin genomic privacy. Proc. 2013 ACM SIGSAC Conf. Comput. Commun. Secur. 1141–1152 (2013).
Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nature Genet. 40, 1068–1075 (2008).
Article CAS PubMed Google Scholar
Kaiser, J. Agency nixes deCODE's new data-mining plan. Science 340, 1388–1389 (2013).
Article CAS PubMed Google Scholar
Bambauer, J. R. Tragedy of the data commons. Harvard J. Law Technol. http://dx.doi.org/10.2139/ssrn.1789749 (2011).
Hartzog, W. & Stutzman, F. The case for online obscurity. Calif. Law Rev. 101, 1 (2013).
Google Scholar
Taleb, N. N. The Black Swan: the Impact of the Highly Improbable (Random House, 2007).
Google Scholar
Shannon, C. Communication theory of secrecy systems. Bell System Techn. J. 28, 656–715 (1949).
Article Google Scholar
Cavoukian, A. Privacy by design. Information and Privacy Commissioner, Ontario, Canada [online], (2009).
Google Scholar
Tryka, K. A. et al. NCBI's database of genotypes and phenotypes: dbGaP. Nucleic Acids Res. 42, D975–D979 (2014).
Article CAS PubMed Google Scholar
Ramos, E. M. et al. A mechanism for controlled access to GWAS data: experience of the GAIN Data Access Committee. Am. J. Hum. Genet. 92, 479–488 (2013).
Article CAS PubMed PubMed Central Google Scholar
Church, G. et al. Public access to genome-wide data: five views on balancing research with privacy and protection. PLoS Genet. 5, e1000665 (2009).
Article PubMed PubMed Central CAS Google Scholar
Agrawal, R., Kiernan, J., Srikant, R. & Xu, Y. Hippocratic databases. Proc. 28th Int. Conf. Very Large Databases 143–154 (2002).
Agrawal, R. et al. Auditing compliance with a hippocratic database. Proc. 30th Int. Conf. Very Large Databases 516–527 (2004).
Venter, H. S., Olivier, M. S. & Eloff, J. H. PIDS: a privacy intrusion detection system. Internet Res. 14, 360–365 (2004).
Article Google Scholar
Creating a global alliance to enable responsible sharing of genomic and clincal data. [online], (2013).
Bafna, V. et al. Abstractions for genomics. Commun. ACM 56, 83–93 (2013).
Article PubMed PubMed Central Google Scholar
Terry, S. F. & Terry, P. F. Power to the people: participant ownership of clinical trial data. Sci. Transl Med. 3, 69cm3 (2011).
Article PubMed Google Scholar
Kaye, J. et al. From patients to partners: participant-centric initiatives in biomedical research. Nature Rev. Genet. 13, 371–376 (2012).
Article CAS PubMed Google Scholar
Sweeney, L. k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzz. 10, 557–570 (2002).
Article Google Scholar
El Emam, K. & Dankar, F. K. Protecting privacy using k-anonymity. J. Am. Med. Informat. Associ. 15, 627–637 (2008).
Article Google Scholar
Malin, B. A. Protecting genomic sequence anonymity with generalization lattices. Methods Inform. Med. 44, 687–692 (2005).
Article CAS Google Scholar
Machanavajjhala, A., Kifer, D., Gehrke, J. & Venkitasubramaniam, M. L-diversity: privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 1, 3 (2007).
Article Google Scholar
Li, N., Li, T. & Venkatasubramanian, S. t-closeness: privacy beyond k-anonymity and L-diversity. IEEE 23rd Int. Conf. Data Eng. 106–115 (2007).
Dwork, C. Differential privacy. Automata, Languages and Programming 1–12 (Springer Verlag, 2006).
Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J. & Vilhuber, L. Privacy: theory meets practice on the map. IEEE 24th Int. Conf. Data Eng. 277–286 (2008).
Uhler, C., Slavkovic, A. B. & Fienberg, S. E. Privacy-preserving data sharing for genome-wide association studies. arXiv 1205.0739 (2012).
Yu, F., Fienberg, S. E., Slavkovic, A. & Uhler, C. Scalable privacy-preserving data sharing methodology for genome-wide association studies. arXiv 1401.5193 (2014).
Johnson, A. & Shmatikov, V. Privacy-preserving data exploration in genome-wide association studies. Proc. 19th ACM SIGKDD Int. Conf. Knowledge Discov. Data Mining 1079–1087 (2013).
Ayday, E., Raisaro, J. L. & Hubaux, J. P. Privacy-enhancing technologies for medical tests using genomic data. Ecole Polytechnique Federale de Lausanne [online], (2013).
Ayday, E., Raisaro, J. L., McLaren, P.J., Fellay, J. & Hubaux, J.-P. Privacy-preserving computation of disease risk by using genomic, clinical, and environmental data. Proc. USENIX Security Workshop Health Inf. Technol. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.309.1513 (2013). This pioneering work shows the use of homomorphic encryption for privacy-preserving genetic risk predictions.
Atallah, M. J., Kerschbaum, F. & Du, W. Secure and private sequence comparisons. Proc. 2003 ACM Workshop Privacy in Electron. Soc. 39–44 (2003).
Jha, S., Kruger, L. & Shmatikov, V. Towards practical privacy for genomic computation. IEEE Symp. Security and Privacy 216–230 (2008).
Chen, Y., Peng, B., Wang, X. & Tang, H. Large-scale privacy-preserving mapping of human genomic sequences on hybrid clouds. Proc. 19th Annu. Netw. Distributed Syst. Security Symp. (2013). The paper presents an interesting concept of privacy-preserving alignment of high-throughput sequencing data that allows the use of untrusted cloud providers.
Yao, A. C.-C. Protocols for secure computations. 23rd Annu. Symp. Found. Comput. Sci. 160–164 (1982).
Li, H. & Homer, N. A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform. 11, 473–483 (2010).
Article CAS PubMed PubMed Central Google Scholar
Bohannon, P., Jakobsson, M. & Srikwan, S. in Public Key Cryptography (eds Imai, H. & Zheng, Y.) 373–390 (Springer, 2000).
Book Google Scholar
Fons, B., Stefan, K., Klaus, K. & Pim, T. Privacy-preserving matching of DNA profiles. Cryptology ePrint Archive 2008, 203 (2008).
Google Scholar
Baldi, P., Baronio, R., Cristofaro, E. D., Gasti, P. & Tsudik, G. Countering GATTACA: efficient and secure testing of fully-sequenced human genomes. Proc. 18th ACM Conf. Comput. Commun. Security 691–702 (2011).
De Cristofaro, E., Faber, S., Gasti, P. & Tsudik, G. Genodroid: are privacy-preserving genomic tests ready for prime time? Proc. 2012 ACM Workshop Privacy in Electron. Soc. 97–108 (2012).
He, D. et al. Identifying genetic relatives without compromising privacy. Genome Res. 24, 664–672 (2014).
Article CAS PubMed PubMed Central Google Scholar
Kantarcioglu, M., Jiang, W., Liu, Y. & Malin, B. A cryptographic approach to securely share and query genomic sequences. IEEE Trans. Inf. Technol. Biomed. 12, 606–617 (2008).
Article PubMed Google Scholar
Kamm, L., Bogdanov, D., Laur, S. & Vilo, J. A new way to protect privacy in large-scale genome-wide association studies. Bioinformatics 29, 886–893 (2013).
Article CAS PubMed PubMed Central Google Scholar
Canim, M., Kantarcioglu, M. & Malin, B. Secure management of biomedical data with cryptographic hardware. IEEE Trans. Inf. Technol. Biomed. 16, 166–175 (2012).
Article PubMed Google Scholar
Narayanan, A. What happened to the crypto dream? IEEE Secur. Priv. 11, 75–76 (2013).
Article Google Scholar
Ayday, E., De Cristofaro, E. Hubaux, J.-P. & Tsudik, G. The chills and thrills of whole genome sequencing. Computer http://doi.ieeecomputersociety.org/10.1109/MC.2013.333 (2013). This is a good overview of cryptographic work for protecting genetic data and of open questions in the area.
Presidential Commission for the Study of Bioethical Issues. Privacy and Progress in Whole Genome Sequencing (2012).
Craig, D. W. et al. Assessing and managing risk when sharing aggregate genetic variant data. Nature Rev. Genet. 12, 730–736 (2011).
Article CAS PubMed Google Scholar
Braun, R., Rowe, W., Schaefer, C., Zhang, J. & Buetow, K. Needles in the haystack: identifying individuals present in pooled genomic data. PLoS Genet. 5, e1000668 (2009). This is a critical assessment of the performance of ADAD with allele frequency data.
Article PubMed PubMed Central CAS Google Scholar
Kendler, K. S., Gallagher, T. J., Abelson, J. M. & Kessler, R. C. Lifetime prevalence, demographic risk factors, and diagnostic validity of nonaffective psychosis as assessed in a US community sample: the National Comorbidity Survey. Arch. Gen. Psychiatry 53, 1022–1031 (1996).
Article CAS PubMed Google Scholar
Lee, J. & Clifton, C. in Information Security 325–340 (Springer, 2011).
Book Google Scholar
Hsu, J. et al. Differential privacy: an economic method for choosing epsilon. arXiv 1402.3329 (2014).
Dwork, C., McSherry, F., Nissim, K. & Smith, A. in Theory of Cryptography 265–284 (Springer, 2006).
Book Google Scholar
Paillier, P. in Advances in Cryptology — EUROCRYPT '99 (ed. Stern, J.) 223–238 (Springer, 1999).
Book Google Scholar
Hill, W. G., Goddard, M. E. & Visscher, P. M. Data and theory point to mainly additive genetic variance for complex traits. PLoS Genet. 4, e1000008 (2008).
Article PubMed PubMed Central CAS Google Scholar
Gentry, C. Fully homomorphic encryption using ideal lattices. Proc. 41st Annu. ACM Symp. Theory of Comput. 169–178 (2009).
Wang, R., Li, Y. F., Wang, X. F., Tang, H. & Zhou, X. Learning your identity and disease from research papers: information leaks in genome wide association study. Proc. 16th ACM Conf. Comput. Commun. Security 534–544 (2009).

Download references

Acknowledgements

Y.E. is an Andria and Paul Heafy Family Fellow and holds a Career Award at the Scientific Interface from the Burroughs Wellcome Fund. This study was supported in part by a US National Human Genome Research Institute grant R21HG006167, and by a gift from C. Stone and J. Stone. The authors thank D. Zielinski and M. Gymrek for comments.

Author information

Authors and Affiliations

Whitehead Institute for Biomedical Research, Nine Cambridge Center, Cambridge, 02142, Massachusetts, USA
Yaniv Erlich
Department of Computer Science, Princeton University, 35 Olden Street, Princeton, 08540, New Jersey, USA
Arvind Narayanan

Authors

Yaniv Erlich
View author publications
You can also search for this author in PubMed Google Scholar
Arvind Narayanan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yaniv Erlich.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary information S1 (figure)

Differential privacy statistic of an association study. (PDF 1266 kb)

Glossary

Safe Harbor: A standard in the US Health Insurance Portability and Accountability Act (HIPAA) rule for de-identification of protected health information by removing 18 types of quasi-identifiers.
Haplotypes: Sets of alleles along the same chromosome.
Cryptographic hashing: A procedure that yields a fixed-length output from any size of input in a way that is hard to determine the input from the output.
Dictionary attacks: Approaches to reverse cryptographic hashing by scanning only highly probable inputs.
Alice: A common generic name in computer security to denote party A.
Bob: A common generic name in computer security to denote party B.
Type 1 error: The probability of obtaining a positive answer from a negative item.
Linkage equilibrium: Absence of correlation between the alleles at two loci.
Power: The probability of obtaining a positive answer for a positive item.
Specificity: The probability of obtaining a negative answer for a negative item.
Linkage disequilibrium: (LD). The correlation between alleles at two loci.
Effect sizes: The contributions of alleles to the values of particular traits.
Positive predictive value: The probability that a positive answer belongs to a true positive.
Expression quantitative trait locus: (eQTL). A genetic variant associated with variability in gene expression.
Genotype imputation: A class of statistical techniques to predict a genotype from information on surrounding genotypes.
Application programming interface: (API). A set of commands that specify the interface with a data set or software applications.
χ²-statistic: A measure of association in case–control genome-wide association studies.
Read mapping: A computationally intensive step in the analysis of high-throughput sequencing to find the location of a short DNA sequence (string) in the genome.
Edit distance: The total number of insertions, deletions and substitutions between two strings.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Erlich, Y., Narayanan, A. Routes for breaching and protecting genetic privacy. Nat Rev Genet 15, 409–421 (2014). https://doi.org/10.1038/nrg3723

Download citation

Published: 08 May 2014
Issue Date: June 2014
DOI: https://doi.org/10.1038/nrg3723

This article is cited by

Balancing the safeguarding of privacy and data sharing: perceptions of genomic professionals on patient genomic data ownership in Australia
- Yuwan Malakar
- Justine Lacey
- Denis C. Bauer
European Journal of Human Genetics (2023)
Privacy-preserving and homogeneity-pursuit integrative analysis for high-dimensional censored data
- Xin Ye
- Baihua He
- Shuangge Ma
Statistical Papers (2023)
Efficient privacy-preserving variable-length substring match for genome sequence
- Yoshiki Nakagawa
- Satsuya Ohata
- Kana Shimizu
Algorithms for Molecular Biology (2022)
Ethical implications of epigenetics in the era of personalized medicine
- Josep Santaló
- María Berdasco
Clinical Epigenetics (2022)
SVAT: Secure outsourcing of variant annotation and genotype aggregation
- Miran Kim
- Su Wang
- Arif Harmanci
BMC Bioinformatics (2022)