Review Article | Published:

Routes for breaching and protecting genetic privacy

Nature Reviews Genetics volume 15, pages 409421 (2014) | Download Citation

  • An Erratum to this article was published on 17 June 2014

This article has been updated

Abstract

We are entering an era of ubiquitous genetic information for research, clinical care and personal curiosity. Sharing these data sets is vital for progress in biomedical research. However, a growing concern is the ability to protect the genetic privacy of the data originators. Here, we present an overview of genetic privacy breaching strategies. We outline the principles of each technique, indicate the underlying assumptions, and assess their technological complexity and maturation. We then review potential mitigation methods for privacy-preserving dissemination of sensitive data and highlight different cases that are relevant to genetic applications.

Key points

  • Privacy breaching techniques can work by cross-referencing two or more pieces of information to gain new, potentially harmful, knowledge on individuals or their families. Broadly speaking, the main routes to breach privacy are identity tracing, attribute disclosure attacks using DNA (ADAD) and completion of sensitive DNA information.

  • Identity tracing exploits quasi-identifiers in the DNA data or metadata to uncover the identity of an unknown genetic data set. ADAD links the identity of a known person to a sensitive phenotype using DNA-derived data. Completion techniques also work on known DNA data and aim to uncover sensitive genomic areas that were masked to protect the participant.

  • In the past few years, the range of techniques and tools to carry out privacy breaching attacks has expanded. Although most of these techniques are currently beyond the reach of the general public, they can be implemented by trained persons with varying degrees of effort and success.

  • There is considerable debate regarding risk management. Some support a pragmatic, ad-hoc approach of privacy by obscurity, whereas others support a systematic, mathematical approach of privacy by design. Privacy-by-design algorithms include access control, differential privacy and cryptographic techniques.

  • So far, data custodians of genetic databases have primarily adopted access control as a mitigation strategy. New developments in cryptographic methods may usher in additional 'security-by-design' techniques.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Change history

  • 17 June 2014

    In this article, an incorrect citation was given in reference 107. The citation should have been: Ayday, E., Raisaro, J. L., McLaren, P. J., Fellay, J. & Hubaux, J.-P. Privacy-preserving computation of disease risk by using genomic, clinical, and environmental data. Proc. USENIX Security Workshop Health Inf. Technol. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.309.1513 (2013). This has now been corrected online. The editors apologize for this error.

References

  1. 1.

    et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493, 216–220 (2013).

  2. 2.

    1000 Genomes Project Consortium et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).

  3. 3.

    Million veterans sequenced. Nature Biotech. 31, 470–470 (2013).

  4. 4.

    Medicine. The ultimate genetic test. Science 336, 1110–1112 (2012).

  5. 5.

    Should we sequence everyone's genome? Yes. BMJ 346, f3133 (2013).

  6. 6.

    , , , & Data sharing in genomics — re-shaping scientific practice. Nature Rev. Genet. 10, 331–335 (2009).

  7. 7.

    et al. Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nature Genet. 42, 570–575 (2010).

  8. 8.

    et al. Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nature Genet. 45, 400–405 (2013).

  9. 9.

    & Metcalfe's law and the biology information commons. Nature Biotech. 31, 297–303 (2013).

  10. 10.

    , , & The complexities of genomic identifiability. Science 339, 275–276 (2013).

  11. 11.

    Institute of Medicine (US) Roundtable on Value & Science-Driven Health Care. Clinical Dataas the Basic Staple of Health Learning: Creating and Protecting a Public Good: Workshop Summary (National Academies Press (US), 2010).

  12. 12.

    et al. To share or not to share: a randomized trial of consent for data sharing in genome research. Genet. Med. 13, 948–955 (2011).

  13. 13.

    et al. Balancing the risks and benefits of genomic data sharing: genome research participants' perspectives. Publ. Health Genom. 15, 106–114 (2012).

  14. 14.

    Careless.data. Nature 507, 7 (2014).

  15. 15.

    & Reconciling personal information in the United States and European Union. 102 California Law Rev. (2013).

  16. 16.

    Heuristics for de-identifying health data. IEEE Secur. Priv. 6, 58–61 (2008).

  17. 17.

    , , & From genetic privacy to open consent. Nature Rev. Genet. 9, 406–411 (2008).

  18. 18.

    Be prepared for the big genome leak. Nature 498, 139 (2013).

  19. 19.

    , & Hacking Exposed 7: Network Security Secrets and Solutions (McGraw Hill, 2012).

  20. 20.

    A taxonomy of privacy. Univ. Pennsylvania Law Rev. 154, 477 (2006). This work organizes various concepts of privacy violations from a legal perspective.

  21. 21.

    Broken promises of privacy: responding to the surprising failure of anonymization. UCLA Law Rev. 57, 1701 (2010).

  22. 22.

    Revisiting the uniqueness of simple demographics in the US population. Proc. 5th ACM Workshop Privacy in Electron. Soc. 77–80 (2006).

  23. 23.

    Simple Demographics Often Identify People Uniquely. Carnegie Mellon Univ. Data Privacy Working Paper 3 (2000).

  24. 24.

    Testimony of Latanya Sweeney before the Privacy and Integrity Advisory Committee of the Department of Homeland Security. US Homeland Security , (2005).

  25. 25.

    , & Identifying participants in the personal genome project by name. Data Privacy Lab , (2013). This study shows identity tracing of PGP participants using metadata and side-channel techniques.

  26. 26.

    Code of Federal Regulations Title 45 Section 164.514 (US Federal Register, 2002).

  27. 27.

    & Evaluating re-identification risks with respect to the HIPAA Privacy Rule. J. Am. Med. Informat. Associ. 17, 169–177 (2010).

  28. 28.

    , , & Harder Than You Think: a Case Study of Re-identification Risk of HIPAA-Compliant Records. NORC at The University of Chicago Abstract 302255 (2011).

  29. 29.

    et al. Recommendations for standardized human pedigree nomenclature. Pedigree standardization task force of the national society of genetic counselors. Am. J. Hum. Genet. 56, 745–752 (1995).

  30. 30.

    Re-identification of familial database records. AMIA Annu. Symp. Proc. 2006, 524–528 (2006).

  31. 31.

    and others 24441-05-12 , (in Hebrew) (2013).

  32. 32.

    & Rumors of the death of consumer genomics are greatly exaggerated. Genome Biol. 14, 139 (2013).

  33. 33.

    Inferential genotyping of Y chromosomes in Latter-Day Saints founders and comparison to Utah samples in the HapMap project. Am. J. Hum. Genet. 84, 251–258 (2009).

  34. 34.

    , , , & Identifying personal genomes by surname inference. Science 339, 321–324 (2013). This paper reports end-to-end identity tracing of anonymous research participants from DNA information and Internet searches, and a risk assessment for the US population.

  35. 35.

    & What's in a name? Y chromosomes, surnames and the genetic genealogy revolution. Trends Genet. 25, 351–360 (2009).

  36. 36.

    & Founders, drift, and infidelity: the relationship between Y chromosome diversity and patrilineal surnames. Mol. Biol. Evol. 26, 1093–1102 (2009).

  37. 37.

    Anonymous sperm donor traced on internet. New Scientist 2 (3 Nov 2005). This article discusses the first public case of identity tracing using genealogical triangulation.

  38. 38.

    Found on the web, with DNA: a boy's father. Washington Post A09 (13 Nov 2005).

  39. 39.

    Family secrets: an adopted man's 26-year quest for his father. The Wall Street Journal (2 May 2009).

  40. 40.

    Are sperm donors really anonymous anymore? Slate (1 Mar 2010).

  41. 41.

    , , & lobSTR: a short tandem repeat profiler for personal genomes. Genome Res. 22, 1154–1162 (2012).

  42. 42.

    China News Network. Ministry of Public Security statistics: “King” into the most common surname in China has 9288 million. Eastday , (in Chinese) (2007).

  43. 43.

    et al. Maximum-likelihood estimation of recent shared ancestry (ERSA). Genome Res. 21, 768–774 (2011).

  44. 44.

    et al. Cryptic distant relatives are common in both isolated and cosmopolitan genetic samples. PLoS ONE 7, e34267 (2012).

  45. 45.

    & Identifiability in genomic research. Science 317, 600–602 (2007).

  46. 46.

    & Improving human forensics through advances in genetics, genomics and molecular biology. Nature Rev. Genet. 12, 179–192 (2011). This is a comprehensive review of methods to predict phenotypes from DNA information.

  47. 47.

    et al. Heritability of adult body height: a comparative study of twin cohorts in eight countries. Twin Res. 6, 399–408 (2003).

  48. 48.

    The role of genetics in craniofacial morphology and growth. Annu. Rev. Anthropol. 20, 261–278 (1991).

  49. 49.

    et al. Estimating human age from T-cell DNA rearrangements. Curr. Biol. 20, R970–R971 (2010).

  50. 50.

    et al. Predicting human age with bloodstains by sjTREC quantification. PLoS ONE 7, e42412 (2012).

  51. 51.

    et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467, 832–838 (2010).

  52. 52.

    et al. A genome-wide approach accounting for body mass index identifies genetic variants influencing fasting glycemic traits and insulin resistance. Nature Genet. 44, 659–669 (2012).

  53. 53.

    et al. A genome-wide association study identifies five loci influencing facial morphology in Europeans. PLoS Genet. 8, e1002932 (2012).

  54. 54.

    et al. IrisPlex: a sensitive DNA tool for accurate prediction of blue and brown eye colour in the absence of ancestry information. Forens. Sci. Int. Genet. 5, 170–180 (2011).

  55. 55.

    Information leakage caused by hidden data in published documents. IEEE Secur. Priv. 2, 23–27 (2004).

  56. 56.

    , & Leakage in data mining: formulation, detection, and avoidance. Proc. 17th ACM SIGKDD Int. Conf. Knowledge Discov. Data Mining 556–563 (2011).

  57. 57.

    & Predicting Social Security numbers from public data. Proc. Natl Acad. Sci. USA 106, 10975–10980 (2009).

  58. 58.

    , & Pseudonymization of radiology data for research purposes. J. Digital Imag. 20, 284–295 (2007).

  59. 59.

    et al. SNPs for a universal individual identification panel. Hum. Genet. 127, 315–324 (2010).

  60. 60.

    , & Genomic research and human subject privacy. Science 305, 183 (2004).

  61. 61.

    et al. The NCBI dbGaP database of genotypes and phenotypes. Nature Genet. 39, 1181–1186 (2007).

  62. 62.

    et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 4, e1000167 (2008). This is the first study to show an ADAD from summary statistic data.

  63. 63.

    et al. A new statistic and its power to infer membership in a genome-wide association study using genotype frequencies. Nature Genet. 41, 1253–1257 (2009).

  64. 64.

    & The limits of individual identification from sample allele frequencies: theory and statistical analysis. PLoS Genet. 5, e1000628 (2009).

  65. 65.

    , , & Genomic privacy and limits of individual detection in a pool. Nature Genet. 41, 965–967 (2009). References 64–65 provide excellent mathematical analyses of ADAD using allele frequency data.

  66. 66.

    , , , & Learning your identity and disease from research papers: information leaks in genome wide association study. Proc. 16th ACM Conf. Comput. Commun. Security 534–544 (2009).

  67. 67.

    , , & On sharing quantitative trait, GWAS results in an era of multiple-omics data and the limits of genomic privacy. Am. J. Hum. Genet. 90, 591–598 (2012).

  68. 68.

    Potential for revealing individual-level information in genome-wide association studies. JAMA 303, 659 (2010).

  69. 69.

    & Protecting aggregate genomic data. Science 322, 44 (2008).

  70. 70.

    , & Temporal trends in results availability from genome-wide association studies. PLoS Genet. 7, e1002269 (2011).

  71. 71.

    Researchers criticize genetic data restrictions. Nature (2008).

  72. 72.

    , & Technical and policy approaches to balancing patient privacy and data sharing in clinical and translational research. J. Investig. Med. 58, 11–18 (2010).

  73. 73.

    On inferring presence of an individual in a mixture: a Bayesian approach. Biostatistics 11, 661–673 (2010).

  74. 74.

    Report on the workshop on establishing a central resource of data from genome sequencing projects. National Genome Research Institute , (2012).

  75. 75.

    , & Bayesian method to predict individual SNP genotypes from gene expression data. Nature Genet. 44, 603–608 (2012).

  76. 76.

    & Genotype imputation for genome-wide association studies. Nature Rev. Genet. 11, 499–511 (2010).

  77. 77.

    , & On Jim Watson's APOE status: genetic information is hard to hide. Eur. J. Hum. Genet. 17, 147–149 (2009). This study clearly shows the limited use of masking sensitive DNA areas.

  78. 78.

    , , & Addressing the concerns of the Lacks family: quantification of kin genomic privacy. Proc. 2013 ACM SIGSAC Conf. Comput. Commun. Secur. 1141–1152 (2013).

  79. 79.

    et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nature Genet. 40, 1068–1075 (2008).

  80. 80.

    Agency nixes deCODE's new data-mining plan. Science 340, 1388–1389 (2013).

  81. 81.

    Tragedy of the data commons. Harvard J. Law Technol. (2011).

  82. 82.

    & The case for online obscurity. Calif. Law Rev. 101, 1 (2013).

  83. 83.

    The Black Swan: the Impact of the Highly Improbable (Random House, 2007).

  84. 84.

    Communication theory of secrecy systems. Bell System Techn. J. 28, 656–715 (1949).

  85. 85.

    Privacy by design. Information and Privacy Commissioner, Ontario, Canada , (2009).

  86. 86.

    et al. NCBI's database of genotypes and phenotypes: dbGaP. Nucleic Acids Res. 42, D975–D979 (2014).

  87. 87.

    et al. A mechanism for controlled access to GWAS data: experience of the GAIN Data Access Committee. Am. J. Hum. Genet. 92, 479–488 (2013).

  88. 88.

    et al. Public access to genome-wide data: five views on balancing research with privacy and protection. PLoS Genet. 5, e1000665 (2009).

  89. 89.

    , , & Hippocratic databases. Proc. 28th Int. Conf. Very Large Databases 143–154 (2002).

  90. 90.

    et al. Auditing compliance with a hippocratic database. Proc. 30th Int. Conf. Very Large Databases 516–527 (2004).

  91. 91.

    , & PIDS: a privacy intrusion detection system. Internet Res. 14, 360–365 (2004).

  92. 92.

    Creating a global alliance to enable responsible sharing of genomic and clincal data. , (2013).

  93. 93.

    et al. Abstractions for genomics. Commun. ACM 56, 83–93 (2013).

  94. 94.

    & Power to the people: participant ownership of clinical trial data. Sci. Transl Med. 3, 69cm3 (2011).

  95. 95.

    et al. From patients to partners: participant-centric initiatives in biomedical research. Nature Rev. Genet. 13, 371–376 (2012).

  96. 96.

    k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzz. 10, 557–570 (2002).

  97. 97.

    & Protecting privacy using k-anonymity. J. Am. Med. Informat. Associ. 15, 627–637 (2008).

  98. 98.

    Protecting genomic sequence anonymity with generalization lattices. Methods Inform. Med. 44, 687–692 (2005).

  99. 99.

    , , & L-diversity: privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 1, 3 (2007).

  100. 100.

    , & t-closeness: privacy beyond k-anonymity and L-diversity. IEEE 23rd Int. Conf. Data Eng. 106–115 (2007).

  101. 101.

    Differential privacy. Automata, Languages and Programming 1–12 (Springer Verlag, 2006).

  102. 102.

    , , , & Privacy: theory meets practice on the map. IEEE 24th Int. Conf. Data Eng. 277–286 (2008).

  103. 103.

    , & Privacy-preserving data sharing for genome-wide association studies. arXiv 1205.0739 (2012).

  104. 104.

    , , & Scalable privacy-preserving data sharing methodology for genome-wide association studies. arXiv 1401.5193 (2014).

  105. 105.

    & Privacy-preserving data exploration in genome-wide association studies. Proc. 19th ACM SIGKDD Int. Conf. Knowledge Discov. Data Mining 1079–1087 (2013).

  106. 106.

    , & Privacy-enhancing technologies for medical tests using genomic data. Ecole Polytechnique Federale de Lausanne , (2013).

  107. 107.

    , , , & Privacy-preserving computation of disease risk by using genomic, clinical, and environmental data. Proc. USENIX Security Workshop Health Inf. Technol. (2013). This pioneering work shows the use of homomorphic encryption for privacy-preserving genetic risk predictions.

  108. 108.

    , & Secure and private sequence comparisons. Proc. 2003 ACM Workshop Privacy in Electron. Soc. 39–44 (2003).

  109. 109.

    , & Towards practical privacy for genomic computation. IEEE Symp. Security and Privacy 216–230 (2008).

  110. 110.

    , , & Large-scale privacy-preserving mapping of human genomic sequences on hybrid clouds. Proc. 19th Annu. Netw. Distributed Syst. Security Symp. (2013). The paper presents an interesting concept of privacy-preserving alignment of high-throughput sequencing data that allows the use of untrusted cloud providers.

  111. 111.

    Protocols for secure computations. 23rd Annu. Symp. Found. Comput. Sci. 160–164 (1982).

  112. 112.

    & A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform. 11, 473–483 (2010).

  113. 113.

    , & in Public Key Cryptography (eds Imai, H. & Zheng, Y.) 373–390 (Springer, 2000).

  114. 114.

    , , & Privacy-preserving matching of DNA profiles. Cryptology ePrint Archive 2008, 203 (2008).

  115. 115.

    , , , & Countering GATTACA: efficient and secure testing of fully-sequenced human genomes. Proc. 18th ACM Conf. Comput. Commun. Security 691–702 (2011).

  116. 116.

    , , & Genodroid: are privacy-preserving genomic tests ready for prime time? Proc. 2012 ACM Workshop Privacy in Electron. Soc. 97–108 (2012).

  117. 117.

    et al. Identifying genetic relatives without compromising privacy. Genome Res. 24, 664–672 (2014).

  118. 118.

    , , & A cryptographic approach to securely share and query genomic sequences. IEEE Trans. Inf. Technol. Biomed. 12, 606–617 (2008).

  119. 119.

    , , & A new way to protect privacy in large-scale genome-wide association studies. Bioinformatics 29, 886–893 (2013).

  120. 120.

    , & Secure management of biomedical data with cryptographic hardware. IEEE Trans. Inf. Technol. Biomed. 16, 166–175 (2012).

  121. 121.

    What happened to the crypto dream? IEEE Secur. Priv. 11, 75–76 (2013).

  122. 122.

    , & The chills and thrills of whole genome sequencing. Computer (2013). This is a good overview of cryptographic work for protecting genetic data and of open questions in the area.

  123. 123.

    Presidential Commission for the Study of Bioethical Issues. Privacy and Progress in Whole Genome Sequencing (2012).

  124. 124.

    et al. Assessing and managing risk when sharing aggregate genetic variant data. Nature Rev. Genet. 12, 730–736 (2011).

  125. 125.

    , , , & Needles in the haystack: identifying individuals present in pooled genomic data. PLoS Genet. 5, e1000668 (2009). This is a critical assessment of the performance of ADAD with allele frequency data.

  126. 126.

    , , & Lifetime prevalence, demographic risk factors, and diagnostic validity of nonaffective psychosis as assessed in a US community sample: the National Comorbidity Survey. Arch. Gen. Psychiatry 53, 1022–1031 (1996).

  127. 127.

    & in Information Security 325–340 (Springer, 2011).

  128. 128.

    et al. Differential privacy: an economic method for choosing epsilon. arXiv 1402.3329 (2014).

  129. 129.

    , , & in Theory of Cryptography 265–284 (Springer, 2006).

  130. 130.

    in Advances in Cryptology — EUROCRYPT '99 (ed. Stern, J.) 223–238 (Springer, 1999).

  131. 131.

    , & Data and theory point to mainly additive genetic variance for complex traits. PLoS Genet. 4, e1000008 (2008).

  132. 132.

    Fully homomorphic encryption using ideal lattices. Proc. 41st Annu. ACM Symp. Theory of Comput. 169–178 (2009).

  133. 133.

    , , , & Learning your identity and disease from research papers: information leaks in genome wide association study. Proc. 16th ACM Conf. Comput. Commun. Security 534–544 (2009).

Download references

Acknowledgements

Y.E. is an Andria and Paul Heafy Family Fellow and holds a Career Award at the Scientific Interface from the Burroughs Wellcome Fund. This study was supported in part by a US National Human Genome Research Institute grant R21HG006167, and by a gift from C. Stone and J. Stone. The authors thank D. Zielinski and M. Gymrek for comments.

Author information

Affiliations

  1. Whitehead Institute for Biomedical Research, Nine Cambridge Center, Cambridge, Massachusetts 02142, USA.

    • Yaniv Erlich
  2. Department of Computer Science, Princeton University, 35 Olden Street, Princeton, New Jersey 08540, USA.

    • Arvind Narayanan

Authors

  1. Search for Yaniv Erlich in:

  2. Search for Arvind Narayanan in:

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Yaniv Erlich.

Supplementary information

PDF files

  1. 1.

    Supplementary information S1 (figure)

    Differential privacy statistic of an association study.

Glossary

Safe Harbor

A standard in the US Health Insurance Portability and Accountability Act (HIPAA) rule for de-identification of protected health information by removing 18 types of quasi-identifiers.

Haplotypes

Sets of alleles along the same chromosome.

Cryptographic hashing

A procedure that yields a fixed-length output from any size of input in a way that is hard to determine the input from the output.

Dictionary attacks

Approaches to reverse cryptographic hashing by scanning only highly probable inputs.

Alice

A common generic name in computer security to denote party A.

Bob

A common generic name in computer security to denote party B.

Type 1 error

The probability of obtaining a positive answer from a negative item.

Linkage equilibrium

Absence of correlation between the alleles at two loci.

Power

The probability of obtaining a positive answer for a positive item.

Specificity

The probability of obtaining a negative answer for a negative item.

Linkage disequilibrium

(LD). The correlation between alleles at two loci.

Effect sizes

The contributions of alleles to the values of particular traits.

Positive predictive value

The probability that a positive answer belongs to a true positive.

Expression quantitative trait locus

(eQTL). A genetic variant associated with variability in gene expression.

Genotype imputation

A class of statistical techniques to predict a genotype from information on surrounding genotypes.

Application programming interface

(API). A set of commands that specify the interface with a data set or software applications.

χ2-statistic

A measure of association in case–control genome-wide association studies.

Read mapping

A computationally intensive step in the analysis of high-throughput sequencing to find the location of a short DNA sequence (string) in the genome.

Edit distance

The total number of insertions, deletions and substitutions between two strings.

About this article

Publication history

Published

DOI

https://doi.org/10.1038/nrg3723

Further reading