Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Opinion
  • Published:

Assessing and managing risk when sharing aggregate genetic variant data

An Erratum to this article was published on 27 September 2011

This article has been updated

Abstract

Access to genetic data across studies is an important aspect of identifying new genetic associations through genome-wide association studies (GWASs). Meta-analysis across multiple GWASs with combined cohort sizes of tens of thousands of individuals often uncovers many more genome-wide associated loci than the original individual studies; this emphasizes the importance of tools and mechanisms for data sharing. However, even sharing summary-level data, such as allele frequencies, inherently carries some degree of privacy risk to study participants. Here we discuss mechanisms and resources for sharing data from GWASs, particularly focusing on approaches for assessing and quantifying the privacy risks to participants that result from the sharing of summary-level data.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Sharing 5,000 SNPs at different prevalence or prior probabilities.

Similar content being viewed by others

Change history

  • 27 September 2011

    In the above article, the incorrect link was provided for GWAS Central. The correct link should have been http://www.gwascentral.org. In the Further Information Box, the link to http://gwas.nih.gov was incorrectly described as 'GWAS Central (includes policy)'.The editors apologize for this error.

References

  1. Hirschhorn, J. N. & Daly, M. J. Genome-wide association studies for common diseases and complex traits. Nature Rev. Genet. 6, 95–108 (2005).

    Article  CAS  PubMed  Google Scholar 

  2. Klein, R. J. et al. Complement factor H polymorphism in age-related macular degeneration. Science 308, 385–389 (2005).

    Article  CAS  PubMed  Google Scholar 

  3. Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).

    CAS  PubMed Central  PubMed  Google Scholar 

  4. Zhernakova, A. et al. Meta-analysis of genome-wide association studies in celiac disease and rheumatoid arthritis identifies fourteen non-HLA shared loci. PLoS Genet. 7, e1002004 (2011).

    Article  CAS  PubMed  Google Scholar 

  5. Hollingworth, P. et al. Common variants at ABCA7, MS4A6A/MS4A4E, EPHA1, CD33 and CD2AP are associated with Alzheimer's disease. Nature Genet. 43, 429–435 (2011).

    Article  CAS  PubMed  Google Scholar 

  6. Schunkert, H. et al. Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nature Genet. 43, 333–338 (2011).

    Article  CAS  PubMed  Google Scholar 

  7. Teslovich, T. M. et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 707–713 (2010).

    Article  CAS  PubMed  Google Scholar 

  8. Kho, A. N. et al. Electronic medical records for genetic research: results of the eMERGE consortium. Sci. Transl. Med. 3, 79re1 (2011).

    Article  PubMed  Google Scholar 

  9. Skol, A. D., Scott, L. J., Abecasis, G. R. & Boehnke, M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nature Genet. 38, 209–213 (2006).

    Article  CAS  PubMed  Google Scholar 

  10. Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nature Rev. Genet. 11, 499–511 (2010).

    CAS  PubMed  Google Scholar 

  11. Durbin, R. M. et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

    Article  CAS  PubMed  Google Scholar 

  12. Zheng, S. L. et al. Cumulative association of five genetic variants with prostate cancer. N. Engl. J. Med. 358, 910–919 (2008).

    Article  CAS  PubMed  Google Scholar 

  13. Vacic, V. et al. Duplications of the neuropeptide receptor gene VIPR2 confer significant risk for schizophrenia. Nature 471, 499–503 (2011).

    Article  CAS  PubMed  Google Scholar 

  14. Heeney, C., Hawkins, N., de Vries, J., Boddington, P. & Kaye, J. Assessing the privacy risks of data sharing in genomics. Public Health Genomics 14, 17–25 (2011).

    Article  CAS  PubMed  Google Scholar 

  15. Church, G. et al. Public access to genome-wide data: five views on balancing research with privacy and protection. PLoS Genet. 5, e1000665 (2009).

    Article  PubMed  Google Scholar 

  16. Preuss, M. et al. Design of the Coronary ARtery DIsease Genome-Wide Replication And Meta-Analysis (CARDIoGRAM) Study: a genome-wide association meta-analysis involving more than 22 000 cases and 60 000 controls. Circ. Cardiovasc. Genet. 3, 475–483 (2010).

    Article  CAS  PubMed  Google Scholar 

  17. Speliotes, E. K. et al. Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nature Genet. 42, 937–948 (2010).

    Article  CAS  PubMed  Google Scholar 

  18. Cornelis, M. C. et al. The gene, environment association studies consortium (GENEVA): maximizing the knowledge obtained from GWAS by collaboration across studies of multiple conditions. Genet. Epidemiol. 34, 364–372 (2010).

    Article  PubMed  Google Scholar 

  19. The Psychiatric GWAS Consortium Steering Committee. A framework for interpreting genome-wide association studies of psychiatric disorders. Mol. Psychiatry 14, 10–17 (2009).

  20. Nelson, M. R. et al. The Population Reference Sample, POPRES: a resource for population, disease, and pharmacological genetics research. Am. J. Hum. Genet. 83, 347–358 (2008).

    Article  CAS  PubMed  Google Scholar 

  21. The International HapMap Consortium. The International HapMap Project. Nature 426, 789–796 (2003).

  22. Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).

    Article  CAS  PubMed  Google Scholar 

  23. Mailman, M. D. et al. The NCBI dbGaP database of genotypes and phenotypes. Nature Genet. 39, 1181–1186 (2007).

    Article  CAS  PubMed  Google Scholar 

  24. Leinonen, R. et al. The European Nucleotide Archive. Nucleic Acids Res. 39, D28–D31 (2011).

    Article  CAS  PubMed  Google Scholar 

  25. Yu, W., Gwinn, M., Clyne, M., Yesupriya, A. & Khoury, M. J. A navigator for human genome epidemiology. Nature Genet. 40, 124–125 (2008).

    Article  CAS  PubMed  Google Scholar 

  26. Thorisson, G. A. et al. HGVbaseG2P: a central genetic association database. Nucleic Acids Res. 37, D797–D802 (2009).

    Article  CAS  PubMed  Google Scholar 

  27. Hirakawa, M. et al. JSNP: a database of common gene variations in the Japanese population. Nucleic Acids Res. 30, 158–162 (2002).

    Article  CAS  PubMed  Google Scholar 

  28. Hindorff, L. A. et al. PheGenI: an integrated resource for browsing genetic association data. Proc. of the 2011 AMIA Summit on Translational Bioinformatics [online], (2011).

    Google Scholar 

  29. Homer, N. et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 4, e1000167 (2008).

    Article  PubMed  Google Scholar 

  30. Sankararaman, S., Obozinski, G., Jordan, M. I. & Halperin, E. Genomic privacy and limits of individual detection in a pool. Nature Genet. 41, 965–967 (2009).

    Article  CAS  PubMed  Google Scholar 

  31. Jacobs, K. B. et al. A new statistic and its power to infer membership in a genome-wide association study using genotype frequencies. Nature Genet. 41, 1253–1257 (2009).

    Article  CAS  PubMed  Google Scholar 

  32. Neyman, J. & Pearson, E. On the problem of the most efficient tests of statistical hypotheses. Phil. Trans. R. Soc. Lond. A 231, 289–337 (1933).

    Article  Google Scholar 

  33. Braun, R., Rowe, W., Schaefer, C., Zhang, J. & Buetow, K. Needles in the haystack: identifying individuals present in pooled genomic data. PLoS Genet. 5, e1000668 (2009).

    Article  PubMed  Google Scholar 

  34. Wang, R., Li, Y. F., Wang, X., Tang, H. & Zhou, X. Learning your identity and disease from research papers: information leaks in genome wide association study. Proc. of the 16th ACM Conf. on Computer and Communications Security, 534–544 (2009).

  35. Visscher, P. M. & Hill, W. G. The limits of individual identification from sample allele frequencies: theory and statistical analysis. PLoS Genet. 5, e1000628 (2009).

    Article  PubMed  Google Scholar 

  36. Clayton, D. On inferring presence of an individual in a mixture: a Bayesian approach. Biostatistics 11, 661–673 (2010).

    Article  PubMed  Google Scholar 

  37. Sampson, J. & Zhao, H. Identifying individuals in a complex mixture of DNA with unknown ancestry. Stat. Appl. Genet. Mol. Biol. 8, 37 (2009).

    Article  Google Scholar 

  38. Zerhouni, E. A. & Nabel, E. G. Protecting aggregate genomic data. Science 322, 44 (2008).

    Article  CAS  PubMed  Google Scholar 

  39. Krawczak, M., Goebel, J. W. & Cooper, D. N. Is the NIH policy for sharing GWAS data running the risk of being counterproductive? Investig. Genet. 1, 3 (2010).

    Article  PubMed  Google Scholar 

  40. Haga, S. B. & O'Daniel, J. Public perspectives regarding data-sharing practices in genomics research. Public Health Genomics 24 Mar 2011 (doi:10.1159/000324705).

    Article  CAS  PubMed  Google Scholar 

  41. Malin, B., Karp, D. & Scheuermann, R. H. Technical and policy approaches to balancing patient privacy and data sharing in clinical and translational research. J. Investig. Med. 58, 11–18 (2010).

    Article  PubMed  Google Scholar 

  42. Elias-Sonnenschein, L. S., Viechtbauer, W., Ramakers, I. H., Verhey, F. R. & Visser, P. J. Predictive value of APOE-ɛ4 allele for progression from MCI to AD-type dementia: a meta-analysis. J. Neurol. Neurosurg. Psychiatry 14 Apr 2011 (doi:10.1136/jnnp.2010.231555).

    Article  Google Scholar 

  43. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

This manuscript represents the views and opinions of its authors and does not necessarily represent the views or policies of the NIH or the US Department of Health and Human Services. This research was supported in part by the Intramural Research Program of the NIH National Library of Medicine. D.W.C. would like to acknowledge support from the US National Heart, Lung and Blood Institute (NHLBI), award U01 HL086528. The authors thank I. Marpuri, S. Buchholtz and L. Gyi for their support in coordinating the development of this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David W. Craig.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links

Related links

FURTHER INFORMATION

23andMe personal genomics company

Electronic Medical Records and Genomics (eMERGE) Network

Database of Genotypes and Phenotypes (dbGaP)

dbSNP

European Genome–Phenome Archive

Genome Medicine Database of Japan (GeMDBJ)

GWAS Central

HuGE Navigator

JSNP

Nature Reviews Genetics series on Genome-Wide Association Studies

NIH GWAS policy

Phenotype–Genotype Integrator (PheGenI)

Public Population Project in Genomics (P3G)

SecureGenome

SNP Health Association Resource (SHARe)

US National Human Genome Research Institute catalogue of published GWASs

Wellcome Trust Case–Control Consortium (WTCCC)

Glossary

Allele frequency

The frequency of the less-common allele of a polymorphism. It has a value between 0 and 0.5 and can vary between populations.

Bayesian

A statistical framework for evaluating a hypothesis. The Bayesian approach assesses the probability of a hypothesis being correct by incorporating the prior probability of the hypothesis.

Discrimination threshold

The significance threshold for rejecting the null hypothesis in a statistical test.

Frequentist

A statistical framework for evaluating a hypothesis. The frequentist approach tests a hypothesis as being correct given the strength of a data set.

Imputation

A method for inferring untyped variants from neighbouring variants, based on linkage disequilibrium and haplotype structure.

Linear regression

The estimation of a first-order relationship between two variables, which involves fitting a line of best fit to the data.

Missingness

The percentage of samples that do not receive a genotype call for a SNP in a genome-wide association study.

Neyman–Pearson lemma

A theorem that assures the optimality of a likelihood ratio test between simple hypotheses at a given threshold.

Prevalence

The prior probability that a person is in a data set of interest. Alternatively, the term can refer to the fraction of individuals in a data set out of the total number of individuals that could be in the data set.

Reference data set

A data set of samples from individuals who are from the same population that was sampled in the summary-level data set of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Craig, D., Goor, R., Wang, Z. et al. Assessing and managing risk when sharing aggregate genetic variant data. Nat Rev Genet 12, 730–736 (2011). https://doi.org/10.1038/nrg3067

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg3067

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research