Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Opportunities and challenges for the use of common controls in sequencing studies

Abstract

Genome-wide association studies using large-scale genome and exome sequencing data have become increasingly valuable in identifying associations between genetic variants and disease, transforming basic research and translational medicine. However, this progress has not been equally shared across all people and conditions, in part due to limited resources. Leveraging publicly available sequencing data as external common controls, rather than sequencing new controls for every study, can better allocate resources by augmenting control sample sizes or providing controls where none existed. However, common control studies must be carefully planned and executed as even small differences in sample ascertainment and processing can result in substantial bias. Here, we discuss challenges and opportunities for the robust use of common controls in high-throughput sequencing studies, including study design, quality control and statistical approaches. Thoughtful generation and use of large and valuable genetic sequencing data sets will enable investigation of a broader and more representative set of conditions, environments and genetic ancestries than otherwise possible.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Where to use common controls?
Fig. 2: Types of control samples.
Fig. 3: Types of bias that could affect common control analyses.
Fig. 4: Common control analysis workflow and example infrastructure with AnVIL.

Similar content being viewed by others

References

  1. McGuire, A. L. et al. The road ahead in genetics and genomics. Nat. Rev. Genet. 21, 581–596 (2020). Perspective from a panel of leading genetics experts across the world describing the current state of the field and where genetics should go to ensure that the insights gained by modern genomic research will benefit all.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Rehm, H. L. et al. ClinGen — the clinical genome resource. N. Engl. J. Med. 372, 2235–2242 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Wang, Q. et al. Rare variant contribution to human disease in 281,104 UK Biobank exomes. Nature 597, 527–532 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Szustakowski, J. D. et al. Advancing human genetics research and drug discovery through exome sequencing of the UK Biobank. Nat. Genet. 53, 942–948 (2021).

    Article  CAS  PubMed  Google Scholar 

  5. Gibbs, R. A. The Human Genome Project changed everything. Nat. Rev. Genet. 21, 575–576 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. UK10K Consortium et al. The UK10K project identifies rare variants in health and disease. Nature 526, 82–90 (2015).

    Article  Google Scholar 

  7. Minikel, E. V. et al. Evaluating drug targets through human loss-of-function genetic variation. Nature 581, 459–464 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Banka, S. et al. How genetically heterogeneous is Kabuki syndrome?: MLL2 testing in 116 patients, review and analyses of mutation and phenotypic spectrum. Eur. J. Hum. Genet. 20, 381–388 (2012).

    Article  CAS  PubMed  Google Scholar 

  9. Biesecker, L. G. Exome sequencing makes medical genomics a reality. Nat. Genet. 42, 13–14 (2010).

    Article  CAS  PubMed  Google Scholar 

  10. Ng, S. B. et al. Exome sequencing identifies the cause of a Mendelian disorder. Nat. Genet. 42, 30–35 (2010).

    Article  CAS  PubMed  Google Scholar 

  11. Akbari, P. et al. Sequencing of 640,000 exomes identifies GPR75 variants associated with protection from obesity. Science 373, eabf8683 (2021).

    Article  CAS  PubMed  Google Scholar 

  12. Flannick, J. et al. Exome sequencing of 20,791 cases of type 2 diabetes and 24,440 controls. Nature 570, 71–76 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Backman, J. D. et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature 599, 628–634 (2021). Initial description of the data and potential provided by exomes for medical and genomic applications across the UK Biobank.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Martin, A. R. et al. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet. 100, 635–649 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Petrovski, S. & Goldstein, D. B. Unequal representation of genetic variation across ancestry groups creates healthcare inequality in the application of precision medicine. Genome Biol. 17, 157 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  16. Manrai, A. K. et al. Genetic misdiagnoses and the potential for health disparities. N. Engl. J. Med. 375, 655–665 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  17. Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007). Foundational early genome-wide association study leveraging a common set of controls to enhance discovery possibility across seven diseases. The paper includes stringent QC now common to ensure homogeneity across a common control data set.

    Article  Google Scholar 

  18. Corredor-Orlandelli, D. et al. Association between paraoxonase-1 p.Q192R polymorphism and coronary artery disease susceptibility in the Colombian population. Vasc. Health Risk Manag. 17, 689–699 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Tan, M. et al. Whole genome sequencing identifies rare germline variants enriched in cancer related genes in first degree relatives of familial pancreatic cancer patients. Clin. Genet. 100, 551–562 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Taroc, E. Z. M. et al. Gli3 regulates vomeronasal neurogenesis, olfactory ensheathing cell formation, and GnRH-1 neuronal migration. J. Neurosci. 40, 311–326 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Muskens, I. S. et al. Germline cancer predisposition variants and pediatric glioma: a population-based study in California. Neuro. Oncol. 22, 864–874 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Lorenzo-Salazar, J. M. et al. Novel idiopathic pulmonary fibrosis susceptibility variants revealed by deep sequencing. ERJ Open Res. 5, 00071 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Georges, A. et al. Rare loss-of-function mutations of PTGIR are enriched in fibromuscular dysplasia. Cardiovasc. Res. 117, 1154–1165 (2021).

    Article  CAS  PubMed  Google Scholar 

  24. Li, C. et al. Mutation analysis of DNAJC family for early-onset Parkinson’s disease in a Chinese cohort. Mov. Disord. 35, 2068–2076 (2020).

    Article  CAS  PubMed  Google Scholar 

  25. Hillman, P. et al. Identification of novel candidate risk genes for myelomeningocele within the glucose homeostasis/oxidative stress and folate/one-carbon metabolism networks. Mol. Genet. Genom. Med. 8, e1495 (2020).

    CAS  Google Scholar 

  26. Hebert, L. et al. Burden of rare deleterious variants in WNT signaling genes among 511 myelomeningocele patients. PLoS ONE 15, e0239083 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Yuan, J.-H. et al. Genomic analysis of 21 patients with corneal neuralgia after refractive surgery. Pain Rep. 5, e826 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Rojas, R. A. et al. Phenotypic continuum between Waardenburg syndrome and idiopathic hypogonadotropic hypogonadism in humans with SOX10 variants. Genet. Med. 23, 629–636 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Terradas, M. et al. TP53, a gene for colorectal cancer predisposition in the absence of Li–Fraumeni-associated phenotypes. Gut 70, 1139–1146 (2021).

    Article  CAS  PubMed  Google Scholar 

  30. Li, C. et al. Mutation analysis of LRP10 in a large Chinese familial Parkinson disease cohort. Neurobiol. Aging 99, 99.e1–99.e6 (2021).

    Article  CAS  Google Scholar 

  31. Gunadi et al. Effect of semaphorin 3C gene variants in multifactorial Hirschsprung disease. J. Int. Med. Res. 49, 300060520987789 (2021).

    Article  CAS  PubMed  Google Scholar 

  32. Messina, A. et al. Neuron-derived neurotrophic factor is mutated in congenital hypogonadotropic hypogonadism. Am. J. Hum. Genet. 106, 58–70 (2020).

    Article  CAS  PubMed  Google Scholar 

  33. Trimarchi, M. et al. Gene expression analysis in patients with cocaine-induced midline destructive lesions. Medicina 57, 861 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  34. Marenne, G. et al. Exome sequencing identifies genes and gene sets contributing to severe childhood obesity, linking PHIP variants to repressed POMC transcription. Cell Metab. 31, 1107–1119.e12 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Singh, T. et al. Rare loss-of-function variants in SETD1A are associated with schizophrenia and developmental disorders. Nat. Neurosci. 19, 571–577 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Sazonovs, A. et al. Sequencing of over 100,000 individuals identifies multiple genes and rare variants associated with Crohns disease susceptibility. Preprint at bioRxiv https://doi.org/10.1101/2021.06.15.21258641 (2021).

    Article  Google Scholar 

  37. Malki, L. et al. Variant PADI3 in central centrifugal cicatricial alopecia. N. Engl. J. Med. 380, 833–841 (2019).

    Article  CAS  PubMed  Google Scholar 

  38. Ulirsch, J. C. et al. The genetic landscape of Diamond–Blackfan anemia. Am. J. Hum. Genet. 103, 930–947 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Hubert, J.-N. et al. The PI3K/mTOR pathway is targeted by rare germline variants in patients with both melanoma and renal cell carcinoma. Cancers 13, 2243 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Rashid, M. et al. ALPK1 hotspot mutation as a driver of human spiradenoma and spiradenocarcinoma. Nat. Commun. 10, 2213 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  41. Belhadj, S. et al. Candidate genes for hereditary colorectal cancer: mutational screening and systematic review. Hum. Mutat. 41, 1563–1576 (2020).

    Article  CAS  PubMed  Google Scholar 

  42. Mosquera Orgueira, A. et al. Detection of rare germline variants in the genomes of patients with B-cell neoplasms. Cancers 13, 1340 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  43. Li, C. et al. Targeted next generation sequencing of nine osteoporosis-related genes in the Wnt signaling pathway among Chinese postmenopausal women. Endocrine 68, 669–678 (2020).

    Article  CAS  PubMed  Google Scholar 

  44. Thorlund, K., Dron, L., Park, J. J. H. & Mills, E. J. Synthetic and external controls in clinical trials — a primer for researchers. Clin. Epidemiol. 12, 457–467 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  45. Popejoy, A. B. & Fullerton, S. M. Genomics is failing on diversity. Nature 538, 161–164 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Ben-Eghan, C. et al. Don’t ignore genetic data from minority populations. Nature 585, 184–186 (2020).

    Article  CAS  PubMed  Google Scholar 

  47. McMahon, A. et al. Sequencing-based genome-wide association studies reporting standards. Cell Genomics 1, 100005 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Gurdasani, D., Barroso, I., Zeggini, E. & Sandhu, M. S. Genomics of disease risk in globally diverse populations. Nat. Rev. Genet. 20, 520–535 (2019). This paper provides a summary of the current state of genomic diversity in research and how diversity is key to discovery and translation in genomics.

    Article  CAS  PubMed  Google Scholar 

  49. Zhang, Y. et al. The prevalence of vitiligo: a meta-analysis. PLoS ONE 11, e0163806 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  50. Conway, M. et al. Analyzing the heterogeneity and complexity of electronic health record oriented phenotyping algorithms. AMIA Annu. Symp. Proc. 2011, 274–283 (2011).

    PubMed  PubMed Central  Google Scholar 

  51. Newton, K. M. et al. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. J. Am. Med. Inform. Assoc. 20, e147–e154 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  52. Shang, N. et al. Making work visible for electronic phenotype implementation: lessons learned from the eMERGE network. J. Biomed. Inform. 99, 103293 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  53. Davis, K. A. S. et al. Indicators of mental disorders in UK Biobank — a comparison of approaches. Int. J. Methods Psychiatr. Res. 28, e1796 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  54. Singh, T. et al. Rare coding variants in ten genes confer substantial risk for schizophrenia. Nature 604, 509–516 (2022).

    Article  CAS  PubMed  Google Scholar 

  55. Ledford, H. Paper on genetics of longevity retracted. Nature https://doi.org/10.1038/news.2011.429 (2011).

    Article  PubMed  Google Scholar 

  56. Viering, D. H. H. M. et al. Genetics of renovascular hypertension in children. J. Hypertens. 38, 1964–1970 (2020).

    Article  CAS  PubMed  Google Scholar 

  57. Mazzarotto, F. et al. Reevaluating the genetic contribution of monogenic dilated cardiomyopathy. Circulation 141, 387–398 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Steel, D. et al. Loss-of-function variants in HOPS complex genes VPS16 and VPS41 cause early onset dystonia associated with lysosomal abnormalities. Ann. Neurol. 88, 867–877 (2020).

    Article  CAS  PubMed  Google Scholar 

  59. Johnson, J. O. et al. Association of variants in the SPTLC1 gene with juvenile amyotrophic lateral sclerosis. JAMA Neurol. 78, 1236–1248 (2021).

    Article  PubMed  Google Scholar 

  60. Gallego-Martinez, A., Requena, T., Roman-Naranjo, P., May, P. & Lopez-Escamez, J. A. Enrichment of damaging missense variants in genes related with axonal guidance signalling in sporadic Meniere’s disease. J. Med. Genet. 57, 82–88 (2020).

    Article  CAS  PubMed  Google Scholar 

  61. Kwok, A. J., Mentzer, A. & Knight, J. C. Host genetics and infectious disease: new tools, insights and translational opportunities. Nat. Rev. Genet. 22, 137–153 (2021).

    Article  CAS  PubMed  Google Scholar 

  62. Fry, A. et al. Comparison of sociodemographic and health-related characteristics of UK biobank participants with those of the general population. Am. J. Epidemiol. 186, 1026–1034 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  63. Wright, C. F. et al. Assessing the pathogenicity, penetrance, and expressivity of putative disease-causing variants in a population setting. Am. J. Hum. Genet. 104, 275 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Povysil, G. et al. Rare-variant collapsing analyses for complex traits: guidelines and applications. Nat. Rev. Genet. 20, 747–759 (2019). Review describing rare variant aggregation testing, a common method for association in sequencing studies. Beyond describing techniques, the review covers specific filtering and quality control needed to ensure appropriate statistical calibration.

    Article  CAS  PubMed  Google Scholar 

  65. Riveros-McKay, F. et al. Genetic architecture of human thinness compared to severe obesity. PLoS Genet. 15, e1007603 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  66. Moskvina, V., Holmans, P., Schmidt, K. M. & Craddock, N. Design of case–controls studies with unscreened controls. Ann. Hum. Genet. 69, 566–576 (2005).

    Article  CAS  PubMed  Google Scholar 

  67. Sham, P. C. & Purcell, S. M. Statistical power and significance testing in large-scale genetic studies. Nat. Rev. Genet. 15, 335–346 (2014).

    Article  CAS  PubMed  Google Scholar 

  68. Auer, P. L. et al. Guidelines for large-scale sequence-based complex trait association studies: lessons learned from the NHLBI Exome Sequencing Project. Am. J. Hum. Genet. 99, 791–801 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Alberts, B. Editorial expression of concern. Science 330, 912 (2010).

    Article  CAS  PubMed  Google Scholar 

  70. Campbell, C. D. et al. Demonstrating stratification in a European American population. Nat. Genet. 37, 868–872 (2005).

    Article  CAS  PubMed  Google Scholar 

  71. Knowler, W. C., Williams, R. C., Pettitt, D. J. & Steinberg, A. G. Gm3;5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. Am. J. Hum. Genet. 43, 520–526 (1988).

    CAS  PubMed  PubMed Central  Google Scholar 

  72. Hellwege, J. N. et al. Population stratification in genetic association studies. Curr. Protoc. Hum. Genet. 95, 1.22.1–1.22.23 (2017).

    Google Scholar 

  73. Choudhry, S. et al. Population stratification confounds genetic association studies among Latinos. Hum. Genet. 118, 652–664 (2006).

    Article  PubMed  Google Scholar 

  74. Helgason, A., Yngvadóttir, B., Hrafnkelsson, B., Gulcher, J. & Stefánsson, K. An Icelandic example of the impact of population structure on association studies. Nat. Genet. 37, 90–95 (2005).

    Article  CAS  PubMed  Google Scholar 

  75. Panarella, M. & Burkett, K. M. A cautionary note on the effects of population stratification under an extreme phenotype sampling design. Front. Genet. 10, 398 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  76. Gravel, S. et al. Demographic history and rare allele sharing among human populations. Proc. Natl Acad. Sci. USA 108, 11983–11988 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  77. Mathieson, I. & McVean, G. Differential confounding of rare and common variants in spatially structured populations. Nat. Genet. 44, 243–246 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  78. O’Connor, T. D. et al. Fine-scale patterns of population stratification confound rare variant association tests. PLoS ONE 8, e65834 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  79. Klann, J. G., Joss, M. A. H., Embree, K. & Murphy, S. N. Data model harmonization for the All Of Us Research Program: transforming i2b2 data into the OMOP common data model. PLoS ONE 14, e0212463 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  80. Wei, W.-Q. et al. Evaluating phecodes, clinical classification software, and ICD-9-CM codes for phenome-wide association studies in the electronic health record. PLoS ONE 12, e0175508 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  81. Leitsalu, L. et al. Cohort profile: Estonian Biobank of the Estonian Genome Center, University of Tartu. Int. J. Epidemiol. 44, 1137–1147 (2015).

    Article  PubMed  Google Scholar 

  82. Choudhury, A. et al. Author correction: High-depth African genomes inform human migration and health. Nature 592, E26 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  83. Di Angelantonio, E. et al. Efficiency and safety of varying the frequency of whole blood donation (INTERVAL): a randomised trial of 45 000 donors. Lancet 390, 2360–2371 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  84. Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  85. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  86. Gutierrez-Sacristan, A. et al. GenoPheno: cataloging large-scale phenotypic and next-generation sequencing data within human datasets. Brief Bioinform. 22, 55–65 (2021).

    Article  CAS  PubMed  Google Scholar 

  87. FinnGen. FinnGen documentation of R5 release. FinnGen https://finngen.gitbook.io/documentation/ (2021).

  88. Wei, C.-Y. et al. Genetic profiles of 103,106 individuals in the Taiwan Biobank provide insights into the health and history of Han Chinese. NPJ Genom. Med. 6, 10 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  89. Karczewski, K. J., Francioli, L. C. & MacArthur, D. G. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  90. Peña-Chilet, M. et al. CSVS, a crowdsourcing database of the Spanish population genetic variability. Nucleic Acids Res. 49, D1130–D1137 (2021).

    Article  PubMed  Google Scholar 

  91. Mailman, M. D. et al. The NCBI dbGaP Database of Genotypes and Phenotypes. Nat. Genet. 39, 1181–1186 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  92. Lappalainen, I. et al. The European Genome–Phenome Archive of human data consented for biomedical research. Nat. Genet. 47, 692–695 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  93. UK Biobank. New costs for 2021. UK Biobank https://www.ukbiobank.ac.uk/enable-your-research/costs (2021).

  94. Lee, S., Kim, S. & Fuchsberger, C. Improving power for rare-variant tests by integrating external controls. Genet. Epidemiol. 41, 610–619 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  95. Hendricks, A. E. et al. ProxECAT: Proxy External Controls Association Test. A new case–control gene region association test using allele frequencies from public controls. PLoS Genet. 14, e1007591 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  96. Guo, M. H., Plummer, L., Chan, Y.-M., Hirschhorn, J. N. & Lippincott, M. F. Burden testing of rare variants identified through exome sequencing via publicly available control data. Am. J. Hum. Genet. 103, 522–534 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  97. Jiang, L. et al. Deviation from baseline mutation burden provides powerful and robust rare-variants association test for complex diseases. Nucleic Acids Res. 50, e34 (2022).

    Article  CAS  PubMed  Google Scholar 

  98. Lali, R. et al. Calibrated rare variant genetic risk scores for complex disease prediction using large exome sequence repositories. Nat. Commun. 12, 5852 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  99. Bodea, C. A. et al. A method to exploit the structure of genetic ancestry space to enhance case–control studies. Am. J. Hum. Genet. 98, 857–868 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  100. Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  101. Schatz, M. C. et al. Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space. Cell Genom. 2, 100085 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  102. National Heart, Lung, and Blood Institute, National Institutes of Health, US Department of Health and Human Services. The NHLBI BioData catalyst. Zenodo https://doi.org/10.5281/zenodo.3822858 (2020).

  103. All of Us Research Program Investigators et al. The “All of Us” Research Program. N. Engl. J. Med. 381, 668–676 (2019).

    Article  Google Scholar 

  104. Langmead, B. & Nellore, A. Cloud computing for genomic data analysis and collaboration. Nat. Rev. Genet. 19, 208–219 (2018). This paper reviews how the current and future state of cloud computing will be fundamental for large-scale genomics research including for collaboration and reproducibility.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  105. Van der Auwera, G. A. & O’Connor, B. D. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (O’Reilly Media, 2020).

  106. Yuen, D. et al. The Dockstore: enhancing a community platform for sharing reproducible and accessible computational protocols. Nucleic Acids Res. 49, W624–W632 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  107. Uffelmann, E. et al. Genome-wide association studies. Nat. Rev. Methods Primers 1, 60 (2021).

    Article  Google Scholar 

  108. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  109. Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  110. Alexander, D. H. & Lange, K. Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinformatics 12, 246 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  111. Reich, D., Price, A. L. & Patterson, N. Principal component analysis of genetic data. Nat. Genet. 40, 491–492 (2008).

    Article  CAS  PubMed  Google Scholar 

  112. Wang, C., Zhan, X., Liang, L., Abecasis, G. R. & Lin, X. Improved ancestry estimation for both genotyping and sequencing data using projection procrustes analysis and genotype imputation. Am. J. Hum. Genet. 96, 926–937 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  113. Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).

    Article  CAS  PubMed  Google Scholar 

  114. 1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  Google Scholar 

  115. Bergström, A. et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, eaay5012 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  116. GenomeAsia100K Consortium. The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature 576, 106–111 (2019).

    Article  Google Scholar 

  117. Maples, B. K., Gravel, S., Kenny, E. E. & Bustamante, C. D. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am. J. Hum. Genet. 93, 278–288 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  118. Hilmarsson, H. et al. High resolution ancestry deconvolution for next generation genomic data. Preprint at bioRxiv https://doi.org/10.1101/2021.09.19.460980 (2021).

    Article  Google Scholar 

  119. Arriaga-MacKenzie, I. S. et al. Summix: a method for detecting and adjusting for population structure in genetic summary data. Am. J. Hum. Genet. 108, 1270–1282 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  120. Wojcik, G. L. et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature 570, 514–518 (2019). A large, multi-ethnic, multi-trait genome-wide association study paper from the Population Architecture using Genomics and Epidemiology (PAGE) study describing best practices for handling heterogeneous population data, including imputation, filtering and QC steps. The paper also describes the critical importance of genomic diversity in genetic association studies.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  121. Choudhury, A. et al. High-depth African genomes inform human migration and health. Nature 586, 741–748 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  122. Exome Variant Server. NHLBI Exome Sequencing Project (ESP). EVS http://evs.gs.washington.edu/EVS/ (2013).

  123. Li, X. et al. Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nat. Genet. 52, 969–983 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  124. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  125. Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  126. Sim, N.-L. et al. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res. 40, W452–W457 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  127. Li, Y. & Lee, S. Novel score test to increase power in association test by integrating external controls. Genet. Epidemiol. 45, 293–304 (2021).

    Article  CAS  PubMed  Google Scholar 

  128. Chen, S. & Lin, X. Analysis in case–control sequencing association studies with different sequencing depths. Biostatistics 21, 577–593 (2020).

    Article  PubMed  Google Scholar 

  129. Hu, Y.-J., Liao, P., Johnston, H. R., Allen, A. S. & Satten, G. A. Testing rare-variant association without calling genotypes allows for systematic differences in sequencing between cases and controls. PLoS Genet. 12, e1006040 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  130. Boyle, E. A., Li, Y. I. & Pritchard, J. K. An expanded view of complex traits: from polygenic to omnigenic. Cell 169, 1177–1186 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  131. Bulik-Sullivan, B. K. et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  132. Clifton, E. A. D. et al. Associations between body mass index-related genetic variants and adult body composition: the Fenland cohort study. Int. J. Obes. 41, 613–619 (2017).

    Article  CAS  Google Scholar 

  133. O’Connor, B. D. et al. The Dockstore: enabling modular, community-focused sharing of Docker-based genomics tools and workflows. F1000Res. 6, 52 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  134. Perkel, J. Democratic databases: science on GitHub. Nature 538, 127–128 (2016).

    Article  CAS  PubMed  Google Scholar 

  135. Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).

    Article  CAS  PubMed  Google Scholar 

  136. Venkataraman G.R. et al. Bayesian model comparison for rare-variant association studies. Am. J. Hum. Genet. 108, 2354–2367 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  137. Thomas, S. P. et al. Cultivating diversity as an ethos with an anti-racism approach in the scientific enterprise. HGG Adv. 108, 100052 (2021).

    Google Scholar 

  138. Bonham, V. L. & Green, E. D. The genomics workforce must become more diverse: a strategic imperative. Am. J. Hum. Genet. 108, 3–7 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  139. Bentley, A. R., Callier, S. L. & Rotimi, C. N. Evaluating the promise of inclusion of African ancestry populations in genomics. NPJ Genom. Med. 5, 5 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  140. Bezuidenhout, L. & Chakauya, E. Hidden concerns of sharing research data by low/middle-income country scientists. Glob. Bioeth. 29, 39–54 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  141. Tsosie, K. S., Yracheta, J. M. & Dickenson, D. Overvaluing individual consent ignores risks to tribal participants. Nat. Rev. Genet. 20, 497–498 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  142. Tindana, P. & de Vries, J. Broad consent for genomic research and biobanking: perspectives from low- and middle-income countries. Annu. Rev. Genomics Hum. Genet. 17, 375–393 (2016). A review outlining the key elements to promote global health and equity when completing genomic research, such as through biobanks.

    Article  CAS  PubMed  Google Scholar 

  143. National Human Genome Research Institute. NOT-HG-21-022: notice announcing the National Human Genome Research Institute’s expectation for sharing quality metadata and phenotypic data. NIH https://grants.nih.gov/grants/guide/notice-files/NOT-HG-21-022.html (2021).

  144. Fiume, M. et al. Federated discovery and sharing of genomic data using Beacons. Nat. Biotechnol. 37, 220–224 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  145. Thorogood, A. et al. International federation of genomic medicine databases using GA4GH standards. Cell Genomics 1, 100032 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  146. Rehm, H. L. et al. GA4GH: international policies and standards for data sharing across genomic research and healthcare. Cell Genom. 1, 100029 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  147. Lawson, J. et al. The Data Use Ontology to streamline responsible access to human biomedical datasets. Cell Genom. 1, 100028 (2021).

    Article  CAS  Google Scholar 

  148. National Heart, Lung, and Blood Institute. Catalyst Fellows Program. NHLBI https://biodatacatalyst.nhlbi.nih.gov/fellows/program/ (2021).

  149. National Human Genome Research Institute. Massive Genome Informatics in the Cloud (MaGIC) Jamboree. AnVIL https://anvilproject.org/events/magic2020 (2020).

  150. Global Alliance for Genomics and Health. GA4GH starter kit. GA4GH https://starterkit.ga4gh.org/ (2021).

  151. Abel, H. J. et al. Mapping and characterization of structural variation in 17,795 human genomes. Nature 583, 83–89 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  152. Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  153. Phan, L. et al. ALFA: Allele Frequency Aggregator. NCBI https://www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/ (2020).

  154. Tadaka, S. et al. jMorp updates in 2020: large enhancement of multi-omics data resources on the general Japanese population. Nucleic Acids Res. 49, D536–D544 (2021).

    Article  CAS  PubMed  Google Scholar 

  155. Sequencing Initiative Suomi Project. Sequencing Initiative Suomi. SISu http://sisuproject.fi (2021).

  156. Wam. Dubai to map genome of all its residents. Khaleej Times https://www.khaleejtimes.com/uae/dubai-to-map-genome-of-all-its-residents (2018).

  157. Geis, C. A Chinese province is sequencing one million of its residents’ genomes. Futurism https://futurism.com/neoscope/chinese-province-sequencing-1-million-residents-genomes (2017).

  158. Health RI. European ‘1+Million Genomes’ initiative (1+MG). Health RI https://www.health-ri.nl/initiatives/european-1million-genomes-initiative-1mg (2020).

  159. Gaziano, J. M. et al. Million Veteran Program: a mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 70, 214–223 (2016).

    Article  PubMed  Google Scholar 

  160. Sirugo, G., Williams, S. M. & Tishkoff, S. A. The missing diversity in human genetic studies. Cell 177, 1080 (2019).

    Article  CAS  PubMed  Google Scholar 

  161. Byrd, J. B., Greene, A. C., Prasad, D. V., Jiang, X. & Greene, C. S. Responsible, practical genomic data sharing that accelerates research. Nat. Rev. Genet. 21, 615–629 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  162. Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016). This foundational manuscript is the first to present the FAIR principles (that is, findable, accessible, interoperable and reusable) for data sharing.

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This work was supported by the Genome Sequencing Program (R35HG011293 to A.E.H. and C.R.G.; U01HG009080 to A.E.H., A.G.I., C.R.G. and M.A.R.; and U24HG008956 to S.B.). The Genome Sequencing Program is funded by the National Institute of Health (NIH) National Human Genome Research Institute (NHGRI), the National Heart, Lung, and Blood Institute (NHLBI) and the National Eye Institute (NEI). G.L.W. received support for this work from NHGRI (R35HG011944).

Author information

Authors and Affiliations

Authors

Contributions

G.L.W., J.M., J.L.E. and A.E.H. researched the literature. G.L.W., J.M., A.G.I., S.B. and A.E.H. provided substantial contributions to discussion of the content. G.L.W., J.M., S.B. and A.E.H. wrote the article. All authors reviewed and/or edited the manuscript before submission.

Corresponding author

Correspondence to Audrey E. Hendricks.

Ethics declarations

Competing interests

C.R.G. owns stock in 23and Me. M.A.R. is a scientific founder of Broadwing Bio, a consultant for MazeTx, and is currently on leave at HiBio. The other authors declare no competing interests.

Peer review

Peer review information

Nature Reviews Genetics thanks the anonymous reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Related links

1000 Genomes Project: https://www.internationalgenome.org

All of Us: https://www.researchallofus.org/

AnVIL: https://anvilproject.org

BioData Catalyst: https://biodatacatalyst.nhlbi.nih.gov

CCDG: https://ccdg.rutgers.edu

CSVS: http://csvs.babelomics.org/

dbGaP: https://www.ncbi.nlm.nih.gov/gap/

dbGaP ALFA: https://www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/

EGA: https://ega-archive.org

Estonian Biobank: https://genomics.ut.ee/en/content/estonian-biobank

FinnGen: https://finngen.gitbook.io/documentation/data-download

GenomeAsia 100K: https://browser.genomeasia100k.org

gnomAD v.2.1: https://gnomad.broadinstitute.org/downloads

gnomAD v.3.1: https://gnomad.broadinstitute.org/downloads

H3Africa: https://catalog.h3africa.org

HGDP: https://www.internationalgenome.org/data-portal/data-collection/hgdp

INTERVAL: https://www.intervalstudy.org.uk

jMorp: https://jmorp.megabank.tohoku.ac.jp/202109/downloads/

Researcher Workbench: https://www.researchallofus.org/data-tools/workbench/

SGDP: https://cloud.google.com/life-sciences/docs/resources/public-datasets/simons

SISu v4.1: https://sisuproject.fi

Taiwan Biobank: https://taiwanview.twbiobank.org.tw/browse38

TOPMed: https://topmed.nhlbi.nih.gov

TOPMed Bravo: https://bravo.sph.umich.edu/freeze8/hg38/

UK Biobank: https://biobank.ctsu.ox.ac.uk/crystal/label.cgi?id=263

Glossary

Monogenic

A condition influenced by one genetic locus.

Oligogenic

A condition influenced by a few genetic loci.

Polygenic

A condition influenced by a large number of genetic loci.

Allele frequencies

The rates of genetic variant types in a specified population.

Common controls

Controls used for multiple studies.

Bias

Systematic error (as opposed to error due to chance processes), whether caused by statistical methods, differences between sampled individuals and the population they nominally represent, differences between cases and controls in ascertainment or sample processing, or other issues.

Confounding

A spurious association or lack of association caused by a third variable that is related to both the predictor variable (for example, allele frequency) and the outcome (for example, case status).

Internal controls

Controls that were ascertained, sequenced and processed together with the case sample. By contrast, external common controls were recruited, sequenced and processed separately, often using different technology from the case sample.

Biobanks

Collections of both biological samples (particularly DNA) and health information from individuals generally assembled from a region or a health system.

Harmonization

The formation of a single cohesive data set from two or more separate data sets by standardizing scales, definitions, quality control and other processing.

Batch effects

Differences between groups induced by processing over different times, places or technologies unrelated to biological causes.

Quality control

A process where low-quality data or observations are identified and improved or removed from further analysis.

Statistical power

The probability of rejecting the null hypothesis when it is false.

Ascertained cases

Participants of a study who are recruited to have a known disease, outcome or condition of interest.

Ascertained controls

Participants of a study who are recruited to not have a known disease, outcome or condition of interest.

Convenience sample

A sample drawn from an easily accessible, but often not representative, cohort.

Population controls

A control group sampled from a population but possibly lacking information regarding the condition of interest, with the result that some of the population controls will likely have the condition of interest.

Admixed

A term to denote the mixture of genetic ancestries from two or more divergent groups.

Population stratification

The presence of subpopulations with differing allele frequencies in a study; a source of confounding if phenotypes also vary by subpopulation.

False positives

Test results that are statistically significant even though there is no real association. By contrast, a false negative is a test result that is not statistically significant even though there is a real association.

Fine-scale ancestry

Genetic differentiation at a regional level (such as subcontinental), as opposed to continental-level ancestry.

Metadata

A high-level description of a data set, often including details of the cohort and of data generation.

Local ancestry

The genetic ancestry of a particular chromosomal region on a haplotype level.

Minor allele frequency

(MAF). For a genetic variant with two alleles, the frequency, in a specified population, of the less frequent allele.

In silico validation

Secondary quality control analysis of genotype calls, often of top association results, that passed the initial harmonization process to ensure that differences in processing do not drive important association signals.

Partial replication

Repeating association analysis reusing some data from the discovery analysis (for example, discovery cases and new external common controls).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wojcik, G.L., Murphy, J., Edelson, J.L. et al. Opportunities and challenges for the use of common controls in sequencing studies. Nat Rev Genet 23, 665–679 (2022). https://doi.org/10.1038/s41576-022-00487-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41576-022-00487-4

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing