Genome-wide association studies using large-scale genome and exome sequencing data have become increasingly valuable in identifying associations between genetic variants and disease, transforming basic research and translational medicine. However, this progress has not been equally shared across all people and conditions, in part due to limited resources. Leveraging publicly available sequencing data as external common controls, rather than sequencing new controls for every study, can better allocate resources by augmenting control sample sizes or providing controls where none existed. However, common control studies must be carefully planned and executed as even small differences in sample ascertainment and processing can result in substantial bias. Here, we discuss challenges and opportunities for the robust use of common controls in high-throughput sequencing studies, including study design, quality control and statistical approaches. Thoughtful generation and use of large and valuable genetic sequencing data sets will enable investigation of a broader and more representative set of conditions, environments and genetic ancestries than otherwise possible.
This is a preview of subscription content, access via your institution
Open Access articles citing this article.
A crowdsourcing database for the copy-number variation of the Spanish population
Human Genomics Open Access 09 March 2023
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 per month
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$189.00 per year
only $15.75 per issue
Rent or buy this article
Get just this article for as long as you need it
Prices may be subject to local taxes which are calculated during checkout
McGuire, A. L. et al. The road ahead in genetics and genomics. Nat. Rev. Genet. 21, 581–596 (2020). Perspective from a panel of leading genetics experts across the world describing the current state of the field and where genetics should go to ensure that the insights gained by modern genomic research will benefit all.
Rehm, H. L. et al. ClinGen — the clinical genome resource. N. Engl. J. Med. 372, 2235–2242 (2015).
Wang, Q. et al. Rare variant contribution to human disease in 281,104 UK Biobank exomes. Nature 597, 527–532 (2021).
Szustakowski, J. D. et al. Advancing human genetics research and drug discovery through exome sequencing of the UK Biobank. Nat. Genet. 53, 942–948 (2021).
Gibbs, R. A. The Human Genome Project changed everything. Nat. Rev. Genet. 21, 575–576 (2020).
UK10K Consortium et al. The UK10K project identifies rare variants in health and disease. Nature 526, 82–90 (2015).
Minikel, E. V. et al. Evaluating drug targets through human loss-of-function genetic variation. Nature 581, 459–464 (2020).
Banka, S. et al. How genetically heterogeneous is Kabuki syndrome?: MLL2 testing in 116 patients, review and analyses of mutation and phenotypic spectrum. Eur. J. Hum. Genet. 20, 381–388 (2012).
Biesecker, L. G. Exome sequencing makes medical genomics a reality. Nat. Genet. 42, 13–14 (2010).
Ng, S. B. et al. Exome sequencing identifies the cause of a Mendelian disorder. Nat. Genet. 42, 30–35 (2010).
Akbari, P. et al. Sequencing of 640,000 exomes identifies GPR75 variants associated with protection from obesity. Science 373, eabf8683 (2021).
Flannick, J. et al. Exome sequencing of 20,791 cases of type 2 diabetes and 24,440 controls. Nature 570, 71–76 (2019).
Backman, J. D. et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature 599, 628–634 (2021). Initial description of the data and potential provided by exomes for medical and genomic applications across the UK Biobank.
Martin, A. R. et al. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet. 100, 635–649 (2017).
Petrovski, S. & Goldstein, D. B. Unequal representation of genetic variation across ancestry groups creates healthcare inequality in the application of precision medicine. Genome Biol. 17, 157 (2016).
Manrai, A. K. et al. Genetic misdiagnoses and the potential for health disparities. N. Engl. J. Med. 375, 655–665 (2016).
Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007). Foundational early genome-wide association study leveraging a common set of controls to enhance discovery possibility across seven diseases. The paper includes stringent QC now common to ensure homogeneity across a common control data set.
Corredor-Orlandelli, D. et al. Association between paraoxonase-1 p.Q192R polymorphism and coronary artery disease susceptibility in the Colombian population. Vasc. Health Risk Manag. 17, 689–699 (2021).
Tan, M. et al. Whole genome sequencing identifies rare germline variants enriched in cancer related genes in first degree relatives of familial pancreatic cancer patients. Clin. Genet. 100, 551–562 (2021).
Taroc, E. Z. M. et al. Gli3 regulates vomeronasal neurogenesis, olfactory ensheathing cell formation, and GnRH-1 neuronal migration. J. Neurosci. 40, 311–326 (2020).
Muskens, I. S. et al. Germline cancer predisposition variants and pediatric glioma: a population-based study in California. Neuro. Oncol. 22, 864–874 (2020).
Lorenzo-Salazar, J. M. et al. Novel idiopathic pulmonary fibrosis susceptibility variants revealed by deep sequencing. ERJ Open Res. 5, 00071 (2019).
Georges, A. et al. Rare loss-of-function mutations of PTGIR are enriched in fibromuscular dysplasia. Cardiovasc. Res. 117, 1154–1165 (2021).
Li, C. et al. Mutation analysis of DNAJC family for early-onset Parkinson’s disease in a Chinese cohort. Mov. Disord. 35, 2068–2076 (2020).
Hillman, P. et al. Identification of novel candidate risk genes for myelomeningocele within the glucose homeostasis/oxidative stress and folate/one-carbon metabolism networks. Mol. Genet. Genom. Med. 8, e1495 (2020).
Hebert, L. et al. Burden of rare deleterious variants in WNT signaling genes among 511 myelomeningocele patients. PLoS ONE 15, e0239083 (2020).
Yuan, J.-H. et al. Genomic analysis of 21 patients with corneal neuralgia after refractive surgery. Pain Rep. 5, e826 (2020).
Rojas, R. A. et al. Phenotypic continuum between Waardenburg syndrome and idiopathic hypogonadotropic hypogonadism in humans with SOX10 variants. Genet. Med. 23, 629–636 (2021).
Terradas, M. et al. TP53, a gene for colorectal cancer predisposition in the absence of Li–Fraumeni-associated phenotypes. Gut 70, 1139–1146 (2021).
Li, C. et al. Mutation analysis of LRP10 in a large Chinese familial Parkinson disease cohort. Neurobiol. Aging 99, 99.e1–99.e6 (2021).
Gunadi et al. Effect of semaphorin 3C gene variants in multifactorial Hirschsprung disease. J. Int. Med. Res. 49, 300060520987789 (2021).
Messina, A. et al. Neuron-derived neurotrophic factor is mutated in congenital hypogonadotropic hypogonadism. Am. J. Hum. Genet. 106, 58–70 (2020).
Trimarchi, M. et al. Gene expression analysis in patients with cocaine-induced midline destructive lesions. Medicina 57, 861 (2021).
Marenne, G. et al. Exome sequencing identifies genes and gene sets contributing to severe childhood obesity, linking PHIP variants to repressed POMC transcription. Cell Metab. 31, 1107–1119.e12 (2020).
Singh, T. et al. Rare loss-of-function variants in SETD1A are associated with schizophrenia and developmental disorders. Nat. Neurosci. 19, 571–577 (2016).
Sazonovs, A. et al. Sequencing of over 100,000 individuals identifies multiple genes and rare variants associated with Crohns disease susceptibility. Preprint at bioRxiv https://doi.org/10.1101/2021.06.15.21258641 (2021).
Malki, L. et al. Variant PADI3 in central centrifugal cicatricial alopecia. N. Engl. J. Med. 380, 833–841 (2019).
Ulirsch, J. C. et al. The genetic landscape of Diamond–Blackfan anemia. Am. J. Hum. Genet. 103, 930–947 (2018).
Hubert, J.-N. et al. The PI3K/mTOR pathway is targeted by rare germline variants in patients with both melanoma and renal cell carcinoma. Cancers 13, 2243 (2021).
Rashid, M. et al. ALPK1 hotspot mutation as a driver of human spiradenoma and spiradenocarcinoma. Nat. Commun. 10, 2213 (2019).
Belhadj, S. et al. Candidate genes for hereditary colorectal cancer: mutational screening and systematic review. Hum. Mutat. 41, 1563–1576 (2020).
Mosquera Orgueira, A. et al. Detection of rare germline variants in the genomes of patients with B-cell neoplasms. Cancers 13, 1340 (2021).
Li, C. et al. Targeted next generation sequencing of nine osteoporosis-related genes in the Wnt signaling pathway among Chinese postmenopausal women. Endocrine 68, 669–678 (2020).
Thorlund, K., Dron, L., Park, J. J. H. & Mills, E. J. Synthetic and external controls in clinical trials — a primer for researchers. Clin. Epidemiol. 12, 457–467 (2020).
Popejoy, A. B. & Fullerton, S. M. Genomics is failing on diversity. Nature 538, 161–164 (2016).
Ben-Eghan, C. et al. Don’t ignore genetic data from minority populations. Nature 585, 184–186 (2020).
McMahon, A. et al. Sequencing-based genome-wide association studies reporting standards. Cell Genomics 1, 100005 (2021).
Gurdasani, D., Barroso, I., Zeggini, E. & Sandhu, M. S. Genomics of disease risk in globally diverse populations. Nat. Rev. Genet. 20, 520–535 (2019). This paper provides a summary of the current state of genomic diversity in research and how diversity is key to discovery and translation in genomics.
Zhang, Y. et al. The prevalence of vitiligo: a meta-analysis. PLoS ONE 11, e0163806 (2016).
Conway, M. et al. Analyzing the heterogeneity and complexity of electronic health record oriented phenotyping algorithms. AMIA Annu. Symp. Proc. 2011, 274–283 (2011).
Newton, K. M. et al. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. J. Am. Med. Inform. Assoc. 20, e147–e154 (2013).
Shang, N. et al. Making work visible for electronic phenotype implementation: lessons learned from the eMERGE network. J. Biomed. Inform. 99, 103293 (2019).
Davis, K. A. S. et al. Indicators of mental disorders in UK Biobank — a comparison of approaches. Int. J. Methods Psychiatr. Res. 28, e1796 (2019).
Singh, T. et al. Rare coding variants in ten genes confer substantial risk for schizophrenia. Nature 604, 509–516 (2022).
Ledford, H. Paper on genetics of longevity retracted. Nature https://doi.org/10.1038/news.2011.429 (2011).
Viering, D. H. H. M. et al. Genetics of renovascular hypertension in children. J. Hypertens. 38, 1964–1970 (2020).
Mazzarotto, F. et al. Reevaluating the genetic contribution of monogenic dilated cardiomyopathy. Circulation 141, 387–398 (2020).
Steel, D. et al. Loss-of-function variants in HOPS complex genes VPS16 and VPS41 cause early onset dystonia associated with lysosomal abnormalities. Ann. Neurol. 88, 867–877 (2020).
Johnson, J. O. et al. Association of variants in the SPTLC1 gene with juvenile amyotrophic lateral sclerosis. JAMA Neurol. 78, 1236–1248 (2021).
Gallego-Martinez, A., Requena, T., Roman-Naranjo, P., May, P. & Lopez-Escamez, J. A. Enrichment of damaging missense variants in genes related with axonal guidance signalling in sporadic Meniere’s disease. J. Med. Genet. 57, 82–88 (2020).
Kwok, A. J., Mentzer, A. & Knight, J. C. Host genetics and infectious disease: new tools, insights and translational opportunities. Nat. Rev. Genet. 22, 137–153 (2021).
Fry, A. et al. Comparison of sociodemographic and health-related characteristics of UK biobank participants with those of the general population. Am. J. Epidemiol. 186, 1026–1034 (2017).
Wright, C. F. et al. Assessing the pathogenicity, penetrance, and expressivity of putative disease-causing variants in a population setting. Am. J. Hum. Genet. 104, 275 (2019).
Povysil, G. et al. Rare-variant collapsing analyses for complex traits: guidelines and applications. Nat. Rev. Genet. 20, 747–759 (2019). Review describing rare variant aggregation testing, a common method for association in sequencing studies. Beyond describing techniques, the review covers specific filtering and quality control needed to ensure appropriate statistical calibration.
Riveros-McKay, F. et al. Genetic architecture of human thinness compared to severe obesity. PLoS Genet. 15, e1007603 (2019).
Moskvina, V., Holmans, P., Schmidt, K. M. & Craddock, N. Design of case–controls studies with unscreened controls. Ann. Hum. Genet. 69, 566–576 (2005).
Sham, P. C. & Purcell, S. M. Statistical power and significance testing in large-scale genetic studies. Nat. Rev. Genet. 15, 335–346 (2014).
Auer, P. L. et al. Guidelines for large-scale sequence-based complex trait association studies: lessons learned from the NHLBI Exome Sequencing Project. Am. J. Hum. Genet. 99, 791–801 (2016).
Alberts, B. Editorial expression of concern. Science 330, 912 (2010).
Campbell, C. D. et al. Demonstrating stratification in a European American population. Nat. Genet. 37, 868–872 (2005).
Knowler, W. C., Williams, R. C., Pettitt, D. J. & Steinberg, A. G. Gm3;5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. Am. J. Hum. Genet. 43, 520–526 (1988).
Hellwege, J. N. et al. Population stratification in genetic association studies. Curr. Protoc. Hum. Genet. 95, 1.22.1–1.22.23 (2017).
Choudhry, S. et al. Population stratification confounds genetic association studies among Latinos. Hum. Genet. 118, 652–664 (2006).
Helgason, A., Yngvadóttir, B., Hrafnkelsson, B., Gulcher, J. & Stefánsson, K. An Icelandic example of the impact of population structure on association studies. Nat. Genet. 37, 90–95 (2005).
Panarella, M. & Burkett, K. M. A cautionary note on the effects of population stratification under an extreme phenotype sampling design. Front. Genet. 10, 398 (2019).
Gravel, S. et al. Demographic history and rare allele sharing among human populations. Proc. Natl Acad. Sci. USA 108, 11983–11988 (2011).
Mathieson, I. & McVean, G. Differential confounding of rare and common variants in spatially structured populations. Nat. Genet. 44, 243–246 (2012).
O’Connor, T. D. et al. Fine-scale patterns of population stratification confound rare variant association tests. PLoS ONE 8, e65834 (2013).
Klann, J. G., Joss, M. A. H., Embree, K. & Murphy, S. N. Data model harmonization for the All Of Us Research Program: transforming i2b2 data into the OMOP common data model. PLoS ONE 14, e0212463 (2019).
Wei, W.-Q. et al. Evaluating phecodes, clinical classification software, and ICD-9-CM codes for phenome-wide association studies in the electronic health record. PLoS ONE 12, e0175508 (2017).
Leitsalu, L. et al. Cohort profile: Estonian Biobank of the Estonian Genome Center, University of Tartu. Int. J. Epidemiol. 44, 1137–1147 (2015).
Choudhury, A. et al. Author correction: High-depth African genomes inform human migration and health. Nature 592, E26 (2021).
Di Angelantonio, E. et al. Efficiency and safety of varying the frequency of whole blood donation (INTERVAL): a randomised trial of 45 000 donors. Lancet 390, 2360–2371 (2017).
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Gutierrez-Sacristan, A. et al. GenoPheno: cataloging large-scale phenotypic and next-generation sequencing data within human datasets. Brief Bioinform. 22, 55–65 (2021).
FinnGen. FinnGen documentation of R5 release. FinnGen https://finngen.gitbook.io/documentation/ (2021).
Wei, C.-Y. et al. Genetic profiles of 103,106 individuals in the Taiwan Biobank provide insights into the health and history of Han Chinese. NPJ Genom. Med. 6, 10 (2021).
Karczewski, K. J., Francioli, L. C. & MacArthur, D. G. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Peña-Chilet, M. et al. CSVS, a crowdsourcing database of the Spanish population genetic variability. Nucleic Acids Res. 49, D1130–D1137 (2021).
Mailman, M. D. et al. The NCBI dbGaP Database of Genotypes and Phenotypes. Nat. Genet. 39, 1181–1186 (2007).
Lappalainen, I. et al. The European Genome–Phenome Archive of human data consented for biomedical research. Nat. Genet. 47, 692–695 (2015).
UK Biobank. New costs for 2021. UK Biobank https://www.ukbiobank.ac.uk/enable-your-research/costs (2021).
Lee, S., Kim, S. & Fuchsberger, C. Improving power for rare-variant tests by integrating external controls. Genet. Epidemiol. 41, 610–619 (2017).
Hendricks, A. E. et al. ProxECAT: Proxy External Controls Association Test. A new case–control gene region association test using allele frequencies from public controls. PLoS Genet. 14, e1007591 (2018).
Guo, M. H., Plummer, L., Chan, Y.-M., Hirschhorn, J. N. & Lippincott, M. F. Burden testing of rare variants identified through exome sequencing via publicly available control data. Am. J. Hum. Genet. 103, 522–534 (2018).
Jiang, L. et al. Deviation from baseline mutation burden provides powerful and robust rare-variants association test for complex diseases. Nucleic Acids Res. 50, e34 (2022).
Lali, R. et al. Calibrated rare variant genetic risk scores for complex disease prediction using large exome sequence repositories. Nat. Commun. 12, 5852 (2021).
Bodea, C. A. et al. A method to exploit the structure of genetic ancestry space to enhance case–control studies. Am. J. Hum. Genet. 98, 857–868 (2016).
Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).
Schatz, M. C. et al. Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space. Cell Genom. 2, 100085 (2022).
National Heart, Lung, and Blood Institute, National Institutes of Health, US Department of Health and Human Services. The NHLBI BioData catalyst. Zenodo https://doi.org/10.5281/zenodo.3822858 (2020).
All of Us Research Program Investigators et al. The “All of Us” Research Program. N. Engl. J. Med. 381, 668–676 (2019).
Langmead, B. & Nellore, A. Cloud computing for genomic data analysis and collaboration. Nat. Rev. Genet. 19, 208–219 (2018). This paper reviews how the current and future state of cloud computing will be fundamental for large-scale genomics research including for collaboration and reproducibility.
Van der Auwera, G. A. & O’Connor, B. D. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (O’Reilly Media, 2020).
Yuen, D. et al. The Dockstore: enhancing a community platform for sharing reproducible and accessible computational protocols. Nucleic Acids Res. 49, W624–W632 (2021).
Uffelmann, E. et al. Genome-wide association studies. Nat. Rev. Methods Primers 1, 60 (2021).
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
Alexander, D. H. & Lange, K. Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinformatics 12, 246 (2011).
Reich, D., Price, A. L. & Patterson, N. Principal component analysis of genetic data. Nat. Genet. 40, 491–492 (2008).
Wang, C., Zhan, X., Liang, L., Abecasis, G. R. & Lin, X. Improved ancestry estimation for both genotyping and sequencing data using projection procrustes analysis and genotype imputation. Am. J. Hum. Genet. 96, 926–937 (2015).
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Bergström, A. et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, eaay5012 (2020).
GenomeAsia100K Consortium. The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature 576, 106–111 (2019).
Maples, B. K., Gravel, S., Kenny, E. E. & Bustamante, C. D. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am. J. Hum. Genet. 93, 278–288 (2013).
Hilmarsson, H. et al. High resolution ancestry deconvolution for next generation genomic data. Preprint at bioRxiv https://doi.org/10.1101/2021.09.19.460980 (2021).
Arriaga-MacKenzie, I. S. et al. Summix: a method for detecting and adjusting for population structure in genetic summary data. Am. J. Hum. Genet. 108, 1270–1282 (2021).
Wojcik, G. L. et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature 570, 514–518 (2019). A large, multi-ethnic, multi-trait genome-wide association study paper from the Population Architecture using Genomics and Epidemiology (PAGE) study describing best practices for handling heterogeneous population data, including imputation, filtering and QC steps. The paper also describes the critical importance of genomic diversity in genetic association studies.
Choudhury, A. et al. High-depth African genomes inform human migration and health. Nature 586, 741–748 (2020).
Exome Variant Server. NHLBI Exome Sequencing Project (ESP). EVS http://evs.gs.washington.edu/EVS/ (2013).
Li, X. et al. Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nat. Genet. 52, 969–983 (2020).
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).
Sim, N.-L. et al. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res. 40, W452–W457 (2012).
Li, Y. & Lee, S. Novel score test to increase power in association test by integrating external controls. Genet. Epidemiol. 45, 293–304 (2021).
Chen, S. & Lin, X. Analysis in case–control sequencing association studies with different sequencing depths. Biostatistics 21, 577–593 (2020).
Hu, Y.-J., Liao, P., Johnston, H. R., Allen, A. S. & Satten, G. A. Testing rare-variant association without calling genotypes allows for systematic differences in sequencing between cases and controls. PLoS Genet. 12, e1006040 (2016).
Boyle, E. A., Li, Y. I. & Pritchard, J. K. An expanded view of complex traits: from polygenic to omnigenic. Cell 169, 1177–1186 (2017).
Bulik-Sullivan, B. K. et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).
Clifton, E. A. D. et al. Associations between body mass index-related genetic variants and adult body composition: the Fenland cohort study. Int. J. Obes. 41, 613–619 (2017).
O’Connor, B. D. et al. The Dockstore: enabling modular, community-focused sharing of Docker-based genomics tools and workflows. F1000Res. 6, 52 (2017).
Perkel, J. Democratic databases: science on GitHub. Nature 538, 127–128 (2016).
Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).
Venkataraman G.R. et al. Bayesian model comparison for rare-variant association studies. Am. J. Hum. Genet. 108, 2354–2367 (2021).
Thomas, S. P. et al. Cultivating diversity as an ethos with an anti-racism approach in the scientific enterprise. HGG Adv. 108, 100052 (2021).
Bonham, V. L. & Green, E. D. The genomics workforce must become more diverse: a strategic imperative. Am. J. Hum. Genet. 108, 3–7 (2021).
Bentley, A. R., Callier, S. L. & Rotimi, C. N. Evaluating the promise of inclusion of African ancestry populations in genomics. NPJ Genom. Med. 5, 5 (2020).
Bezuidenhout, L. & Chakauya, E. Hidden concerns of sharing research data by low/middle-income country scientists. Glob. Bioeth. 29, 39–54 (2018).
Tsosie, K. S., Yracheta, J. M. & Dickenson, D. Overvaluing individual consent ignores risks to tribal participants. Nat. Rev. Genet. 20, 497–498 (2019).
Tindana, P. & de Vries, J. Broad consent for genomic research and biobanking: perspectives from low- and middle-income countries. Annu. Rev. Genomics Hum. Genet. 17, 375–393 (2016). A review outlining the key elements to promote global health and equity when completing genomic research, such as through biobanks.
National Human Genome Research Institute. NOT-HG-21-022: notice announcing the National Human Genome Research Institute’s expectation for sharing quality metadata and phenotypic data. NIH https://grants.nih.gov/grants/guide/notice-files/NOT-HG-21-022.html (2021).
Fiume, M. et al. Federated discovery and sharing of genomic data using Beacons. Nat. Biotechnol. 37, 220–224 (2019).
Thorogood, A. et al. International federation of genomic medicine databases using GA4GH standards. Cell Genomics 1, 100032 (2021).
Rehm, H. L. et al. GA4GH: international policies and standards for data sharing across genomic research and healthcare. Cell Genom. 1, 100029 (2021).
Lawson, J. et al. The Data Use Ontology to streamline responsible access to human biomedical datasets. Cell Genom. 1, 100028 (2021).
National Heart, Lung, and Blood Institute. Catalyst Fellows Program. NHLBI https://biodatacatalyst.nhlbi.nih.gov/fellows/program/ (2021).
National Human Genome Research Institute. Massive Genome Informatics in the Cloud (MaGIC) Jamboree. AnVIL https://anvilproject.org/events/magic2020 (2020).
Global Alliance for Genomics and Health. GA4GH starter kit. GA4GH https://starterkit.ga4gh.org/ (2021).
Abel, H. J. et al. Mapping and characterization of structural variation in 17,795 human genomes. Nature 583, 83–89 (2020).
Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).
Phan, L. et al. ALFA: Allele Frequency Aggregator. NCBI https://www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/ (2020).
Tadaka, S. et al. jMorp updates in 2020: large enhancement of multi-omics data resources on the general Japanese population. Nucleic Acids Res. 49, D536–D544 (2021).
Sequencing Initiative Suomi Project. Sequencing Initiative Suomi. SISu http://sisuproject.fi (2021).
Wam. Dubai to map genome of all its residents. Khaleej Times https://www.khaleejtimes.com/uae/dubai-to-map-genome-of-all-its-residents (2018).
Geis, C. A Chinese province is sequencing one million of its residents’ genomes. Futurism https://futurism.com/neoscope/chinese-province-sequencing-1-million-residents-genomes (2017).
Health RI. European ‘1+Million Genomes’ initiative (1+MG). Health RI https://www.health-ri.nl/initiatives/european-1million-genomes-initiative-1mg (2020).
Gaziano, J. M. et al. Million Veteran Program: a mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 70, 214–223 (2016).
Sirugo, G., Williams, S. M. & Tishkoff, S. A. The missing diversity in human genetic studies. Cell 177, 1080 (2019).
Byrd, J. B., Greene, A. C., Prasad, D. V., Jiang, X. & Greene, C. S. Responsible, practical genomic data sharing that accelerates research. Nat. Rev. Genet. 21, 615–629 (2020).
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016). This foundational manuscript is the first to present the FAIR principles (that is, findable, accessible, interoperable and reusable) for data sharing.
This work was supported by the Genome Sequencing Program (R35HG011293 to A.E.H. and C.R.G.; U01HG009080 to A.E.H., A.G.I., C.R.G. and M.A.R.; and U24HG008956 to S.B.). The Genome Sequencing Program is funded by the National Institute of Health (NIH) National Human Genome Research Institute (NHGRI), the National Heart, Lung, and Blood Institute (NHLBI) and the National Eye Institute (NEI). G.L.W. received support for this work from NHGRI (R35HG011944).
C.R.G. owns stock in 23and Me. M.A.R. is a scientific founder of Broadwing Bio, a consultant for MazeTx, and is currently on leave at HiBio. The other authors declare no competing interests.
Peer review information
Nature Reviews Genetics thanks the anonymous reviewer(s) for their contribution to the peer review of this work.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1000 Genomes Project: https://www.internationalgenome.org
All of Us: https://www.researchallofus.org/
BioData Catalyst: https://biodatacatalyst.nhlbi.nih.gov
dbGaP ALFA: https://www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/
Estonian Biobank: https://genomics.ut.ee/en/content/estonian-biobank
GenomeAsia 100K: https://browser.genomeasia100k.org
gnomAD v.2.1: https://gnomad.broadinstitute.org/downloads
gnomAD v.3.1: https://gnomad.broadinstitute.org/downloads
Researcher Workbench: https://www.researchallofus.org/data-tools/workbench/
SISu v4.1: https://sisuproject.fi
Taiwan Biobank: https://taiwanview.twbiobank.org.tw/browse38
TOPMed Bravo: https://bravo.sph.umich.edu/freeze8/hg38/
UK Biobank: https://biobank.ctsu.ox.ac.uk/crystal/label.cgi?id=263
A condition influenced by one genetic locus.
A condition influenced by a few genetic loci.
A condition influenced by a large number of genetic loci.
- Allele frequencies
The rates of genetic variant types in a specified population.
- Common controls
Controls used for multiple studies.
Systematic error (as opposed to error due to chance processes), whether caused by statistical methods, differences between sampled individuals and the population they nominally represent, differences between cases and controls in ascertainment or sample processing, or other issues.
A spurious association or lack of association caused by a third variable that is related to both the predictor variable (for example, allele frequency) and the outcome (for example, case status).
- Internal controls
Controls that were ascertained, sequenced and processed together with the case sample. By contrast, external common controls were recruited, sequenced and processed separately, often using different technology from the case sample.
Collections of both biological samples (particularly DNA) and health information from individuals generally assembled from a region or a health system.
The formation of a single cohesive data set from two or more separate data sets by standardizing scales, definitions, quality control and other processing.
- Batch effects
Differences between groups induced by processing over different times, places or technologies unrelated to biological causes.
- Quality control
A process where low-quality data or observations are identified and improved or removed from further analysis.
- Statistical power
The probability of rejecting the null hypothesis when it is false.
- Ascertained cases
Participants of a study who are recruited to have a known disease, outcome or condition of interest.
- Ascertained controls
Participants of a study who are recruited to not have a known disease, outcome or condition of interest.
- Convenience sample
A sample drawn from an easily accessible, but often not representative, cohort.
- Population controls
A control group sampled from a population but possibly lacking information regarding the condition of interest, with the result that some of the population controls will likely have the condition of interest.
A term to denote the mixture of genetic ancestries from two or more divergent groups.
- Population stratification
The presence of subpopulations with differing allele frequencies in a study; a source of confounding if phenotypes also vary by subpopulation.
- False positives
Test results that are statistically significant even though there is no real association. By contrast, a false negative is a test result that is not statistically significant even though there is a real association.
- Fine-scale ancestry
Genetic differentiation at a regional level (such as subcontinental), as opposed to continental-level ancestry.
A high-level description of a data set, often including details of the cohort and of data generation.
- Local ancestry
The genetic ancestry of a particular chromosomal region on a haplotype level.
- Minor allele frequency
(MAF). For a genetic variant with two alleles, the frequency, in a specified population, of the less frequent allele.
- In silico validation
Secondary quality control analysis of genotype calls, often of top association results, that passed the initial harmonization process to ensure that differences in processing do not drive important association signals.
- Partial replication
Repeating association analysis reusing some data from the discovery analysis (for example, discovery cases and new external common controls).
Rights and permissions
About this article
Cite this article
Wojcik, G.L., Murphy, J., Edelson, J.L. et al. Opportunities and challenges for the use of common controls in sequencing studies. Nat Rev Genet 23, 665–679 (2022). https://doi.org/10.1038/s41576-022-00487-4
This article is cited by
A crowdsourcing database for the copy-number variation of the Spanish population
Human Genomics (2023)