Opportunities and challenges for the use of common controls in sequencing studies

Wojcik, Genevieve L.; Murphy, Jessica; Edelson, Jacob L.; Gignoux, Christopher R.; Ioannidis, Alexander G.; Manning, Alisa; Rivas, Manuel A.; Buyske, Steven; Hendricks, Audrey E.

doi:10.1038/s41576-022-00487-4

Review Article
Published: 17 May 2022

Opportunities and challenges for the use of common controls in sequencing studies

Nature Reviews Genetics volume 23, pages 665–679 (2022)Cite this article

5249 Accesses
10 Citations
29 Altmetric
Metrics details

Subjects

Abstract

Genome-wide association studies using large-scale genome and exome sequencing data have become increasingly valuable in identifying associations between genetic variants and disease, transforming basic research and translational medicine. However, this progress has not been equally shared across all people and conditions, in part due to limited resources. Leveraging publicly available sequencing data as external common controls, rather than sequencing new controls for every study, can better allocate resources by augmenting control sample sizes or providing controls where none existed. However, common control studies must be carefully planned and executed as even small differences in sample ascertainment and processing can result in substantial bias. Here, we discuss challenges and opportunities for the robust use of common controls in high-throughput sequencing studies, including study design, quality control and statistical approaches. Thoughtful generation and use of large and valuable genetic sequencing data sets will enable investigation of a broader and more representative set of conditions, environments and genetic ancestries than otherwise possible.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Where to use common controls?**

**Fig. 3: Types of bias that could affect common control analyses.**

**Fig. 4: Common control analysis workflow and example infrastructure with AnVIL.**

Benefits and limitations of genome-wide association studies

Article 08 May 2019

Public platform with 39,472 exome control samples enables association studies without genotype sharing

Article Open access 10 January 2024

Genome-wide association studies

Article 26 August 2021

References

McGuire, A. L. et al. The road ahead in genetics and genomics. Nat. Rev. Genet. 21, 581–596 (2020). Perspective from a panel of leading genetics experts across the world describing the current state of the field and where genetics should go to ensure that the insights gained by modern genomic research will benefit all.
Article CAS PubMed PubMed Central Google Scholar
Rehm, H. L. et al. ClinGen — the clinical genome resource. N. Engl. J. Med. 372, 2235–2242 (2015).
Article CAS PubMed PubMed Central Google Scholar
Wang, Q. et al. Rare variant contribution to human disease in 281,104 UK Biobank exomes. Nature 597, 527–532 (2021).
Article CAS PubMed PubMed Central Google Scholar
Szustakowski, J. D. et al. Advancing human genetics research and drug discovery through exome sequencing of the UK Biobank. Nat. Genet. 53, 942–948 (2021).
Article CAS PubMed Google Scholar
Gibbs, R. A. The Human Genome Project changed everything. Nat. Rev. Genet. 21, 575–576 (2020).
Article CAS PubMed PubMed Central Google Scholar
UK10K Consortium et al. The UK10K project identifies rare variants in health and disease. Nature 526, 82–90 (2015).
Article Google Scholar
Minikel, E. V. et al. Evaluating drug targets through human loss-of-function genetic variation. Nature 581, 459–464 (2020).
Article CAS PubMed PubMed Central Google Scholar
Banka, S. et al. How genetically heterogeneous is Kabuki syndrome?: MLL2 testing in 116 patients, review and analyses of mutation and phenotypic spectrum. Eur. J. Hum. Genet. 20, 381–388 (2012).
Article CAS PubMed Google Scholar
Biesecker, L. G. Exome sequencing makes medical genomics a reality. Nat. Genet. 42, 13–14 (2010).
Article CAS PubMed Google Scholar
Ng, S. B. et al. Exome sequencing identifies the cause of a Mendelian disorder. Nat. Genet. 42, 30–35 (2010).
Article CAS PubMed Google Scholar
Akbari, P. et al. Sequencing of 640,000 exomes identifies GPR75 variants associated with protection from obesity. Science 373, eabf8683 (2021).
Article CAS PubMed Google Scholar
Flannick, J. et al. Exome sequencing of 20,791 cases of type 2 diabetes and 24,440 controls. Nature 570, 71–76 (2019).
Article CAS PubMed PubMed Central Google Scholar
Backman, J. D. et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature 599, 628–634 (2021). Initial description of the data and potential provided by exomes for medical and genomic applications across the UK Biobank.
Article CAS PubMed PubMed Central Google Scholar
Martin, A. R. et al. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet. 100, 635–649 (2017).
Article CAS PubMed PubMed Central Google Scholar
Petrovski, S. & Goldstein, D. B. Unequal representation of genetic variation across ancestry groups creates healthcare inequality in the application of precision medicine. Genome Biol. 17, 157 (2016).
Article PubMed PubMed Central Google Scholar
Manrai, A. K. et al. Genetic misdiagnoses and the potential for health disparities. N. Engl. J. Med. 375, 655–665 (2016).
Article PubMed PubMed Central Google Scholar
Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007). Foundational early genome-wide association study leveraging a common set of controls to enhance discovery possibility across seven diseases. The paper includes stringent QC now common to ensure homogeneity across a common control data set.
Article Google Scholar
Corredor-Orlandelli, D. et al. Association between paraoxonase-1 p.Q192R polymorphism and coronary artery disease susceptibility in the Colombian population. Vasc. Health Risk Manag. 17, 689–699 (2021).
Article CAS PubMed PubMed Central Google Scholar
Tan, M. et al. Whole genome sequencing identifies rare germline variants enriched in cancer related genes in first degree relatives of familial pancreatic cancer patients. Clin. Genet. 100, 551–562 (2021).
Article CAS PubMed PubMed Central Google Scholar
Taroc, E. Z. M. et al. Gli3 regulates vomeronasal neurogenesis, olfactory ensheathing cell formation, and GnRH-1 neuronal migration. J. Neurosci. 40, 311–326 (2020).
Article CAS PubMed PubMed Central Google Scholar
Muskens, I. S. et al. Germline cancer predisposition variants and pediatric glioma: a population-based study in California. Neuro. Oncol. 22, 864–874 (2020).
Article CAS PubMed PubMed Central Google Scholar
Lorenzo-Salazar, J. M. et al. Novel idiopathic pulmonary fibrosis susceptibility variants revealed by deep sequencing. ERJ Open Res. 5, 00071 (2019).
Article PubMed PubMed Central Google Scholar
Georges, A. et al. Rare loss-of-function mutations of PTGIR are enriched in fibromuscular dysplasia. Cardiovasc. Res. 117, 1154–1165 (2021).
Article CAS PubMed Google Scholar
Li, C. et al. Mutation analysis of DNAJC family for early-onset Parkinson’s disease in a Chinese cohort. Mov. Disord. 35, 2068–2076 (2020).
Article CAS PubMed Google Scholar
Hillman, P. et al. Identification of novel candidate risk genes for myelomeningocele within the glucose homeostasis/oxidative stress and folate/one-carbon metabolism networks. Mol. Genet. Genom. Med. 8, e1495 (2020).
CAS Google Scholar
Hebert, L. et al. Burden of rare deleterious variants in WNT signaling genes among 511 myelomeningocele patients. PLoS ONE 15, e0239083 (2020).
Article CAS PubMed PubMed Central Google Scholar
Yuan, J.-H. et al. Genomic analysis of 21 patients with corneal neuralgia after refractive surgery. Pain Rep. 5, e826 (2020).
Article PubMed PubMed Central Google Scholar
Rojas, R. A. et al. Phenotypic continuum between Waardenburg syndrome and idiopathic hypogonadotropic hypogonadism in humans with SOX10 variants. Genet. Med. 23, 629–636 (2021).
Article CAS PubMed PubMed Central Google Scholar
Terradas, M. et al. TP53, a gene for colorectal cancer predisposition in the absence of Li–Fraumeni-associated phenotypes. Gut 70, 1139–1146 (2021).
Article CAS PubMed Google Scholar
Li, C. et al. Mutation analysis of LRP10 in a large Chinese familial Parkinson disease cohort. Neurobiol. Aging 99, 99.e1–99.e6 (2021).
Article CAS Google Scholar
Gunadi et al. Effect of semaphorin 3C gene variants in multifactorial Hirschsprung disease. J. Int. Med. Res. 49, 300060520987789 (2021).
Article CAS PubMed Google Scholar
Messina, A. et al. Neuron-derived neurotrophic factor is mutated in congenital hypogonadotropic hypogonadism. Am. J. Hum. Genet. 106, 58–70 (2020).
Article CAS PubMed Google Scholar
Trimarchi, M. et al. Gene expression analysis in patients with cocaine-induced midline destructive lesions. Medicina 57, 861 (2021).
Article PubMed PubMed Central Google Scholar
Marenne, G. et al. Exome sequencing identifies genes and gene sets contributing to severe childhood obesity, linking PHIP variants to repressed POMC transcription. Cell Metab. 31, 1107–1119.e12 (2020).
Article CAS PubMed PubMed Central Google Scholar
Singh, T. et al. Rare loss-of-function variants in SETD1A are associated with schizophrenia and developmental disorders. Nat. Neurosci. 19, 571–577 (2016).
Article CAS PubMed PubMed Central Google Scholar
Sazonovs, A. et al. Sequencing of over 100,000 individuals identifies multiple genes and rare variants associated with Crohns disease susceptibility. Preprint at bioRxiv https://doi.org/10.1101/2021.06.15.21258641 (2021).
Article Google Scholar
Malki, L. et al. Variant PADI3 in central centrifugal cicatricial alopecia. N. Engl. J. Med. 380, 833–841 (2019).
Article CAS PubMed Google Scholar
Ulirsch, J. C. et al. The genetic landscape of Diamond–Blackfan anemia. Am. J. Hum. Genet. 103, 930–947 (2018).
Article CAS PubMed PubMed Central Google Scholar
Hubert, J.-N. et al. The PI3K/mTOR pathway is targeted by rare germline variants in patients with both melanoma and renal cell carcinoma. Cancers 13, 2243 (2021).
Article CAS PubMed PubMed Central Google Scholar
Rashid, M. et al. ALPK1 hotspot mutation as a driver of human spiradenoma and spiradenocarcinoma. Nat. Commun. 10, 2213 (2019).
Article PubMed PubMed Central Google Scholar
Belhadj, S. et al. Candidate genes for hereditary colorectal cancer: mutational screening and systematic review. Hum. Mutat. 41, 1563–1576 (2020).
Article CAS PubMed Google Scholar
Mosquera Orgueira, A. et al. Detection of rare germline variants in the genomes of patients with B-cell neoplasms. Cancers 13, 1340 (2021).
Article PubMed PubMed Central Google Scholar
Li, C. et al. Targeted next generation sequencing of nine osteoporosis-related genes in the Wnt signaling pathway among Chinese postmenopausal women. Endocrine 68, 669–678 (2020).
Article CAS PubMed Google Scholar
Thorlund, K., Dron, L., Park, J. J. H. & Mills, E. J. Synthetic and external controls in clinical trials — a primer for researchers. Clin. Epidemiol. 12, 457–467 (2020).
Article PubMed PubMed Central Google Scholar
Popejoy, A. B. & Fullerton, S. M. Genomics is failing on diversity. Nature 538, 161–164 (2016).
Article CAS PubMed PubMed Central Google Scholar
Ben-Eghan, C. et al. Don’t ignore genetic data from minority populations. Nature 585, 184–186 (2020).
Article CAS PubMed Google Scholar
McMahon, A. et al. Sequencing-based genome-wide association studies reporting standards. Cell Genomics 1, 100005 (2021).
Article CAS PubMed PubMed Central Google Scholar
Gurdasani, D., Barroso, I., Zeggini, E. & Sandhu, M. S. Genomics of disease risk in globally diverse populations. Nat. Rev. Genet. 20, 520–535 (2019). This paper provides a summary of the current state of genomic diversity in research and how diversity is key to discovery and translation in genomics.
Article CAS PubMed Google Scholar
Zhang, Y. et al. The prevalence of vitiligo: a meta-analysis. PLoS ONE 11, e0163806 (2016).
Article PubMed PubMed Central Google Scholar
Conway, M. et al. Analyzing the heterogeneity and complexity of electronic health record oriented phenotyping algorithms. AMIA Annu. Symp. Proc. 2011, 274–283 (2011).
PubMed PubMed Central Google Scholar
Newton, K. M. et al. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. J. Am. Med. Inform. Assoc. 20, e147–e154 (2013).
Article PubMed PubMed Central Google Scholar
Shang, N. et al. Making work visible for electronic phenotype implementation: lessons learned from the eMERGE network. J. Biomed. Inform. 99, 103293 (2019).
Article PubMed PubMed Central Google Scholar
Davis, K. A. S. et al. Indicators of mental disorders in UK Biobank — a comparison of approaches. Int. J. Methods Psychiatr. Res. 28, e1796 (2019).
Article PubMed PubMed Central Google Scholar
Singh, T. et al. Rare coding variants in ten genes confer substantial risk for schizophrenia. Nature 604, 509–516 (2022).
Article CAS PubMed Google Scholar
Ledford, H. Paper on genetics of longevity retracted. Nature https://doi.org/10.1038/news.2011.429 (2011).
Article PubMed Google Scholar
Viering, D. H. H. M. et al. Genetics of renovascular hypertension in children. J. Hypertens. 38, 1964–1970 (2020).
Article CAS PubMed Google Scholar
Mazzarotto, F. et al. Reevaluating the genetic contribution of monogenic dilated cardiomyopathy. Circulation 141, 387–398 (2020).
Article CAS PubMed PubMed Central Google Scholar
Steel, D. et al. Loss-of-function variants in HOPS complex genes VPS16 and VPS41 cause early onset dystonia associated with lysosomal abnormalities. Ann. Neurol. 88, 867–877 (2020).
Article CAS PubMed Google Scholar
Johnson, J. O. et al. Association of variants in the SPTLC1 gene with juvenile amyotrophic lateral sclerosis. JAMA Neurol. 78, 1236–1248 (2021).
Article PubMed Google Scholar
Gallego-Martinez, A., Requena, T., Roman-Naranjo, P., May, P. & Lopez-Escamez, J. A. Enrichment of damaging missense variants in genes related with axonal guidance signalling in sporadic Meniere’s disease. J. Med. Genet. 57, 82–88 (2020).
Article CAS PubMed Google Scholar
Kwok, A. J., Mentzer, A. & Knight, J. C. Host genetics and infectious disease: new tools, insights and translational opportunities. Nat. Rev. Genet. 22, 137–153 (2021).
Article CAS PubMed Google Scholar
Fry, A. et al. Comparison of sociodemographic and health-related characteristics of UK biobank participants with those of the general population. Am. J. Epidemiol. 186, 1026–1034 (2017).
Article PubMed PubMed Central Google Scholar
Wright, C. F. et al. Assessing the pathogenicity, penetrance, and expressivity of putative disease-causing variants in a population setting. Am. J. Hum. Genet. 104, 275 (2019).
Article CAS PubMed PubMed Central Google Scholar
Povysil, G. et al. Rare-variant collapsing analyses for complex traits: guidelines and applications. Nat. Rev. Genet. 20, 747–759 (2019). Review describing rare variant aggregation testing, a common method for association in sequencing studies. Beyond describing techniques, the review covers specific filtering and quality control needed to ensure appropriate statistical calibration.
Article CAS PubMed Google Scholar
Riveros-McKay, F. et al. Genetic architecture of human thinness compared to severe obesity. PLoS Genet. 15, e1007603 (2019).
Article PubMed PubMed Central Google Scholar
Moskvina, V., Holmans, P., Schmidt, K. M. & Craddock, N. Design of case–controls studies with unscreened controls. Ann. Hum. Genet. 69, 566–576 (2005).
Article CAS PubMed Google Scholar
Sham, P. C. & Purcell, S. M. Statistical power and significance testing in large-scale genetic studies. Nat. Rev. Genet. 15, 335–346 (2014).
Article CAS PubMed Google Scholar
Auer, P. L. et al. Guidelines for large-scale sequence-based complex trait association studies: lessons learned from the NHLBI Exome Sequencing Project. Am. J. Hum. Genet. 99, 791–801 (2016).
Article CAS PubMed PubMed Central Google Scholar
Alberts, B. Editorial expression of concern. Science 330, 912 (2010).
Article CAS PubMed Google Scholar
Campbell, C. D. et al. Demonstrating stratification in a European American population. Nat. Genet. 37, 868–872 (2005).
Article CAS PubMed Google Scholar
Knowler, W. C., Williams, R. C., Pettitt, D. J. & Steinberg, A. G. Gm3;5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. Am. J. Hum. Genet. 43, 520–526 (1988).
CAS PubMed PubMed Central Google Scholar
Hellwege, J. N. et al. Population stratification in genetic association studies. Curr. Protoc. Hum. Genet. 95, 1.22.1–1.22.23 (2017).
Google Scholar
Choudhry, S. et al. Population stratification confounds genetic association studies among Latinos. Hum. Genet. 118, 652–664 (2006).
Article PubMed Google Scholar
Helgason, A., Yngvadóttir, B., Hrafnkelsson, B., Gulcher, J. & Stefánsson, K. An Icelandic example of the impact of population structure on association studies. Nat. Genet. 37, 90–95 (2005).
Article CAS PubMed Google Scholar
Panarella, M. & Burkett, K. M. A cautionary note on the effects of population stratification under an extreme phenotype sampling design. Front. Genet. 10, 398 (2019).
Article PubMed PubMed Central Google Scholar
Gravel, S. et al. Demographic history and rare allele sharing among human populations. Proc. Natl Acad. Sci. USA 108, 11983–11988 (2011).
Article CAS PubMed PubMed Central Google Scholar
Mathieson, I. & McVean, G. Differential confounding of rare and common variants in spatially structured populations. Nat. Genet. 44, 243–246 (2012).
Article CAS PubMed PubMed Central Google Scholar
O’Connor, T. D. et al. Fine-scale patterns of population stratification confound rare variant association tests. PLoS ONE 8, e65834 (2013).
Article PubMed PubMed Central Google Scholar
Klann, J. G., Joss, M. A. H., Embree, K. & Murphy, S. N. Data model harmonization for the All Of Us Research Program: transforming i2b2 data into the OMOP common data model. PLoS ONE 14, e0212463 (2019).
Article CAS PubMed PubMed Central Google Scholar
Wei, W.-Q. et al. Evaluating phecodes, clinical classification software, and ICD-9-CM codes for phenome-wide association studies in the electronic health record. PLoS ONE 12, e0175508 (2017).
Article PubMed PubMed Central Google Scholar
Leitsalu, L. et al. Cohort profile: Estonian Biobank of the Estonian Genome Center, University of Tartu. Int. J. Epidemiol. 44, 1137–1147 (2015).
Article PubMed Google Scholar
Choudhury, A. et al. Author correction: High-depth African genomes inform human migration and health. Nature 592, E26 (2021).
Article CAS PubMed PubMed Central Google Scholar
Di Angelantonio, E. et al. Efficiency and safety of varying the frequency of whole blood donation (INTERVAL): a randomised trial of 45 000 donors. Lancet 390, 2360–2371 (2017).
Article PubMed PubMed Central Google Scholar
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).
Article CAS PubMed PubMed Central Google Scholar
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Article CAS PubMed PubMed Central Google Scholar
Gutierrez-Sacristan, A. et al. GenoPheno: cataloging large-scale phenotypic and next-generation sequencing data within human datasets. Brief Bioinform. 22, 55–65 (2021).
Article CAS PubMed Google Scholar
FinnGen. FinnGen documentation of R5 release. FinnGen https://finngen.gitbook.io/documentation/ (2021).
Wei, C.-Y. et al. Genetic profiles of 103,106 individuals in the Taiwan Biobank provide insights into the health and history of Han Chinese. NPJ Genom. Med. 6, 10 (2021).
Article CAS PubMed PubMed Central Google Scholar
Karczewski, K. J., Francioli, L. C. & MacArthur, D. G. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article CAS PubMed PubMed Central Google Scholar
Peña-Chilet, M. et al. CSVS, a crowdsourcing database of the Spanish population genetic variability. Nucleic Acids Res. 49, D1130–D1137 (2021).
Article PubMed Google Scholar
Mailman, M. D. et al. The NCBI dbGaP Database of Genotypes and Phenotypes. Nat. Genet. 39, 1181–1186 (2007).
Article CAS PubMed PubMed Central Google Scholar
Lappalainen, I. et al. The European Genome–Phenome Archive of human data consented for biomedical research. Nat. Genet. 47, 692–695 (2015).
Article CAS PubMed PubMed Central Google Scholar
UK Biobank. New costs for 2021. UK Biobank https://www.ukbiobank.ac.uk/enable-your-research/costs (2021).
Lee, S., Kim, S. & Fuchsberger, C. Improving power for rare-variant tests by integrating external controls. Genet. Epidemiol. 41, 610–619 (2017).
Article PubMed PubMed Central Google Scholar
Hendricks, A. E. et al. ProxECAT: Proxy External Controls Association Test. A new case–control gene region association test using allele frequencies from public controls. PLoS Genet. 14, e1007591 (2018).
Article PubMed PubMed Central Google Scholar
Guo, M. H., Plummer, L., Chan, Y.-M., Hirschhorn, J. N. & Lippincott, M. F. Burden testing of rare variants identified through exome sequencing via publicly available control data. Am. J. Hum. Genet. 103, 522–534 (2018).
Article CAS PubMed PubMed Central Google Scholar
Jiang, L. et al. Deviation from baseline mutation burden provides powerful and robust rare-variants association test for complex diseases. Nucleic Acids Res. 50, e34 (2022).
Article CAS PubMed Google Scholar
Lali, R. et al. Calibrated rare variant genetic risk scores for complex disease prediction using large exome sequence repositories. Nat. Commun. 12, 5852 (2021).
Article CAS PubMed PubMed Central Google Scholar
Bodea, C. A. et al. A method to exploit the structure of genetic ancestry space to enhance case–control studies. Am. J. Hum. Genet. 98, 857–868 (2016).
Article CAS PubMed PubMed Central Google Scholar
Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).
Article CAS PubMed PubMed Central Google Scholar
Schatz, M. C. et al. Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space. Cell Genom. 2, 100085 (2022).
Article CAS PubMed PubMed Central Google Scholar
National Heart, Lung, and Blood Institute, National Institutes of Health, US Department of Health and Human Services. The NHLBI BioData catalyst. Zenodo https://doi.org/10.5281/zenodo.3822858 (2020).
All of Us Research Program Investigators et al. The “All of Us” Research Program. N. Engl. J. Med. 381, 668–676 (2019).
Article Google Scholar
Langmead, B. & Nellore, A. Cloud computing for genomic data analysis and collaboration. Nat. Rev. Genet. 19, 208–219 (2018). This paper reviews how the current and future state of cloud computing will be fundamental for large-scale genomics research including for collaboration and reproducibility.
Article CAS PubMed PubMed Central Google Scholar
Van der Auwera, G. A. & O’Connor, B. D. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (O’Reilly Media, 2020).
Yuen, D. et al. The Dockstore: enhancing a community platform for sharing reproducible and accessible computational protocols. Nucleic Acids Res. 49, W624–W632 (2021).
Article CAS PubMed PubMed Central Google Scholar
Uffelmann, E. et al. Genome-wide association studies. Nat. Rev. Methods Primers 1, 60 (2021).
Article Google Scholar
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Article CAS PubMed PubMed Central Google Scholar
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
Article PubMed PubMed Central Google Scholar
Alexander, D. H. & Lange, K. Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinformatics 12, 246 (2011).
Article PubMed PubMed Central Google Scholar
Reich, D., Price, A. L. & Patterson, N. Principal component analysis of genetic data. Nat. Genet. 40, 491–492 (2008).
Article CAS PubMed Google Scholar
Wang, C., Zhan, X., Liang, L., Abecasis, G. R. & Lin, X. Improved ancestry estimation for both genotyping and sequencing data using projection procrustes analysis and genotype imputation. Am. J. Hum. Genet. 96, 926–937 (2015).
Article CAS PubMed PubMed Central Google Scholar
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Article CAS PubMed Google Scholar
1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article Google Scholar
Bergström, A. et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, eaay5012 (2020).
Article PubMed PubMed Central Google Scholar
GenomeAsia100K Consortium. The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature 576, 106–111 (2019).
Article Google Scholar
Maples, B. K., Gravel, S., Kenny, E. E. & Bustamante, C. D. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am. J. Hum. Genet. 93, 278–288 (2013).
Article CAS PubMed PubMed Central Google Scholar
Hilmarsson, H. et al. High resolution ancestry deconvolution for next generation genomic data. Preprint at bioRxiv https://doi.org/10.1101/2021.09.19.460980 (2021).
Article Google Scholar
Arriaga-MacKenzie, I. S. et al. Summix: a method for detecting and adjusting for population structure in genetic summary data. Am. J. Hum. Genet. 108, 1270–1282 (2021).
Article CAS PubMed PubMed Central Google Scholar
Wojcik, G. L. et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature 570, 514–518 (2019). A large, multi-ethnic, multi-trait genome-wide association study paper from the Population Architecture using Genomics and Epidemiology (PAGE) study describing best practices for handling heterogeneous population data, including imputation, filtering and QC steps. The paper also describes the critical importance of genomic diversity in genetic association studies.
Article CAS PubMed PubMed Central Google Scholar
Choudhury, A. et al. High-depth African genomes inform human migration and health. Nature 586, 741–748 (2020).
Article CAS PubMed PubMed Central Google Scholar
Exome Variant Server. NHLBI Exome Sequencing Project (ESP). EVS http://evs.gs.washington.edu/EVS/ (2013).
Li, X. et al. Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nat. Genet. 52, 969–983 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Article CAS PubMed PubMed Central Google Scholar
Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).
Article CAS PubMed PubMed Central Google Scholar
Sim, N.-L. et al. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res. 40, W452–W457 (2012).
Article CAS PubMed PubMed Central Google Scholar
Li, Y. & Lee, S. Novel score test to increase power in association test by integrating external controls. Genet. Epidemiol. 45, 293–304 (2021).
Article CAS PubMed Google Scholar
Chen, S. & Lin, X. Analysis in case–control sequencing association studies with different sequencing depths. Biostatistics 21, 577–593 (2020).
Article PubMed Google Scholar
Hu, Y.-J., Liao, P., Johnston, H. R., Allen, A. S. & Satten, G. A. Testing rare-variant association without calling genotypes allows for systematic differences in sequencing between cases and controls. PLoS Genet. 12, e1006040 (2016).
Article PubMed PubMed Central Google Scholar
Boyle, E. A., Li, Y. I. & Pritchard, J. K. An expanded view of complex traits: from polygenic to omnigenic. Cell 169, 1177–1186 (2017).
Article CAS PubMed PubMed Central Google Scholar
Bulik-Sullivan, B. K. et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).
Article CAS PubMed PubMed Central Google Scholar
Clifton, E. A. D. et al. Associations between body mass index-related genetic variants and adult body composition: the Fenland cohort study. Int. J. Obes. 41, 613–619 (2017).
Article CAS Google Scholar
O’Connor, B. D. et al. The Dockstore: enabling modular, community-focused sharing of Docker-based genomics tools and workflows. F1000Res. 6, 52 (2017).
Article PubMed PubMed Central Google Scholar
Perkel, J. Democratic databases: science on GitHub. Nature 538, 127–128 (2016).
Article CAS PubMed Google Scholar
Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).
Article CAS PubMed Google Scholar
Venkataraman G.R. et al. Bayesian model comparison for rare-variant association studies. Am. J. Hum. Genet. 108, 2354–2367 (2021).
Article CAS PubMed PubMed Central Google Scholar
Thomas, S. P. et al. Cultivating diversity as an ethos with an anti-racism approach in the scientific enterprise. HGG Adv. 108, 100052 (2021).
Google Scholar
Bonham, V. L. & Green, E. D. The genomics workforce must become more diverse: a strategic imperative. Am. J. Hum. Genet. 108, 3–7 (2021).
Article CAS PubMed PubMed Central Google Scholar
Bentley, A. R., Callier, S. L. & Rotimi, C. N. Evaluating the promise of inclusion of African ancestry populations in genomics. NPJ Genom. Med. 5, 5 (2020).
Article PubMed PubMed Central Google Scholar
Bezuidenhout, L. & Chakauya, E. Hidden concerns of sharing research data by low/middle-income country scientists. Glob. Bioeth. 29, 39–54 (2018).
Article PubMed PubMed Central Google Scholar
Tsosie, K. S., Yracheta, J. M. & Dickenson, D. Overvaluing individual consent ignores risks to tribal participants. Nat. Rev. Genet. 20, 497–498 (2019).
Article CAS PubMed PubMed Central Google Scholar
Tindana, P. & de Vries, J. Broad consent for genomic research and biobanking: perspectives from low- and middle-income countries. Annu. Rev. Genomics Hum. Genet. 17, 375–393 (2016). A review outlining the key elements to promote global health and equity when completing genomic research, such as through biobanks.
Article CAS PubMed Google Scholar
National Human Genome Research Institute. NOT-HG-21-022: notice announcing the National Human Genome Research Institute’s expectation for sharing quality metadata and phenotypic data. NIH https://grants.nih.gov/grants/guide/notice-files/NOT-HG-21-022.html (2021).
Fiume, M. et al. Federated discovery and sharing of genomic data using Beacons. Nat. Biotechnol. 37, 220–224 (2019).
Article CAS PubMed PubMed Central Google Scholar
Thorogood, A. et al. International federation of genomic medicine databases using GA4GH standards. Cell Genomics 1, 100032 (2021).
Article CAS PubMed PubMed Central Google Scholar
Rehm, H. L. et al. GA4GH: international policies and standards for data sharing across genomic research and healthcare. Cell Genom. 1, 100029 (2021).
Article CAS PubMed PubMed Central Google Scholar
Lawson, J. et al. The Data Use Ontology to streamline responsible access to human biomedical datasets. Cell Genom. 1, 100028 (2021).
Article CAS Google Scholar
National Heart, Lung, and Blood Institute. Catalyst Fellows Program. NHLBI https://biodatacatalyst.nhlbi.nih.gov/fellows/program/ (2021).
National Human Genome Research Institute. Massive Genome Informatics in the Cloud (MaGIC) Jamboree. AnVIL https://anvilproject.org/events/magic2020 (2020).
Global Alliance for Genomics and Health. GA4GH starter kit. GA4GH https://starterkit.ga4gh.org/ (2021).
Abel, H. J. et al. Mapping and characterization of structural variation in 17,795 human genomes. Nature 583, 83–89 (2020).
Article CAS PubMed PubMed Central Google Scholar
Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).
Article CAS PubMed PubMed Central Google Scholar
Phan, L. et al. ALFA: Allele Frequency Aggregator. NCBI https://www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/ (2020).
Tadaka, S. et al. jMorp updates in 2020: large enhancement of multi-omics data resources on the general Japanese population. Nucleic Acids Res. 49, D536–D544 (2021).
Article CAS PubMed Google Scholar
Sequencing Initiative Suomi Project. Sequencing Initiative Suomi. SISu http://sisuproject.fi (2021).
Wam. Dubai to map genome of all its residents. Khaleej Times https://www.khaleejtimes.com/uae/dubai-to-map-genome-of-all-its-residents (2018).
Geis, C. A Chinese province is sequencing one million of its residents’ genomes. Futurism https://futurism.com/neoscope/chinese-province-sequencing-1-million-residents-genomes (2017).
Health RI. European ‘1+Million Genomes’ initiative (1+MG). Health RI https://www.health-ri.nl/initiatives/european-1million-genomes-initiative-1mg (2020).
Gaziano, J. M. et al. Million Veteran Program: a mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 70, 214–223 (2016).
Article PubMed Google Scholar
Sirugo, G., Williams, S. M. & Tishkoff, S. A. The missing diversity in human genetic studies. Cell 177, 1080 (2019).
Article CAS PubMed Google Scholar
Byrd, J. B., Greene, A. C., Prasad, D. V., Jiang, X. & Greene, C. S. Responsible, practical genomic data sharing that accelerates research. Nat. Rev. Genet. 21, 615–629 (2020).
Article CAS PubMed PubMed Central Google Scholar
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016). This foundational manuscript is the first to present the FAIR principles (that is, findable, accessible, interoperable and reusable) for data sharing.
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported by the Genome Sequencing Program (R35HG011293 to A.E.H. and C.R.G.; U01HG009080 to A.E.H., A.G.I., C.R.G. and M.A.R.; and U24HG008956 to S.B.). The Genome Sequencing Program is funded by the National Institute of Health (NIH) National Human Genome Research Institute (NHGRI), the National Heart, Lung, and Blood Institute (NHLBI) and the National Eye Institute (NEI). G.L.W. received support for this work from NHGRI (R35HG011944).

Author information

Authors and Affiliations

Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
Genevieve L. Wojcik
Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, USA
Jessica Murphy, Christopher R. Gignoux & Audrey E. Hendricks
Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO, USA
Jessica Murphy & Audrey E. Hendricks
Department of Biomedical Data Science, Stanford Medical School, Stanford, CA, USA
Jacob L. Edelson & Manuel A. Rivas
Human Medical Genetics and Genomics Program, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
Christopher R. Gignoux & Audrey E. Hendricks
Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
Christopher R. Gignoux & Audrey E. Hendricks
Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA, USA
Alexander G. Ioannidis
Clinical and Translational Epidemiology Unit, Massachusetts General Hospital, Boston, MA, USA
Alexander G. Ioannidis
Metabolism Program, Broad Institute, Cambridge, MA, USA
Alisa Manning
Department of Medicine, Harvard Medical School, Boston, MA, USA
Alisa Manning
Department of Statistics, Rutgers University, Piscataway, NJ, USA
Steven Buyske

Authors

Genevieve L. Wojcik
View author publications
You can also search for this author in PubMed Google Scholar
Jessica Murphy
View author publications
You can also search for this author in PubMed Google Scholar
Jacob L. Edelson
View author publications
You can also search for this author in PubMed Google Scholar
Christopher R. Gignoux
View author publications
You can also search for this author in PubMed Google Scholar
Alexander G. Ioannidis
View author publications
You can also search for this author in PubMed Google Scholar
Alisa Manning
View author publications
You can also search for this author in PubMed Google Scholar
Manuel A. Rivas
View author publications
You can also search for this author in PubMed Google Scholar
Steven Buyske
View author publications
You can also search for this author in PubMed Google Scholar
Audrey E. Hendricks
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.L.W., J.M., J.L.E. and A.E.H. researched the literature. G.L.W., J.M., A.G.I., S.B. and A.E.H. provided substantial contributions to discussion of the content. G.L.W., J.M., S.B. and A.E.H. wrote the article. All authors reviewed and/or edited the manuscript before submission.

Corresponding author

Correspondence to Audrey E. Hendricks.

Ethics declarations

Competing interests

C.R.G. owns stock in 23and Me. M.A.R. is a scientific founder of Broadwing Bio, a consultant for MazeTx, and is currently on leave at HiBio. The other authors declare no competing interests.

Peer review

Peer review information

Nature Reviews Genetics thanks the anonymous reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Glossary

Monogenic: A condition influenced by one genetic locus.
Oligogenic: A condition influenced by a few genetic loci.
Polygenic: A condition influenced by a large number of genetic loci.
Allele frequencies: The rates of genetic variant types in a specified population.
Common controls: Controls used for multiple studies.
Bias: Systematic error (as opposed to error due to chance processes), whether caused by statistical methods, differences between sampled individuals and the population they nominally represent, differences between cases and controls in ascertainment or sample processing, or other issues.
Confounding: A spurious association or lack of association caused by a third variable that is related to both the predictor variable (for example, allele frequency) and the outcome (for example, case status).
Internal controls: Controls that were ascertained, sequenced and processed together with the case sample. By contrast, external common controls were recruited, sequenced and processed separately, often using different technology from the case sample.
Biobanks: Collections of both biological samples (particularly DNA) and health information from individuals generally assembled from a region or a health system.
Harmonization: The formation of a single cohesive data set from two or more separate data sets by standardizing scales, definitions, quality control and other processing.
Batch effects: Differences between groups induced by processing over different times, places or technologies unrelated to biological causes.
Quality control: A process where low-quality data or observations are identified and improved or removed from further analysis.
Statistical power: The probability of rejecting the null hypothesis when it is false.
Ascertained cases: Participants of a study who are recruited to have a known disease, outcome or condition of interest.
Ascertained controls: Participants of a study who are recruited to not have a known disease, outcome or condition of interest.
Convenience sample: A sample drawn from an easily accessible, but often not representative, cohort.
Population controls: A control group sampled from a population but possibly lacking information regarding the condition of interest, with the result that some of the population controls will likely have the condition of interest.
Admixed: A term to denote the mixture of genetic ancestries from two or more divergent groups.
Population stratification: The presence of subpopulations with differing allele frequencies in a study; a source of confounding if phenotypes also vary by subpopulation.
False positives: Test results that are statistically significant even though there is no real association. By contrast, a false negative is a test result that is not statistically significant even though there is a real association.
Fine-scale ancestry: Genetic differentiation at a regional level (such as subcontinental), as opposed to continental-level ancestry.
Metadata: A high-level description of a data set, often including details of the cohort and of data generation.
Local ancestry: The genetic ancestry of a particular chromosomal region on a haplotype level.
Minor allele frequency: (MAF). For a genetic variant with two alleles, the frequency, in a specified population, of the less frequent allele.
In silico validation: Secondary quality control analysis of genotype calls, often of top association results, that passed the initial harmonization process to ensure that differences in processing do not drive important association signals.
Partial replication: Repeating association analysis reusing some data from the discovery analysis (for example, discovery cases and new external common controls).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wojcik, G.L., Murphy, J., Edelson, J.L. et al. Opportunities and challenges for the use of common controls in sequencing studies. Nat Rev Genet 23, 665–679 (2022). https://doi.org/10.1038/s41576-022-00487-4

Download citation

Accepted: 22 March 2022
Published: 17 May 2022
Issue Date: November 2022
DOI: https://doi.org/10.1038/s41576-022-00487-4

This article is cited by

The expanding diagnostic toolbox for rare genetic diseases
- Kristin D. Kernohan
- Kym M. Boycott
Nature Reviews Genetics (2024)
Public platform with 39,472 exome control samples enables association studies without genotype sharing
- Mykyta Artomov
- Alexander A. Loboda
- Mark J. Daly
Nature Genetics (2024)
Principles and methods for transferring polygenic risk scores across global populations
- Linda Kachuri
- Nilanjan Chatterjee
- Tian Ge
Nature Reviews Genetics (2024)
Increase in power by obtaining 10 or more controls per case when type-1 error is small in large-scale association studies
- Hormuzd A. Katki
- Sonja I. Berndt
- Nathaniel Rothman
BMC Medical Research Methodology (2023)
A crowdsourcing database for the copy-number variation of the Spanish population
- Daniel López-López
- Gema Roldán
- Joaquin Dopazo
Human Genomics (2023)

Opportunities and challenges for the use of common controls in sequencing studies

Subjects

Abstract

Access options

Similar content being viewed by others

Benefits and limitations of genome-wide association studies

Public platform with 39,472 exome control samples enables association studies without genotype sharing

Genome-wide association studies

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Publisher’s note

Related links

Glossary

Rights and permissions

About this article

Cite this article

This article is cited by

The expanding diagnostic toolbox for rare genetic diseases

Public platform with 39,472 exome control samples enables association studies without genotype sharing

Principles and methods for transferring polygenic risk scores across global populations

Increase in power by obtaining 10 or more controls per case when type-1 error is small in large-scale association studies

A crowdsourcing database for the copy-number variation of the Spanish population

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Publisher’s note

Related links

Glossary

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links