Genetic association results are often interpreted with the assumption that study participation does not affect downstream analyses. Understanding the genetic basis of participation bias is challenging since it requires the genotypes of unseen individuals. Here we demonstrate that it is possible to estimate comparative biases by performing a genome-wide association study contrasting one subgroup versus another. For example, we showed that sex exhibits artifactual autosomal heritability in the presence of sex-differential participation bias. By performing a genome-wide association study of sex in approximately 3.3 million males and females, we identified over 158 autosomal loci spuriously associated with sex and highlighted complex traits underpinning differences in study participation between the sexes. For example, the body mass index–increasing allele at FTO was observed at higher frequency in males compared to females (odds ratio = 1.02, P = 4.4 × 10−36). Finally, we demonstrated how these biases can potentially lead to incorrect inferences in downstream analyses and propose a conceptual framework for addressing such biases. Our findings highlight a new challenge that genetic studies may face as sample sizes continue to grow.
This is a preview of subscription content, access via your institution
Open Access articles citing this article.
npj Precision Oncology Open Access 31 August 2023
Quantification of race/ethnicity representation in Alzheimer’s disease neuroimaging research in the USA: a systematic review
Communications Medicine Open Access 25 July 2023
Patterns of item nonresponse behaviour to survey questionnaires are systematic and associated with genetic loci
Nature Human Behaviour Open Access 29 June 2023
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
The GWAS results are available through the GWAS catalog accession nos. GCST90013473 (23andMe) and GCST90013474. Full summary statistics for 23andMe are available upon request from https://research.23andme.com/dataset-access/.
Scripts are available at https://github.com/dsgelab/genobias.
Prictor, M., Teare, H. J. A. & Kaye, J. Equitable participation in biobanks: the risks and benefits of a “dynamic consent” approach. Front. Public Health 6, 253 (2018).
Leitsalu, L. et al. Cohort profile: Estonian Biobank of the Estonian Genome Center, University of Tartu. Int. J. Epidemiol. 44, 1137–1147 (2015).
Klijs, B. et al. Representativeness of the LifeLines cohort study. PLoS ONE 10, e0137203 (2015).
Fry, A. et al. Comparison of sociodemographic and health-related characteristics of UK Biobank participants with those of the general population. Am. J. Epidemiol. 186, 1026–1034 (2017).
Pedersen, C. B. et al. The iPSYCH2012 case-cohort sample: new directions for unravelling genetic and environmental architectures of severe mental disorders. Mol. Psychiatry 23, 6–14 (2018).
Rothman, K. J., Gallacher, J. E. J. & Hatch, E. E. Why representativeness should be avoided. Int. J. Epidemiol. 42, 1012–1014 (2013).
Keyes, K. M. & Westreich, D. UK Biobank, big data, and the consequences of non-representativeness. Lancet 393, 1297 (2019).
Swanson, J. M. The UK Biobank and selection bias. Lancet 380, 110 (2012).
Elwood, J. M. Commentary: on representativeness. Int. J. Epidemiol. 42, 1014–1015 (2013).
Pizzi, C. et al. Sample selection and validity of exposure–disease association estimates in cohort studies. J. Epidemiol. Community Health 65, 407–411 (2011).
Richiardi, L., Pizzi, C. & Pearce, N. Commentary: representativeness is usually not necessary and often should be avoided. Int. J. Epidemiol. 42, 1018–1022 (2013).
Perry, J. R. B. et al. Stratifying type 2 diabetes cases by BMI identifies genetic risk variants in LAMA1 and enrichment for risk variants in lean compared to obese cases. PLoS Genet. 8, e1002741 (2012).
Martin, J. et al. Association of genetic risk for schizophrenia with nonparticipation over time in a population-based cohort study. Am. J. Epidemiol. 183, 1149–1158 (2016).
Taylor, A. E. et al. Exploring the association of genetic factors with participation in the Avon Longitudinal Study of Parents and Children. Int. J. Epidemiol. 47, 1207–1216 (2018).
Adams, M. J. et al. Factors associated with sharing e-mail information and mental health survey participation in large population cohorts. Int. J. Epidemiol. 49, 410–421 (2020).
Tyrrell, J. et al. Genetic predictors of participation in optional components of UK Biobank. Nat. Commun. 12, 886 (2021).
Munafò, M. R., Tilling, K., Taylor, A. E., Evans, D. M. & Davey Smith, G. Collider scope: when selection bias can substantially influence observed associations. Int. J. Epidemiol. 47, 226–235 (2018).
Boraska, V. et al. Genome-wide meta-analysis of common variant differences between men and women. Hum. Mol. Genet. 21, 4805–4815 (2012).
Ryu, D., Ryu, J. & Lee, C. Genome-wide association study reveals sex-specific selection signals against autosomal nucleotide variants. J. Hum. Genet. 61, 423–426 (2016).
Watanabe, K. et al. A global overview of pleiotropy and genetic architecture in complex traits. Nat. Genet. 51, 1339–1348 (2019).
Lee, J. J. et al. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nat. Genet. 50, 1112–1121 (2018).
Censin, J. C. et al. Causal relationships between obesity and the leading causes of death in women and men. PLoS Genet. 15, e1008405 (2019).
Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Gaziano, J. M. et al. Million Veteran Program: a mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 70, 214–223 (2016).
Chen, Z. et al. China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. Int. J. Epidemiol. 40, 1652–1666 (2011).
Dewey, F. E. et al. Distribution and clinical impact of functional variants in 50,726 whole-exome sequences from the DiscovEHR study. Science 354, aaf6814 (2016).
Gottesman, O. et al. The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future. Genet. Med. 15, 761–771 (2013).
Denny, J. C. et al. The “All of Us” Research Program. N. Engl. J. Med. 381, 668–676 (2019).
Batty, G. D., Gale, C. R., Kivimäki, M., Deary, I. J. & Bell, S. Comparison of risk factor associations in UK Biobank against representative, general population based studies with conventional response rates: prospective cohort study and individual participant meta-analysis. BMJ 368, m131 (2020).
Richardson, D. B., Rzehak, P., Klenk, J. & Weiland, S. K. Analyses of case-control data for additional outcomes. Epidemiology 18, 441–445 (2007).
Monsees, G. M., Tamimi, R. M. & Kraft, P. Genome-wide association scans for secondary traits using case-control samples. Genet. Epidemiol. 33, 717–728 (2009).
Dudbridge, F. et al. Adjustment for index event bias in genome-wide association studies of subsequent events. Nat. Commun. 10, 1561 (2019).
Mahmoud, O., Dudbridge, F., Davey Smith, G., Munafò, M. & Tilling, K. Slope-Hunter: a robust method for index-event bias correction in genome-wide association studies of subsequent traits. Preprint at bioRxiv https://doi.org/10.1101/2020.01.31.928077 (2020).
Grotzinger, A. D. et al. Genomic structural equation modelling provides insights into the multivariate genetic architecture of complex traits. Nat. Hum. Behav. 3, 513–525 (2019).
Heckman, J. J. Sample selection bias as a specification error. Econometrica 47, 153–161 (1979).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Olsen, L. et al. Prevalence of rearrangements in the 22q11.2 region and population-based risk of neuropsychiatric and developmental disorders in a Danish population: a case-cohort study. Lancet Psychiatry 5, 573–580 (2018).
Henn, B. M. et al. Cryptic distant relatives are common in both isolated and cosmopolitan genetic samples. PLoS ONE 7, e34267 (2012).
Loh, P.-R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).
Zheng, X. et al. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics 28, 3326–3328 (2012).
Watanabe, K., Taskesen, E., van Bochoven, A. & Posthuma, D. Functional mapping and annotation of genetic associations with FUMA. Nat. Commun. 8, 1826 (2017).
Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).
Baselmans, B. M. L. et al. Multivariate genome-wide analyses of the well-being spectrum. Nat. Genet. 51, 445–451 (2019).
Jansen, I. E. et al. Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer’s disease risk. Nat. Genet. 51, 404–413 (2019).
Nolte, I. M. et al. Missing heritability: is the gap closing? An analysis of 32 complex traits in the Lifelines Cohort Study. Eur. J. Hum. Genet. 25, 877–885 (2017).
Bulik-Sullivan, B. K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).
Gazal, S. et al. Linkage disequilibrium-dependent architecture of human complex traits shows action of negative selection. Nat. Genet. 49, 1421–1427 (2017).
Gazal, S., Marquez-Luna, C., Finucane, H. K. & Price, A. L. Reconciling S-LDSC and LDAK functional enrichment estimates. Nat. Genet. 51, 1202–1204 (2019).
Evans, L. M. et al. Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits. Nat. Genet. 50, 737–745 (2018).
Lee, S. H., Wray, N. R., Goddard, M. E. & Visscher, P. M. Estimating missing heritability for disease from genome-wide association studies. Am. J. Hum. Genet. 88, 294–305 (2011).
Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 47, 1236–1241 (2015).
Locke, A. E. et al. Genetic studies of body mass index yield new insights for obesity biology. Nature 518, 197–206 (2015).
Hemani, G. et al. The MR-Base platform supports systematic causal inference across the human phenome. eLife 7, e34408 (2018).
Choi, S. W. & O’Reilly, P. F. PRSice-2: polygenic risk score software for biobank-scale data. Gigascience 8, giz082 (2019).
We thank G. Davey Smith for insightful comments. This research was conducted by using the UK Biobank resource under application no. 31063. A.G. was supported by the Academy of Finland Fellowship (no. 323116). This work was supported by the Medical Research Council (Unit Programme number MC_UU_12015/2). M.G.N. is a fellow of the Jacobs Foundation and is supported by ZonMw grant nos. 849200011 and 531003014 from the Netherlands Organization for Health Research and Development and a VENI grant awarded by the Dutch Research Council (VI.Veni.191 G.030). A. Abdellaoui is supported by the Foundation Volksbond Rotterdam and ZonMw grant no. 849200011 from the Netherlands Organization for Health Research and Development. The FinnGen project is funded by 2 grants from Business Finland (nos. HUS 4685/31/2016 and UH 4386/31/2016) and 11 industry partners (AbbVie, AstraZeneca UK, Biogen MA, Celgene Corporation, Celgene International II Sàrl, Genentech, Merck Sharp & Dohme Corp, Pfizer, GlaxoSmithKline, Sanofi, Maze Therapeutics, Janssen Biotech). We thank the following biobanks for collecting the FinnGen project samples: Auria Biobank (https://www.auria.fi/biopankki/); THL Biobank (https://thl.fi/en/web/thl-biobank); Helsinki Biobank (https://www.helsinginbiopankki.fi/fi/etusivu); Biobank Borealis of Northern Finland (https://www.ppshp.fi/Tutkimus-ja-opetus/Biopankki/Pages/Biobank-Borealis-briefly-in-English.aspx); Finnish Clinical Biobank Tampere (https://www.tays.fi/en-US/Research_and_development/Finnish_Clinical_Biobank_Tampere); Biobank of Eastern Finland (https://ita-suomenbiopankki.fi/en/); Central Finland Biobank (https://www.ksshp.fi/fi-FI/Potilaalle/Biopankki); Finnish Red Cross Blood Service Biobank (https://www.veripalvelu.fi/verenluovutus/biopankkitoiminta); and Terveystalo Biobank (https://www.terveystalo.com/fi/Yritystietoa/Terveystalo-Biopankki/Biopankki/). All Finnish Biobanks are members of the BBMRI.fi infrastructure (http://www.bbmri.fi/). We thank the research participants and employees of 23andMe who contributed to this study.
P.N., A. Auton and D.H. are employed by 23andMe. P.J. is a paid consultant to Global Gene Corp and Humanity Inc.
Peer review information Nature Genetics thanks the anonymous reviewers for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended Data Fig. 1 Different participation bias scenarios that may lead to a correlation between sex and genetic variants.
S, selection (that is participation in the study); X, trait; Gx, genotype causing X. The assumed causal paths are shown in blue, and the induced correlations are shown in red. Three scenarios exist in which sex can become heritable due to selection. a, Sex causes X which in turn causes selection. b, X and sex influence the selection independently. c, The effect of X on selection is different between the two sexes. This is the scenario discussed in the paper. We have run simulations (Supplementary Fig. 3) and scenarios a and b are less likely to be observed because the effect of the trait on selection would need to be extremely large.
On the y-axis is the effect in the entire study population (n = 2,462,132), and on the x-axis is the effect only among those younger than 30 years (n = 320,366). Error bars represent the confidence intervals for the effect size estimates.
Extended Data Fig. 3 Effect of sex-differential participation bias on the genetic correlation between y0 and y1 when the phenotypes have h2 = 0.1 or h2 = 0.3.
Each line represents a different degree of participation bias, expressed as the odds ratio (OR) used for the sampling. The higher the OR, the higher the degree of participation bias. The x-axis represents different values for the parameter k that gives the sex-differential effect. The smaller k is, the higher is the degree of the sex-differential effect. Under no partecipation bias or sex-differential effect y0 and y1 have a genetic correlation equal to 0.
The forest plot shows the effect of sampling men and women differentially based on BMI. The x-axis represents different values of bias introduced. For higher values, heavier males and leaner women are randomly picked. The number on top of the segment represents the P-value of the difference in effect between the two sexes using the Z-score method. The bias becomes large enough to be detected as ‘significant’ even at the lower values of bias applied. The straight lines represent the effect of BMI on T2D estimated without any sample selection.
About this article
Cite this article
Pirastu, N., Cordioli, M., Nandakumar, P. et al. Genetic analyses identify widespread sex-differential participation bias. Nat Genet 53, 663–671 (2021). https://doi.org/10.1038/s41588-021-00846-7
This article is cited by
Genome Medicine (2023)
Quantification of race/ethnicity representation in Alzheimer’s disease neuroimaging research in the USA: a systematic review
Communications Medicine (2023)
Adjusting for population stratification in polygenic risk score analyses: a guide for model specifications in the UK Biobank
Journal of Human Genetics (2023)
Nature Human Behaviour (2023)
Nature Genetics (2023)