Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Genetic analyses identify widespread sex-differential participation bias

Abstract

Genetic association results are often interpreted with the assumption that study participation does not affect downstream analyses. Understanding the genetic basis of participation bias is challenging since it requires the genotypes of unseen individuals. Here we demonstrate that it is possible to estimate comparative biases by performing a genome-wide association study contrasting one subgroup versus another. For example, we showed that sex exhibits artifactual autosomal heritability in the presence of sex-differential participation bias. By performing a genome-wide association study of sex in approximately 3.3 million males and females, we identified over 158 autosomal loci spuriously associated with sex and highlighted complex traits underpinning differences in study participation between the sexes. For example, the body mass index–increasing allele at FTO was observed at higher frequency in males compared to females (odds ratio = 1.02, P = 4.4 × 1036). Finally, we demonstrated how these biases can potentially lead to incorrect inferences in downstream analyses and propose a conceptual framework for addressing such biases. Our findings highlight a new challenge that genetic studies may face as sample sizes continue to grow.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Manhattan plot for a GWAS of sex in 2,462,132 participants from 23andMe.
Fig. 2: SNP heritability on the liability scale for sex across five studies.
Fig. 3: Illustration of the concept and consequences of sex-differential participation bias.
Fig. 4: Genetic correlation with being born female versus male and 38 traits in the UK Biobank and 23andMe.
Fig. 5: PS distribution highlights sex-differential participation by educational level in the UK Biobank.

Similar content being viewed by others

Data availability

The GWAS results are available through the GWAS catalog accession nos. GCST90013473 (23andMe) and GCST90013474. Full summary statistics for 23andMe are available upon request from https://research.23andme.com/dataset-access/.

Code availability

Scripts are available at https://github.com/dsgelab/genobias.

References

  1. Prictor, M., Teare, H. J. A. & Kaye, J. Equitable participation in biobanks: the risks and benefits of a “dynamic consent” approach. Front. Public Health 6, 253 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  2. Leitsalu, L. et al. Cohort profile: Estonian Biobank of the Estonian Genome Center, University of Tartu. Int. J. Epidemiol. 44, 1137–1147 (2015).

    Article  PubMed  Google Scholar 

  3. Klijs, B. et al. Representativeness of the LifeLines cohort study. PLoS ONE 10, e0137203 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  4. Fry, A. et al. Comparison of sociodemographic and health-related characteristics of UK Biobank participants with those of the general population. Am. J. Epidemiol. 186, 1026–1034 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  5. Pedersen, C. B. et al. The iPSYCH2012 case-cohort sample: new directions for unravelling genetic and environmental architectures of severe mental disorders. Mol. Psychiatry 23, 6–14 (2018).

    Article  CAS  PubMed  Google Scholar 

  6. Rothman, K. J., Gallacher, J. E. J. & Hatch, E. E. Why representativeness should be avoided. Int. J. Epidemiol. 42, 1012–1014 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  7. Keyes, K. M. & Westreich, D. UK Biobank, big data, and the consequences of non-representativeness. Lancet 393, 1297 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  8. Swanson, J. M. The UK Biobank and selection bias. Lancet 380, 110 (2012).

    Article  PubMed  Google Scholar 

  9. Elwood, J. M. Commentary: on representativeness. Int. J. Epidemiol. 42, 1014–1015 (2013).

    Article  PubMed  Google Scholar 

  10. Pizzi, C. et al. Sample selection and validity of exposure–disease association estimates in cohort studies. J. Epidemiol. Community Health 65, 407–411 (2011).

    Article  PubMed  Google Scholar 

  11. Richiardi, L., Pizzi, C. & Pearce, N. Commentary: representativeness is usually not necessary and often should be avoided. Int. J. Epidemiol. 42, 1018–1022 (2013).

    Article  PubMed  Google Scholar 

  12. Perry, J. R. B. et al. Stratifying type 2 diabetes cases by BMI identifies genetic risk variants in LAMA1 and enrichment for risk variants in lean compared to obese cases. PLoS Genet. 8, e1002741 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Martin, J. et al. Association of genetic risk for schizophrenia with nonparticipation over time in a population-based cohort study. Am. J. Epidemiol. 183, 1149–1158 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  14. Taylor, A. E. et al. Exploring the association of genetic factors with participation in the Avon Longitudinal Study of Parents and Children. Int. J. Epidemiol. 47, 1207–1216 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  15. Adams, M. J. et al. Factors associated with sharing e-mail information and mental health survey participation in large population cohorts. Int. J. Epidemiol. 49, 410–421 (2020).

    Article  PubMed  Google Scholar 

  16. Tyrrell, J. et al. Genetic predictors of participation in optional components of UK Biobank. Nat. Commun. 12, 886 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Munafò, M. R., Tilling, K., Taylor, A. E., Evans, D. M. & Davey Smith, G. Collider scope: when selection bias can substantially influence observed associations. Int. J. Epidemiol. 47, 226–235 (2018).

    Article  PubMed  Google Scholar 

  18. Boraska, V. et al. Genome-wide meta-analysis of common variant differences between men and women. Hum. Mol. Genet. 21, 4805–4815 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Ryu, D., Ryu, J. & Lee, C. Genome-wide association study reveals sex-specific selection signals against autosomal nucleotide variants. J. Hum. Genet. 61, 423–426 (2016).

    Article  CAS  PubMed  Google Scholar 

  20. Watanabe, K. et al. A global overview of pleiotropy and genetic architecture in complex traits. Nat. Genet. 51, 1339–1348 (2019).

    Article  CAS  PubMed  Google Scholar 

  21. Lee, J. J. et al. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nat. Genet. 50, 1112–1121 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Censin, J. C. et al. Causal relationships between obesity and the leading causes of death in women and men. PLoS Genet. 15, e1008405 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  23. Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  24. Gaziano, J. M. et al. Million Veteran Program: a mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 70, 214–223 (2016).

    Article  PubMed  Google Scholar 

  25. Chen, Z. et al. China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. Int. J. Epidemiol. 40, 1652–1666 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  26. Dewey, F. E. et al. Distribution and clinical impact of functional variants in 50,726 whole-exome sequences from the DiscovEHR study. Science 354, aaf6814 (2016).

    Article  PubMed  CAS  Google Scholar 

  27. Gottesman, O. et al. The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future. Genet. Med. 15, 761–771 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Denny, J. C. et al. The “All of Us” Research Program. N. Engl. J. Med. 381, 668–676 (2019).

    Article  PubMed  Google Scholar 

  29. Batty, G. D., Gale, C. R., Kivimäki, M., Deary, I. J. & Bell, S. Comparison of risk factor associations in UK Biobank against representative, general population based studies with conventional response rates: prospective cohort study and individual participant meta-analysis. BMJ 368, m131 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  30. Richardson, D. B., Rzehak, P., Klenk, J. & Weiland, S. K. Analyses of case-control data for additional outcomes. Epidemiology 18, 441–445 (2007).

    Article  PubMed  Google Scholar 

  31. Monsees, G. M., Tamimi, R. M. & Kraft, P. Genome-wide association scans for secondary traits using case-control samples. Genet. Epidemiol. 33, 717–728 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  32. Dudbridge, F. et al. Adjustment for index event bias in genome-wide association studies of subsequent events. Nat. Commun. 10, 1561 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  33. Mahmoud, O., Dudbridge, F., Davey Smith, G., Munafò, M. & Tilling, K. Slope-Hunter: a robust method for index-event bias correction in genome-wide association studies of subsequent traits. Preprint at bioRxiv https://doi.org/10.1101/2020.01.31.928077 (2020).

  34. Grotzinger, A. D. et al. Genomic structural equation modelling provides insights into the multivariate genetic architecture of complex traits. Nat. Hum. Behav. 3, 513–525 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  35. Heckman, J. J. Sample selection bias as a specification error. Econometrica 47, 153–161 (1979).

    Article  Google Scholar 

  36. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Olsen, L. et al. Prevalence of rearrangements in the 22q11.2 region and population-based risk of neuropsychiatric and developmental disorders in a Danish population: a case-cohort study. Lancet Psychiatry 5, 573–580 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  38. Henn, B. M. et al. Cryptic distant relatives are common in both isolated and cosmopolitan genetic samples. PLoS ONE 7, e34267 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Loh, P.-R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Zheng, X. et al. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics 28, 3326–3328 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Watanabe, K., Taskesen, E., van Bochoven, A. & Posthuma, D. Functional mapping and annotation of genetic associations with FUMA. Nat. Commun. 8, 1826 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  42. Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).

    Article  CAS  PubMed  Google Scholar 

  43. Baselmans, B. M. L. et al. Multivariate genome-wide analyses of the well-being spectrum. Nat. Genet. 51, 445–451 (2019).

    Article  CAS  PubMed  Google Scholar 

  44. Jansen, I. E. et al. Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer’s disease risk. Nat. Genet. 51, 404–413 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Nolte, I. M. et al. Missing heritability: is the gap closing? An analysis of 32 complex traits in the Lifelines Cohort Study. Eur. J. Hum. Genet. 25, 877–885 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  46. Bulik-Sullivan, B. K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Gazal, S. et al. Linkage disequilibrium-dependent architecture of human complex traits shows action of negative selection. Nat. Genet. 49, 1421–1427 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Gazal, S., Marquez-Luna, C., Finucane, H. K. & Price, A. L. Reconciling S-LDSC and LDAK functional enrichment estimates. Nat. Genet. 51, 1202–1204 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Evans, L. M. et al. Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits. Nat. Genet. 50, 737–745 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Lee, S. H., Wray, N. R., Goddard, M. E. & Visscher, P. M. Estimating missing heritability for disease from genome-wide association studies. Am. J. Hum. Genet. 88, 294–305 (2011).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  51. Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 47, 1236–1241 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Locke, A. E. et al. Genetic studies of body mass index yield new insights for obesity biology. Nature 518, 197–206 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Hemani, G. et al. The MR-Base platform supports systematic causal inference across the human phenome. eLife 7, e34408 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  54. Choi, S. W. & O’Reilly, P. F. PRSice-2: polygenic risk score software for biobank-scale data. Gigascience 8, giz082 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank G. Davey Smith for insightful comments. This research was conducted by using the UK Biobank resource under application no. 31063. A.G. was supported by the Academy of Finland Fellowship (no. 323116). This work was supported by the Medical Research Council (Unit Programme number MC_UU_12015/2). M.G.N. is a fellow of the Jacobs Foundation and is supported by ZonMw grant nos. 849200011 and 531003014 from the Netherlands Organization for Health Research and Development and a VENI grant awarded by the Dutch Research Council (VI.Veni.191 G.030). A. Abdellaoui is supported by the Foundation Volksbond Rotterdam and ZonMw grant no. 849200011 from the Netherlands Organization for Health Research and Development. The FinnGen project is funded by 2 grants from Business Finland (nos. HUS 4685/31/2016 and UH 4386/31/2016) and 11 industry partners (AbbVie, AstraZeneca UK, Biogen MA, Celgene Corporation, Celgene International II Sàrl, Genentech, Merck Sharp & Dohme Corp, Pfizer, GlaxoSmithKline, Sanofi, Maze Therapeutics, Janssen Biotech). We thank the following biobanks for collecting the FinnGen project samples: Auria Biobank (https://www.auria.fi/biopankki/); THL Biobank (https://thl.fi/en/web/thl-biobank); Helsinki Biobank (https://www.helsinginbiopankki.fi/fi/etusivu); Biobank Borealis of Northern Finland (https://www.ppshp.fi/Tutkimus-ja-opetus/Biopankki/Pages/Biobank-Borealis-briefly-in-English.aspx); Finnish Clinical Biobank Tampere (https://www.tays.fi/en-US/Research_and_development/Finnish_Clinical_Biobank_Tampere); Biobank of Eastern Finland (https://ita-suomenbiopankki.fi/en/); Central Finland Biobank (https://www.ksshp.fi/fi-FI/Potilaalle/Biopankki); Finnish Red Cross Blood Service Biobank (https://www.veripalvelu.fi/verenluovutus/biopankkitoiminta); and Terveystalo Biobank (https://www.terveystalo.com/fi/Yritystietoa/Terveystalo-Biopankki/Biopankki/). All Finnish Biobanks are members of the BBMRI.fi infrastructure (http://www.bbmri.fi/). We thank the research participants and employees of 23andMe who contributed to this study.

Author information

Authors and Affiliations

Authors

Consortia

Contributions

N.P., M.C., P.N., C.E.C., M.D.V.d.Z., A. Abdellaoui, D.H., B.M.N., R.K.W., M.G.N., J.R.B.P. and A.G. designed the study. N.P., M.C., P.N., G.M., A. Abdellaoui, B.H., M.K., V.M.R., P.D.B.P., N.B., J.K., T.D.A., M.D.V.d.Z., R.B., A.D.B., A. Auton, D.H., M.G.N., J.R.B.P. and A.G. analyzed the data. N.P., M.C., A. Abdellaoui, C.E.C., F.R.D., K.K.O., R.B., P.J., B.M.N., R.K.W., M.G.N., J.R.B.P. and A.G. interpreted the results. P.N., A. Abdellaoui, V.M.R., T.D.A., T.M., E.d.G., Y.O., A.D.B., A. Auton, D.H., B.M.N., M.G.N., J.R.B.P. and A.G. provided the data. N.P., M.C., B.M.N., M.G.N., J.R.B.P. and A.G. wrote the manuscript. 23 and Me Research Team, FinnGen Study and iPSYCH Consortium provided data.

Corresponding authors

Correspondence to John R. B. Perry or Andrea Ganna.

Ethics declarations

Competing interests

P.N., A. Auton and D.H. are employed by 23andMe. P.J. is a paid consultant to Global Gene Corp and Humanity Inc.

Additional information

Peer review information Nature Genetics thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Different participation bias scenarios that may lead to a correlation between sex and genetic variants.

S, selection (that is participation in the study); X, trait; Gx, genotype causing X. The assumed causal paths are shown in blue, and the induced correlations are shown in red. Three scenarios exist in which sex can become heritable due to selection. a, Sex causes X which in turn causes selection. b, X and sex influence the selection independently. c, The effect of X on selection is different between the two sexes. This is the scenario discussed in the paper. We have run simulations (Supplementary Fig. 3) and scenarios a and b are less likely to be observed because the effect of the trait on selection would need to be extremely large.

Extended Data Fig. 2 Effect size for association between SNPs and sex in 23andMe.

On the y-axis is the effect in the entire study population (n = 2,462,132), and on the x-axis is the effect only among those younger than 30 years (n = 320,366). Error bars represent the confidence intervals for the effect size estimates.

Extended Data Fig. 3 Effect of sex-differential participation bias on the genetic correlation between y0 and y1 when the phenotypes have h2 = 0.1 or h2 = 0.3.

Each line represents a different degree of participation bias, expressed as the odds ratio (OR) used for the sampling. The higher the OR, the higher the degree of participation bias. The x-axis represents different values for the parameter k that gives the sex-differential effect. The smaller k is, the higher is the degree of the sex-differential effect. Under no partecipation bias or sex-differential effect y0 and y1 have a genetic correlation equal to 0.

Extended Data Fig. 4 Effects of sex differential bias on the BMI→T2D relationship.

The forest plot shows the effect of sampling men and women differentially based on BMI. The x-axis represents different values of bias introduced. For higher values, heavier males and leaner women are randomly picked. The number on top of the segment represents the P-value of the difference in effect between the two sexes using the Z-score method. The bias becomes large enough to be detected as ‘significant’ even at the lower values of bias applied. The straight lines represent the effect of BMI on T2D estimated without any sample selection.

Supplementary information

Supplementary Information

Supplementary Note and Figs. 1–6

Reporting Summary

Supplementary Tables

Supplementary Tables 1–11

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pirastu, N., Cordioli, M., Nandakumar, P. et al. Genetic analyses identify widespread sex-differential participation bias. Nat Genet 53, 663–671 (2021). https://doi.org/10.1038/s41588-021-00846-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41588-021-00846-7

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing