Introduction

Initial participation and continued engagement in cohort studies may be influenced by an individual’s social and lifestyle characteristics1. This selection has the potential to result in bias in estimating phenotypic and genotypic associations2. It is well established that large cohort studies tend to have a healthy volunteer bias in initial participation3. There is also a growing body of evidence suggesting that continued engagement in cohort studies may be influenced by a range of factors. Studies have demonstrated age, education, ancestry, geographic location and health status are associated with loss-to-follow-up4. However, as with all observational studies, these associations may be confounded and therefore not causal in nature.

In order to assess the causes of non-participation, genetic data on a more complete sample can be leveraged. Analysis of genetic data in the Avon Longitudinal Study of Parents and Children (ALSPAC) demonstrated a number of factors were causally related to participation. Education, agreeableness and openness caused higher participation, whilst higher BMI, schizophrenia, neuroticism and depression caused lower participation1,5. A study in the UK Biobank6 performed a genome-wide association study of completing the mental health questionnaire, identifying 25 loci associated with survey completion, and strong positive genetic correlations with educational attainment and better health and negative genetic correlations with psychological distress and schizophrenia.

In general, an analysis will give biased estimates if the exposure and the outcome variable (or causes of them) are associated with participation (conditional on the other variables in the analysis model7). Selection bias can also occur under other circumstances when only the outcome is related to selection, for example, if exposure does cause the outcome, and the outcome causes selection7. As another example, selection bias can occur if a modifier of the effect of exposure on outcome causes selection8. A comparison of associations between risk factors and overall and cause-specific mortality in UK Biobank and the less-selected Health Survey for England and Scottish Health Surveys showed wide variation in these associations9, with some over-estimated in UK Biobank and some under-estimated. Thus, to understand the impact of selective participation for a particular analysis, we need to identify factors that influence participation.

The UK Biobank has several measures of participation. Here, we utilise up to 451,036 individuals of European ancestry in the UK Biobank to identify factors that cause participation in the four available optional components of the baseline study in order to improve our understanding of the biases that may affect these associations and lead to false inferences. The four optional components tested were (a) the percentage of food frequency questionnaires (FFQ) completed, (b) acceptance of the invite to wear a physical activity monitor, (c) acceptance of an invitation to participate in the mental health questionnaire (MHQ) and (d) the completion of the aide-memoire. We used two-sample Mendelian randomisation (MR) approaches to explore the role of over 80 predictors on participation in the UK Biobank. Finally, we also explored genetic correlations between participation in the UK Biobank and the ALSPAC study to test between study consistency. If the same factors affect participation in studies that vary by geography, time period and design, then those studies will suffer the same bias, and thus replication of results across studies becomes meaningless10.

This study identified 32 variants associated with participation in at least one of the four optional components (P < 6 × 10−9), including loci with known links to intelligence and Alzheimer’s disease. Genetic correlations demonstrated that participation bias was common across studies, whilst MR provided evidence that longer educational duration, older menarche and taller stature increased participation, whilst higher levels of adiposity, dyslipidaemia, neuroticism, Alzheimer’s and schizophrenia reduced participation. Our effect estimates can be used for sensitivity analysis to account for selective participation biases in genetic or non-genetic analyses.

Results

Observational associations

The demographics of the participants included in this study are summarised (Table 1). Overall, 42,429 participants completed all four optional components of the UK Biobank study, whilst 51,141 participated in the food frequency questionnaire (FFQ), the physical activity actigraph monitoring and the mental health questionnaire (MHQ).

Table 1 Demographics of the four participation measures.

Participation in the four additional UK Biobank questionnaires and tests was associated with older age (FFQ and aide-memoire), female sex (all four outcomes), lower body mass index (all four outcomes), lower levels of deprivation (all four outcomes), higher fluid intelligence (all four outcomes), never smoking (all four outcomes), higher self-reported physical activity using the International Physical Activity Questionnaire (IPAQ) (FFQ and physical activity), higher measured physical activity (aide-memoire, MHQ), no depression (MHQ and aide-memoire) and no type 2 diabetes (all four outcomes). There was some evidence that the aide-memoire variable captured a different aspect of participation, with associations in the opposite direction to the other participation measures. For example, a longer duration in education was associated with lower odds of completing the aide-memoire, but higher odds of participating in the other three components. This was further evidenced by strong observational associations and genetic correlations between three of the participation variables, whilst completing the aide-memoire was not as robustly correlated with participation in the other optional surveys (Supplementary Tables 1 and 2).

We also generated a binary variable to compare participants who were invited to participate in at least one the optional surveys (i.e., FFQ, MHQ and physical activity) versus those participants not invited (Supplementary Table 3). Receiving an invite to participate in the three optional surveys (n = 336,633) was associated with younger age, male sex, lower body mass index, lower levels of deprivation, higher fluid intelligence, never smoking and a lower prevalence of type 2 diabetes.

GWAS of the participation variables identified 32 loci

GWAS of the four participation traits was performed in individuals of European descent using BOLT-LMM, with sample sizes of N = 300,639 for the FFQ, N = 215,127 for physical activity monitoring, N = 294,787 for MHQ and N = 451,306 for aide-memoire. After clumping and using a stringent GWAS cut-off of P < 6 × 10−9, there were 8 loci for the FFQ, 1 locus for physical activity participation, 21 loci for MHQ participation and 2 loci for aide-memoire (Table 2 and Supplementary Fig. 1). Twenty-three variants were associated at P < 6 × 10−9 with receiving an invite to participate in any of the optional surveys (Supplementary Data 1). All variants reaching the less stringent P < 5 × 10−8 threshold are reported (Supplementary Tables 4 and 5). With the exception of the aide-memoire, many of the lead variants for the other participation measures were within 500 kb of another lead variant for a different participation measure (Table 2 and Supplementary Data 2). For example, 6/8 of the FFQ lead variants at P < 6 × 10−9 were within 500 kb of another lead variant for either actigraphy or MHQ participation, whilst the only variant at P < 6 × 10−9 for actigraphy was within 500 kb of an FFQ and MHQ variant and 4/21 variants at P < 6 × 10−9 for the MHQ were within 500 kb of an FFQ or actigraphy variant.

Table 2 Variants associated with participation from genome-wide association analyses in the UK Biobank (P < 6 × 10−9).

Two of the variants identified for FFQ participation were previously identified in GWAS of intelligence (rs1121087111,12) and cognitive performance (rs1342859813). For both variants, the allele associated with higher intelligence or cognitive performance is associated with completing more FFQ. A further two variants (rs9261655 and rs147412694) were associated with blood cell traits14. Here, the alleles associated with higher blood cell counts were associated with completing fewer FFQ, whilst the reason behind this unknown, higher white blood cell counts have previously been associated with poorer cognition15,16. Four of the eight variants were also in high LD (r2 > 0.8) with genome-wide significant (GWS) signals from behavioural GWAS, including ADHD, risk tolerance, smoking and alcohol consumption. As expected from previous work on participation and our understanding of risky behaviours, alleles associated with a higher risk of ADHD, higher-risk tolerance and a higher risk of smoking or consuming alcohol associated with lower FFQ participation (Table 2).

The locus identified for participation in actigraphy (rs55714359) was in partial linkage disequilibrium with the variants identified for participation in the mental health questionnaire (r2 = 0.52) and completing the food frequency questionnaire (r2 = 0.32). This variant was previously identified as associated with multiple sclerosis17, with the allele associated with higher odds of multiple sclerosis associated with lower odds of participation in physical activity. In the UK Biobank, this variant is also associated with adiposity related traits18. The allele associated with higher adiposity is also associated with lower odds of participation in physical activity monitoring.

Of the 25 loci identified in a GWAS of MHQ participation by Adams et al.6 we replicated 15/25 (60%) at P < 6 × 10−9 and 22/25 (88%) at P < 5 × 10−8 in this larger sample of related individuals. The three missing variants (rs35028061, rs13082026 and rs57692580) were directionally consistent and approaching GWS (P values were 5.1 × 10−8, 6.6 × 10−7 and 9.6 × 10−7, respectively). Of the 21 variants associated with MHQ participation at the stringent threshold, four were previously associated with cognitive function and intelligence measures (rs7542974, rs485929, rs11793831 and rs710802013) and a further three were in high LD with variants identified associated with intelligence outcomes. For all variants, the allele associated with higher intelligence or cognitive performance was associated with higher odds of completing the MHQ.

A missense mutation in APOE (rs429358) was associated with MHQ participation. The C-allele is a marker of the APOE-ε4 genotype which is a major risk factor for Alzheimer’s disease19, and here, was associated with lower odds of participation in the MHQ. Further analysis in the unrelated subset tested whether individuals with APOE-ε4ε4 haplotype were less likely to participate in the MHQ compared to those with the APOE-ε2ε2 haplotype. Lower odds of MHQ participation was observed in the APOE-ε4ε4 haplotype carriers 0.89 (95% CI: 0.80, 1.00) in all individuals and in those who were less than 50 years old at recruitment (OR: 0.81 (95% CI: 0.65, 1.00)). This suggests that individuals with early signs of cognitive impairment had reduced capacity to participate in the MHQ.

The variant rs58101275 has previously been associated with bone mineral density20 and isoleucine levels21. The G allele raises both isoleucine levels and bone mineral density (BMD) and was associated with lower odds of completing the aide-memoire. Previous studies have demonstrated that BMD is inversely associated with cognition22 and Alzheimer’s disease, indicating those with higher BMD may have a better memory.

Of the 23 variants at P < 6 × 10−9, 6 were either top signals for MHQ participation or in high LD (r2 > 0.8) with variants for MHQ participation (Supplementary Table 3). For all six loci, the allele that was associated with higher odds of participation in the MHQ was associated with higher odds of receiving an invite to participate in at least one optional survey. Variant rs73078357 was previously identified as associated with email contact (Supplementary Table 3). 8/23 variants were previously associated with cognitive performance13, intelligence12 and self-reported educational attainment13,23.

Genetic correlations with published GWAS studies

After Bonferroni correction (P < 1.5 × 10−5), we observed strong positive genetic correlations between three of the four participation measures (FFQ, MHQ and physical activity completion) and qualifications, fluid intelligence, years spent in education. Strong inverse genetic correlations were noted between three of the four participation measures (FFQ completion, MHQ and physical activity completion) and obesity-related traits. Completing the aide-memoire, was strongly inversely correlated with risk-taking, ever smoking, driving fast, having fractured bones in the last 5 years and schizophrenia. It was positively associated with suffering from nerves and experiencing nervous feelings.

Genetic correlations with ALSPAC participation measures

There were positive genetic correlations between the ALSPAC participation measures and FFQ completion (mother: rg = 0.533, P = 3 × 10−8; child: rg = 0.488, P = 3 × 10−9), participation in MHQ (mother: rg = 0.616, P = 8 × 10−10, child: rg = 0.627, P = 2 × 10−12), and physical activity participation (mother rg = 0.487, P = 2 × 10−5; child rg = 0.319, P = 0.001) (Supplementary Table 2). The aide-memoire variable in UK Biobank was not strongly correlated with the ALSPAC participation measures (mother rg = 0.215, P = 0.08, child rg = 0.167, P = 0.14; Supplementary Table 2). Receiving an invite to participate was strongly correlated with the participation measures in ALSPAC (mother rg = 0.58, P = 1 × 10−9; child rg = 0.59, P = 1 × 10−10).

Mendelian randomisation analyses

In all individuals, Mendelian randomisation24,25 analysis demonstrated that 27 traits caused at least one participation measure at a threshold of P < 0.05 (P value based on the inverse-variance weighted (IVW) analyse, with 8 at more stringent P < 0.0001; Supplementary Table 6). Of the 27 traits, 15, 18, 10 and 6 were associated with FFQ, MHQ and physical activity and aide-memoire, respectively.

Longer duration in education and higher intelligence predicted higher odds of participation in the FFQ, MHQ and physical activity monitoring (Fig. 1A and Supplementary Table 6). For example, a one-SD longer duration (~5 years) in education caused higher odds of participation in the MHQ (1.78 (95% CI: 1.61, 1.98)) and physical activity monitoring (1.69 (95% CI: 1.36, 2.13)). In contrast, there was limited evidence for longer educational duration predicting the completion of the aide-memoire.

Fig. 1: Plots of the Mendelian Randomisation results.
figure 1

Dot plots representing the inverse-variance weighted results from two-sample MR analyses for (A) educational, (B) anthropometric, (C) behavioural and (D) neurological and psychological traits. Error bars represent the 95% confidence intervals of the IVW estimate.

Higher adiposity caused lower odds of participation in the FFQ, MHQ and physical activity monitoring. For example, the odds ratios for participation in the MHQ and PA monitoring per one-SD higher waist:hip ratio were 0.85 (95% CI: 0.80, 0.89) and 0.83 (95% CI: 0.75, 0.93), respectively (Fig. 1B and Supplementary Data 3). Higher BMI caused lower odds of participation in the FFQ and physical activity monitoring in women (Supplementary Data 3). There was limited evidence that higher adiposity predicted aide-memoire completion. A one-SD taller stature caused higher odds of completing the MHQ (OR: 1.06 (95% CI: 1.04, 1.07)) and physical activity monitoring (OR: 1.07 (95% CI: 1.03, 1.11)). Taller stature also caused participants to complete more FFQ (Fig. 1B and Supplementary Data 3). There was no strong evidence that any of the other anthropometric measures tested caused participation, although many of the estimates have wide confidence intervals.

Genetic evidence demonstrated that behavioural characteristics caused participation (Fig. 1C). For example, older age of losing virginity caused participants to complete more FFQ and have higher odds of participation in the MHQ (OR: 1.15 (95% CI: 1.03, 1.28)). A twofold higher genetic liability for being a morning person chronotype caused higher odds of completing the aide-memoire (OR: 1.02 (95% CI: 1.01, 1.04)) and lower odds of completing the MHQ (OR: 0.98 (95% CI: 0.96, 1.00)). A twofold higher genetic liability for riskier behaviour caused lower odds of completing the aide-memoire (OR: 0.27 (95% CI: 0.19, 0.40)), but was not linked to completing the optional surveys. In a subset of former and current smokers, the role of smoking heaviness on participation was explored. A one-SD higher cigarette per day (~11 cigarettes) caused lower odds of participating in the MHQ (OR: 0.88 (95% CI: 0.85, 0.92)) and the physical activity monitoring (OR: 0.93 (95% CI: 0.89, 0.97)) (Supplementary Data 4).

C-reactive protein (CRP) was the only biomarker tested with some evidence of a causal association, with a twofold higher CRP causing higher odds of completing the MHQ (OR: 1.08 (95% CI: 1.04, 1.12); Supplementary Data 3).

Higher genetic liability of cancer and non-cancer diseases and poorer metabolic health generally caused lower odds of participation (Supplementary Data 3). For example, a twofold higher genetic liability of breast cancer was associated with lower odds of participating in the MHQ (OR: 0.98 (95% CI: 0.96, 1.00)), physical activity monitoring (OR: 0.97 (95% CI: 0.95, 1.00)) and completing the aide-memoire (OR: 0.97 (95% CI: 0.95, 1.00)).

Several psychological and neurological conditions caused lower odds of participation (Fig. 1D and Supplementary Data 4). For example, a genetic liability to ADHD and schizophrenia was associated with the completion of fewer FFQ and lower odds of participation in the MHQ and physical activity monitoring. A twofold higher genetic risk of schizophrenia lowered the odds of completing the MHQ by 3%, (OR: 0.97 (95% CI: 0.95, 0.99)). A genetic liability for schizophrenia also lowered the odds of completing the aide-memoire. Genetic liability for autism and extraversion caused fewer FFQ to be completed. Alzheimer’s disease genetic liability was associated with lower odds of participation in the FFQ, physical activity monitoring and MHQ. A doubling in Alzheimer’s genetic liability was associated with a 0.976 (95% CI: 0.969, 0.983) lower odds of completing the MHQ.

There was little evidence that reproductive traits in women caused participation, with the exception of age at menarche. For example, a one year older age at menarche was associated with 1.07 (95% CI: 1.03, 1.11) and 1.07 (95% CI: 1.03, 1.12) higher odds of completing the MHQ and physical activity monitoring, respectively (Supplementary Data 3).

Generally, results were consistent when analysed in men and women separately (Supplementary Data 3), with the exception of BMI and physical activity participation, where evidence suggested high BMI only caused lower odds of participation in women (ORwomen: 0.88 (95% CI: 0.81, 0.96), ORmen: 1.01 (95% CI: 0.92, 1.12)), Pinteraction = 0.07).

Two-sample MR methods that are more robust to pleiotropy generally provided similar results (Supplementary Data 3).

Discussion

This study explored the genetic basis of four different participation measures, plus whether or not participants were invited to at least one optional element in the UK Biobank study and used Mendelian randomisation to test the causal role of a broad range of factors in participation.

Some individual characteristics appear to decrease the likelihood of participation in all of the optional invited components of the UK Biobank study (i.e., physical activity monitoring, food frequency and MHQ). These include lower intelligence and educational attainment, higher adiposity and increased liability to ADHD and neuroticism and schizophrenia. Many of these were previously identified in the ALSPAC study1,5, previous UK Biobank study analyses6 and Generation Scotland6. This implies that missingness of all the variables collected in the optional components of UK Biobank will be influenced by these underlying traits. The fourth participation measure considered was the aide-memoire, where participants were asked at baseline to complete a short form with specific data. Our analyses suggest that this measure captures another aspect of behaviour, perhaps reflecting compliance rather than participation, with evidence that a genetic liability to riskier behaviour was inversely associated with completing the aide-memoire.

GWAS identified a number of loci robustly associated with the different participation measures. A number of genome-wide significant loci were shared across the participation traits, suggesting a general role in influencing participation. Although further analyses using colocalization methods would be necessary to more formally test whether these shared loci represent the same signal. Many of the variants identified were in or near loci which had previously been identified as associating with, intelligence and cognitive function or behaviour-based traits. Alleles associated with higher intelligence or risk aversion were consistently associated with completing the MHQ and more of the FFQ. In the MHQ GWAS, the top signal was in the highly pleiotropic APOE locus. The allele that raises participation in the MHQ (T) is associated with lower odds of Alzheimer’s disease19, heart disease, inflammation and dyslipidaemia26. Further analyses indicated that the APOE-ε4ε4 haplotype carriers were less likely to participate in the MHQ and high genetic liability for Alzheimer’s disease lowered odds of participation in the FFQ, physical activity monitoring and the MHQ.

In addition to performing GWAS of our four participation measures, we also performed a GWAS of invitation to participate in at least one of the three optional components. Because only those invited can participate, the fact that not everyone is invited could result in collider bias in our analysis of participation27. A factor that is positively associated with both being invited and participation is likely to have its association with participation biased towards the null when conditioning on having been invited, assuming that being invited and participating are positively correlated (as demonstrated here) and that there are no interactions (on the probability scale) between the variable and others that also influence invitation/participation. Here, we demonstrated that some variants were associated with higher odds of both being invited to participate and completing the MHQ. This suggests that here conditioning on being invited to participate could have resulted in the Mendelian randomisation analyses for these variables being biased towards the null, if they were in truth positively associated with participation. However, if there are non-linearities or interactions in the effects of the risk factors on invitation/participation, then the direction of the bias cannot be predicted. Similarly, a factor that affects being invited, but does not in truth affect participation, could appear to have a positive or negative spurious association with participation, conditional on being invited.

Using genetic correlation analyses, we have demonstrated that these participation issues are not specific to the UK Biobank. Two participation measures from the ALSPAC study1 were strongly correlated with the participation measures derived in the UK Biobank. This fits with a previous study where strong genetic correlations were noted between UK Biobank mental health participation and participation in follow-up in Generation Scotland6. These results suggest that similar genetic factors are driving participation in follow-up and optional components of studies, regardless of study design, recruitment strategies and the population demographics of the study. The similarity of factors affecting participation across different studies is potentially important for comparisons of results between studies—if similar factors cause participation in different studies, then collider bias will have the same impact on the results from each study. Thus, results from different studies would be subject to similar biases, causing replication of results across studies to become meaningless.

These results are important for informing analysis strategies and the likely direction and magnitude of bias due to conditioning only on those who participate. For the participator-only analysis for a given model to be unbiased, it is necessary for the outcome variable to be independent of missingness, given the variables in the analysis model7. Thus when examining the factors affecting physical activity, all the factors that we have shown here to be related to participation in the physical activity monitoring (BMI, height, education, intelligence, ADHD, age at menarche), should be either included in the analysis model or used in other strategies such as inverse probability weighting (IPW) or multiple imputations (MI). Where a selection is related to the underlying concept(s) measured by the optional component, then this concept will be missing not at random and analyses where it is the outcome will likely be biased7. On the other hand, a participator-only analysis of a model that involves only characteristics that are unrelated to participation will not be biased by conditioning on participation.

Selection of the type demonstrated here may cause bias in estimates of effects, and the size and direction of bias cannot (usually) be exactly determined. Previous work has shown that estimated effects of risk factors on mortality and cause-specific mortality differ between UK Biobank and the less-selected Health Survey for England—with some being moved towards, and some away from the null. This could imply that selection into UK Biobank is causing bias in estimating these effects. For example, we have shown that smoking is negatively associated with participation in UK Biobank. If a factor that causes lung cancer is also negatively associated with participation (e.g., socioeconomic position), then selecting on participating in UK Biobank would induce a negative association between smoking and lung cancer (assuming an additive model), which would bias the estimated effect of smoking on lung cancer towards the null. This is indeed what is seen in the comparison of estimates for this effect between UK Biobank and HSE-SHS9. It should be noted that this simple estimate of the direction of bias depends on assumptions about the underlying selection model, and cannot be verified with only UK Biobank data—e.g., an interaction between smoking and socioeconomic position in their effect on participation could change the size and direction of any bias. We have similarly shown previously that some effect estimates were different when calculated on only those continuing to participate in ALSPAC, compared to all those participating at baseline1. It has also been suggested that selection bias may (at least in part) be responsible for overestimates of the protective effect of moderate alcohol consumption28,29.

Strategies to investigate or minimise the impact of selection on a given estimate depend on the data available on the population not selected into the study. Inverse probability weighting (IPW) has been suggested to overcome mortality bias30, but the validity of this depends on correctly specifying the selection model. If there is an unmeasured factor that affects selection and is related to the variables in the analysis model, then this may mean that inverse probability weighting is not unbiased31. IPW as a solution also depends on having data on all the variables affecting selection and their distribution in the population in which we wish to make the inference. Solutions using IPW to infer bounds on estimates have been proposed, although these can result in wide bounds, or depend on underlying assumptions about associations of unmeasured factors with selection32,33. Over-sampling of under-represented subgroups of the population is used, for example in the Millenium Cohort Study34. However, this solution will only remove bias due to selection into those specified subgroups (not any other selection bias). In addition, if the selection in those subgroups now differs according to other factors—e.g., the participators from the hard-to-reach groups are comparatively healthier than those in the easier-to-recruit group, then new biases may be introduced.

A key advantage to the genetic analyses presented here over the observational analyses usually reported (and reported here in Table 1) is the ability to draw conclusions about causality (under the usual assumptions of MR, in particular, the assumptions around horizontal pleiotropy). For example, smoking is related to participation in the aide-memoire observationally (Table 1) but may be due to confounding as there is little evidence of an association using genetic variants associated with smoking (Supplementary Table S8). This information about causality may be useful to inform strategies to improve participation—for example, if smoking caused participation then qualitative work could be done to find out why smokers were less prone to participate, and then to address this in recruitment/retention strategies. However, if actually the association between smoking and participation is driven by (for example) socioeconomic position, and had nothing to do with their actual smoking, then targeting only smokers could be counter-productive. A strategy based only on improving participation in smokers could even induce more bias, in that interaction between socioeconomic position and smoking in their effect on participation might be induced.

There are a number of limitations to this analysis. First, our analysis sample was based on Europeans only in the UK Biobank sample. The UK Biobank is not population-representative and therefore these findings may not be generalisable to other population studies. Second, email access was only available at baseline and therefore this might not accurately reflect access to email at the release of the various optional components. Third, it is possible that some participants died before being able to participate in some of the optional components, however, this number will likely be small. Fourth, factors relating to participation may change with age. However, we saw strong genetic correlations with our UK Biobank participation measures and the ALSPAC measures. Fifth, the predictors used in MR, were selected a priori and it is possible we have missed some key predictors of participation. Finally, for MR we assume that the genetic variants used as an instrumental variable affect the outcome only through their effect on the exposure (i.e., the absence of horizontal pleiotropy). Our sensitivity analyses using MR-Egger and Median MR, which are more robust to horizontal pleiotropy were generally consistent, although often had much wider confidence intervals that crossed the null.

In summary, we demonstrate that genetic variants are associated with participation in several aspects of the UK Biobank study and that a wide range of traits cause differences in participation. This builds on previous work in the ALSPAC study and here we demonstrate strong genetic correlations between the UK Biobank participation measures and ALSPAC highlighting that these issues are likely to be seen in many studies. Our findings highlight the potential for introducing bias into both genetic and non-genetic analyses. All studies need to consider the importance of selection bias and use sensitivity analyses to assess the robustness of their conclusions.

Methods

UK Biobank

This study was conducted using the UK Biobank resource, which has ethical approval and its own ethics committee (https://www.ukbiobank.ac.uk/ethics/). Details of the patient and public involvement in the UK Biobank are available online (www.ukbiobank.ac.uk/about-biobank-uk/ and https://www.ukbiobank.ac.uk/wp-content/uploads/2011/07/Summary-EGF-consultation.pdf?phpMyAdmin=trmKQlYdjjnQIgJ%2CfAzikMhEnx6). No patients were specifically involved in setting the research question or the outcome measures, nor were they involved in developing plans for recruitment, design, or implementation of this study. No patients were asked to advise on interpretation or writing up of results. There are no specific plans to disseminate the results of the research to study participants, but the UK Biobank disseminates key findings from projects on its website.

The UK Biobank study recruited over 500,000 individuals aged between 37 and 73 years (with >99.5% aged between 40 and 70 years) from across the UK between 2006 and 2010. The UK Biobank35,36 collected extensive phenotypic and genotypic data on all participants. Here, we used data in up to 451,036 UK Biobank individuals who were defined as European ancestry using principal component analyses. Briefly, we generated principal components in the 100 Genomes Cohort, using high-confidence SMPs to obtain their individual loadings. The loadings were then used to project all of the UK Biobank samples into the same principal component space. The individuals were then clustered using principal components 1 to 4.

Participation measures

Four participation phenotypes were derived in the UK Biobank:

  1. 1.

    Percentage of food frequency questionnaires (FFQ) completed, based on the number of invites (data field 110002, 0–4) and the number of acceptances (data field 110001, 0–4). A binary variable was also created that represented sent a food frequency questionnaire but never accepted (0) and sent a food frequency questionnaire and completed at least one (1). This variable is based on the online requests which were sent out every 3–4 months a total of four times between February 2011 and June 2012 to participants who provided an email address at recruitment37.

  2. 2.

    Participation in physical activity monitoring, a binary variable defined using data fields 110005 and 110006. 0 represents invited but not accepted and 1 represents invited and accepted. Between February 2013 and December 2015, a random sample of participants with a valid email was invited to wear the accelerometer. Participants from the North West region were excluded due to participant burden concerns38.

  3. 3.

    Participation in a mental health questionnaire (MHQ), a binary variable defined using data fields 20400 and 20005. 0 represents invited but not accepted and 1 represents invited and accepted. Participants with a valid email were invited to complete the MHQ. The UK Biobank’s contact approach was to (a) send an initial invitation email, (b) send a reminder email to non-responders (2 weeks after the initial invite), (c) send a reminder to partial responders (2 weeks after they started the questionnaire) and d) the last chance invitation after 4 months.

  4. 4.

    Aide-memoire completed, a binary variable derived from data field 111 which represents compliance to a request from the UK Biobank prior to attending the assessment centre to fill out specific information to help with the questionnaire. 0 represents non-compliance and 1 represents compliance.

With the exception of the aide-memoire, which was requested by everyone, the remaining variables relied on UK Biobank participants being invited to take part. The general UK Biobank protocol was to invite everyone to participate in the optional questionnaires and surveys, although as detailed above these invitations were generally sent via email. To investigate the impact of this strategy, we also created a variable to represent whether participants were invited to participate in at least one of the optional surveys above (coded as 1) or not (coded as 0).

Genotypes

We used imputed genotypes available from the UK Biobank for association analyses39. Variants were excluded if imputation quality (INFO) was <0.3 or the minor allele frequency (MAF) was <0.1%. This quality control process resulted in 6,930,712 variants for association analyses. Lead SNPs were defined as those with the smallest P value and locus boundaries were defined using a ±0.5 Mb distance from the lead SNP.

Observational associations

Logistic regression analyses were used to explore the relationship between participant demographics and the four participation measures, plus the invitation measure. The Pearson correlations and overlap between the four participation measures were also investigated. Chi-squared analyses were used to explore the overlap of the binary participation measures.

Genome-wide association analysis

All individual variant association testing was performed using BOLT-LMM40 v2.3. This software applies a linear mixed model (LMM) to adjust for population structure and individual relatedness. From the ~805,000 directly-genotyped (non-imputed) variants available, we identified 524,307 good-quality variants (bi-allelic SNPs; MAF ≥ 1%; HWE P > 1 × 10−6; non-missing in all genotype batches, total missingness <1.5% and not in a region of long-range LD) which BOLT-LMM used to build its relatedness model. A number of covariates (age, sex, UK Biobank assessment centre and genotyping platform (categorical; UKBiLEVE array, UKB Axiom array interim release and UKB Axiom array full release) were included at runtime. Here in the main paper, we only report variants that reached a stringent P < 6 × 10−9 cut-off based on simulations41. The results from the GWAS of receiving an invite to at least one of the three optional surveys is also reported.

Genetic correlations

We used a method based on LD score regression42 as implemented in the LD Hub software43, available at http://ldsc.broadinstitute.org/ldhub/, to quantify the genetic overlap between the four participation traits and 832 traits with publicly available GWA data. This method uses the cross-products of summary test statistics from two GWASs and regresses them against a measure of how much variation each SNP tags (its LD score). Variants with high LD scores are more likely to contain more true signals and thus provide a greater chance of overlap with genuine signals between GWASs. Correlations were reported if they reached a Bonferroni corrected P value (number of tests = 3220; P < 1.5 × 10−5).

We also used the LD score regression to explore the genetic correlation between our participation measures and those available in the ALSPAC study1. The LD score regression method used summary statistics from the GWAS meta-analysis of the 4 participation measures in UK Biobank and the participation measures of ALSPAC mother and children, calculates the cross-product of test statistics at each SNP, and then regresses the cross-product on the LD score.

Finally, we utilised LD score regression to explore the genetic correlation between not receiving an invite to participate in the various optional components and the four participation measures.

Genome-wide genetic correlations do not provide evidence of causality, which we tested with Mendelian randomisation using specific sets of variants. Instead, they likely represent a complex mixture of direct and indirect causal associations in both directions, pleiotropy and residual stratification. These likely properties of genome-wide genetic correlations mean they provide a way of projecting a phenotype measured in one study into another to test between study consistency (e.g., the ALSPAC versus UK Biobank comparison), or, when comparing different traits within one study, potentially a measure of correlation that is more representative of biological processes than observational correlations, although we note that observational correlations were usually very similar to genetic correlations.

Mendelian randomisation

We undertook two-sample MR analyses to further test the causal relationships between 80 exposure traits (decided a priori on the grounds that they are common exposures and used in current MR pipelines) (Supplementary Table 8) and the four different participation outcomes. The predictors were classified into nine broad categories (Supplementary Table 4).

The two-sample MR analyses used summary-level data from the BOLT-LMM GWAS of the participation traits. Known SNPs for each exposure trait (Supplementary Table 4) were extracted from the GWAS results to estimate the association of outcome and exposure-trait-SNP, whilst published coefficients from the primary GWAS were utilised for the association of exposure with exposure-trait-SNP. Four two-sample MR methods were performed using a custom pipeline: inverse-variance weighting (IVW); MR-Egger24; Weighted median (WM)25; Penalised weighted median (PWM)25. We have presented the IVW approach as our main analysis method, with the MR-Egger, WM and PWM representing sensitivity analyses to account for unidentified pleiotropy, which may bias our results. Horizontal pleiotropy occurs when the genetic variants related to the exposure of interest independently influence the outcome. IVW assumes there is either no horizontal pleiotropy under a fixed-effect model or, if using a random-effects model after detecting heterogeneity amongst the causal estimates, that the strength of the association between the genetic instruments and the exposure is not correlated with the magnitude of the pleiotropic effects (the InSIDE assumption) and that the pleiotropic effects have an average value of zero. MR-Egger estimates and adjusts for non-zero mean pleiotropy and therefore provides unbiased estimates if just the InSIDE assumption holds24.

To explore the role of smoking heaviness on participation in the different smoking strata we performed one-sample MR in the unrelated subset of Europeans in the UK Biobank. We performed analyses in all individuals and stratified by smoking status into never, former, current and ever smokers. Here, for our binary participation measures, we first assessed the association between the cigarettes per day and the smoking GRS. The predicted values and residuals from this regression model were saved. Second, the predicted values from stage 1 were used as the independent variable and the participation measures as the dependent variable in a logistic regression model. As the FFQ participation measure was continuous we utilised the ivreg2 command in Stata.

All analyses were performed in Stata version 14 or R version 3.5.0.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.