Main

For all sample surveys, ascertainment bias, meaning that the sample is not representative of the target population, could lead to seriously misleading conclusions1,2. By its very nature, ascertainment bias usually cannot be evaluated based on the sample alone3. Typically, other variables (covariates) that have known distributions for both sample and population are needed for adjustments1,3. Such adjustments are inherently imperfect as the covariates are unlikely to fully capture the underlying bias1,3. For genetic studies, among participants of the primary study who have contributed DNA, further engagement in optional components of the study has been demonstrated to have associations with genotypes and phenotypes4,5,6,7. That, however, does not address the genotypic difference between the primary study participants and the target population. Thus, it is striking that one can investigate how the sampled genotypes are biased based on themselves alone. A recent study identified single nucleotide polymorphisms (SNPs) that had significant allele frequency differences between the sampled males and females, and proposed that those variants have differential participation effects for the sexes8. This approach, however, cannot identify variants that affect primary study participation of both sexes in a similar manner. Here, we show how to do so.

Results

Three allele comparisons

All individuals are genetically related to some degree. Furthermore, each individual has two copies of genetic segments on autosomal chromosomes, and some of these segments are identical by descent (IBD), that is inherited from a recent common ancestor, with genetic segments in a relative. Instead of comparing individuals, we compare genetic segments. The key idea is that an allele that has higher frequency in participants than nonparticipants would also have higher frequency in segments that are in two participants than in segments that are in only one. Following this observation, we present below three principles of genetic induced participation bias, and show how to use only the sampled genetic data to perform genome-wide association scans (GWAS) for study participation that capture only direct genetic effects9,10, and are unaffected by population stratification11.

First principle of genetic induced ascertainment bias

On average, between two ascertained individuals, genetic segments shared IBD, relative to segments that are not, are enriched with alleles that have positive direct effects on ascertainment probability. Figure 1a illustrates this general principle. With a large sample of individuals of the same ancestry, at a specific SNP locus, many pairs of individuals would share one long haplotype inherited IBD from a not-very-distant common ancestor. Each of such pairs has one distinct shared haplotype, and two distinct not-shared haplotypes. The SNP allele that promotes participation would tend to have a higher frequency in the shared than the not-shared haplotypes. The shared and not-shared alleles are considered as cases and controls, respectively, and matched as they are in the same individuals. Still, that does not remove potential confounding entirely as haplotypes driven to higher frequency through natural selection would also be shared by more individuals. Ascertainment bias is a form of selection and, to cleanly distinguish it from other forms of selection, requires more stringent matching of shared and not-shared haplotypes. That is achieved by using ascertained parent–offspring and sibling pairs (sib-pairs).

Fig. 1: The first principle of genetic induced participation bias: comparing shared and not-shared alleles.
figure 1

a, General relative pairs sharing one allele IBD. b, Parent–offspring pairs. c, Sib-pairs sharing one allele IBD (IBD1). d, Sib-pairs sharing two alleles or no allele IBD (IBD2 and IBD0). Different shading denotes segments that are distinct by descent.

Ascertained parent–offspring pairs

For an ascertained parent–offspring pair (Fig. 1b) where the other parent is not ascertained, there are three distinct alleles by descent for a given SNP: the allele transmitted from parent to offspring (T), the parental allele not transmitted to the offspring (NT) and the allele inherited by the offspring from the other parent (O). The T allele is shared, and the other two are not-shared. The NT and T alleles are perfectly matched as both are in the ascertained parent. Mendelian inheritance dictates that each would have the same chance to be transmitted to the offspring to become the shared allele. The principle is similar to that underlying the transmission disequilibrium test12. The O allele is not as perfectly matched and is currently not used. For the ascertained parent–offspring pairs, with alleles coded as 0/1, and FT and FNT denoting the frequency of 1s among the T and NT alleles respectively, the difference

$${F}_{{\mathrm{T}}}-{F}_{{{\mathrm{NT}}}}$$
(1)

can be used to test for association between SNP and ascertainment (Methods). We call this the transmitted–nontransmitted comparison (TNTC). When both parents are ascertained, that can be treated as two parent–offspring pairs as the transmissions from the two parents are independent without a participation effect.

Ascertained sib-pairs, parents not ascertained

At a specific locus, assuming random Mendelian transmission, a sib-pair has probability 1/2 of inheriting the same allele IBD from the father, and the same independently from the mother. Consequentially, a sib-pair shares 2, 1 or 0 alleles IBD with probabilities 1/4, 1/2 and 1/4, respectively. With dense SNPs, the IBD state of a locus can usually be determined with high accuracy. For a specific SNP, based on the sib-pairs with known IBD states, two comparisons are made.

For each sib-pair that has IBD state 1 for a given SNP (Fig. 1c), there are one distinct shared allele (S) and two distinct not-shared alleles (NS1 and NS2). Let FIBD1S and FIBD1NS denote the frequency of allele 1 among the S alleles and the NS (NS1 and NS2 combined) alleles, respectively. The difference

$${F}_{{{\mathrm{IBD}}}{\mathrm{1S}}}-{F}_{{{\mathrm{IBD}}}1{{\mathrm{NS}}}}$$
(2)

is the within-sib-pair comparison (WSPC). For any such pair, if the S allele is paternally inherited, then the two NS alleles are maternally inherited, and vice versa. Despite that, conditional on the genotypes of the two parents, without ascertainment bias, the frequency difference has expectation 0. Assuming random transmissions, the shared allele is equally likely to be paternal or maternal (Extended Data Fig. 1). Thus, any systematic difference between fathers and mothers cancels in expectation.

For the sib-pairs sharing two alleles IBD at a locus, let \({F}_{{{\mathrm{IBD}}}2}\) be the frequency of allele 1. Similarly, let \({F}_{{{\mathrm{IBD}}}0}\) be the allele 1 frequency among sib-pairs that share no allele IBD. The difference (Fig. 1d)

$${F}_{{{\mathrm{IBD}}}2}-{F}_{{{\mathrm{IBD}}}0}$$
(3)

is the between-sib-pairs comparison (BSPC). This compares different sib-pairs and does not require splitting genotypes into shared and not-shared alleles. For a sib-pair at a particular locus, the chance to be in IBD state 0 or 2 is the same and, thus, without ascertainment bias, the frequency difference has expectation zero. In general, a sib-pair would be in IBD state 2 for some regions, and in IBD state 0 (or 1) in other regions.

Results from the three comparisons introduced can be combined, for testing, estimation or prediction purposes. However, having the individual results separately is important because they could capture different effects depending on the nature of the participation bias. Most importantly, they are impacted differently by genotyping and data-processing errors.

Analyses of UK Biobank data

The UK Biobank (UKBB) is a large-scale database with genetic and phenotypic information of individuals from across the UK13. Individuals aged between 40 and 69 years and living within a 25-mile radius of any of the 22 UKBB assessment centers were invited to participate14. Among the 9,238,453 invited, 5.5% (~500,000 individuals) participated and went through baseline assessments that took place from 2006 to 2010 (ref. 14). In addition to phenotypic details collected at the baseline visit, information continued to be added, including follow-up studies for large subsets of the cohort14,15,16. It is known that the UKBB sample is not fully representative of the UK population13,14,17. Participants were more likely to be female, less likely to smoke, older, taller, had lower body mass index (BMI)14 and more educated17.

We applied our methods to 4,427 parent–offspring pairs and 16,668 sib-pairs with White British descent in UKBB (Supplementary Note section 1; Methods). To start, association analysis was performed for 658,565 SNPs available in the UKBB phased haplotype data13. For each SNP, t-statistics for TNTC, WSPC and BSPC were computed by dividing each of the frequency difference, equations (1) to (3), by its standard error (s.e.) (Supplementary Note section 2; Methods). When results from all the SNPs were examined together, we observed a tendency for the major allele to be positively associated with participation. Investigations supported by extensive simulations revealed three causes to this bias: (a) IBD-calling errors, (b) phasing errors, (c) genotyping errors (Methods). (a) affects the processing of sib-pair data when the IBD sharing status (0, 1 or 2) of every SNP is ‘called’ and impacts WSPC and BSPC. The problem with (a) was mostly eliminated when we replaced KING18, a ‘general’ IBD-calling program, by snipar9, specifically designed for sib-pairs (Extended Data Fig. 2), and trimmed away 250 SNPs from each end of an IBD segment (contiguous SNPs with the same IBD status called; Extended Data Fig. 3; Methods). (b) and (c) affect TNTC and WSPC, which require splitting genotypes into shared and not-shared alleles; (b) enters when the relatives are both heterozygous and phasing with neighboring SNPs is required to determine the shared allele9 (Extended Data Fig. 4). Through theoretical calculations and simulations, we found that under most scenarios, (b) and (c) would contribute to the observed major-allele bias (Supplementary Note section 3; Methods). For variants with minor allele frequency (MAF) > 0.10, we estimated that around 50% of the observed major-allele bias can be explained by phasing errors, while genotyping errors play a greater role for low frequency variants (Extended Data Figs. 5 and 6; Methods). As (b) and (c) do not impact BSPC and with (a) addressed, results based on BSPC can be used in a straightforward manner. When LD score regression (LDSC)19 is applied to the \({\chi }^{2}\) statistics computed from the BSPC GWAS, the fitted intercept is nearly exactly 1 (0.9998), indicating that the \({\chi }^{2}\) statistics are neither inflated because of data artefacts, nor are they affected by issues such as population stratification. For TNTC and WSPC, the major-allele bias and related problems affect different analyses differently, as illustrated by analyses below and information in Methods and Supplementary Note section 3. For most SNPs individually, the bias induced by (b) and (c) is small and apparent only when analyzed as a group. However, in the major histocompatibility complex (MHC), a difficult region with extended linkage disequilibrium (LD), 17 SNPs have \(P < 5\times {10}^{-8}\) with WSPC. These are clearly data artefacts as none is even nominally significant with BSPC (all P values presented are two-sided). We discarded SNPs from the MHC and other extended LD regions20 and applied other quality control filtering (Supplementary Note section 4; Methods), leaving 500,632 remaining SNPs with MAF > 1% in our participation genome scans. To handle the data-error induced biases separately for TNTC and WSPC, a two-step allele frequency adjustment was applied, which eliminated the correlation with allele frequency and shrank the t-statistics towards zero (Extended Data Fig. 7; Methods). Figure 2 summarizes the steps underlying the procedure used to test SNPs for association with participation.

Fig. 2: Flowchart summarizing the procedure implemented to test genetic variants for association with primary participation using genotypes of participants only.
figure 2

Core steps are shown as yellow rectangles. The procedure involves dividing the data into different data groups, shown here as blue rectangles. Gray rectangles show additional quality control steps, implemented to reduce or adjust for genotyping and data-processing errors. Pink rectangles show within-study validation steps.

GWAS based on BSPC does not give any genome-wide significant SNP, which is unsurprising given that the power is not high. Combining the association results from all three comparisons, one SNP, rs113001936 on chromosome 16 (\(P=3.4\times {10}^{-9}\)), passed the genome-wide significant threshold. However, since the positively associated allele has a frequency of 0.988, and the significance is driven mainly by WSPC (\(P=0.048\) for BSPC), we consider this SNP as only suggestive. To validate and to provide a proper understanding of the results, we constructed three separate participation polygenic scores (pPGSs) whose SNP-weights are based on these three sets of t-statistics (Methods). Values of each of the three pPGSs, standardized to have variance 1, were computed for 272,409 White British individuals without relatives in UKBB of third degree or closer, referred to as the ‘unrelateds’. Associations between each of the three pPGSs and various quantitative traits were examined using the unrelateds (Supplementary Note section 5; Methods). Table 1 shows some of the strongest associations plus a few nonsignificant ones for reference. The BSPC pPGS and the WSPC pPGS are supposed to estimate essentially the same effects with comparable power (see simulation results below). Thus, it is comforting that their associations with the various traits are generally compatible. With multiple-comparison adjustment, the difference between the BSPC and the WSPC pPGS associations is not significant for any of the traits. Considering that the TNTC pPGS is based on a smaller sample size, its association results are also in general agreement. This shows that the data errors impacting TNTC and WSPC have only a small effect on polygenic score prediction. Indeed, even without adjustments, the TNTC and WSPC pPGSs give associations (Supplementary Table 1) similar to those in Table 1.

Table 1 Primary pPGSs association with phenotypes

We constructed a combined pPGS using the results from all three comparisons (Methods). Its strongest association is with educational attainment (EA) where the effect (in standard deviation (s.d.) units) is 0.0309 with \(P=3.9\times {10}^{-53}\). The effect, 0.0300, is nearly as strong for age-at-first-birth (AFB) of women (\(P=8.6\times {10}^{-21}\)). The next strongest association, notably negative, is with BMI (\(P=4.0\times {10}^{-22}\)). These results, consistent with the known differences in EA and BMI between the UKBB sample and population14,17, validated that our GWAS performed without phenotype data can nonetheless capture genetic associations with participation.

Some UKBB participants were invited to answer a dietary study questionnaire in 2011–2012 (ref. 16), and some were invited to a physical activity study in 2013–2015 that required wearing an accelerometer for a week15. Not everybody was invited (criteria included having a valid email address) and only a subset of those invited actually participated. We call participation in these follow-up studies ‘secondary participation’. For the dietary study, the estimated effect of the combined pPGS on being invited, in loge odds-ratio \(\left(\right.{\rm{log }}\left({{\mathrm{OR}}}\right)\)), is 0.0342 (\(P=4.8\times {10}^{-16}\); Table 1). For those invited, the estimated effect on actual participation in \({\rm{log }}({{\mathrm{OR}}})\) is 0.0272 (\(P=2.5\times {10}^{-7}\)). For the physical activity study, the corresponding estimates are 0.0277 (\(P=1.1\times {10}^{-12}\)) and 0.0300 (\(P=6.4\times {10}^{-8}\)).

Associations with the combined pPGS were further examined adjusting for EA (Table 1). For traits/variables with \(P < 1\times {10}^{-3}\) originally, the adjusted effects shrink but all remain significant. Relatively, the effect on participating in the physical activity study when invited shrinks the least, by 18%, from 0.0300 to 0.0246, while the effect on dietary study invitation shrinks by 38%, from 0.0342 to 0.0213. This difference in shrinkage is partly because the estimated effect of EA on dietary study invitation is 0.4661, much larger than its effect on physical activity study participation, 0.1826. From the EA effects alone, the ascertainment bias would seem to be much stronger with dietary study invitation. By contrast, the genetic component of the ascertainment bias for physical study participation is stronger than that for dietary study invitation after EA adjustment. Thus, while the participation genetic component is associated with EA, its effect on other traits is not manifested mainly through EA. Importantly, phenotypes known to correlate with participation do not fully capture the nature and magnitude of the ascertainment bias. When males and females are analyzed separately for the traits/variables in Table 1 (Fig. 3), no significant difference is found for the pPGS effects with multiple-comparison adjustment. Furthermore, while UKBB participation rates differ by sex and age14, the combined pPGS is associated with neither \((P > 0.05)\), implying that the effect of the combined pPGS is additive to the effects of sex and age on participation, with no detectable statistical interactions. This, however, does not rule out the existence of variants with strong differential effects for the sexes as the combined pPGS is based on GWASs that are not sex specific and thus not designed to capture such variants.

Fig. 3: Sex-specific analysis of pPGS associations.
figure 3

Centers of error bars correspond to effect estimates in Table 1 for the pPGS based on combined weights; here, the association analyses were performed for male (blue error bars) and female (red error bars) unrelateds separately. Error bars correspond to 95% CI (estimate ±1.96 s.e.). Quantitative phenotypes were standardized to have a variance of 1 in males and females separately (Supplementary Note section 5). For secondary participation traits, the sample size shown equals number of cases plus controls.

Advancing from hypothesis testing to parameter estimation, we note that the UKBB did not recruit families, and participants were all adults providing their own consent13. Under these conditions, for alleles associated with participation, relative frequencies in different groups of individuals and genetic segments depend on many factors, most important of which are overall participation rate, denoted here by \(\alpha\), and the participation rate of close relatives of participants (Fig. 4 and Extended Data Fig. 8). For sib-pairs, the participation rate of an individual given that its sibling participates divided by \(\alpha\) is the sibling recurrence participation rate ratio, denoted by \({\lambda }_{S}\). Simulating from a simplified scenario where the population consists of a set of sib-pairs, we examined the relationships between allele frequencies in various groups of individuals and segments (Fig. 4 and Extended Data Fig. 8; Methods). We assumed a liability-threshold model where a person participates if the liability score is above a certain threshold (Methods). Given \(\alpha ,\) the correlation of siblings’ liability scores, induced by both genetic and nongenetic factors, determines \({\lambda }_{S}\). Results based on assuming \(\alpha =5.5 \%\), the participation rate of UKBB, are presented here. For an SNP, let \({f}_{{\mathrm{pop}}}\) and \({F}_{{\mathrm{samp}}}\) denote respectively the frequency of allele 1 in the population and in the participants. Allele 1 is assumed to have a positive participation effect. Results highlighted here are frequency differences relative to \((F_{{\mathrm{samp}}}-{f}_{{\mathrm{pop}}})\). The latter has the following relationship with the frequency difference between participants and nonparticipants:

$$\begin{array}{c}({F}_{\mathrm{samp}}-{f}_{\mathrm{pop}})=(1-\alpha )({F}_{\mathrm{samp}}-{F}_{\mathrm{nonparticipants}}).\end{array}$$
(4)
Fig. 4: Relative frequency differences as functions of sib-pairs enrichment in sample.
figure 4

For \(\alpha =0.055\), the participation rate of UKBB, three frequency difference ratios of an SNP (indicated in the figure) with participation effects are displayed as functions of the sibling recurrence participation rate ratio, \({\lambda }_{S}\). The results are from simulations under a liability-threshold model where the participation of an individual is determined by its liability score and the participation rate \(\alpha\). Given \(\alpha\), \({\lambda }_{S}\) is a function of the correlation between the liability scores of two siblings (Methods). In particular, for \(\alpha =0.055\), a correlation of 0.193 between the siblings’ liability scores leads to \({\lambda }_{S}=2\), the reported enrichment of sib-pairs in the UKBB data. We simulated 500 replications from a population of \(\,5\times {10}^{7}\) sib-pairs. Allele 1 of the SNP is assumed to have a population frequency of 0.5 and the effect of the SNP is assumed to account for 0.1% of the variance of the liability score. The simulated averages of the two ratios, \(({F}_{{\mathrm{IBD}}1\mathrm{S}}-{F}_{{\mathrm{IBD}}1{\mathrm{NS}}})/({F}_{{\mathrm{samp}}}-{f}_{{\mathrm{pop}}})\) and \(\left({F}_{{\mathrm{IBD}}2}-{F}_{{\mathrm{IBD}}0}\right)/\left({F}_{{\mathrm{samp}}}-{f}_{{\mathrm{pop}}}\right),\) shown with red hollow squares and blue solid triangles respectively, are virtually indistinguishable from each other; they are roughly 1 when \({\lambda }_{S}\) is close to 1 and decrease gradually as \({\lambda }_{S}\) increases, but are always positive. The simulated average of the third ratio \(({F}_{{\mathrm{SIBS}}}-{F}_{{\mathrm{SING}}})/({F}_{{\mathrm{samp}}}-{f}_{{\mathrm{pop}}})\), where \({F}_{{\mathrm{SIBS}}}\) is the allele frequency in the participating sibling pairs and \({F}_{{\mathrm{SING}}}\) is the allele frequency in the participating individuals whose sibling does not participate, is shown with gray solid circles. For \({\lambda }_{S}=2\), the first two ratios are around 0.86 and the third ratio is around 0.32.

In Fig. 4, the simulated averages of the ratios

$$\frac{{F}_{{\mathrm{IBD}}1\mathrm{S}}-{F}_{{\mathrm{IBD}}1{\mathrm{NS}}}}{{F}_{{\mathrm{samp}}}-{f}_{{\mathrm{pop}}}}$$
(5)

and

$$\frac{{F}_{{\mathrm{IBD}}2}-{F}_{{\mathrm{IBD}}0}}{{F}_{{\mathrm{samp}}}-{f}_{{\mathrm{pop}}}}$$
(6)

are nearly identical, indicating that WSPC and BSPC are capturing real effects in a similar manner. Both ratios are close to 1 when \({\lambda }_{S}\) is close to 1, and decrease gradually as \({\lambda }_{S}\) increases. For \({\lambda }_{S}=2\), the reported UKBB sib-pair enrichment, the ratios are around 0.86. Results here are simulated from an SNP with \({f}_{{\mathrm{pop}}}=0.5\) and with effect accounting for 0.1% of the variance of the liability score. Ratios for other parameter values are similar if the individual SNP effect is small. Results in Fig. 4 are for the genotype having an additive effect. Assuming allele 1 has a dominant effect, \(({F}_{{\mathrm{IBD}}2}-{F}_{{\mathrm{IBD}}0})\), is about 10% smaller than \(({F}_{{\mathrm{IBD}}1\mathrm{s}}-{F}_{{\mathrm{IBD}}1{\mathrm{NS}}})\). For a recessive model, \(({F}_{{\mathrm{IBD}}1\mathrm{s}}-{F}_{{\mathrm{IBD}}1{\mathrm{NS}}})\) is about 8% smaller than \(({F}_{{\mathrm{IBD}}2}-{F}_{{\mathrm{IBD}}0})\). Based on the same set of sib-pairs, BSPC and WSPC are directly comparable. TNTC is based on a separate set of participant pairs and thus relative effects of SNPs can differ. Reasons include the parents being older and the asymmetric relationship between parent and offspring (Supplementary Note section 6).

Alleles in the unrelateds are not IBD shared through a recent common ancestor with any other participant, while participants with close relatives participating have both shared and not-shared alleles. This leads to the second principle.

Second principle of genetic induced ascertainment bias

The second principle is that alleles that promote participation would have different frequencies in participants with participating close relatives and in those without. In most cases, the frequencies in the former would be higher. From the same simulations, Fig. 4 displays the simulated average of the ratio

$$\frac{{F}_{{\mathrm{SIBS}}}-{F}_{{\mathrm{SING}}}}{{F}_{{\mathrm{samp}}}-{f}_{{\mathrm{pop}}}}$$
(7)

where \({F}_{{\mathrm{SIBS}}}\) is the allele frequency in the participating sibling pairs, and \({F}_{{\mathrm{SING}}}\) (SING denotes ‘singletons’) is the allele frequency in the participating individuals whose sibling does not participate. Examining this empirically without overfitting, we randomly partitioned the UKBB sibling pairs into two halves. The pPGS with weights derived from the GWASs using data of the first half (pPGS1) has average value for the second half of the sibling pairs that is 0.044 s.d. higher than that of the unrelateds (\(P=\,5.0\,\times \,{10}^{-6})\). In reverse, the first half of the sibling pairs have average pPGS2 value, weights derived from the second half, that is 0.026 s.d. higher than the unrelateds (\(P=8.5\,\times \,{10}^{-3})\).

Genetic correlation and heritability

We applied LD score regression using LDSC21 to estimate the correlations between the genetic component underlying primary participation and the genetic components of other traits included in Table 1. Results for traits where the 95% confidence interval (CI) for the correlation excludes zero are in Table 2. For primary participation, the data errors that affect TNTC and WSPC are found to induce a bias in the \({\chi }^{2}\) statistics that is negatively correlated with the LD scores (Methods). While the described adjustments reduce the problem (Supplementary Note section 3; Methods), to avoid residual artefacts, the LDSC results here are based on BSPC only. For the other traits, GWAS results are based on the same unrelateds used for the results in Table 1 (Methods). While the primary participation genetic correlations with EA and the secondary participation traits are all significantly positive and substantial, they are all below 0.5. That could be partly because, while BSPC captures only direct genetic effects, the other GWASs were performed in the ‘standard’ manner and thus capture population effects that include both direct and nondirect genetic effects9,10. Genetic correlation estimates based on the adjusted WSPC results (Supplementary Table 2) are not very different. However, further research is needed to determine how best to use the WSPC and TNTC results for genetic correlation estimates.

Table 2 Estimates of genetic correlations between primary participation and other traits

Using the BSPC GWAS results, we obtained an estimate of 12.5% for the heritability of the liability score underlying primary participation (Methods). This required making several adjustments to the LDSC-estimated heritability and involved utilizing the 0.86 estimated value for the ratio \(({F}_{{\mathrm{IBD}}2}-{F}_{{\mathrm{IBD}}0})/\left.{F}_{\mathrm{samp}}-{f}_{\mathrm{pop}}\right)\) (Supplementary Note section 7; Methods). For \(\alpha =0.055\), heritability of 12.5% would mean that the average value of the genetic component underlying the liability score is 0.72 s.d. higher for the participants than the population. Heritability estimates for the liability scores underlying the secondary participation events are distinctly smaller, ranging from 3.4% to 6.2% (Supplementary Table 3). This could be partly because these heritability estimates are computed from a biased sample. Notably, studying secondary participation is not the same as studying the bias underlying the primary sample, which affects everything that follows.

Third principle of genetic induced ascertainment bias

If genetics contribute to participation, there would be more close relative pairs among the participants than what is expected if participation is random. The third principle holds because if a participant has an above average genetic propensity to participate, so would its close relatives. Even though UKBB did not purposefully recruit families, the number of sib-pairs are twice as many as expected under random sampling13. It was speculated that mutual consultation and possibly shared environment contributed to correlated participation of close relatives13. Another likely contributor is shared DNA. With the liability-threshold model, \(\alpha =0.055\) and \({\lambda }_{S}=2.0\) correspond to a 0.193 correlation between the liability scores of sib-pairs. Assuming the liability score heritability is 12.5%, the direct genetics effect can account for (0.125/2)/0.193 = 32% of the liability score correlation.

Discussion

Participation bias, a concern for all sample surveys, is becoming increasingly relevant in the age of ‘Big Data’1,3. In addition to obvious pitfalls, it could induce, for genetic studies, more subtle consequences that include collider bias17 and the introduction of artificial epistatic effects (Supplementary Note section 8). Here, we show how ascertainment bias leaves footprints in the genetic data that can be exploited to study the bias itself. Our approach shares some principles with affected sib-pairs linkage analysis22. However, whereas the latter relies only on IBD, our method is an IBD-based association analysis.

Two of the three comparisons we proposed are sensitive to genotyping and phasing errors. This complicates analyses, but also creates a secondary usage of our method as a data-quality monitoring tool, like the Hardy–Weinberg equilibrium (HWE) test. For example, results that are significantly different between WSPC and BSPC at the MHC region would indicate data or data-processing problems. Furthermore, note that the genotyping-error induced bias also affects the transmission disequilibrium test—a test usually considered as fully robust.

There can be a common genetic component underlying many different participation events, for example, the pPGS constructed for primary participation associates with both the passive (being invited) and active (deciding to participate when invited) phases of the secondary participation events. Thus, the effect of a genetic component associated with participation could accumulate through many participation events of a person’s lifespan, or it can be magnified through nested participation, for example, the participants in the dietary and physical activity studies have higher average genetic propensity to participate than the other UKBB participants, who have higher average propensities than the population. Instead of thinking of participation as a consequence of other characteristics and established traits, we propose that the propensity to participate in a wide range of events is a behavioral trait in its own right. While our method exploits information previously not utilized, even more could be achieved by combining it with other available information.

Methods

Genetic data

The UKBB has a Research Tissue Bank approval (Research Ethics Committee reference 21/NW/0157) from the North West Multicenter Research Ethics Committee, and all participants gave informed consent. For our data application, we used the UKBB 500K data release previously described by Bycroft et al.13. We filtered out individuals that had withdrawn consent, were not in the kinship inference and phasing input, had a duplicate/twin in sample, with excess of third degree relatives, with missing rate above 2% as well as heterozygosity and missingness outliers and those who showed potential sex chromosome aneuploidy or potential sex mismatch.

The current analysis started with 658,565 biallelic sequence variants in the UKBB phased haplotype data13. We refer to them as SNPs even though a very small fraction are short indels. Compared with standard GWAS, some of our analyses are more sensitive to data artefacts. Thus, we applied a number of filters to the SNPs in addition to those applied by Bycroft et al.13 (see Supplementary Note section 4 for details). Together with the trimming described below, association results were obtained for 500,632 high-quality SNPs.

Identifying relatives and the group of unrelateds

Using kinship coefficients from the UKBB data release, we identified 16,668 White British sib-pairs (42.4% male, mean (year of birth (YOB)) = 1950.8), and 4,427 White British parent–offspring pairs (parents: 31.0% male, mean (YOB) = 1941.7; offsprings: 38.9% male, mean (YOB) = 1965.0, Supplementary Note section 1). White British descent was determined from self-identified ancestry and PC analysis13. For sibling-ships with more than two siblings, the first two participating siblings were chosen. Sib-pairs and parent–offspring pairs were chosen to have no overlap. Within the White British subset, 272,409 individuals have no relatives within the UKBB of third degree or closer (46.7% male, mean (YOB) = 1951.3). Those are referred to as the ‘unrelateds’ here.

Inferring IBD segments and trimming

We inferred IBD segments for sibling pairs using snipar9, a program designed for sib-pairs. Compared with KING18 (v.2.2.4) a program not tailored for sib-pairs, snipar entails a substantial improvement in IBD estimation (Extended Data Fig. 2). Neither program uses phasing information as input. Considering the uncertainty around recombination events, we trimmed away 250 SNPs from the beginnings and ends of the inferred IBD segments before tabulating allele frequencies (Extended Data Fig. 3). This removed 11,000 SNPs all together (2 × 250 × 22) and reduced the sample sizes for the remaining SNPs (median reduction of 1,565 sibling pairs).

Inferring shared allele

For a biallelic SNP and a relative pair, there are five possible unordered combinations of genotypes for IBD1. The IBD shared allele is clear except when both individuals are heterozygous. For that, the phasing information provided13 was utilized. The closest neighboring SNP, for which one individual was heterozygous and the other homozygous, was identified and used to determine the shared haplotype and, through that, the shared allele of the target SNP (Extended Data Fig. 4).

Test statistics

For each SNP, we computed separate t-statistics for TNTC, WSPC and BSPC, by dividing each of the frequency differences, equations (1) to (3), by its s.e. For TNTC and WSPC, the s.e. values were computed assuming the shared versus not-shared frequency differences of each pair were independent of each other. For BSPC, for every SNP, the allele frequencies computed for individual sibling pairs, whether IBD0 or IBD2 pairs, were assumed to be independent of each other. No further assumptions, for example, HWE or independence of genotypes of two siblings in the IBD0 case, were made. Essentially, TNTC and WSPC were treated as one-sample t-tests (of differences), and BSPC were treated as a two-sample t-test. Specific equations are given in Supplementary Note section 2, which also contains information on sample sizes.

Sources of errors inducing the major-allele bias in the test statistics

When the initial results for all the SNPs were examined together, the t-statistics showed a tendency towards favouring the major allele as having higher frequency in the shared alleles. We identified three sources to this bias:

IBD estimation errors

For every sib-pair, the analysis starts with determining the IBD status for each SNP. Error here could lead to biases. This problem was addressed by replacing the KING18 program by snipar9, as noted above. Together with the trimming noted above, the impact of IBD estimation errors seems to be mostly eliminated (see Supplementary Note section 3 for further discussion).

Phasing errors

Phasing errors can induce a systematic major-allele bias in the WSPC and TNTC t-statistics. We focus more on WSPC below as it captures essential the same real effects as BSPC. The induced bias on TNTC is similar.

For TNTC and WSPC, when the genotypes of both relatives are heterozygous, to determine the shared allele, we use phased haplotypes that include neighboring SNPs. Phasing errors can thus induce a bias. Let the frequency of allele 1 be \({f}_{{\mathrm{pop}}}=f.\) For simplicity, assume the SNP is in HWE in the population and not associated with participation. When two siblings share one allele IBD and both are heterozygous, the chance that the shared allele is 1 is \(\left(1-f\right).\) Let \({\varepsilon }_{f}\), a function of \(f\), be the error rate of calling the shared allele 0 when it is actually 1. By symmetry, the error rate of calling the shared allele 1 when it is actually 0 is \({\varepsilon }_{1-f}\). Because the probability of the shared allele being actually 1 is \((1-f)\), the induced bias of the two type of errors combined is \(f{\varepsilon }_{1-f}-\left(1-f\right){\varepsilon }_{f}\). For \(f > 0.5\), the bias is positive if

$$\frac{{\varepsilon }_{f}}{{\varepsilon }_{1-f}} < \,\frac{f}{1-f}$$
(8)

Using data from 739 UKBB trios, where the shared allele between a parent–offspring pair in the double-heterozygotes case could be resolved by the genotype of the other parent, we estimated \({\varepsilon }_{f}\) (Extended Data Fig. 5), which satisfies equation (8). More details about the quantitative results and the induced bias on the t-statistics are given in Supplementary Note section 3. One observation, highlighted by Extended Data Fig. 6, is that, for SNPs with \({\mathrm{MAF}} > \,0.10\), around \(\,50 \%\) of the observed bias of the WSPC t-statistics can be explained by the double-heterozygote errors. However, as MAF becomes small, the fraction of the empirical bias explainable by double-heterozygote errors decreases, for example, to 21% when \({\mathrm{MAF}}=\,0.01.\,\) We believe genotype-calling errors are responsible for the additional bias.

Genotyping errors

Genotyping errors can induce a systemic bias to the WSPC and TNTC statistics. Simulations show that random errors where each genotype has a small probability to be replaced by a random genotype drawn from the population would induce a bias on \({F}_{{\mathrm{IBD}}1\mathrm{S}}-{F}_{{\mathrm{IBD}}1{\mathrm{NS}}}\) which is positive if allele 1 has frequency >0.5. This happens even though such error mechanism would not even change the sampling distribution of the called genotypes. Furthermore, genotype error rate is known to be higher for SNPs with lower MAFs. We believe the main driving force of the major-allele bias is the minor allele being overcalled in the not-shared alleles (explanation in Supplementary Note section 3).

Two-step adjustment: adjustments of t-statistics and χ 2 statistics

For WSPC and TNTC, separately, we made a simple adjustment by regressing the t-statistics on centred allele frequency (cf), \({cf}=\left(\,f-0.5\right)\), through the origin and took the residuals as the adjusted values. This corresponds to deducting \((0.6464\times {cf}\,)\) and \((0.3781\times {cf}\,)\) from the unadjusted t-statistics of WSPC and TNTC, respectively. This adjustment avoids artificial association with other GWASs that are also subject to a major-allele bias, but resulting from a different mechanism. If the other GWASs exhibit major-allele biases for the same reasons, this adjustment would reduce, but not eliminate, an artificial association.

Effect of the major-allele bias on the χ 2 statistics

The errors underlying the major-allele bias of the t-statistics would also inflate the average values of the \({\chi }^{2}\) statistics. The t-statistic adjustment above will reduce, but not eliminate, this inflation. This is because the adjustment is based on allele frequency, that is, it removes the average bias of alleles with similar frequency. As SNPs with similar allele frequency will have biases that vary around the average, the variation of the adjusted t-statistics would still be inflated, and through that inflate the average value of the \({\chi }^{2}\) statistics. Notably, the average \({\chi }^{2}\) value for BSPC is 1.011 for the 500,632 SNPs, 1.025 for the 110,533 SNPs with \({\mathrm{MAF}} > 0.25\), and 1.007 for the 390,099 SNPs with \({\mathrm{MAF}} < 0.25.\) By contrast, for WSPC, the corresponding average \({\chi }^{2}\) values are 1.136, 1.039 and 1.163, respectively, without t-statistic adjustment, and 1.074, 1.029 and 1.086, respectively, after adjustment. With t-statistic adjustment, the average \({\chi }^{2}\) value for WSPC is only modestly inflated relative to BSPC for SNPs with \({\mathrm{MAF}} > \,0.25\), but remain substantially inflated for SNPs with \({\mathrm{MAF}} < \,0.25\). Moreover, while the \({\chi }^{2}\) statistics for BSCP are slightly positively correlated with MAF (r = 0.006), the \({\chi }^{2}\) statistics for WSCP, even with t-statistic adjustment, have a larger but negative correlation with MAF (r = −0.021). The latter is because the major-allele bias of an SNP increases as \({\mathrm{MAF}}\) decreases. The negative correlation between the WSPC \({\chi }^{2}\) values and \({\mathrm{MAF}}\) is problematic for the application of LD score regression as the LD scores have a substantial positive correlation (\(r=0.35\)) with MAF. Without further adjustments, applying LD score regression to the WSPC \({\chi }^{2}\) values gives a negative fitted slope, or a negative estimated heritability. To address this, we performed \({\mathrm{MAF}}\)-specific genomic control. Specifically, we started by regressing the \({\chi }^{2}\) values computed from the adjusted t-statistics on polynomial of MAF up to the third power. The fitted values of the χ2 values as a function of MAF are displayed in Extended Data Fig. 7. The \({\chi }^{2}\) values were then adjusted by dividing the original values by the fitted values. By construction, these \({\chi }^{2}\) values have average equal to 1. Taking the square-root and multiplying by the sign of the adjusted t-statistics gave the ‘final’ z-(or t-) scores of the WSPC GWAS. The same method was used for the TNTC GWAS. These z-scores were used to evaluate the significance of individual SNPs and to compute the polygenic scores used for Table 1. These adjustments reduce the impact of the artefacts, but the results are still imperfect (see Supplementary Note section 3 for further discussion). Thus, the LD score regression estimates of genetic correlations in Table 2 of the main text are based on the BSPC results only, and so is the heritability estimate given. However, the estimates in Supplementary Table 2, based on the adjusted WSPC results, are broadly consistent with those in Table 2, supporting that the adjustments have reduced the problems that arise in the application of LD score regression.

When we use the GWAS results to construct polygenic scores, we find the adjustments described above to have a very small effect on the predictive power of the polygenic scores. In particular, the polygenic score constructed from the WSPC t-statistics, with or without adjustments, has very similar predictive power as the polygenic score constructed from the BSPC t-statistics (Table 1 and Supplementary Table 1). This is because the bias is quite small per SNP for this filtered set and so would only be adding a little noise to the polygenic score prediction.

Polygenic score analysis

The pPGS was computed with PLINK 1.90 (ref. 24) by summing over the weighted genotypes of the 500,632 SNPs fulfilling quality control, using the z- (or t-) statistics from the primary participation GWASs as weights. The relationship between the pPGS, standardized to have a variance of 1, and the phenotypes in Table 1 was estimated with a linear regression and logistic regression in R (v.3.4.3) in the group of White British unrelateds, taking sex, YOB, age at recruitment, genotyping array and 40 PCs into account. The quantitative phenotypes were transformed to have a variance of 1 for men and women separately (for further information see Supplementary Note section 5). To account for population stratification, the P value for each pPGS-phenotype association was adjusted through dividing the squared test statistic (t-test for quantitative phenotypes and z-test for binary phenotypes) by the LD score regression intercept estimated from GWAS summary statistics for the corresponding phenotype (see below).

LD score regression

We performed LD score regression with the program LDSC (v.1.0.1) (ref. 19) using the European Ancestry LD scores computed by the Pan-UKB team23 (downloaded on 7 April 2021). Analyses were based on the 500,632 SNPs used to compute the pPGS.

LD score regression intercepts and genetic correlations were estimated for the primary participation test statistics described above and the phenotypes shown in Table 1. We obtained summary statistics for the phenotypes shown in Table 1 by running GWASs in the group of White British unrelated individuals in the UKBB using BOLT-LMM (v.2.3) (ref. 25). For quantitative phenotypes, which had been adjusted for sex, age, YOB and 40 PCs and transformed to have variance of 1 as described in Supplementary Note section 5, genotyping array was added as an additional covariate in the GWASs. For the binary phenotypes, YOB, age at recruitment up to the order of three, 40 PCs, sex and genotyping array were added as covariates. For LD score regression, we used the standard linear regression P values, P_LINREG, from the BOLT output. Heritability of participation traits was estimated for a liability-threshold model (see below).

Liability-threshold model for participation of individuals and sib-pairs

The liability-threshold model assumes that a liability score, denoted here by \(X\), underlies a 0/1 trait or response. \(X\) is assumed to have a standard normal distribution (or roughly so). Let \(I\) be the 0/1 participation variable, and participation rate is \(P\left(I=1\right)=\alpha\). It is assumed that \(I=1\) if \(X > \, {\tau}\), where \(\tau =\,{\Phi }^{-1}(1-\alpha )\) and \(\Phi\) is the cumulative distribution of the standard normal. Correlation of participation between individuals is modeled through the correlation of their liability scores.

For \(n\) sib-pairs, let \(\,i=1,\ldots ,n\) index the pairs and \(j=1,\,2\) index the two siblings in a pair. Focusing on one SNP with its standardized genotype denoted by g, the liability of sibling ij is modeled as

$${X}_{ij}=\,{{w}_{1}g}_{ij}+\,{w}_{A}{A}_i+{w}_{B}{B}_{ij}$$
(9)

where \({A}_i\) and Bij are standard normal variables. The variables Ai, Bi1 and Bi2 are assumed to be independent of gi1 and gi2, and each other. Ai captures effects from shared environment as well as shared genetic factors other than g. We assume \({w}_{1}^{2}+{w}_{A}^{2}+{w}_{B}^{2}=1\) so that var(Xij) = 1. Because cor(gi1, gi2) = 1/2, \({\mathrm{cor}}\left({X}_{i1},\,{X}_{i2}\right)={w}_{A}^{2}+{w}_{1}^{2}/2.\,\) Here Xij is not exactly normally distributed because of gij, but it is close if \({w}_{1}\) is small.

Simulations to study relationships between various frequency differences of an SNP

Results in Fig. 4 were simulated with \(\alpha =0.055\), the participation rate of UKBB. Allele 1 of the SNP is assumed to have a population frequency of 0.5 and a positive participation effect with \({w}_{1}^{2}=0.001\). Sixteen different values of wA are chosen so that \({\mathrm{cor}}\left({X}_{i1},\,{X}_{i2}\right)=\,{w}_{A}^{2}+{w}_{1}^{2}/2\) takes on values of 0.0005, 0.025, 0.050, 0.075, 0.100, 0.125, 0.150, 0.175, 0.193, 0.225, 0.250, 0.275, 0.300, 0.325, 0.350 and 0.375. Notably, \({\mathrm{cor}}\left({X}_{i1},\,{X}_{i2}\right)=0.193\) leads to \({\lambda }_{S}=2.0\), the reported sibling enrichment in UKBB13.

Simulations for dominant and recessive models were performed similarly. For the dominant model, the gij in the liability score definition is taken as the standardized version of a 0/1 variable, which is 1 if the actual genotype is 1 or 2, and 0 otherwise. For the recessive model, gij is the standardized version of a 0/1 variable, which is 1 if the actual genotype is 2.

Estimating heritability for liability scores

Here, we show how the heritability of the participation traits were estimated, starting with secondary participation. When the LDSC program is given the \({\chi }^{2}\) statistics and sample sizes for a set of genome-wide SNPs, the heritability estimate produced is an estimate of the fraction of variance of the trait accountable by the genetic component, or \({r}^{2}\) between trait and the genetic component. If genetic component \(G\) influences the participation variable \(I\) through liability score \(X\), then

$${\mathrm{cor}}\left(G,I\right)={\mathrm{cor}}\left(G,X\right)\times {\mathrm{cor}}\left(X,I\right)$$
(10)

Heritability of \(X\) and \(I\) thus satisfy

$${h}^{2}\left(X\right)=\,\frac{{h}^{2}\left(I\right)}{{\mathrm{cor}}^{2}\left(X,\,I\right)}$$
(11)

with

$${\mathrm{cor}}(X,I)=\,\frac{E({XI})-E(X)E(I)}{\surd ({\mathrm{var}}(X){\mathrm{var}}(I))}=\,\frac{\alpha E(X|I=1)}{\sqrt{\alpha (1-\alpha )}}=\,\frac{\alpha \frac{\phi (\tau )}{1-\Phi (\tau )}}{\sqrt{\alpha (1-\alpha )}}=\frac{\phi (\tau )}{\sqrt{\alpha (1-\alpha )}}$$
(12)

where \(\phi\) is the density function of the standard normal and E stands for expectation. Consider the Dietary Study invitation event. Feeding the LDSC program the \({\chi }^{2}\) statistics from the GWAS analyses and a sample size of 272,409 (166,993 invited; 105,416 not-invited), the heritability estimate is 0.0372. Here \(\alpha =\,166993/272409=0.613,\) \(\tau =\,{\Phi }^{-1}\left(1-0.613\right)=\,\)−0.287, and \({\mathrm{cor}}\left(X,{I}\right)\) is 0.786. The heritability of \({X}\) is estimated as the estimated heritability of \({I}\) multiplied by the adjustment factor \([1/{\mathrm{cor}}^{2}\left(X,I\right)]\), or

$$0.0372\,\times \frac{1}{{0.786}^{2}}=0.060$$
(13)

The heritability of the liability scores of the other secondary participation events are similarly estimated. The \({r}^{2}\) between a genetic component, or the genotype of an individual SNP, is weaker, or statistically less efficient, with \({I}\) than with \(X\). In this sense, the adjustment factor is inversely proportional to the statistical efficiencies of the test statistics used, a principle that also applies to the two other adjustments described below. Moreover, Supplementary Note section 7 describes how the adjustment could be alternatively applied through providing the LDSC program with modified sample sizes.

With primary participation, heritability estimation requires the adjustment step above plus two others because the BSPC results are not direct comparisons of genotypes of participants and nonparticipants. Let \({n}_{{\mathrm{IBD}}2}\) and \({n}_{{\mathrm{IBD}}0}\) be the respective number of IBD2 and IBD0 pairs at the SNP location. Feeding the LDSC program the \({\chi }^{2}\) statistics from the BSPC GWAS with number of ‘cases’ equals \({n}_{{\mathrm{IBD}}2}\) and number of controls equals \(2\times {n}_{{\mathrm{IBD}}0}\), the estimated heritability given is 0.0838. Here \(\alpha =0.055\), \(\tau =1.598\) and \({\mathrm{cor}}\left(X,I\right)\) = 0.488. Hence, the first adjustment factor is \({\left(1/0.488\right)}^{2}=\,4.2.\,\) However, the numbers of participants and nonparticipants have a 0.055 to 0.945 ratio. By contrast, the cases to controls ratio for BSPC is around 1:2, much more efficient as the variance of the comparison is roughly proportional to

$$\frac{1}{{n}_{{\mathrm{cases}}}}+\,\frac{1}{{n}_{{\mathrm{controls}}}}$$
(14)

As variance here is inversely proportional to efficiency, this leads to the adjustment factor, for arbitrary \(n,\)

$$\frac{\frac{1}{n\times \left(1/3\right)}+\frac{1}{n\times \left(2/3\right)}}{\frac{1}{n\times \left(0.055\right)}+\frac{1}{n\times \left(0.945\right)}}=\,\frac{3+\left(\frac{3}{2}\right)}{18.18+1.06}=0.2338$$
(15)

There is a third adjustment because the allele frequency difference between the IBD2 sibs and the IBD0 sibs is smaller than the frequency difference between the participants and nonparticipants. Specifically, for \(\alpha =0.055\) and \({\lambda }_{S}=2.0\), our simulations gave

$$E\left[\frac{{F}_{{\mathrm{IBD}}2}-{F}_{{\mathrm{IBD}}0}}{{F}_{{\mathrm{samp}}}-{f}_{{\mathrm{pop}}}}\right]\approx 0.86$$
(16)

Since \((F_{{\mathrm{samp}}}-{f}_{{\mathrm{pop}}})=\left(1-\alpha \right)\left({F}_{{\mathrm{samp}}}-{F}_{{\mathrm{nonparticipants}}}\right)\)

$$E\left[\frac{{F}_{{\mathrm{IBD}}2}-{F}_{{\mathrm{IBD}}0}}{{F}_{{\mathrm{samp}}}-{F}_{{\mathrm{nonparticipants}}}}\right]\approx 0.86\times \,\left(1-\alpha \right)=0.86\times 0.945=0.8127$$
(17)

Because efficiency is proportional to effect2, the adjustment factor is

$${\left(\frac{1}{0.8127}\right)}^{2}=1.514$$
(18)

With the three adjustments, the estimated heritability of primary participation is

$$0.0838\times 4.2\times 0.2338\times 1.514=0.125$$
(19)

or 12.5%.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.