Studying the genetics of participation using footprints left on the ascertained genotypes

The trait of participating in a genetic study probably has a genetic component. Identifying this component is difficult as we cannot compare genetic information of participants with nonparticipants directly, the latter being unavailable. Here, we show that alleles that are more common in participants than nonparticipants would be further enriched in genetic segments shared by two related participants. Genome-wide analysis was performed by comparing allele frequencies in shared and not-shared genetic segments of first-degree relative pairs of the UK Biobank. In nonoverlapping samples, a polygenic score constructed from that analysis is significantly associated with educational attainment, body mass index and being invited to a dietary study. The estimated correlation between the genetic components underlying participation in UK Biobank and educational attainment is estimated to be 36.6%—substantial but far from total. Taking participation behaviour into account would improve the analyses of the study data, including those of health traits.

The trait of participating in a genetic study probably has a genetic component. Identifying this component is difficult as we cannot compare genetic information of participants with nonparticipants directly, the latter being unavailable. Here, we show that alleles that are more common in participants than nonparticipants would be further enriched in genetic segments shared by two related participants. Genome-wide analysis was performed by comparing allele frequencies in shared and not-shared genetic segments of first-degree relative pairs of the UK Biobank. In nonoverlapping samples, a polygenic score constructed from that analysis is significantly associated with educational attainment, body mass index and being invited to a dietary study. The estimated correlation between the genetic components underlying participation in UK Biobank and educational attainment is estimated to be 36.6%-substantial but far from total. Taking participation behaviour into account would improve the analyses of the study data, including those of health traits.
For all sample surveys, ascertainment bias, meaning that the sample is not representative of the target population, could lead to seriously misleading conclusions 1,2 . By its very nature, ascertainment bias usually cannot be evaluated based on the sample alone 3 . Typically, other variables (covariates) that have known distributions for both sample and population are needed for adjustments 1,3 . Such adjustments are inherently imperfect as the covariates are unlikely to fully capture the underlying bias 1,3 . For genetic studies, among participants of the primary study who have contributed DNA, further engagement in optional components of the study has been demonstrated to have associations with genotypes and phenotypes [4][5][6][7] . That, however, does not address the genotypic difference between the primary study participants and the target population. Thus, it is striking that one can investigate how the sampled genotypes are biased based on themselves alone. A recent study identified single nucleotide polymorphisms (SNPs) that had significant allele frequency differences between the sampled males and females, and proposed that those variants have differential participation effects for the sexes 8 . This approach, however, cannot identify variants that affect primary study participation of both sexes in a similar manner. Here, we show how to do so.

Three allele comparisons
All individuals are genetically related to some degree. Furthermore, each individual has two copies of genetic segments on autosomal chromosomes, and some of these segments are identical by descent (IBD), that is inherited from a recent common ancestor, with genetic segments in a relative. Instead of comparing individuals, we compare genetic segments. The key idea is that an allele that has higher frequency in participants than nonparticipants would also have higher frequency in segments that are in two participants than in segments that are in only one. Following this observation, we present below three principles of genetic induced participation bias, and show how to use only the sampled genetic data to perform genome-wide association scans (GWAS) for study participation that capture only direct genetic effects 9,10 , and are unaffected by population stratification 11 .
First principle of genetic induced ascertainment bias. On average, between two ascertained individuals, genetic segments shared IBD, relative to segments that are not, are enriched with alleles that have positive direct effects on ascertainment probability. Figure 1a illustrates this Technical Report https://doi.org/10.1038/s41588-023-01439-2 Ascertained sib-pairs, parents not ascertained. At a specific locus, assuming random Mendelian transmission, a sib-pair has probability 1/2 of inheriting the same allele IBD from the father, and the same independently from the mother. Consequentially, a sib-pair shares 2, 1 or 0 alleles IBD with probabilities 1/4, 1/2 and 1/4, respectively. With dense SNPs, the IBD state of a locus can usually be determined with high accuracy. For a specific SNP, based on the sib-pairs with known IBD states, two comparisons are made.
For each sib-pair that has IBD state 1 for a given SNP (Fig. 1c), there are one distinct shared allele (S) and two distinct not-shared alleles (NS 1 and NS 2 ). Let F IBD1S and F IBD1NS denote the frequency of allele 1 among the S alleles and the NS (NS 1 and NS 2 combined) alleles, respectively. The difference is the within-sib-pair comparison (WSPC). For any such pair, if the S allele is paternally inherited, then the two NS alleles are maternally inherited, and vice versa. Despite that, conditional on the genotypes of the two parents, without ascertainment bias, the frequency difference has expectation 0. Assuming random transmissions, the shared allele is equally likely to be paternal or maternal (Extended Data Fig. 1). Thus, any systematic difference between fathers and mothers cancels in expectation. For the sib-pairs sharing two alleles IBD at a locus, let F IBD2 be the frequency of allele 1. Similarly, let F IBD0 be the allele 1 frequency among sib-pairs that share no allele IBD. The difference (Fig. 1d) is the between-sib-pairs comparison (BSPC). This compares different sib-pairs and does not require splitting genotypes into shared and not-shared alleles. For a sib-pair at a particular locus, the chance to be in IBD state 0 or 2 is the same and, thus, without ascertainment bias, the frequency difference has expectation zero. In general, a sib-pair would be in IBD state 2 for some regions, and in IBD state 0 (or 1) in other regions. Results from the three comparisons introduced can be combined, for testing, estimation or prediction purposes. However, having the general principle. With a large sample of individuals of the same ancestry, at a specific SNP locus, many pairs of individuals would share one long haplotype inherited IBD from a not-very-distant common ancestor. Each of such pairs has one distinct shared haplotype, and two distinct not-shared haplotypes. The SNP allele that promotes participation would tend to have a higher frequency in the shared than the not-shared haplotypes. The shared and not-shared alleles are considered as cases and controls, respectively, and matched as they are in the same individuals. Still, that does not remove potential confounding entirely as haplotypes driven to higher frequency through natural selection would also be shared by more individuals. Ascertainment bias is a form of selection and, to cleanly distinguish it from other forms of selection, requires more stringent matching of shared and not-shared haplotypes. That is achieved by using ascertained parent-offspring and sibling pairs (sib-pairs).
Ascertained parent-offspring pairs. For an ascertained parent-offspring pair (Fig. 1b) where the other parent is not ascertained, there are three distinct alleles by descent for a given SNP: the allele transmitted from parent to offspring (T), the parental allele not transmitted to the offspring (NT) and the allele inherited by the offspring from the other parent (O). The T allele is shared, and the other two are not-shared. The NT and T alleles are perfectly matched as both are in the ascertained parent. Mendelian inheritance dictates that each would have the same chance to be transmitted to the offspring to become the shared allele. The principle is similar to that underlying the transmission disequilibrium test 12 . The O allele is not as perfectly matched and is currently not used. For the ascertained parent-offspring pairs, with alleles coded as 0/1, and F T and F NT denoting the frequency of 1s among the T and NT alleles respectively, the difference can be used to test for association between SNP and ascertainment (Methods). We call this the transmitted-nontransmitted comparison (TNTC). When both parents are ascertained, that can be treated as two parent-offspring pairs as the transmissions from the two parents are independent without a participation effect. individual results separately is important because they could capture different effects depending on the nature of the participation bias. Most importantly, they are impacted differently by genotyping and data-processing errors.

Analyses of UK Biobank data
The UK Biobank (UKBB) is a large-scale database with genetic and phenotypic information of individuals from across the UK 13 . Individuals aged between 40 and 69 years and living within a 25-mile radius of any of the 22 UKBB assessment centers were invited to participate 14 . Among the 9,238,453 invited, 5.5% (~500,000 individuals) participated and went through baseline assessments that took place from 2006 to 2010 (ref. 14). In addition to phenotypic details collected at the baseline visit, information continued to be added, including follow-up studies for large subsets of the cohort [14][15][16] . It is known that the UKBB sample is not fully representative of the UK population 13,14,17 . Participants were more likely to be female, less likely to smoke, older, taller, had lower body mass index (BMI) 14 and more educated 17 . We applied our methods to 4,427 parent-offspring pairs and 16,668 sib-pairs with White British descent in UKBB (Supplementary Note section 1; Methods). To start, association analysis was performed for 658,565 SNPs available in the UKBB phased haplotype data 13 . For each SNP, t-statistics for TNTC, WSPC and BSPC were computed by dividing each of the frequency difference, equations (1) to (3), by its standard error (s.e.) (Supplementary Note section 2; Methods). When results from all the SNPs were examined together, we observed a tendency for the major allele to be positively associated with participation. Investigations supported by extensive simulations revealed three causes to this bias: (a) IBD-calling errors, (b) phasing errors, (c) genotyping errors (Methods). (a) affects the processing of sib-pair data when the IBD sharing status (0, 1 or 2) of every SNP is 'called' and impacts WSPC and BSPC. The problem with (a) was mostly eliminated when we replaced KING 18 , a 'general' IBD-calling program, by snipar 9 , specifically designed for sib-pairs (Extended Data Fig. 2), and trimmed away 250 SNPs from each end of an IBD segment (contiguous SNPs with the same IBD status called; Extended Data Fig. 3; Methods). (b) and (c) affect TNTC and WSPC, which require splitting genotypes into shared and not-shared alleles; (b) enters when the relatives are both heterozygous and phasing with neighboring SNPs is required to determine the shared allele 9 (Extended Data Fig. 4). Through theoretical calculations and simulations, we found that under most scenarios, (b) and (c) would contribute to the observed major-allele bias (Supplementary Note section 3; Methods). For variants with minor allele frequency (MAF) > 0.10, we estimated that around 50% of the observed major-allele bias can be explained by phasing errors, while genotyping errors play a greater role for low frequency variants (Extended Data Figs. 5 and 6; Methods). As (b) and (c) do not impact BSPC and with (a) addressed, results based on BSPC can be used in a straightforward manner. When LD score regression (LDSC) 19 is applied to the χ 2 statistics computed from the BSPC GWAS, the fitted intercept is nearly exactly 1 (0.9998), indicating that the χ 2 statistics are neither inflated because of data artefacts, nor are they affected by issues such as population stratification. For TNTC and WSPC, the major-allele bias and related problems affect different analyses differently, as illustrated by analyses below and information in Methods and Supplementary Note section 3. For most SNPs individually, the bias induced by (b) and (c) is small and apparent only when analyzed as a group. However, in the major histocompatibility complex (MHC), a difficult region with extended linkage disequilibrium (LD), 17 SNPs have P < 5 × 10 −8 with WSPC. These are clearly data artefacts as none is even nominally significant with BSPC (all P values presented are two-sided). We discarded SNPs from the MHC and other extended LD regions 20 and applied other quality control filtering (Supplementary Note section 4; Methods), leaving 500,632 remaining SNPs with MAF > 1% in our participation genome scans. To handle the data-error induced biases separately for TNTC and WSPC, a two-step allele frequency adjustment was applied, which eliminated the correlation with allele frequency and shrank the t-statistics towards zero (Extended Data Fig. 7; Methods). Figure 2 summarizes the steps underlying the procedure used to test SNPs for association with participation.
GWAS based on BSPC does not give any genome-wide significant SNP, which is unsurprising given that the power is not high. Combining the association results from all three comparisons, one SNP, rs113001936 on chromosome 16 (P = 3.4 × 10 −9 ), passed the genomewide significant threshold. However, since the positively associated allele has a frequency of 0.988, and the significance is driven mainly by WSPC (P = 0.048 for BSPC), we consider this SNP as only suggestive. To validate and to provide a proper understanding of the results, we constructed three separate participation polygenic scores (pPGSs) whose SNP-weights are based on these three sets of t-statistics (Methods). Values of each of the three pPGSs, standardized to have variance 1, were computed for 272,409 White British individuals without relatives in UKBB of third degree or closer, referred to as the 'unrelateds'. Associations between each of the three pPGSs and various quantitative traits were examined using the unrelateds (Supplementary Note section 5; Methods). Table 1 shows some of the strongest associations plus a few nonsignificant ones for reference. The BSPC pPGS and the WSPC pPGS are supposed to estimate essentially the same effects with comparable power (see simulation results below). Thus, it is comforting that their associations with the various traits are generally compatible. With multiple-comparison adjustment, the difference between the BSPC and the WSPC pPGS associations is not significant for any of the traits. Considering that the TNTC pPGS is based on a smaller sample size, its association results are also in general agreement. This shows that the data errors impacting TNTC and WSPC have only a small effect on polygenic score prediction. Indeed, even without adjustments, the TNTC and WSPC pPGSs give associations (Supplementary Table 1) similar to those in Table 1.
We constructed a combined pPGS using the results from all three comparisons (Methods). Its strongest association is with educational attainment (EA) where the effect (in standard deviation (s.d.) units) is 0.0309 with P = 3.9 × 10 −53 . The effect, 0.0300, is nearly as strong for age-at-first-birth (AFB) of women (P = 8.6 × 10

−21
). The next strongest association, notably negative, is with BMI (P = 4.0 × 10 −22 ). These results, consistent with the known differences in EA and BMI between the UKBB sample and population 14,17 , validated that our GWAS performed without phenotype data can nonetheless capture genetic associations with participation. Some UKBB participants were invited to answer a dietary study questionnaire in 2011-2012 (ref. 16), and some were invited to a physical activity study in 2013-2015 that required wearing an accelerometer for a week 15 . Not everybody was invited (criteria included having a valid email address) and only a subset of those invited actually participated. We call participation in these follow-up studies 'secondary participation'. For the dietary study, the estimated effect of the combined pPGS on being invited, in log e odds-ratio ( log (OR)), is 0.0342 ( P = 4.8 × 10 −16 ; Table 1). For those invited, the estimated effect on actual participation in log(OR) is 0.0272 (P = 2.5 × 10 −7 ). For the physical activity study, the corresponding estimates are 0.0277 (P = 1.1 × 10 −12 ) and 0.0300 (P = 6.4 × 10 −8 ).
Associations with the combined pPGS were further examined adjusting for EA (Table 1). For traits/variables with P < 1 × 10 −3 originally, the adjusted effects shrink but all remain significant. Relatively, the effect on participating in the physical activity study when invited shrinks the least, by 18%, from 0.0300 to 0.0246, while the effect on dietary study invitation shrinks by 38%, from 0.0342 to 0.0213. This difference in shrinkage is partly because the estimated effect of EA on dietary study invitation is 0.4661, much larger than its effect on physical activity study participation, 0.1826. From the EA effects alone, the ascertainment bias would seem to be much stronger with dietary study Technical Report https://doi.org/10.1038/s41588-023-01439-2 invitation. By contrast, the genetic component of the ascertainment bias for physical study participation is stronger than that for dietary study invitation after EA adjustment. Thus, while the participation genetic component is associated with EA, its effect on other traits is not manifested mainly through EA. Importantly, phenotypes known to correlate with participation do not fully capture the nature and magnitude of the ascertainment bias. When males and females are analyzed separately for the traits/variables in Table 1 (Fig. 3), no significant difference is found for the pPGS effects with multiplecomparison adjustment. Furthermore, while UKBB participation rates differ by sex and age 14 , the combined pPGS is associated with neither (P > 0.05), implying that the effect of the combined pPGS is additive to the effects of sex and age on participation, with no detectable statistical interactions. This, however, does not rule out the existence of variants with strong differential effects for the sexes as the combined pPGS is based on GWASs that are not sex specific and thus not designed to capture such variants. Advancing from hypothesis testing to parameter estimation, we note that the UKBB did not recruit families, and participants were all adults providing their own consent 13 . Under these conditions, for alleles associated with participation, relative frequencies in different groups of individuals and genetic segments depend on many factors, most important of which are overall participation rate, denoted here by α, and the participation rate of close relatives of participants ( Fig. 4 and Extended Data Fig. 8). For sib-pairs, the participation rate of an individual given that its sibling participates divided by α is the sibling recurrence participation rate ratio, denoted by λ S . Simulating from a simplified scenario where the population consists of a set of sib-pairs, we examined the relationships between allele frequencies in various groups of individuals and segments ( Fig. 4 and Extended Data Fig. 8; Methods). We assumed a liability-threshold model where a person participates if the liability score is above a certain threshold (Methods). Given α, the correlation of siblings' liability scores, induced by both genetic and nongenetic factors, determines λ S . Results based on assuming α = 5.5%, the participation rate of UKBB, are presented here. For an SNP, let f pop and F samp denote respectively the frequency of allele 1 in the population and in the participants. Allele 1 is assumed to have a positive participation effect. Results highlighted here are frequency differences relative to (F samp − f pop ). The latter has the following relationship with the frequency difference between participants and nonparticipants: In Fig. 4, the simulated averages of the ratios . Based on the same set of sib-pairs, BSPC and WSPC are directly comparable. TNTC is based on a separate set of participant pairs and thus relative effects of SNPs can differ. Reasons include the parents being older and the asymmetric relationship between parent and offspring (Supplementary Note section 6). Alleles in the unrelateds are not IBD shared through a recent common ancestor with any other participant, while participants with close relatives participating have both shared and not-shared alleles. This leads to the second principle.
Second principle of genetic induced ascertainment bias. The second principle is that alleles that promote participation would have different frequencies in participants with participating close relatives and in those without. In most cases, the frequencies in the former would be higher. From the same simulations, Fig. 4 displays the simulated average of the ratio where F SIBS is the allele frequency in the participating sibling pairs, and F SING (SING denotes 'singletons') is the allele frequency in the participating individuals whose sibling does not participate. Examining this empirically without overfitting, we randomly partitioned the UKBB sibling pairs into two halves. The pPGS with weights derived from the GWASs using data of the first half (pPGS 1 ) has average value for the second half of the sibling pairs that is 0.044 s.d. higher than that of the unrelateds (P = 5.0 × 10 −6 ) . In reverse, the first half of the sibling pairs have average pPGS 2 value, weights derived from the second half, that is 0.026 s.d. higher than the unrelateds (P = 8.5 × 10 −3 ).
Genetic correlation and heritability. We applied LD score regression using LDSC 21 to estimate the correlations between the genetic component underlying primary participation and the genetic components of Regression was used for analyzing quantitative traits and logistic regression was used for binary traits.    figure) with participation effects are displayed as functions of the sibling recurrence participation rate ratio, λ S . The results are from simulations under a liability-threshold model where the participation of an individual is determined by its liability score and the participation rate α. Given α, λ S is a function of the correlation between the liability scores of two siblings (Methods). In particular, for α = 0.055, a correlation of 0.193 between the siblings' liability scores leads to λ S = 2, the reported enrichment of sib-pairs in the UKBB data. We simulated 500 replications from a population of 5 × 10 7 sib-pairs. Allele 1 of the SNP is assumed to have a population frequency of 0.5 and the effect of the SNP is assumed to account for 0.1% of the variance of the liability score. The simulated averages of the two ratios, shown with red hollow squares and blue solid triangles respectively, are virtually indistinguishable from each other; they are roughly 1 when λ S is close to 1 and decrease gradually as λ S increases, but are always positive. The simulated average of the third ratio (F SIBS − F SING )/(F samp − f pop ) , where F SIBS is the allele frequency in the participating sibling pairs and F SING is the allele frequency in the participating individuals whose sibling does not participate, is shown with gray solid circles. For λ S = 2, the first two ratios are around 0.86 and the third ratio is around 0.32.  Table 2) are not very different. However, further research is needed to determine how best to use the WSPC and TNTC results for genetic correlation estimates. Using the BSPC GWAS results, we obtained an estimate of 12.5% for the heritability of the liability score underlying primary participation (Methods). This required making several adjustments to the LDSC-estimated heritability and involved utilizing the 0.86 estimated value for the ratio (F IBD2 − F IBD0 )/ F samp − f pop ) (Supplementary Note section 7; Methods). For α = 0.055, heritability of 12.5% would mean that the average value of the genetic component underlying the liability score is 0.72 s.d. higher for the participants than the population. Heritability estimates for the liability scores underlying the secondary participation events are distinctly smaller, ranging from 3.4% to 6.2% (Supplementary Table 3). This could be partly because these heritability estimates are computed from a biased sample. Notably, studying secondary participation is not the same as studying the bias underlying the primary sample, which affects everything that follows.
Third principle of genetic induced ascertainment bias. If genetics contribute to participation, there would be more close relative pairs among the participants than what is expected if participation is random. The third principle holds because if a participant has an above average genetic propensity to participate, so would its close relatives. Even though UKBB did not purposefully recruit families, the number of sib-pairs are twice as many as expected under random sampling 13 . It was speculated that mutual consultation and possibly shared environment contributed to correlated participation of close relatives 13 .
Another likely contributor is shared DNA. With the liability-threshold model, α = 0.055 and λ S = 2.0 correspond to a 0.193 correlation between the liability scores of sib-pairs. Assuming the liability score heritability is 12.5%, the direct genetics effect can account for (0.125/2)/0.193 = 32% of the liability score correlation.

Discussion
Participation bias, a concern for all sample surveys, is becoming increasingly relevant in the age of 'Big Data' 1,3 . In addition to obvious pitfalls, it could induce, for genetic studies, more subtle consequences that include collider bias 17 and the introduction of artificial epistatic effects (Supplementary Note section 8). Here, we show how ascertainment bias leaves footprints in the genetic data that can be exploited to study the bias itself. Our approach shares some principles with affected sib-pairs linkage analysis 22 . However, whereas the latter relies only on IBD, our method is an IBD-based association analysis.
Two of the three comparisons we proposed are sensitive to genotyping and phasing errors. This complicates analyses, but also creates a secondary usage of our method as a data-quality monitoring tool, like the Hardy-Weinberg equilibrium (HWE) test. For example, results that are significantly different between WSPC and BSPC at the MHC region would indicate data or data-processing problems. Furthermore, note that the genotyping-error induced bias also affects the transmission disequilibrium test-a test usually considered as fully robust.
There can be a common genetic component underlying many different participation events, for example, the pPGS constructed for primary participation associates with both the passive (being invited) and active (deciding to participate when invited) phases of the secondary participation events. Thus, the effect of a genetic component associated with participation could accumulate through many participation events of a person's lifespan, or it can be magnified through nested participation, for example, the participants in the dietary and physical activity studies have higher average genetic propensity to participate than the other UKBB participants, who have higher average propensities than the population. Instead of thinking of participation as a consequence of other characteristics and established traits, we propose that the propensity to participate in a wide range of events is a behavioral trait in its own right. While our method exploits information previously not utilized, even more could be achieved by combining it with other available information.

Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41588-023-01439-2.

Genetic data
The UKBB has a Research Tissue Bank approval (Research Ethics Committee reference 21/NW/0157) from the North West Multicenter Research Ethics Committee, and all participants gave informed consent. For our data application, we used the UKBB 500K data release previously described by Bycroft et al. 13 . We filtered out individuals that had withdrawn consent, were not in the kinship inference and phasing input, had a duplicate/twin in sample, with excess of third degree relatives, with missing rate above 2% as well as heterozygosity and missingness outliers and those who showed potential sex chromosome aneuploidy or potential sex mismatch.
The current analysis started with 658,565 biallelic sequence variants in the UKBB phased haplotype data 13 . We refer to them as SNPs even though a very small fraction are short indels. Compared with standard GWAS, some of our analyses are more sensitive to data artefacts. Thus, we applied a number of filters to the SNPs in addition to those applied by Bycroft et al. 13 (see Supplementary Note section 4 for details). Together with the trimming described below, association results were obtained for 500,632 high-quality SNPs.

Identifying relatives and the group of unrelateds
Using kinship coefficients from the UKBB data release, we identi-

Inferring IBD segments and trimming
We inferred IBD segments for sibling pairs using snipar 9 , a program designed for sib-pairs. Compared with KING 18 (v.2.2.4) a program not tailored for sib-pairs, snipar entails a substantial improvement in IBD estimation (Extended Data Fig. 2). Neither program uses phasing information as input. Considering the uncertainty around recombination events, we trimmed away 250 SNPs from the beginnings and ends of the inferred IBD segments before tabulating allele frequencies (Extended Data Fig. 3). This removed 11,000 SNPs all together (2 × 250 × 22) and reduced the sample sizes for the remaining SNPs (median reduction of 1,565 sibling pairs).

Inferring shared allele
For a biallelic SNP and a relative pair, there are five possible unordered combinations of genotypes for IBD1. The IBD shared allele is clear except when both individuals are heterozygous. For that, the phasing information provided 13 was utilized. The closest neighboring SNP, for which one individual was heterozygous and the other homozygous, was identified and used to determine the shared haplotype and, through that, the shared allele of the target SNP (Extended Data Fig. 4).

Test statistics
For each SNP, we computed separate t-statistics for TNTC, WSPC and BSPC, by dividing each of the frequency differences, equations (1) to (3), by its s.e. For TNTC and WSPC, the s.e. values were computed assuming the shared versus not-shared frequency differences of each pair were independent of each other. For BSPC, for every SNP, the allele frequencies computed for individual sibling pairs, whether IBD0 or IBD2 pairs, were assumed to be independent of each other. No further assumptions, for example, HWE or independence of genotypes of two siblings in the IBD0 case, were made. Essentially, TNTC and WSPC were treated as one-sample t-tests (of differences), and BSPC were treated as a two-sample t-test. Specific equations are given in Supplementary Note section 2, which also contains information on sample sizes.

Sources of errors inducing the major-allele bias in the test statistics
When the initial results for all the SNPs were examined together, the t-statistics showed a tendency towards favouring the major allele as having higher frequency in the shared alleles. We identified three sources to this bias: IBD estimation errors. For every sib-pair, the analysis starts with determining the IBD status for each SNP. Error here could lead to biases. This problem was addressed by replacing the KING 18 program by snipar 9 , as noted above. Together with the trimming noted above, the impact of IBD estimation errors seems to be mostly eliminated (see Supplementary Note section 3 for further discussion).

Phasing errors.
Phasing errors can induce a systematic major-allele bias in the WSPC and TNTC t-statistics. We focus more on WSPC below as it captures essential the same real effects as BSPC. The induced bias on TNTC is similar.
For TNTC and WSPC, when the genotypes of both relatives are heterozygous, to determine the shared allele, we use phased haplotypes that include neighboring SNPs. Phasing errors can thus induce a bias. Let the frequency of allele 1 be f pop = f. For simplicity, assume the SNP is in HWE in the population and not associated with participation. When two siblings share one allele IBD and both are heterozygous, the chance that the shared allele is 1 is (1 − f) . Let ε f , a function of f , be the error rate of calling the shared allele 0 when it is actually 1. By symmetry, the error rate of calling the shared allele 1 when it is actually 0 is ε 1−f . Because the probability of the shared allele being actually 1 is (1 − f), the induced bias of the two type of errors combined is Using data from 739 UKBB trios, where the shared allele between a parent-offspring pair in the double-heterozygotes case could be resolved by the genotype of the other parent, we estimated ε f (Extended Data Fig. 5), which satisfies equation (8). More details about the quantitative results and the induced bias on the t-statistics are given in Supplementary Note section 3. One observation, highlighted by Extended Data Fig. 6, is that, for SNPs with MAF > 0.10, around 50% of the observed bias of the WSPC t-statistics can be explained by the double-heterozygote errors. However, as MAF becomes small, the fraction of the empirical bias explainable by double-heterozygote errors decreases, for example, to 21% when MAF = 0.01. We believe genotype-calling errors are responsible for the additional bias.
Genotyping errors. Genotyping errors can induce a systemic bias to the WSPC and TNTC statistics. Simulations show that random errors where each genotype has a small probability to be replaced by a random genotype drawn from the population would induce a bias on F IBD1S − F IBD1NS which is positive if allele 1 has frequency >0.5. This happens even though such error mechanism would not even change the sampling distribution of the called genotypes. Furthermore, genotype error rate is known to be higher for SNPs with lower MAFs. We believe the main driving force of the major-allele bias is the minor allele being overcalled in the not-shared alleles (explanation in Supplementary Note section 3). https://doi.org/10.1038/s41588-023-01439-2

Two-step adjustment: adjustments of t-statistics and χ 2 statistics
For WSPC and TNTC, separately, we made a simple adjustment by regressing the t-statistics on centred allele frequency (cf), cf = ( f − 0.5), through the origin and took the residuals as the adjusted values. This corresponds to deducting (0.6464 × cf ) and (0.3781 × cf ) from the unadjusted t-statistics of WSPC and TNTC, respectively. This adjustment avoids artificial association with other GWASs that are also subject to a major-allele bias, but resulting from a different mechanism. If the other GWASs exhibit major-allele biases for the same reasons, this adjustment would reduce, but not eliminate, an artificial association.
Effect of the major-allele bias on the χ 2 statistics. The errors underlying the major-allele bias of the t-statistics would also inflate the average values of the χ 2 statistics. The t-statistic adjustment above will reduce, but not eliminate, this inflation. This is because the adjustment is based on allele frequency, that is, it removes the average bias of alleles with similar frequency. As SNPs with similar allele frequency will have biases that vary around the average, the variation of the adjusted t-statistics would still be inflated, and through that inflate the average value of the χ 2 statistics. Notably, the average χ 2 value for BSPC is 1.011 for the 500,632 SNPs, 1.025 for the 110,533 SNPs with MAF > 0.25, and 1.007 for the 390,099 SNPs with MAF < 0.25. By contrast, for WSPC, the corresponding average χ 2 values are 1.136, 1.039 and 1.163, respectively, without t-statistic adjustment, and 1.074, 1.029 and 1.086, respectively, after adjustment. With t-statistic adjustment, the average χ 2 value for WSPC is only modestly inflated relative to BSPC for SNPs with MAF > 0.25, but remain substantially inflated for SNPs with MAF < 0.25. Moreover, while the χ 2 statistics for BSCP are slightly positively correlated with MAF (r = 0.006), the χ 2 statistics for WSCP, even with t-statistic adjustment, have a larger but negative correlation with MAF (r = −0.021). The latter is because the major-allele bias of an SNP increases as MAF decreases. The negative correlation between the WSPC χ 2 values and MAF is problematic for the application of LD score regression as the LD scores have a substantial positive correlation (r = 0.35) with MAF. Without further adjustments, applying LD score regression to the WSPC χ 2 values gives a negative fitted slope, or a negative estimated heritability. To address this, we performed MAF -specific genomic control. Specifically, we started by regressing the χ 2 values computed from the adjusted t-statistics on polynomial of MAF up to the third power. The fitted values of the χ 2 values as a function of MAF are displayed in Extended Data Fig. 7. The χ 2 values were then adjusted by dividing the original values by the fitted values. By construction, these χ 2 values have average equal to 1. Taking the square-root and multiplying by the sign of the adjusted t-statistics gave the 'final' z-(or t-) scores of the WSPC GWAS. The same method was used for the TNTC GWAS. These z-scores were used to evaluate the significance of individual SNPs and to compute the polygenic scores used for Table 1. These adjustments reduce the impact of the artefacts, but the results are still imperfect (see Supplementary Note section 3 for further discussion). Thus, the LD score regression estimates of genetic correlations in Table 2 of the main text are based on the BSPC results only, and so is the heritability estimate given. However, the estimates in Supplementary Table 2, based on the adjusted WSPC results, are broadly consistent with those in Table 2, supporting that the adjustments have reduced the problems that arise in the application of LD score regression.
When we use the GWAS results to construct polygenic scores, we find the adjustments described above to have a very small effect on the predictive power of the polygenic scores. In particular, the polygenic score constructed from the WSPC t-statistics, with or without adjustments, has very similar predictive power as the polygenic score constructed from the BSPC t-statistics (Table 1 and Supplementary Table 1). This is because the bias is quite small per SNP for this filtered set and so would only be adding a little noise to the polygenic score prediction.

Polygenic score analysis
The pPGS was computed with PLINK 1.90 (ref. 24) by summing over the weighted genotypes of the 500,632 SNPs fulfilling quality control, using the z-(or t-) statistics from the primary participation GWASs as weights. The relationship between the pPGS, standardized to have a variance of 1, and the phenotypes in Table 1 was estimated with a linear regression and logistic regression in R (v. 3.4.3) in the group of White British unrelateds, taking sex, YOB, age at recruitment, genotyping array and 40 PCs into account. The quantitative phenotypes were transformed to have a variance of 1 for men and women separately (for further information see Supplementary Note section 5). To account for population stratification, the P value for each pPGS-phenotype association was adjusted through dividing the squared test statistic (t-test for quantitative phenotypes and z-test for binary phenotypes) by the LD score regression intercept estimated from GWAS summary statistics for the corresponding phenotype (see below).

LD score regression
We performed LD score regression with the program LDSC (v.1.0.1) (ref. 19) using the European Ancestry LD scores computed by the Pan-UKB team 23 (downloaded on 7 April 2021). Analyses were based on the 500,632 SNPs used to compute the pPGS.
LD score regression intercepts and genetic correlations were estimated for the primary participation test statistics described above and the phenotypes shown in Table 1. We obtained summary statistics for the phenotypes shown in Table 1 by running GWASs in the group of White British unrelated individuals in the UKBB using BOLT-LMM (v.2.3) (ref. 25). For quantitative phenotypes, which had been adjusted for sex, age, YOB and 40 PCs and transformed to have variance of 1 as described in Supplementary Note section 5, genotyping array was added as an additional covariate in the GWASs. For the binary phenotypes, YOB, age at recruitment up to the order of three, 40 PCs, sex and genotyping array were added as covariates. For LD score regression, we used the standard linear regression P values, P_LINREG, from the BOLT output. Heritability of participation traits was estimated for a liability-threshold model (see below).

Liability-threshold model for participation of individuals and sib-pairs
The liability-threshold model assumes that a liability score, denoted here by X , underlies a 0/1 trait or response. X is assumed to have a standard normal distribution (or roughly so). Let I be the 0/1 participation variable, and participation rate is P (I = 1) = α . It is assumed (1 − α) and Φ is the cumulative distribution of the standard normal. Correlation of participation between individuals is modeled through the correlation of their liability scores.
For n sib-pairs, let i = 1, … , n index the pairs and j = 1, 2 index the two siblings in a pair. Focusing on one SNP with its standardized genotype denoted by g, the liability of sibling ij is modeled as (9) where A i and B ij are standard normal variables. The variables A i , B i1 and B i2 are assumed to be independent of g i1 and g i2 , and each other. A i captures effects from shared environment as well as shared genetic factors other than g. We assume w 2 1 + w 2 A + w 2 B = 1 so that var(X ij ) = 1. Because cor(g i1 , g i2 ) = 1/2, cor (X i1 , X i2 ) = w 2 A + w 2 1 /2. Here X ij is not exactly normally distributed because of g ij , but it is close if w 1 is small.
Simulations to study relationships between various frequency differences of an SNP. Results in Fig. 4 were simulated with α = 0.055, the participation rate of UKBB. Allele 1 of the SNP is assumed to have a https://doi.org/10.1038/s41588-023-01439-2 population frequency of 0.5 and a positive participation effect with  13 .
Simulations for dominant and recessive models were performed similarly. For the dominant model, the g ij in the liability score definition is taken as the standardized version of a 0/1 variable, which is 1 if the actual genotype is 1 or 2, and 0 otherwise. For the recessive model, g ij is the standardized version of a 0/1 variable, which is 1 if the actual genotype is 2.

Estimating heritability for liability scores
Here, we show how the heritability of the participation traits were estimated, starting with secondary participation. When the LDSC program is given the χ 2 statistics and sample sizes for a set of genome-wide SNPs, the heritability estimate produced is an estimate of the fraction of variance of the trait accountable by the genetic component, or r 2 between trait and the genetic component. If genetic component G influences the participation variable I through liability score X , then cor (G, I) = cor (G, X) × cor (X, I) (10) Heritability of X and I thus satisfy with cor(X, I) = E(XI) − E(X)E(I) √(var(X)var(I)) = αE(X|I = 1) √α(1 − α) (12) where ϕ is the density function of the standard normal and E stands for expectation. The heritability of the liability scores of the other secondary participation events are similarly estimated. The r 2 between a genetic component, or the genotype of an individual SNP, is weaker, or statistically less efficient, with I than with X . In this sense, the adjustment factor is inversely proportional to the statistical efficiencies of the test statistics used, a principle that also applies to the two other adjustments described below. Moreover, Supplementary Note section 7 describes how the adjustment could be alternatively applied through providing the LDSC program with modified sample sizes.
With primary participation, heritability estimation requires the adjustment step above plus two others because the BSPC results are not direct comparisons of genotypes of participants and nonparticipants. Let n IBD2 and n IBD0 be the respective number of IBD2 and IBD0 pairs at the SNP location. Feeding the LDSC program the χ 2 statistics from the BSPC GWAS with number of 'cases' equals n IBD2 and number of controls equals 2 × n IBD0 , the estimated heritability given is 0.0838.
Here α = 0.055, τ = 1.598 and cor (X, I) = 0.488. Hence, the first adjustment factor is (1/0.488) 2 = 4.2. However, the numbers of participants and nonparticipants have a 0.055 to 0.945 ratio. By contrast, the cases to controls ratio for BSPC is around 1:2, much more efficient as the variance of the comparison is roughly proportional to 1 n cases + 1 n controls (14) As variance here is inversely proportional to efficiency, this leads to the adjustment factor, for arbitrary n, 18.18 + 1.06 = 0.2338 (15) There is a third adjustment because the allele frequency difference between the IBD2 sibs and the IBD0 sibs is smaller than the frequency difference between the participants and nonparticipants. Specifically, for α = 0.055 and λ S = 2.0, our simulations gave ] ≈ 0.86 × (1 − α) = 0.86 × 0.945 = 0.8127 (17) Because efficiency is proportional to effect 2 , the adjustment factor is ( 1 0.8127 ) 2 = 1.514 (18) With the three adjustments, the estimated heritability of primary participation is 0.0838 × 4.2 × 0.2338 × 1.514 = 0.125 (19) or 12.5%.

Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability
The primary participation GWAS summary statistics generated in this study have been deposited to the GWAS catalog under the accession codes GCST90267220, GCST90267221, GCST90267222 and GCST90267223. Researchers can apply for access to individual-level UKBB data on their website (http://www.ukbiobank.ac.uk/ register-apply/).

Code availability
The genotype data was handled with QCTOOL v.  The two solid lines show the fit from regressing the χ 2 statistics, computed from the allele-frequency adjusted TNTC and WSPC t-statistics, on MAF up to the third power. The t-statistics are for the 500,632 SNPs. The fitted value for a particular MAF can be interpreted as the average χ 2 values for SNPs with MAFs close to that. The broken line is the corresponding fit for the BSPC χ 2 statistics. Given that BSPC and WSPC capture similar true effects with comparable power, the difference between the MAF-specific fitted/average χ 2 values is a measure of the average inflation of the WSPC χ 2 values. Notably, for WSPC, the fitted χ 2 value is much higher for SNPs with low MAFs. By contrast, the fitted χ 2 value for BSPC has an increasing trend as MAF gets bigger. When MAF is low, the WSPC fitted χ 2 value is substantially higher than that of BSPC, indicating that data errors are inducing a higher inflation there. As MAF increases, the difference between the WSPC and BSPC fitted χ 2 values decreases. The fitted χ 2 value of BSPC actually becomes slightly bigger than that of WSPC for MAF > 0.46, although that difference is not statistically significant. This is consistent with the WSPC results being close to unbiased when MAF is close to 0.5, which makes sense as the difference between major and minor alleles is small, and so is the major allele effect, when MAF is close to 0.5. The TNTC fitted χ 2 value is in general smaller than that of WSPC. That is mainly due to the smaller effective sample size of TNTC, which affects the contributions of both the true effect and the bias to the χ 2 statistics.