Mendelian imputation of parental genotypes improves estimates of direct genetic effects

Effects estimated by genome-wide association studies (GWASs) include effects of alleles in an individual on that individual (direct genetic effects), indirect genetic effects (for example, effects of alleles in parents on offspring through the environment) and bias from confounding. Within-family genetic variation is random, enabling unbiased estimation of direct genetic effects when parents are genotyped. However, parental genotypes are often missing. We introduce a method that imputes missing parental genotypes and estimates direct genetic effects. Our method, implemented in the software package snipar (single-nucleotide imputation of parents), gives more precise estimates of direct genetic effects than existing approaches. Using 39,614 individuals from the UK Biobank with at least one genotyped sibling/parent, we estimate the correlation between direct genetic effects and effects from standard GWASs for nine phenotypes, including educational attainment (r = 0.739, standard error (s.e.) = 0.086) and cognitive ability (r = 0.490, s.e. = 0.086). Our results demonstrate substantial confounding bias in standard GWASs for some phenotypes.

and cognitive ability, effects estimated from standard GWASs provide inaccurate estimates of direct genetic effects.
results Single-locus model. We consider a model for the effect of a SNP on the phenotypes of two siblings. (We consider a model for two siblings for the purposes of exposition, but our missing-data framework can handle any number of siblings, with or without phenotype observations.) Let Y ij be the phenotype of sibling j in family i. Then Y ij = δg ij + αpg p(i) + αmg m(i) + ϵ ij ; (1) where g ij is the genotype of sibling j in family i, δ is the direct effect of the SNP and g p(i) and g m(i) are the genotypes of the father and mother in family i. SNPs are assumed to be biallelic with alleles '0' and '1' , and genotypes are counts of the allele '1' with frequency f. Sibling genotypes are conditionally independent of environmental effects given parental genotypes. Therefore, estimates of direct effects from fitting model (1) are unbiased 12 . We refer to α p and α m as 'nontransmitted coefficients' (NTCs), as they are the expected coefficients on the alleles not transmitted to the proband in a regression of proband phenotype on proband genotype and nontransmitted alleles 4 . The NTCs capture IGEs from relatives, in addition to confounding due to population stratification and AM 3,4 . The residual ϵ ij is uncorrelated with g ij , g p(i) , and g m(i) , but ϵ i1 and ϵ i2 may be correlated. Note that standard GWAS methods that regress proband pheno type onto proband genotype estimate the 'population effect' , β = δ + (α p + α m )/2, which is the direct effect plus the average NTC, α = (α p + α m )/2. We also consider a model that adds IGEs from siblings: Y i1 = δg i1 + η s g i2 + (αp − η s /2) g p(i) + (αm − η s /2) g m(i) + ϵ i1 ; Y i2 = δg i2 + η s g i1 + (αp − η s /2) g p(i) + (αm − η s /2) g m(i) + ϵ i2 ;  Displayed are cases where one offspring is phenotyped (the proband), but our framework can handle cases where both offspring are phenotyped. We distinguish between imputations that are linear functions of observed genotypes and imputations that are nonlinear functions of observed genotypes, and thereby add information for estimating the parameters of models (1) and (2). There are seven cases (c-g, j and k) where nonlinear imputations are possible, but we note that in some of these cases (c-e, j and k), the resulting variance-covariance matrix of the observed and imputed genotypes is not of full rank (Extended Data Fig. 1), implying that the full parameter vector of model (2) cannot be identified based on data of that type alone. For example, in case e, the imputed paternal and maternal genotypes are the same, but the imputed sum of paternal and maternal can be used to estimate the parameters of model (3). We detail how to combine information from different data types below. Although we show the case of two offspring here to simplify exposition, our imputation method and software (snipar) can handle any number of genotyped offspring (Methods and Supplementary Note Sections 3 and 5). g par(i) = 0 + 0 + 1 + 1 g par(i) = 0 + 1 + 1 + f g par(i) = 0 + 1 + f + f Fig. 2 | Imputation of missing parental genotypes. a, Imputation from sibling pairs (Fig. 1e). Given knowledge of the IBD state of the siblings' alleles (alleles coded by '0' and '1'), the sum of the maternal and paternal genotypes can be imputed (ĝ par(i) ) . If the siblings do not share any alleles IBD, then all four parental alleles are observed (IBD0). If the siblings share one allele IBD, then three parental alleles are observed (IBD1). If the siblings share both alleles IBD, then two parental alleles are observed (IBD2). When parental alleles are unobserved, we impute them with the frequency of allele 1, f. The IBD state between siblings changes with the recombination events that occurred during meiosis in the parents and can be inferred (Methods and Supplementary Note Section 9). b, Shows how phased data can be used to determine which allele is shared between two individuals who share one allele IBD at a SNP where both are heterozygous. This applies to sibling pairs in IBD1 and parent-offspring pairs, who always shared one allele IBD. A neighboring SNP that has been phased with the target SNP and is homozygous for one individual and heterozygous for the other is used to resolve the uncertainty. For the individual on the left, the 0 allele must be the allele shared with the other individual at the neighboring SNP; thus, through the phased haplotype '1-0' (hap1), it is determined that the 1 allele is the shared allele at the target SNP.  39 , one that uses IBD and unphased data (red) and one that uses IBD and phased data (black). Effective sample size is measured relative to that from using sibling genotypes alone without any imputation and assuming that we have a sample of independent families with two genotyped and phenotyped siblings in each family (Supplementary Note Section 4). a, Effective sample size for estimation of the direct genetic effect when MAF is 20% as a function of correlation between siblings' residuals. b, Effective sample size for estimation of direct genetic effects as a function of MAF when the correlation between siblings' residuals is zero. (Results follow a similar pattern for other sibling correlations.) For imputation from unphased data, when both siblings are heterozygous and share one allele IBD, which allele is shared IBD cannot be determined (Fig. 2b), so the imputation averages over the two possibilities. When phased data are used, the observed parental alleles can be determined, so the relative efficiency does not depend upon MAF. c, The same as for a, but for average NTC. d, The same as for b, but for average NTC.
where η s is the IGE from the sibling. Because both proband and sibling genotypes are conditionally independent of environment given parental genotype, estimates of δ and η s from fitting model (2) are unbiased 4 . Because a parental allele has a 1/2 chance of being passed onto a sibling, the NTCs include one half of the indirect sibling effect, but this is removed from the coefficients on the parental genotypes in model (2) due to inclusion of the sibling genotype.
Imputing missing genotypes in a nuclear family. Genotypes in the complete-data model (2) that are unobserved are treated as missing data and imputed based on Mendelian laws. Imputations that are linear functions of observed genotypes do not add infor mation for estimation of the parameters of models (1) and (2). However, there are seven cases out of the 2 4 − 1 = 15 complete-missing data patterns ( Fig. 1 and Extended Data Fig. 1) where nonlinear imputations are possible. These seven cases can be divided into three equivalence classes (up to symmetry): genotyped sibling pairs (Fig. 1e), genotyped parent-offspring pairs (Fig. 1c,d,j,k) and genotyped sibling pairs with one genotyped parent (Fig. 1f,g). When two or more siblings are genotyped, we use the IBD state of the siblings to determine which parental alleles have been observed (Fig. 2a). We provide a method to infer the IBD states of siblings in snipar (Methods and Supplementary Note Section 9). In certain cases, phased genotypes are required to determine which parental alleles have been observed (Fig. 2b), although imputation can proceed without phased data at the cost of lower accuracy. See Methods and Supplementary Note Sections 3 and 5 for further details.
Estimating effects using imputed genotypes. We replace unobserved parental genotypes with their imputed values to estimate direct effects and NTCs. We estimate the direct effect and NTCs of each SNP as fixed effects in an LMM, which includes a family-level random effect, thereby accounting for phenotypic correlations between siblings (Methods). In Supplementary Note Section 2, we prove general theoretical properties of multiple regression using this type of imputation: where unobserved covariates are replaced with their expectations given the observed covariates. We prove that estimates remain unbiased and consistent, and that the empirical sampling variance-covariance matrix of the estimates is an unbiased estimate of the true sampling variance-covariance matrix.
Estimating direct effects using parent-offspring pairs. Consider a sample of families where the genotype of the proband and its mother have been observed but the father's genotype is unobserved (Fig. 1d). If we impute the father's genotype as ĝ p(i) = E[g p(i) |g i1 , g m(i) ] (Methods), then our theoretical results imply that by performing the regression we obtain unbiased and consistent estimates of δ, α p , and α m . If no imputation is performed, then it is impossible to obtain unbiased estimates of δ without making an additional assumption, such as α p = α m . When using phased data, the effective sample size for estimation of direct genetic effects relative to complete observation of parental genotypes (Fig. 1a,b) is approximately equal to 1/2 (Supplementary Note Section 5.3). With unphased data, this increases from a minimum of 1/6 when minor allele frequency (MAF) is 0.5 to a maximum of 1/2 as MAF approaches 0 (Extended Data Fig. 2). Imputation from siblings increases power. Many previous analyses regressed phenotypic differences between siblings onto genotypic differences 7,9,17 . In model (2), this corresponds to: This method yields unbiased estimates of δ − η s . When genotypes are imputed from sibling data (Fig. 1e), we have no information on differences between maternal and paternal genotypes, only their sum. We can express model (2) as (Supplementary Note Section 1) where α ′ = (αp + αm − η s ) /2, and for some ϵ ′ i1 , ϵ ′ i2 that are uncorrelated with the siblings' genotypes and g par(i) . By imputing g par(i) as the conditional expectation given g i1 and g i2 and the IBD state of the siblings (Fig. 2a), we can obtain unbiased estimates of the parameter vector (δ, η s , α′) by regression of phenotype jointly onto proband, sibling and imputed parental genotype. Note that by performing Proband ( Fig. 1h) Sibling pairs ( Fig. 1e) Father-child pairs (Fig. 1c) Mother-child pairs (Fig. 1d) In the first column, we give the data type in terms of the observed genotypes in the nuclear family, referencing the relevant panel of Fig. 1; in the second column, we give an example of a regression that could be performed using that data type and parental genotypes imputed from the observed genotypes; in the third column, we give the expected column vector of regression coefficients from performing the regression. Y i1 is the phenotype of sibling 1 in family i; g ij is the genotype of sibling j in family i; g p(i) is the paternal genotype, and ĝ p(i) is the imputed paternal genotype; g m(i) is the maternal genotype, and ĝ m(i) is the imputed maternal genotype; ĝ par(i) is the imputed sum of maternal and paternal genotypes; δ is the direct effect; η s is the indirect sibling effect; and α p and α m are, respectively, the paternal and maternal NTCs. Proband and sibling(s) but no parents ( Fig. 1e) 35,197 Proband and parent (Fig. 1c, d) 3,216 Proband, parent and sibling(s) (Fig. 1f, g) 312 Proband and both parents (Fig. 1b) 832 Proband, both parents and sibling (Fig. 1a) 62 Total 39,619 the imputation, we are able to separately estimate δ and η s , whereas the sibling difference approach can only estimate δ − η s . This shows how Mendelian imputation enables identification of more complex models than approaches that do not perform imputation. Although our method is able to distinguish IGEs through siblings from direct genetic effects, more precise estimates of direct genetic effects can be obtained by assuming η s = 0, at the cost of some bias if η s ≠ 0. Letting r be the correlation of the siblings' residuals (which will be approximately equal to their phenotypic correlation when the SNP explains a small fraction of the phenotypic variance, as is typical for complex human traits), then the bias in estimates of δ is −[(1 + 2r)/(2 + r)]η S (Supplementary Note Section 4.3). This is smaller than the bias from regression on differences in sibling genotype, which is −η s . For the following results, η s = 0 is assumed unless otherwise stated.
Using parental genotypes imputed from phased data increases the effective sample size for estimation of δ by a factor of 1 + 1−r 3(1+r) ≥ 1 relative to using genetic differences between siblings (Supplementary Note Section 4). This has a maximum of 4/3 at r = 0 (Fig. 3). We confirmed the theoretical result using simulated data (Extended Data Fig. 3). For estimation of the average NTC, α = (αp + αm)/2, the effective sample size is increased by a factor of 1 + 1−r/2 2(1+r) ≥ 1.125, which has a maximum of 1.5 when r = 0. For estimation of both the direct effect and the average NTC, the gain is somewhat lower when using unphased data, depending upon MAF and r (Fig. 3). Using parental genotypes imputed from phased data always gives more precise estimates of direct effects and average NTCs than using unphased data or genetic differences between siblings (Supplementary Note Section 4).
Combining different missing data types. The different missing data patterns in Fig. 1 Table  1. Although not all data types enable identification of (δ, η s , α p , α m ), they can contribute to an overall estimate of (δ, η s , α p , α m ) when the combination of data types enables identifiability. If genotypes are observed or imputed as outlined above, then a single regression that combines all the data types together gives consistent estimates of the full parameter vector provided that the resulting regression design matrix is not collinear (Supplementary Note Section 2). This is the approach we adopt for applications to real data (Methods). Alternatively, a form of multivariate meta-analysis can be used (Supplementary Note Section 6).
Imputing missing parental genotypes in the UKB. We applied our methods to the 'White British' subsample of the UKB 22 (Methods). Using KING 23 , we identified a sample of 39,619 individuals for which parental genotypes were observed or could be imputed ( Table 2). We inferred IBD segments for sibling pairs using snipar (Methods). We validated the IBD inference using 31 families with two siblings and both parents genotyped, finding that the IBD states were correct 99.65% of the time (Supplementary Table 1). The IBD sharing statistics of the siblings were close to theoretical expectations (Extended Data Fig. 4).
Using snipar, we imputed missing parental genotypes from phased haplotypes for 1,586,010 SNPs, the union of the genotyping array SNPs and HapMap3 SNPs with MAF > 1%. We found that there was negligible bias in the imputed genotypes (Methods).
We tested the performance of our method in realistic simulations based on genetic data from the UKB 'White British' sample (Supplementary Note Section 12.2). We simulated traits affected by AM, parental IGEs, vertical transmission 24 (where a phenotype of the parent(s) affects the phenotype of the offspring through the environment), vertical transmission and AM 25 and population in phenotype per s.d. change in PGI) is given along with the 95% confidence interval. Estimates were derived from a sample of 39,619 individuals with imputed and/or observed parental genotypes. Effects were estimated by joint regression of individuals' phenotypes onto their own PGI, and their mother's and father's (imputed or observed) PGIs (Methods). We give the average of the estimated maternal and paternal NTCs, adjusted for bias due to imputation in the presence of AM (Methods), as the 'average NTC' here, and the difference between maternal and paternal NTCs as 'maternal minus paternal'. Phenotypes were adjusted for 40 genetic PCs before analysis. HDL, high-density lipoprotein cholesterol; BMI, body mass index; FEV1, forced expiratory volume in one second. strati fication. We did not detect any bias in estimates of direct effects across the simulated traits (Supplementary Table 2).

Direct and indirect effects of an education PGI.
We analyzed the effects of an EA PGI on nine phenotypes in the UKB (Methods). By using observed and imputed parental PGIs, we obtained unbiased estimates of the indirect sibling effects (η s ) of the EA PGI (Extended Data Fig. 5 and Supplementary Table 3), which were not statistically significant for any phenotype (P > 0.05, two-sided Z-test). Assuming η s = 0, we obtained more precise estimates of direct effects and average NTCs ( Fig. 4 and Supplementary Table 4).
Statistically significant estimates of direct effects were obtained for all phenotypes; and statistically significant average NTCs (P < 0.05, two-sided Z-test) were obtained for all phenotypes other than eversmoked and neuroticism. Estimates of maternal minus paternal NTCs were not statistically significant for any of the phenotypes (P > 0.05, two-sided Z-test), although power for this analysis was limited.
Across the phenotypes, the direct/(direct + average NTC) ratio was similar to a previous analysis of an EA PGI in Icelandic data 4 , except for EA, where here the ratio was 0.50 (compared to 0.70 in Iceland). The fraction of variance explained by the direct effect is the square of this ratio, implying that only 25% of the variance explained by the EA PGI is due to its direct effect alone (compared to 49% in Iceland).

Genome-wide association analyses for nine phenotypes.
We estimated the direct effects, NTCs and population effects of 1,586,010 SNPs with MAF > 1% on nine phenotypes (Methods). Phenotypes were adjusted for 40 genetic PCs before SNP effects were estimated. In our PGI analyses (above), we found no evidence for substantial IGEs from siblings. Therefore, to increase precision, we estimated effects assuming η s = 0.
At these sample sizes, power is limited for analysis of direct effects and NTCs of individual SNPs. We therefore focused on estimating the genome-wide correlation between direct and population effects, r(δ, β) (Methods). This measures the degree to which population effects, as estimated by standard GWAS, reflect direct genetic effects. We also estimated r(δ, α), the genome-wide correlation between direct effects and average NTCs. To estimate the correlations, we used a moment-based estimator that adjusts for the known sampling variance-covariance matrix of the estimates (Supplementary Note Section 11).
We estimated r(δ, β) and r(δ, α) for the phenotypes simulated from genetic data from the UKB 'White British' subsample (Supplementary Table 2). For phenotypes affected by direct effects and parental IGEs in a random-mating population, r(δ, α) is the correlation between direct effects and average parental IGEs, which our simulation results confirmed.
A plausible model for IGEs is vertical transmission 24 . We simulated a phenotype affected by vertical transmission for 20 generations of random mating, reaching an approximate equilibrium. For this phenotype, r(δ, β) = 0.953 (s.e. = 0.009).
When there is population stratification or AM, average NTCs (and therefore population effects) capture effects due to other genetic and environmental factors with which the SNP is correlated due to nonrandom mating, in addition to IGEs from relatives. We simulated 20 generations of AM for the same vertical transmission phenotype model, reaching an approximate equilibrium. For this phenotype, r(δ, β) = 0.883 (s.e. = 0.009). For a phenotype affected by direct effects and a random environmental component (no indirect effects or vertical transmission), r(δ, β) = 0.949 (s.e. = 0.008) after 20 generations of AM. We also simulated a phenotype affected by direct effects and population stratification, for which r(δ, β) = 0.917 (s.e. = 0.007). These results show that population stratification, AM, and vertical transmission, along with their interactions, can lead to r(δ, β) substantially below 1.
Evidence for residual stratification. To investigate whether residual population stratification that persists after adjustment for PCs 7,8 contributes to the low correlations between direct and population effects for EA and cognitive ability, we adjusted those phenotypes for birth coordinates and the location where each individual was assessed, in addition to PCs (Methods). This increased the estimated correlation for EA to r(δ, β) = 0.791 (s.e. = 0.066), an increase of 0.053 (s.e. = 0.045; P = 0.124 from a one-sided Z-test for an increase); and increased the estimated correlation for cognitive ability to r(δ, β) = 0.568 (s.e. = 0.088), an increase of 0.078 (s.e. = 0.064; P = 0.113).
PCs based on rare variants or IBD sharing capture recent population structure better than PCs based on common variants 8 , which we used to adjust the phenotypes in our primary analysis. To better The estimate is given along with the 95% confidence interval. Direct effects are causal effects due to inheritance of alleles; population effects are estimated by standard GWASs and include direct effects, indirect effects from relatives and confounding due to population stratification and AM. We estimated the correlation between direct and population effect estimates using summary statistics derived from a sample of 39,619 individuals from the 'White British' subsample of the UKB where parental genotypes were imputed, using the developed methods, or observed (Methods). Phenotypes were adjusted for 40 genetic PCs before analysis. We do not show the results for age at first birth in women and ever-smoked here due to their large standard errors (see Supplementary Table 5 To further investigate the contribution of residual stratification, we computed correlations between genetic associations with birth coordinates (adjusted for PCs) and direct effects, average NTCs, and population effects (Methods and Supplementary Table 6). The correlations with birth coordinates reflect the degree to which SNP effects on phenotypes, after adjustment for PCs, are correlated with the geographic structure in the population 27 . Estimated correlations between birth coordinates and direct effects were attenuated toward zero relative to correlations between birth coordinates and average NTCs and population effects. Furthermore, correlations between birth coordinates and average NTCs and population effects tended to line up along a south-east to north-west axis (Extended Data Fig. 6), likely reflecting the phenotypes' correlations with socioeconomic status, genetic structure in the UK population and geographic variation in socioeconomic status across the United Kingdom 27,28 .

Correlations across phenotypes for direct and population effects.
We sought to test whether correlations between population effects on EA and cognitive ability and population effects on other phenotypes are inflated due to IGEs and confounding factors. To remove the influence of sampling correlations between effect estimates due to overlapping samples, we estimated population effects in a sample of 276,419 unrelated individuals who are unrelated (third degree or less) to the sample used in our primary analysis. We used BOLT-LMM to estimate population effects on the same nine pheno types as in our primary analysis as well as north and east birth coordinates (Methods). We refer to the population effects estimated in this sample as β BOLT for an unspecified phenotype and β BOLT:EA to refer to a specific phenotype (in this case EA). This enabled us to estimate, without needing to adjust for sampling correlations, correlations between direct effects on EA and cognitive ability (δ EA and δ cog ) from our primary analysis and population effects on nine phenotypes and north and east birth coordinates from the unrelated sample (β BOLT ), which we refer to as r(δ EA , β BOLT ) and r(δ cog , β BOLT ). We compared these to correlations between population effects on EA and cognitive ability (β EA and β cog ) from our primary analysis and population effects on nine phenotypes and north and east birth coordinates from the unrelated sample (β BOLT ), which we refer to as r(β EA , β BOLT ) and r(β cog , β BOLT ) (Supplementary Tables 5 and 7).
Across the phenotypes, population effects were more strongly correlated with population effects on EA and cognitive ability than with direct effects on EA and cognitive ability (Fig. 6), supporting the hypothesis that IGEs and confounding factors inflate correlations between population effects. The largest estimated difference was r(β cog , β BOLT:EA ) − r(δ cog , β BOLT:EA ) = 0.319 (s.e. = 0.073), suggesting either shared confounding in population effects on EA and cognitive ability, or shared IGEs that are not highly correlated with direct effects on cognitive ability, or both.

Discussion
We introduce Mendelian imputation as a tool to perform genetic association analyses. Conceptually, this is similar to multipoint linkage analysis performed with pedigrees 29 , familial imputations 30,31 and methods in animal breeding 19 . However, our approach focuses on imputing missing parental genotypes in a nuclear family rather than in the large pedigrees typical in animal breeding. We found that our imputation method improves on a recent method developed for imputing missing genotypes in a nuclear family, AlphaFamImpute 21 , both in terms of bias and R 2 between imputed and actual parental genotypes (Methods and Supplementary  , on the x axis. The vertical error bars give the 95% confidence intervals for the correlation with the direct effects, and the horizontal error bars give the 95% confidence intervals for the correlation with population effects. We give numerical results along with P values for differences between correlations with direct and population effects in Supplementary Table 7. AAFB, age at first birth (in women); north, north birth coordinate; east, east birth coordinate.
heterozygous cases (Fig. 2b). Our approach could be extended to use genotypes of other relatives of the missing parent(s). However, many data sets contain genotypes of siblings and/or parent-offspring pairs from families that have no known pedigree relation, and for these data sets, our approach provides close to optimal recovery of the genotypes of the missing parent(s).
Mendelian imputation, used appropriately, produces unbiased estimates of parameters along with valid sampling errors. This makes it rather unique among single imputation methods in situations where the amount of missing information is substantial 32 . These properties derive from the fact that missing genotypes are imputed as nonlinear functions of observed genotypes, using Mendelian laws, thereby adding information for parameter estimation without introducing noise. Mendelian imputation enables integrated analysis of different data types, maximizing power and enabling identification of models than cannot be identified without imputation.
We examined the degree to which GWAS estimates reflect direct effects by estimating the genome-wide correlation between direct and population effects, finding that population effects and direct effects are not highly correlated (<0.9) for EA and cognitive ability. We found evidence that this is in part due to recent structure in the population that is captured by PCs of the IBD relatedness matrix, but not by PCs computed from common variants 8 . Our simulation results (Supplementary Table 2) suggest that a combination of vertical transmission and AM 24,25 may also contribute to the low correlation between direct and population effects.
Another phenomenon that may contribute is ascertainment: If direct effects and NTCs are not very strongly correlated, but both are also correlated with ascertainment, then collider bias 33 could push the correlation estimate in the negative direction and reduce correlations between direct and population effects. Analysis of simulated phenotypes under ascertainment supports this hypothesis, where strong ascertainment reduced r(δ, α) to −0.264 (s.e. = 0.091) for a phenotype with uncorrelated direct effects and parental IGEs (true r(δ, α) = 0), which may explain why we observed negative r(δ, α) for cognitive ability and neuroticism, as observations for these phenotypes are ascertained for higher education and lower neuroticism 34,35 .
If population effects are not highly correlated with direct effects, then this has implications for genetic prediction methods. For example, for prediction of differences between embryos, only direct effects are relevant, so selecting embryos using population effects would perform poorly compared to using direct effects 36,37 (assuming equal precision of direct and population effect estimates) and could introduce confounding related biases. Confounding related biases can also lead to spurious inferences in mendelian randomization 10 and studies of selection 7,11 . If NTCs are substantial and imperfectly correlated with direct genetic effects, then prediction accuracy could be increased by including predictors based on NTCs of parental genotypes in addition to predictors based on proband genotypes.
We found evidence that correlations between GWAS summary statistics on many of the nine phenotypes examined here and EA and cognitive ability are inflated by factors other than direct effects. Application of the methods developed here to larger sample sizes will enable us to estimate the relative contribution of direct effects, IGEs, and confounding factors to estimates of genetic correlations 38 .
By analyzing an EA PGI, we observed a lower direct/(direct + average NTC) ratio for EA than was observed in Iceland 4 . This implies that the combined influence of IGEs, population stratification, and AM is stronger in the UK than in Iceland. The PGI was constructed from standard GWAS summary statistics, so the average NTCs of the PGI could reflect bias and/or IGEs in the original GWAS summary statistics. Future studies could examine prediction using PGIs constructed from direct effect estimates, which do not have these biases.
Collection of genetic data on close relatives is inevitable as sample sizes grow larger. However, samples of close relatives will remain much smaller than samples of distantly related individuals. We see data on unrelated individuals as one possible pattern of missing data in a framework for human genetic analysis that treats the nuclear family as the fundamental unit of analysis rather than the individual. By combining information from different missing data patterns, we will be able to construct a more accurate picture of the role of genetics in human phenotype variation.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/ s41588-022-01085-0.

Methods
Imputation from sibling pairs. Given the genotypes of a sibling pair (Fig. 1e) and the IBD state of the alleles (which alleles are shared by descent from the parents), the observed parental alleles can be determined (Fig. 2a).
Because parent-of-origin cannot be determined from sibling data, we impute the sum of maternal and paternal genotypes (Fig. 2). Let g par(i) = g m(i) + g p(i) be the sum of the genotype of the mother ( g m(i) ) and the genotype of the father (g p(i) ) in family i, and let g i1 and g i2 be the genotypes of the two siblings. We compute , where IBD i is the IBD state of the two siblings. Because all four alleles are observed in IBD state 0, we have that When we do not observe a parental allele, we impute it using the population frequency of allele 1, f. Therefore, When the siblings share one allele IBD, let g ¬s i2 be 1 if the allele in sibling 2 that is not shared IBD with sibling 1 is allele 1, and let g ¬s i2 be 0 otherwise. If we impute by then the squared correlation between imputed and observed parental genotype is ¾. This is because we observe two parental alleles in IBD2, with probability ¼; three parental alleles in IBD1, with probability 1/2; and four parental alleles in IBD0, with probability ¼; giving an average of 3 observed parental alleles. The squared correlation is higher than based on best linear unbiased imputation, 2/3 (Supplementary Note Appendix D).
When in IBD1 (the siblings share one allele IBD), the alleles not shared are known unless both siblings are heterozygous. When both are heterozygous, information from neighboring phased SNPs can be used to resolve the uncertainty (Fig. 2b). However, without phased data, imputation can proceed by averaging over the two possibilities (shared allele is 0 versus shared allele is 1), giving at the cost of lower correlation with the unobserved parental genotype (Supplementary Note Section 3.2). We generalize the above approach to imputing from genotype observations on three or more siblings (Supplementary Note 3.2.1). When n i siblings have been observed in family i, on average we observe 4 ( 1 − 2 −ni ) parental alleles, so the imputation approaches full recovery of the combined parental genotype as the number of siblings increases. Imputation from parent-offspring pairs. Consider imputing the genotype of a proband's father given observations of the proband and mother's genotypes (Fig. 1d). We impute as the expectation given the proband and mother's genotype: . Given the proband's paternally inherited allele, one half of the paternal genotype is determined, and the expectation of the other half is given by f. The resulting squared correlation of the paternal genotype with the imputed paternal genotype is therefore 1/2, higher than with the best linear unbiased imputation, 1/3 (Supplementary Note Appendix D).
Similar to the sibling case, the paternally inherited allele of the proband is known unless both mother and proband are heterozygous, in which case phased data are needed to resolve the uncertainty (Fig. 2b). Without phased data, the unobserved paternal genotype can be imputed by averaging over the two possible inheritance patterns (Supplementary Note Section 5.1), giving The loss of information relative to phased data increases with increasing heterozygosity.
We generalize the imputation procedure to incorporate situations where two or more siblings' genotypes and one parent's genotype have been observed (Fig.  1f, g). We leverage both IBD sharing between siblings and the observed parent to efficiently impute the missing parent's genotype, giving methods for both phased and unphased data (Supplementary Note Sections 5.1.1 and 5.2).
Estimation of SNP effects. Phenotype observations of siblings are correlated through both shared genetic factors and shared environmental factors. To obtain efficient estimates of SNP effects, the phenotypic correlations between siblings should be modeled. We implemented an LMM in snipar that achieves this by modeling the mean phenotype within each family as a random effect. Let Y ij be the phenotype of sibling j in family i; then, assuming the overall mean of the phenotype is zero, where X ij are the mean-centered (observed or imputed) genotypes; θ is the corresponding vector of parameters; μ i is the mean in family i, which we model as a mean-zero normally distributed random effect with variance σ 2 F , independent for each family; and ϵ ij is the residual for individual j in family i, independent for each individual. This implies that, conditional on X, the phenotypic correlation of siblings is σ 2 The columns of X and θ depend upon the data type and model being estimated (Table 1). The default is for the columns of X to be the individual's genotype, the individual's father's imputed or observed genotype, and the individual's mother's imputed or observed genotype, with θ = [ δ, αp, αm ] T . When only sibling genotypes are available, to prevent collinearity, the columns of X reduce to the individual's genotype and the imputed sum of maternal and paternal genotypes, We also provide an option in snipar to add the proband's siblings' genotypes to the regression to fit indirect effects from siblings. For estimation of the effects of genome-wide SNPs, snipar first infers the variance components σ 2 F and σ 2 ϵ by maximum likelihood for a null model without any SNP effects, which can be done in O(n) computations (Supplementary Note Section 10). We then fix the variance components at their maximum likelihood estimate for estimation of the SNP effects. Given the variance components, the estimate of θ can be obtained analytically in O(n) computations.
Effect of population structure. In the results above and the main text, we have assumed random mating. When population structure is present, this leads to bias in the imputed parental genotypes. We analyze the consequences of this in Supplementary Note Section 7. In general, estimates of NTCs are biased by structure, with the bias increasing with Wright's F st . Bias is introduced into estimates of direct effects when data types with different numbers of observed parental alleles are mixed together. For imputation from sibling pairs, the number of observed parental alleles differs with the IBD state of the siblings, introducing a bias into estimates of δ that is approximately equal to Fstα/2 when F st is small. For relatively homogeneous samples, any such bias is therefore likely to be negligible at the individual SNP level. Further, SNPs with large values of F st will tend to be filtered out during quality control because they violate Hardy-Weinberg Equilibrium.
In Supplementary Note Section 7.2, we derive an alternative estimator for δ that splits the regression by the number of observed parental alleles, and we prove that this estimator is not biased by population structure. Although this estimator is more robust, it is less precise than the estimator described above, which performs a single regression using all individuals, irrespective of the number of parental alleles observed. However, the alternative estimator for δ is still more precise than the estimator based on genetic differences between siblings, having an effective sample size 1 + 1−r 6(1+r) ≥ 1 times higher, with a maximum of 7/6 when r =0.
PGI analyses using imputed parental genotypes. Consider a PGI composed of L SNPs for the father in family i: where w l is the weight of SNP l, and g p(i)l is the genotype of the father at SNP l. If the father is not genotyped, then the imputed PGI is: where ĝ p(i)l is imputed as described above. Assuming the L SNPs are in linkage equilibrium, then theoretical results for single SNP analyses carry over. In practice, linkage disequilibrium (LD) between some of the SNPs is expected. However, if many SNPs from across the genome contribute to the PGI, only a small fraction of the pairs of SNPs will have non-negligible correlations due to local LD, and the effect on the imputations and estimates would be negligible. However, for a phenotype with AM, contributing SNPs can become correlated regardless of their physical positions 3,40,41 . Because each SNP is imputed individually without conditioning on other SNPs that contribute to the PGI, the imputed PGIs are not exactly the conditional expectations given the observed PGIs. Consider a model for the association between a phenotype and a PGI: where PGI ij is the PGI of sibling j in family i, and PGI m(i) is the PGI of the mother in family i. We show in the Supplementary Note Section 8 that using imputed parental PGIs in place of observed parental PGIs does not introduce bias to estimates of δ, even when AM is present, when the number of SNPs, L, is large. However, a slight bias in estimates of NTCs is introduced. For example, if using parental genotypes imputed from sibling pairs with phased data, the estimate of α would be inflated by a factor of (1 + ram)/(1 + ram/2), where r am is the equilibrium correlation between maternal and paternal PGI. We note that, even with fully observed genotypes, AM implies that α captures confounding due to correlation between the parental PGI and the genetic component of the phenotype that would be uncorrelated with the PGI under random mating, as described previously 4,42 .
UKB sample. We used the UKB sample that had been identified by UKB to have predominantly 'White British' ancestry 22 . We filtered out individuals identified by the UKB as having excess relatives, excess heterozygosity, or sex chromosome aneuploidy. We used the kinship coefficients computed by the UKB to identify individuals with a first-degree relative, where a first-degree relation is defined as a kinship coefficient of 0.177 and above 23 . We used KING 23 with the '-related -degree 1' options to infer the sibling and parent-offspring relations within that set of individuals (Table 2). We identified 157 duplicates/monozygotic twins and removed one from each pair from further analyses. There were 19,290 sibling pairs from 17,289 sibships, including 913 sibships of size greater than 2, with a maximum size of 6.
Haplotypes for the SNPs that were present on both the UKB Axiom and the UK BiLEVE genotyping arrays and that passed quality control were provided by the UKB 22 . Phasing was performed using SHAPEIT3 (ref. 43 ) and the 1000 Genomes Phase 3 dataset 44 as a reference panel. This resulted in phased haplotypes for a set of 658,720 autosomal SNPs with an estimated switch error rate of 0.229% 22 .
In addition, we used SHAPEIT2 with the -duohmm option (with -W 5 parameter) to phase 1.1 million HapMap3 SNPs with MAF > 1% from the imputed genotype data provided by the UKB. The 'duohmm' option takes advantage of parent-offspring relations to improve phasing. We merged the haplotypes provided by UKB with the haplotypes for HapMap3 SNPs using QCTOOL, giving haplotypes for 1,652,145 unique SNPs, 1,586,010 of which had MAF > 1%.
To compute the PCs of the IBD relatedness matrix, we used KING 23 with the -ibdseg option to infer IBD segments between all pairs in the 39,619 individuals from the 'White British' subsample of the UKB for which parental genotypes were observed or could be imputed, along with their first-degree relatives, giving a total sample of 44,553. The relatedness between two individuals based on IBD sharing was calculated as (1/2) × P(IBD1) + P(IBD2), where P(IBD1) and P(IBD2) are the fractions of the autosome shared in IBD1 and IBD2 segments respectively. We extracted the eigenvectors with the 40 largest eigenvalues from the resulting relatedness matrix.
UKB phenotypes. We analyzed EA; standing height (Data Field 50); body mass index (Data Field 21001); neuroticism score (Data Field 20127); whether an individual answered that they had ever smoked or not (Data Field 20160), encoded as a binary variable; cognitive ability, derived from a test of 'fluid intelligence' (Data Field 20016); high-density lipoprotein cholesterol (Data Field 30760); forced expiratory volume in one second (Data Field 3063); age at first live birth (in women) (Data Field 2754); and north (Data Field 129) and east (Data Field 130) birth coordinates. For EA, we converted the answers to the qualifications question (Data Field 6138) to years of education according to the method used in the most recent GWAS of EA 45 . For all phenotypes, we regressed out age, age 2 , age 3 , sex and interactions between sex and age, age 2 , and age 3 , along with the 40 genetic PCs provided by the UKB. For quantitative phenotypes, we normalized the phenotypes to have variance 1 separately in males and females.
To further investigate the impact of residual population stratification on EA and cognitive ability, we adjusted EA and cognitive ability for birth coordinates and the center where they were assessed (Data Field 54), in addition to the covariates listed above. To do this, we regressed the phenotype onto the covariates listed above and linear and nonlinear functions of north and east birth coordinates, assessment center coded as a categorical variable and interactions between assessment center and north and east birth coordinates and the squares and cubes of north and east birth coordinates. For the nonlinear functions of north and east birth coordinates, we used north coordinate, its square and cube; east coordinate, its square and cube; and all pairwise products between north coordinate and its square and cube, and east coordinate and its square and cube.

IBD inference.
We developed a hidden Markov model (HMM), implemented in snipar, to infer IBD segments shared between siblings (Supplementary Note Section 9). The HMM models the joint distribution of a sibling pair's (unphased) genotypes at a SNP conditional on the IBD state. To account for LD between nearby SNPs, we weighted the contribution of siblings' genotypes at each SNP to the overall likelihood for the chromosome by the inverse of the LD score of the SNP. We calculated the LD scores using the LD Score Regression 46 software with a 1 centimorgan (cM) window. The probability of transitioning from one IBD state to another is inferred from the genetic distance between the SNPs. We account for genotyping errors, which requires a parameter γ for the probability of a genotyping error. We smooth the IBD segments inferred by the HMM to remove short segments that are improbable based on their length in cM and whose neighboring segments have the same IBD state. This requires a parameter, m, the minimum allowed length (in cM) of an IBD segment that differs from its adjacent segments.
We optimized the parameters γ and m by using 31 families where two siblings and both parents are genotyped, and therefore the true IBD state can be inferred for many SNPs (Supplementary Note Section 9). We found that (γ, m) = (10 −4 , 0.01 cM) gave the highest probability of inferring the true IBD state, 99.65%. We give the proportions of SNPs with inferred IBD states 0, 1 and 2 as a function of the true IBD state in Supplementary Table 1.
We compared this to IBD segments inferred by KING using the -ibdsegs option, which had an overall probability of inferring the true IBD state of 98.5%.
Our method therefore gave around a fourfold reduction in IBD errors compared to KING. Furthermore, we found that the distribution of IBD states inferred by KING diverged substantially from the theoretical expectation near the ends of chromosomes and centromeres, whereas the distribution of IBD states inferred by snipar was close to theoretical expectations from end to end (Extended Data Fig. 4).

Imputation of missing parental genotypes.
Using the inferred IBD segments (above), we imputed missing parental genotypes from phased haplotypes for 1,586,010 SNPs, the union of the genotyping array SNPs and the HapMap3 SNPs with MAF > 1%. We examined the bias in the imputed parental genotypes by performing the imputation for families with both parental genotypes as if one or both parental genotypes were missing. If the imputation is unbiased, then the regression coefficient of the observed parental genotypes onto the imputed parental genotypes should be 1. This is because the covariance between the imputed parental genotypes and the observed parental genotypes should be equal to the variance of the imputed parental genotypes (Supplementary Note Section 2). Based on data from 31 families with two siblings and two parents genotyped, we obtained a regression coefficient of 0.9997 for regression of the sum of observed parental genotypes onto the imputed sum of parental genotypes. Based on data from 894 families with both parents genotyped, we set one parent's genotype as missing and imputed it from the remaining genotypes in the family, and we obtained a regression coefficient of 0.9989 for regression of the observed parent's genotype onto the imputed parent's genotype. These results show there is negligible bias in the imputed parental genotypes.
Estimating direct and indirect effects of an education PGI. We used summary statistics from a GWAS of EA 9 modified to remove the individuals in this study and their relatives, up to the third degree, from the summary statistics. We computed the PGI by applying LD-pred 47 to the summary statistics. We computed PGIs for individuals and their siblings and parents based on observed and imputed genotypes. We estimated the effects of the PGI by performing an LMM regression in snipar: where the columns of X were the intercept, the individual's PGI, the mean PGI of the siblings of the individual, the (imputed or observed) PGI of the individual's father, and the (imputed or observed) PGI of the individual's mother. Here, to account for the fact that the PGI might explain a substantial amount of phenotypic variance, the variance parameters σ 2 F and σ 2 ϵ were estimated jointly with θ. For the analysis assuming that the indirect effect from the sibling was zero, we dropped the sibling PGI from the regression and expanded the sample to include individuals with genotyped and/or imputed parents but without a genotyped sibling. We adjusted average NTC estimates for bias introduced by imputation when AM is present (Supplementary Note Section 8).
GWASs in unrelated individuals. We conducted GWASs using BOLT-LMM 48 in the sample of 'White British' UKB participants without third-degree or closer relatives also genotyped in the UKB 22 . This sample is therefore unrelated (less than third degree) from the sample of individuals with observed and/or imputed parental genotypes, who all have at least one first-degree relative also genotyped in the UKB. As in the related sample, we filtered out individuals identified by the UKB as having excess relatives, excess heterozygosity or sex chromosome aneuploidy, giving a sample size of 276,419. We applied BOLT-LMM to the 658,720 SNPs present on the UKB genotyping array.
In addition to the nine phenotypes used in the related sample, we also analyzed north and east birth coordinates. We adjusted the phenotypes for the same set of covariates as in our analysis of the related sample, including 40 genetic PCs. We estimated correlations between the summary statistics in the related sample and the summary statistics in the unrelated sample using the moment-based estimator (Supplementary Note Section 11) with the sampling correlation between the estimates set to zero.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
Summary statistics for the direct effects, NTCs, and population effects of 1,586,010 SNPs on nine phenotypes can be downloaded from http://www.thessgac.org/data, subject to a terms of use to encourage responsible use of the data. Applications for access to the UKB data can be made on the UKB website (http://www.ukbiobank. ac.uk/register-apply/).

Code availability
The code for IBD inference, imputation and genome-wide association and PGI analyses is available as a Python package, snipar 49 , under an MIT license at Extended Data Fig. 1 | Variance-covariance matrices of observed and imputed genotypes. Each matrix shows the variance-covariance matrix within the nuclear family given observed and imputed genotypes. The labels a) to o) correspond to those in Fig. 1. g i1 denotes the proband's genotype, g i2 denotes the proband's sibling's genotype, g p(i) denotes father's genotype and g m(i) denotes mother's genotype. A caret ('hat') over the 'g', such as ĝ p(i) , indicates an unobserved genotype that is imputed (either linearly or non-linearly) using observed genotypes. Displayed are the variance-covariance matrices of the genotypes, normalized by the variance of an observed genotype, 2f(1−f) where f is the allele frequency. This scaling means that the diagonal entry corresponding to an observed genotype is 1. Best linear unbiased imputations can be derived from applying formulae for multivariate Gaussian random-variables to a) (Supplementary Note Appendix D).

Extended Data Fig. 3 | Confirmation of theoretical result for direct effects.
Here we compare theoretical predictions to results from simulated data for the effective sample size for estimation of direct genetic effects using parental genotypes imputed from both phased and unphased sibling genotype data relative to estimating direct genetic effects using genetic differences between siblings. For 3,000 independent families, we simulated three different traits affected by direct, paternal, and maternal effects of 1,000 SNPs with minor allele frequency 0.5 (Supplementary Note Section 12.1). The overall variance explained by the combined direct, paternal, and maternal effects varied between the simulations, leading to different correlations between siblings' phenotypes. We computed theoretical expectations based on formulae derived in Supplementary Note Section 4.3, which are drawn as the red curve (theoretical expectation for imputation from unphased data) and the black curve (theoretical expectation for imputation from phased data). The simulation results for unphased data are given by the black circles, and the simulation results for the phased data are given by black triangles. The relative effective sample size is given by the ratio between the sampling variance for estimation of direct effects when using differences between siblings to the sampling variance when using parental genotypes imputed from phased or unphased data. Fig. 4 | IBD0 proportion among sibling pairs across chromosome 1. We show the proportion of sibling pairs, out of 19,290 pairs, that are called as IBD0 for each SNP with MAF>1% on chromosome 1 on the UK Biobank genotyping array. We compare the fraction of pairs that are called IBD0 by snipar (black line) and KING (gray line). The theoretical expectation according to Mendelian segregation is 0.25, indicated by the red horizontal line. Fig. 6 | Correlations with north and east birth coordinates. Associations between SNPs and north and east birth coordinates that persist after principal component adjustment were assessed by performing a genome-wide association study of north and east birth coordinates in a sample of unrelated individuals from the 'White British' subsample of the UK Biobank (Methods). We estimated genome-wide correlations between SNP associations with north and east birth coordinate and a) direct effects, b) average non-transmitted coefficients (NTCs), and c) population effects on 9 phenotypes (Methods). Point estimates are plotted as points, with error bars giving 95% confidence intervals. See Supplementary Table 6 for numerical results. Abbreviations: AAFB, age at first birth; BMI, body mass index; FEV1, forced expiratory volume in one second; HDL, high-density lipoprotein cholesterol.