Controlling for background genetic effects using polygenic scores improves the power of genome-wide association studies

Ongoing increases in the size of human genotype and phenotype collections offer the promise of improved understanding of the genetics of complex diseases. In addition to the biological insights that can be gained from the nature of the variants that contribute to the genetic component of complex trait variability, these data bring forward the prospect of predicting complex traits and the risk of complex genetic diseases from genotype data. Here we show that advances in phenotype prediction can be applied to improve the power of genome-wide association studies. We demonstrate a simple and efficient method to model genetic background effects using polygenic scores derived from SNPs that are not on the same chromosome as the target SNP. Using simulated and real data we found that this can result in a substantial increase in the number of variants passing genome-wide significance thresholds. This increase in power to detect trait-associated variants also translates into an increase in the accuracy with which the resulting polygenic score predicts the phenotype from genotype data. Our results suggest that advances in methods for phenotype prediction can be exploited to improve the control of background genetic effects, leading to more accurate GWAS results and further improvements in phenotype prediction.

SNP. In practice, the PGS should be approximately independent of the selected SNP under the null hypothesis of no causal association at the target SNP, since the off-target SNPs that constitute the polygenic score are selected independently and are on differing chromosomes (that is they are not in LD with the target SNP and there is no-collider bias between the target SNPs and off-target SNPs since the null hypothesis is true).
This proves the conservation of type I error. Under the alternative hypothesis that the target SNP has a causal association with the outcome, collider bias might result in some correlation between the PGS and target SNP genotype; however, the extent of this correlation is likely extremely weak when there are a large number of variants that are associated with the trait in question, and unlikely to invalidate the following argument.
We first list the assumptions and notation we will use for the remainder of the argument.

Assumptions
• Let X correspond to the standardized SNP genotype at a particular location • Without loss of generality, assume that Var(X) = 1 and E(X) = 0 (that is if X * is the original genotype data, X = (X * − E(X * ))/SD(X * ) • Similarly, the outcome Y is standardized, so that E(Y ) = 0 and Var(Y ) = 1 • Data collected on outcome, Y , target SNP X, and offtarget genetic SNPs, G 1 , ..., G K for samples i = 1 . . . N • The estimated LOCO polygenic score,P = ∑ k∈Ŝβ k G k , constructed over SNPs in the selection setŜ. Again SNP variables G k for k ∈ S are standardized to have mean 0, variance 1. By construction,P has expected value 0. We assume thatβ k are scaled so that the empirical variance ofP over samples i ≤ N is 1.
• Finally, we consider the LOCO polygenic score P that corresponds to SNPs in S but weighted according to their "true" associations β k , P = ∑ k∈Ŝ β k G k • Subscript notation. i and j refer to individuals i, j ≤ N; k ≤ K refers to genetic location Variance ofβ in fastGWA model The fastGWA model takes the form: N ) = Πτ 2 0 , where the family matrix Π is assumed known (or can be estimated using the original genotypes). The overall variance matrix of Var(Y) = (Y 1 , ...,Y N ) in (1) accounting for both the environmental variance and genetic random effect is V = σ 2 0 I + Πτ 2 0 Assuming consistent REML estimates,τ 0 andσ 0 , of τ 0 and σ 0 , estimated by fastGWA, fastGWA estimates β by genalized least squares:β Since,β is computed using generalized least squares, it is easily shown that: with X being the vector of the target SNP over i = 1, ..., N Henceforth, we will assume that estimation error in the estimated variance components:σ 0 andτ 0 is negligible, so can effectively leave out the hat-notation when referring to variance components.
To examine the effect of the extent of family correlation structure on Var(β ) in a simplistic setting, we will assume that Π has a compound symmetry structure (implying that all individuals are equally related. That is That is Π has elements −1 ≤ ρ ≤ 1 on its off diagonals and 1 on its diagonals. It follows that the matrix V has also a compound symmetry form: The inverse of V (if it exists) can be calculated analytically and is equal to: It follows that: Now, noting that E(X 2 i ) =1 and assuming that E(X i X j ) = ρ, the genetic correlation, for large N one can show that the above is approximately equal to indicating that Var(β ) is smallest when fastGWA is run on unrelated individuals, that is where ρ = 0. From this, we see that the inclusion of a genetic-random effect (with a particular correlation matrix) in fastGWA does little to increase power (although the association estimate will be slightly more efficient than the corresponding estimate from a regression not taking into account family structure when ρ = 0. The goal in Fast-GWA is instead to properly incorporate family structure in the estimation of Var(β ).
In particular, related-ness in the GWAS reduces the power of finding associated SNPs 4 (which is indicated in that Var(β ) is a increasing function of ρ).

Variance ofβ in fastGWA-PGS model
The fastGWA-PGS model takes the form: whereP = P + ε P is the estimated polygenic risk score, assumed to be independent of X, and estimated in a LOCO fashion. We will later justify that the modified residual terms ε (1) and g (1) , are zero mean random variables that are independent of X conditional onP providedP is independent of X. Comparing with equation (1) we have that: Var(ε (0) ) +Var(g (0) ) = Var(ε (1) ) +Var(g (1) ) + γ 2 (4) Importantly, these independence conditions imply that conditional onP, Cov(X,Y |P) = βVar(X|P) = βVar(X). Noting then that Cov(X,Y |P) is constant, it must equal Cov(X,Y ), which implys that β = Cov(X,Y )/Var(X). This indicates that the coefficient β multiplying the SNP genotype is the same in (3) and (1). Note that the variances of both residual terms may be reduced due to addition of the polygenic risk score, that is Var(ε (1) ) = σ 2 1 < Var(ε (0) ) = σ 2 0 and Var(g (1) ) = τ 2 1 < Var(g (0) ) = τ 2 0 . As vector equations we again assume that Var(ε N ) = Πτ 2 1 . Comparing equations (1) and (3), it follows that adjustment for the polygenic score will reduce the variance of the environmental noise and genetic components in (1), by the quantities: Corr(P, ε (0) ) and Corr(P, g (0) ). Note if we instead adjusted for the "true" polygenic score, P, in the regression, we might reduce more of the noise in the genetic random effect but would not reduce noise in the environmental random effect.

5
The model can be approximately fit in 2 stages. First, we orthogonalize the out- where YP is the predicted outcome from a regression usingP. Second, we orthogonalize X with respect toP, that is calculate X (1) = X − XP. Assuming X is truly independent ofP one would expect that X (1) ∼ X. Finally, β is estimated by a generalized least squares fit, regressing Y (1) on X (1) , in the following model where the variance matrix Similarly to before,β = X (1) t V (1)−1 Y (1) and the variance ofβ is and under the circumstance that the off-diagonal elements of Π are all equal to ρ, and X (1) ∼ X, this is approximately noting that σ 2 1 < σ 2 0 and τ 2 1 < τ 2 0 and comparing to (2) indicates the variance ofβ is reduced by adding the informative (and independent) estimated PGS to the regression.
Because of near-orthogonality of X andP, one would not expect the absolute-size ofβ to be altered (indeed we argued previously that the β coefficient in the two regression formulae (1) and (5) should be equal), indicating that a test based onβ 2 /Var(β ) should have improved power.
Justification of independence of modified residuals and SNP genotype X under approximate independence of X andP As previously noted, if residuals, ε (1) and g (1) and genotype, X, in equation (3) are truly independent of each other, and ε (1) and g (1) are zero mean and finite variance, standard calculations as demonstrated later show that the variance calculated as (7) is asymptotically correct. In addition, the β parameters will 'match' in equations (1) and (3), and hence the PGS adjusted model will have improved power under the alternative whilst conserving type I error under the null. The following is an argument to justfify this condition. By assumption, in equation (1), the residual terms ε (0) and g (0) are independent of the genotype vector X. We also have assumed that the selected polygenic score, P is statistically independent of X. This implies that once standardized to have mean 0, X andP should be approximately orthogonal. Now, conditional on the vector of polygenic scores,P Let YP =γP be the projection of the response vector Y onto the vectorP.
By examining the right hand side of equation (1), and the approximate orthogonality of X andP, this projection is also equal to the sum of the projections of the vectors ε (0) and g (0) ontoP, which we denote ε P , we have the equation: where β is the same coefficient as in equation (1). Noting that conditional onP, the vectors ε (1) and g (1) are functions of the vectors ε (0) and g (0) , which are all independent of X, ε (1) and g (1) are also independent of X. In addition, ε (0) , g (0) andP are 0-mean random variables by assumption. Since, as vectors ε (1) and g (1) can be viewed as the difference of a zero mean vector and a projection onto a zero mean vector they can also be viewed as zero mean vectors, which completes the argument.
Conservation of Type I error, after adjustment forP, assuming independence of X and modified residuals Under the scenario that we have sucessfully reduced residual noise by incorporating a polygenic risk score as above, the association test checks the orthogonality of the genotype vector for the SNP, X with the noise reduced outcome vector (after subtracting off the predicted outcome based on the polygenic score). Since the polygenic risk score is approximately othogonal to the SNP in question, and was constructed with no reference to the SNP, the Type I error of this test should not be affected. This follows in a straightforward way from the observations that the modified genetic and environmental residuals are independent of X and have 0 mean and the variance matrix listed above as we have justified above.
In more detail, suppose that β = 0. If E(β ) = 0 and the variance of Var(β ) is really given by (7), it follows that the test statistic:β 2 /Var(β ) should be approximately chisquared with 1 degree of freedom, and p-values will be uniform as required for a valid statistical test.
Var(Y (1) ) = Var(ε (1) ) + Var(g (1) ), which by definition is given by (6), implying that Var(β ) is indeed given by (7) Supplementary figures & Tables Figure S1: Assessment of the false positive rate in 100 simulations, causal variants were simulated on the even chromosomes leaving the odd chromosomes to carry information on the false positive rate. The results of fastGWA-PGS are shown for three P&T P-value thresholds (LOCO PGS is calculated using 5 x 10 −5 , 0.05 & 0.5 pvalue cut off points) and LDpred2.        Figure S6: QQ plots comparing the distributions of the negative logarithm of the Pvalues obtained when different methods were applied to the height phenotype from the UK Biobank.  Figure S7: QQ plots comparing the distributions of the negative logarithm of the Pvalues obtained when different methods were applied to the heel bone mineral density (HBMD) phenotype from the UK Biobank  Figure S8: QQ plots comparing the distributions of the negative logarithm of the Pvalues obtained when different methods were applied to the body mass index (BMI) phenotype from the UK Biobank Figure S9: Proportion of phenotypic variance (in height, BMI, & HBMD) explained by polygenic scores, calculated using the P&T method, as a function of the P-value thresholds applied in the P&T method. The polygenic scores were calculated from summary statistics obtained using the methods shown. Table S6: Maximum difference in sensitivity between methods, and the corresponding specificity at which this maximum occurs (from 100 simulations with h 2 =0.5, N=100,000 & 1,000 causal loci).