Quantifying portable genetic effects and improving cross-ancestry genetic prediction with GWAS summary statistics

Polygenic risk scores (PRS) calculated from genome-wide association studies (GWAS) of Europeans are known to have substantially reduced predictive accuracy in non-European populations, limiting their clinical utility and raising concerns about health disparities across ancestral populations. Here, we introduce a statistical framework named X-Wing to improve predictive performance in ancestrally diverse populations. X-Wing quantifies local genetic correlations for complex traits between populations, employs an annotation-dependent estimation procedure to amplify correlated genetic effects between populations, and combines multiple population-specific PRS into a unified score with GWAS summary statistics alone as input. Through extensive benchmarking, we demonstrate that X-Wing pinpoints portable genetic effects and substantially improves PRS performance in non-European populations, showing 14.1%–119.1% relative gain in predictive R2 compared to state-of-the-art methods based on GWAS summary statistics. Overall, X-Wing addresses critical limitations in existing approaches and may have broad applications in cross-population polygenic risk prediction.

for 31 traits in East Asians. X-axis represents 4 folds and Y-axis is 10 folds repeated learning. The Pvalue of two-sided Wilcoxon signed-rank test is !"#$%&%' = 0.14. Suppose the standardized traits Y 1 and Y 2 in two populations follow the linear models with random effects: where X k is a N k × M standardized genotype matrix; β k is a M -dimensional vector of genetic effect sizes; ϵ k are non-genetic effects. We assume cross-population genetic covariance is localized in small regions R 1 , . . . , R r , i.e. the joint genetic effect sizes follow the multivariate normal distribution: where h 2 1 and h 2 2 denote heritability in two populations; ρ g is the cross-population genetic covariance;Ĩ is a diagonal matrix whereĨ[i, i] = 1 if and only if i ∈ ∪ r j=1 R j ; K = r j=1 |R j |, which is the number of SNPs with correlated genetic effect. We assume non-genetic effects ϵ k are independent across two populations.
Our goal is to use scan statistic to identify small regions enriched for local genetic correlation between two populations. Therefore, we design the numerator in the scan statistic as i∈R z 1i z 2i , the inner product of z-scores in a region across two populations, which quantifies the concordant association pattern of SNP effect sizes. To normalize the effects of LD in two populations, we use Sd i∈R z 1i z 2i under the null hypothesis as the denominator of scan statistic. Under the null hypothesis that cross-population genetic correlation is zero, the joint z-scores for two populations follow the multivariate normal distribution as Since individual genotype data is hardly accessible due to privacy issues in practice, we use LD matrices (denoted by V k ) estimated from reference panels (e.g. the European and East Asian individuals from 1000 Genome Project) to approximate the sample LD matrices can be unbiasedly estimated as is the sample size of the reference panel for population k. Using this approximation, we have We use XPASS 1 to estimate heritability for two populations h 2 1 and h 2 2 . Let Σ k = Denote z kR as the sub-vector of z k indexed by R, and Σ k,RR as the sub-matrix of Σ k whose rows and columns are both indexed by R. Then for a given region R, the joint z-scores in R follows the multivariate normal distribution Following bilinear form theory, we can show that Var i∈R z 1i z 2i = Tr [Σ 1,RR Σ 2,RR ], where Tr[A] denotes the trace of matrix A. However, computing value of Tr [Σ 1,RR Σ 2,RR ] for all possible regions R is computationally expensive. Therefore, we use i∈R Σ 1,ii * Σ 2,ii to replace Tr [Σ 1,RR Σ 2,RR ]. We add tuning parameter θ to the power term of i∈R Σ 1,ii * Σ 2,ii to accommodate the approximation bias and control the penalty strength of LD effects.
We select the best tuning parameter θ from a candidate set {0.5, 0.55, 0.6, 0.65, 0.7, 0.75} such that the identified regions for that given θ achieve the highest proportion of genetic covariance. The cross-population genetic covariance for aggregated regions can be estimated using XPASS 1 .
We compute the cross-population LD matrix by taking the maximum of population-specific LD matrix in an element-wise fashion, as previously suggested 2 . We use ldetect 3 to divide the genome into 185 LD blocks (average size of 15MB) that are approximately independent in both populations based on the cross-population LD matrix. We first apply LOGODetect to identify regions with family-wise error rate cutoff of 0.05 in each LD block separately. Then we collect all the candidate regions identified across different LD blocks and control FDR level of 0.05 using the Benjamini-Hochberg procedure.

Model
Consider an additive genetic model: where β k is a M -dimensional vector of SNP effect sizes in population k, ϵ k is a vector of error terms with variance σ 2 k , to which we assign a non-informative Jeffreys prior. MVN denotes multivariate normal distribution, and I k is an identity matrix.
Consider an annotation with A category, we assign an annotation-dependent horseshoe prior to β jk : Here, β jk denotes the effect of SNP j in population k, ϕ is the global shrinkage parameter shared across all M SNPs, ψ j represents the local shrinkage parameters for SNP j, λ f (j),k denotes the annotation-dependent shrinkage parameter for SNP j in population k, f : j → a ∈ {1, . . . A} is a function that maps the j-th SNP to its corresponding category a in the annotation.
To perform the full Bayesian model fitting, we assign the half-Cauchy priors to the global, local, and annotation-dependent shrinkage parameters as follows: Using the half-Cauchy decomposition, we have where IG denotes the inverse-gamma distribution.

Gibbs sampler
Next, we derive the full conditional distribution of all parameters in the above model.
For notation purpose, we rewrite the prior in matrix form: The Gibbs sampler involves the following steps in each Markov Chain Monte Carlo (MCMC) iteration: where D k is the LD-matrix for population k,β k is the marginal least squares estimates obtained from GWAS summary statistics. To avoid the numerical issue caused by colinearity between SNPs, we restrict where M k is number of SNPs in population k.
where k j = 1 if SNP j exists only in one population and r if it exists in r populations included.
where s a is the number of predictors in category a, l(a, k) = {j ∈ {1, . . . , M } : λ f (j),k = λ a,k } is the set of SNP predictors that belongs to category a.

Example of annotation-dependent shrinkage based on local genetic correlation annotation
Here, we provide an example of the annotation-dependent shrinkage λ f (j),k based on local genetic correlation annotation. WLOG, we assume that we have K = 3 populations in total and population 1 is the target population.
Given the local genetic correlation annotation Ω 2 and Ω 3 , the λ f (j),k in the Full Bayesian model fitting process is specified as:  Here, k-th row represents the specification of the annotation-dependent shrinkage parameter λ f (j),k ′ for k ′ = 1, 2, 3 when obtaining the posterior effects for k-th population.

Model-tuning strategy
Instead of assigning a prior for ϕ, we select the global shrinkage parameter ϕ from a grid of value {10 −6 , 10 −4 , 10 −2 , 1} with respect to the largest R 2 in the validation set. The detailed algorithm is listed below: Algorithm 1: Model-tuning X-Wing Input: GWAS summary statistics and population-matached LD reference panel from population 1 to K, target sample genotype.
Output: X-Wing PRS. 1 We perform local genetic correlation analysis between population 1 and population k (k = 2, . . . K) to identify top s regions with positive local genetic correlation. We denote the set of regions as Ω k .
2 For each ϕ ∈ {10 −6 , 10 −4 , 10 −2 , 1}, we fit our PRS model with annotation-dependent shrinkage specified below: when estimating the posterior SNP effects for the non-target population k that When estimating the posterior SNP effects for target population, we used λ f (j),k = 1 for all j = 1, 2, . . . M, k = 1, . . . K. 3 For each ϕ ∈ {10 −6 , 10 −4 , 10 −2 , 1}, based on the posterior mean effects of population k obtained in step2, we calculate population-specific score P RS k,ϕ . A common practice to combine these population-specific scores is to fit a regression model using the same phenotype Y (v) 1 and K population-specific PRS in an independent validation dataset from the target population: Instead of fitting a regression in independent samples, we employ a novel strategy to obtain the least squares estimates of regression weights (i.e. w 1,φ , . . . w K,φ ) using GWAS summary statistics. We introduce this approach in the section 3.3. 4 The final X-Wing PRS is then calculated as:

Incorporating multiple annotations 2.2.1 Model
We generalized our model to incorporate T annotations with the annotation-dependent shrinkage prior to β jk : Here, β jk denotes the effect of SNP j in population k, ϕ k is the global shrinkage parameter shared across all SNPs for population k, ψ j represents the local shrinkage parameters for SNP j and are shared across population, λ f (j,t),k is the annotation-dependent shrinkage parameters for SNP j in population k for t-th annoration, f : (j, t) → a t ∈ {1, . . . , . . . A t } is a function that maps the j-th SNP to its corresponding category a t in the t-th annotation.
To perform the full Bayesian model fitting, we assign the half-Cauchy priors to the global, local, and annotation-dependent shrinkage parameters as follows: Using the half-Cauchy decomposition, we have where IG denotes the inverse-gamma distribution.

Gibbs sampler
Next, we derive the full conditional distribution of all parameters in the above model.
The Gibbs sampler then involves the following steps in each MCMC iteration: where D k is the LD-matrix for population k,β k is the marginal least squares estimates obtained from GWAS summary statistics. To avoid the numerical issue caused by colinearity between SNPs, we restrict where k j = 1 if SNP j exists only in one population and r if it exists in r populations included.

Sufficient statistics for least squares estimator of linear combination weights
Consider the linear combination problem for K centered population-specific PRS using the individual-level validation data X Here, superscript v highlights the fact that phenotypes and PRS in this regression exercise need to be obtained from a validation dataset that is different from any data used for GWAS and PRS training. Y T is a K-dimensional linear combination weights vector.For simplicity, is standardized, and b quantifies standardized SNP effects.
Next, we showed the The least squares estimator for w is This indicates that b, X are sufficient statistics for w, where b is obtained from the PRS training procedure, X is from in-sample LD matrix, and X can be obtained from the summary statistics of the validation sample. When the in-sample LD information is not available, we use LD matrix from the reference panel as replacement. Then we have where N (ref ) and P RS (ref ) denote the sample size and PRS matrix in the reference panel. Taken together, this shows that in order to obtain w, we only need the LD reference and summary statistics from a validation sample.

Derivation for subsampling GWAS summary statistics from training and validation sets
Consider the phenotype-genotype model: where Y i is the standardized phenotype with mean 0 and variance 1 for individual i, X i is a 1×M standardized genotype matrix, and ϵ i is the error term, β is a p-dimensional effect sizes vector. Note that the subscript i in section 3.2 denotes the individual rather than the population.
Here, we consider X i and Y i as random and i.i.d. distributed (i.e., . We denote Y = (Y 1 , . . . , Y N ) T as a N -dimensional phenotype vector and X = (X T 1 , . . . , X T N ) T as a N × M standardized genotype matrix.
The standard approach the process for model validation technique involves first randomly sampling a subset of N −N (v) individuals from full sample (X, Y ) as the training data (X (tr) , Y (tr) ), and use the remaining The GWAS sample size is large and hence by the central limit theorem, we have approximately The covariance between X T Y and X (tr)T Y (tr) is Here, we use the formula for the conditional distribution of two multivariate normal random vectors: , and Cov(A, B) = Σ AB , we have the distribution of A|B following a multivariate normal distribution with mean and covariance matrix.
Thus, we have For the conditional expectation, we plug in the estimator x T y for N E[X T 1 Y 1 ], the estimator for the conditional expectation is For notation purpose, we define the conditional variance as The diagonal term Σ jj of Σ is The off-diagonal term Σ jk of Σ, j ̸ = k is It turns out that the Σ is exactly the LD matrix constructed using the standardized genotypes. Thus, we obtain the estimator for the conditional variance as is obtained from the reference panel, In conclusion, we have Thus, we subsample the summary statistics for training set given full summary statistics X T Y by where g is a N (ref ) -dimensional vector with elements drawn from a standard normal distribution.

Dealing with tuning parameters in the PRS model
If there are tuning parameters in the PRS model, we use the correlation R between phenotype and linearly combined PRS in the validation set to select the optimal tuning parameter, as well as to estimate the linear combination weights.
Followed the notation above, suppose there are tuning parameters γ in the PRS model, consider the linear combination problem for K centered PRS using the individual-level validation data: where P RS k,γ , . . . , P RS × K centered PRS matrix with respect to to the tuning parameter γ in the PRS model, w γ is a K-dimensional linear combination weights vector.
The least squares estimator for w γ iŝ where P RS The correlation R γ between the linearly combined PRS and phenotype in the validation set with respect to to the estimated weightsŵ γ is : Then we select the optimal tuning parameterγ aŝ and use the linear combination weightsŵγ with respect to the optimal tuning parameterγ to linearly combine the PRS.

Grid search to handle negative least squares estimates for mixing weights
In practice, the least squares estimates for linear combination weights of a particular PRS can be negative.
It may decrease the prediction accuracy of the linearly combined PRS. Thus, we provide a grid search strategy to mimic the non-negative least squares.
We pre-specify a grid of positive value for the linear combination weights Then, we use the formula in the above section to calculate the correlation between the linearly combined PRS and phenotype in the validation set. The linear combination weights with respect to to largest correlation, w grid = argmax w∈W R w , will be used to linearly combine the PRS.

Summary statistics-based ridge regression to combine multiple PRS
When linearly combined many PRS with multicollinearity problems, the least squares of the linear combination weights may be sub-optimal. A remedy for multicollinearity is ridge regression.
We first describe a individual-level data-based ridge regression. Consider the linear combination problem for K centered population-specific PRS using the individual-level validation data X Here, Y T is a K-dimensional linear combination weights vector. For simplicity, we assume Y is standardized, and b quantifies standardized SNP effects.
The ridge regression estimator for w is where λ is the shrinkage parameter. It has a closed-form solution: (39) common practice to obtain the ridge regression estimator is to use two disjoint validation set, one to select the optimal tuning parameter λ and the other to estimate w with the selected tuning parameter.
Next, we proposed a summary statistics-based ridge regression for combining multiple PRS. Given the GWAS summary statistics with sample size N 1 , we subsample GWAS summary statistics for the training set 1 , and for the validation set X to subsample GWAS for two disjoint validation sets: X We first apply the PRS method using X (tr)T 1 Y (tr) as training data to obatin SNP effects b. Next, we obtain grid of the ridge regression estimate w ridge,λ using X The λ with respect to the largest correlation between phenotype and linearly combined PRS in validation set 1 will be used:λ where P RS (ref ) = X To avoid overfitting, we recommend using distinct reference panel in summary statistics sampling, PRS model training, ridge regression hyperparameter λ selection, and linear combination weights estimation.

Transforming allele count scale SNP effects into standardized scale
For simplicity, we assume that the genotype matrix is standardized, thus the SNP effects should be on standardized allele scale. Since many PRS method outputs the allele count SNP effects , we use allele frequency from the target population to transform the allele count SNP effects to the standardized effects. The k-th where b (allele) k is the allele count SNP effects for M SNPs in k-th population, f 1 is the M -dimensional target population allele frequency vector. When the in-sample allele frequency is not available, we estimated it from the target population reference panel.

GWAS summary statistics-based cross-validation
Suppose we divide the full GWAS sample (X 1 , Y 1 ) into a training set X individuals, and a validation set X individuals. Given the association z-scores from GWAS summary statistics and genotype data from the reference panel, association summary statistics based on training and validation sets can be sampled as: To perform P -folds cross-validation, we first uses the above formula to sample X T 1,p Y 1,p , p = 1, . . . P − 1 from P independent subset with sample size N1 P and obtain the GWAS summary statistics from training and validation sets in fold p as: and estimate the linear combination weights in each fold.

Regarding the marginal linear assumptions
X-Wing assumes a marginal linear regression between the phenotype and SNP. In practice, many of the GWAS are performed using the mixed model. Mathematically, the GWAS association results using the linear mixed model is equivalent to the results using the marginal linear model on phenotypic residual after adjusting for best linear unbiased prediction (BLUP). This phenotypic residual can be considered the phenotype after adjusting for the sample relatedness (genetic relationship matrix). In our analysis, the BBJ GWAS are performed using BOLT-LMM. BOLT-LMM utilized a two-step approach to conduct the linear mixed model GWAS: "Our algorithm fits a Gaussian mixture model of SNP using a fast variational approximation to compute approximate phenotypic residuals and tests the residuals for association with candidate markers via a retrospective score statistic". Therefore, the mixed model GWAS from BOLT-LMM is from the marginal linear regression between the phenotypic residuals and the SNP. To summarize, most current GWAS (such as BBJ used in this study) essentially comes from a marginal linear model with phenotypic residuals as outcome and SNP as predictor. Therefore, our marginal effects assumption is valid, and we don't expect it will influence the PRS model performance. And our repeated learning approach can be considered as splitting on this phenotypic residual instead of the raw phenotype that is correlated among correlated individuals. Therefore, applying the summary statistics-based splitting using summary statistics that have already been accounted for sample relatedness should not cause problems.

Implementation of other methods
XPASS XPASS 1 is an empirical Bayes-based PRS framework that leverages genetic correlation for crosspopulation polygenic prediction. In our paper, XPASS is used to compute heritability and cross-population genetic covariance (correlation), and estimate the SNP posterior effects used to calculate PRS. We used populationmatched 1000 Genomes Project data as the reference panel. Five principal components of genotypes in reference panel were used as covariate files as suggested by the software. We estimated the global genetic correlation using genome-wide SNPs. We also created two SNP sets: SNPs inside and outside significant genome regions identified by X-Wing, and computed cross-population genetic correlation using GWAS summary statistics restricted to the two SNP sets separately. Standard errors of genetic parameters (heritability, genetic covariance, and genetic correlation) were estimated using block-wise jackknife method. For PRS construction, we obtained the posterior effects for each population to generate population-specific PRS. Although XPASS did not propose to linearly combine PRS, we applied the linear combination to XPASS-derived PRS for a fair comparison.
PESCA As suggested by PESCA paper 2 , we pruned SNPs such that correlation between SNPs does not exceed 0.95 in the population-matched 1000 Genomes Project data. We used ldetect 3  SNPs identified by PESCA into regions with equal size, such that the aggregated size was the same as that of X-Wing.