Reconstructing SNP allele and genotype frequencies from GWAS summary statistics

The emergence of genome-wide association studies (GWAS) has led to the creation of large repositories of human genetic variation, creating enormous opportunities for genetic research and worldwide collaboration. Methods that are based on GWAS summary statistics seek to leverage such records, overcoming barriers that often exist in individual-level data access while also offering significant computational savings. Such summary-statistics-based applications include GWAS meta-analysis, with and without sample overlap, and case-case GWAS. We compare performance of leading methods for summary-statistics-based genomic analysis and also introduce a novel framework that can unify usual summary-statistics-based implementations via the reconstruction of allelic and genotypic frequencies and counts (ReACt). First, we evaluate ASSET, METAL, and ReACt using both synthetic and real data for GWAS meta-analysis (with and without sample overlap) and find that, while all three methods are comparable in terms of power and error control, ReACt and METAL are faster than ASSET by a factor of at least hundred. We then proceed to evaluate performance of ReACt vs an existing method for case-case GWAS and show comparable performance, with ReACt requiring minimal underlying assumptions and being more user-friendly. Finally, ReACt allows us to evaluate, for the first time, an implementation for calculating polygenic risk score (PRS) for groups of cases and controls based on summary statistics. Our work demonstrates the power of GWAS summary-statistics-based methodologies and the proposed novel method provides a unifying framework and allows further extension of possibilities for researchers seeking to understand the genetics of complex disease.

. Average running time in seconds for fixed effect meta-analysis for ReACt, METAL, and ASSET. All experiments were performed at Purdue's Snyder cluster on a dedicated node which features a Haswell processor running at 2.6 GHz with 512 GB of RAM and a 64-bit CentOS Linux 7 operating system. We report average running time in seconds over ten iterations using ReACt, METAL, and ASSET. In the case of METAL we evaluated the performance of the latest release in GitHub [28]. In each iteration, two or four sets of summary statistics (for 100,000 SNPs) were meta-analyzed. Recall that all methods scale as a function of the number of SNPs and is independent of the number of samples, since only summary statistics are used.

ReACt METAL ASSET
2 input studies 2.2s 1.8s 696s 4 input studies 3.1s 3.3s 3715s Table S2. Performance of fixed-effect meta-analysis with two input studies with uneven case/control sample sizes under different conditions. We compare power and type I error rate (T1E) of our method meta-analyzing two studies with uneven case/control sample sizes vs. ASSET/METAL for a significance threshold p < 5 · 10 −5 . Study one contains 1500 cases and 500 controls, and study two contains 500 cases and 1500 controls.  Table S3. Performance of fixed-effect meta-analysis with two input studies under different conditions. We compare power and type I error rate (T1E) of our method meta-analyzing two studies vs. ASSET/METAL for a significance threshold p < 5 · 10 −5 . METAL dev refers to the latest release in GitHub [28]. Two variants of ReACt are tested: Exact and Est, indicating whether the sample overlap was exactly known as part of the input or whether it was estimated, respectively. Sample overlap indicates the number of cases and controls that were shared between two input studies. I.e. a sample overlap equal to 100 means that there are 100 cases and 100 controls shared between two input studies. Total sample sizes for each input study, including the shared samples, are equal to 2000 when the sample overlap is equal to zero; 2400 when the sample overlap is equal to 100; and 4000 when the sample overlap is equal to 500. In each case, the sample is equally split to cases and controls. Also see figure 1 and 2.  Table S4. Performance of fixed-effect meta-analysis with four input studies under different conditions. We compare power and type I error rate (T1E) of our method meta-analyzing four studies vs. ASSET/METAL for a significance threshold p < 5 · 10 −5 . METAL dev refers to the latest release in GitHub [28]. Two variants of ReACt are tested: Exact and Est, indicating whether the sample overlap was exactly known as part of the input or whether it was estimated, respectively. Sample overlap indicates the number of cases and controls that were shared between two input studies. I.e. a sample overlap equal to 100 means that there are 100 cases and 100 controls shared between two input studies. Total sample sizes for each input study, including the shared samples, are equal to 2000 when the sample overlap is equal to zero; 2400 when the sample overlap is equal to 100; and 4000 when the sample overlap is equal to 500. In each case, the sample is equally split to cases and controls.

Solving the non-linear system of equations of Section 2.1
For notational simplicity, let a = a cse iℓ , b = u cse iℓ , c = a cnt iℓ , and d = u cnt iℓ . We rewrite eqns.
(1)-(4) as Our goal is compute values for the four unknowns a, b, c, and d. Combining eqns. (19) and (20), we get Substituting eqn. (22) Substituting eqn. (24) into eqn. (22), we get We now note that all four unknowns can be written as functions of d and other known quantities. Substituting eqn. (23) Simplifying the above equation, we get which can be further simplified to Eqn. (26) is a quadratic equation on d; its real roots (if they exist) are Given d, we can immediately compute a, b, and c using eqns. (23), (24), and (25). In order to determine whether d is equal to d 1 or d 2 , we first check whether d 1 or d 2 guarantee that a, b, c, and d are all positive numbers. If both d 1 and d 2 satisfy this constraint, then we choose the largest of the two roots, as it solves the following trivial minimization problem: The above choice is based on the assumption that in summary statistics A 1 (whose frequency is equal to the above fraction) typically denotes the effective (minor) allele. Additionally, our code performs a sanity check for allele alignment across studies given the solution d 1 or d 2 .
For the sake of completeness, we also prove that it is not possible for both d 1 and d 2 to be negative. First, note that

31/34
Using x = a + b > 0 and w = 1 which implies that wx − 2 + 2z > 0. Combining with eqn. (27), we conclude that d 1 + d + 2 is non-negative; recall that w, x, y, and z are all non-negative. Additionally, which implies that d 1 and d 2 must have the same sign, and since their sum is non-negative, they must both be positive. It is a simple exercise to prove that as long as root(s) exist, at least one of them will guarantee that all values for a, b and c will be positive.
One important exception arises when the discriminant in eqn. (26) is negative. In that case, no real roots exist for the quadratic equation. We do note that, theoretically, this should never happen, since the underlying unknown quantities are positive real numbers. However, stratification correction and genotype missingness could force the discriminant to fall below zero. To address this issue, we inflate w (i.e., the square of the standard error for the respective SNP) and recompute the discriminant. More specifically, we iteratively multiply w by 1.001 (a 0.1% inflation) until a non-negative discriminant is obtained or until 50 iterations are reached. The maximum inflation we allow (after the full 50 iterations) is 1.001 50 − 1 ≈ 5%. If after 50 iterations we have failed to find a non-negative discriminant we omit this particular SNP from further analyses. Empirically, for most input SNPs, a real root can be found after at most ten iterations.

Correction for sample overlap between the base/target studies for group PRS
The existence of shared samples in base (discovery) and target populations can lead to inflation in association between PRS and the trait of interest in the target population [19? ]. In our case, such overlap will cause higher levels of significance in the t-test comparing the case and control PRS distribution. So far, for conventional PRS, the most widely accepted approach to address this problem is simply to identify the overlapping individuals and remove them from the target population. However, in practice, this is not always possible since it usually requires additional access to the individual level data of the base population. In this section, we introduce a correction for sample overlap between the base and target populations implemented in ReACt that could alleviate such issues.
In the following, we will use the case group as an example. Assume that the sample size for cases of the target population is N cse target , out of which N cse shr are also cases in the base population (overlap). If the probability of each sample being shared between the base and target studies is uniformly distributed in both base and target studies, we would expect the observed mean PRS in target cases PRS cse obs to be a weighted sum of the mean PRS in base cases PRS cse base and the mean PRS of cases that only exist in the target population PRS cse target as follows: Therefore, the mean PRS for cases only in the target population is: where PRS cse obs is the uncorrected group mean computed as described in Section 4.3.2. PRS cse base can be obtained by simply setting the target population to be the same as the base population, using base summary statistics to compute group PRS for the target population. Similarly, we can adjust the variance computation as follows: Var(PRS cse obs ) = for controls. Then, the corrected p-value will be based on a t-test using the corrected mean and variance and an adjusted degree of freedom: This is a straightforward correction on the target PRS using the scores of the base population that one would use if there were no stratification between the base and target populations. In practice, this idealized scenario does not hold. In order to deal with the stratification between the base and target populations, prior to any correction, we shift the scores for base cases and controls by aligning the base population PRS means to the target population as follows: In the above, PRS base and PRS target are mean PRS for the base and target populations respectively: In practice, we use PRS cse * base and PRS cnt * base instead of PRS cse base and PRS cnt base in equations (29)-(32) for correction. We evaluated the performance of this correction scheme by introducing sample overlaps between the base and target populations using the same simulation model as the one we used to evaluate the performance of our group PRS approach. We computed the real individual level PRS using PRSice2, from which we obtained the inflated PRS descriptive statistics (group mean, standard deviation, and t-test p-value) for all target samples, including the ones that are shared with the base population. We also computed PRS statistics for samples that are present only in the target population as the ground truth. We compared results from our corrected group PRS method to the PRS statistics for the samples that are exclusive to the target population computed using PRSice2. Results on synthetic data demonstrated that our correction can drastically alleviate the inflation in p-values that is the result of sample overlap the between base and target populations. See Table S5, which shows representative results from our experimental evaluations. If the number of overlapping samples is unknown to the user, we apply the approach proposed in [28] to get an estimate of the overlapping sample size and we correct the output statistics accordingly. Note that this correction approach is based on the assumption that all samples having an equal probability of being shared between the base and target populations, which might be unrealistic in certain settings.

Speeding up the logistic regression computation
Recall that in section 4.2.1, for any SNP i, if we try to formulate the computation of elements in H H H and G G G in one iteration, they will be: and where u, v ∈ {0, 1, 2}. Same as in section 4.2.1 , we dropped the subscript i from d j , z j and X X X. If we follow these equations, when the sample size 33/34 ∑ L ℓ=1 N ℓ increases, the computational burden will increase linearly. However, in practice this step can be achieved with an O(L) complexity, as long as we take advantage of the fact that all elements of X X X are discrete values involving only 0, 1, 2 and study indicators I iℓ . This indicates that both d j · X X X ju X X X jv and d j · z j · X X X ju can only take a few possible values. In fact, since there are only 3 · L possibilities for X X X j * (3 different genotypes · L different studies), there are also only 6 possible values for d j . We denote them as d ℓn , with ℓ ∈ {1, . . . , L} and n ∈ {0, 1, 2}. Therefore, as an example, d 10 will be the value of d j for a sample j if it belongs to study 1 and has a genotype of A2A2. Similarly, for z j , since y j is involved in this computation, we need to consider in total 3L · 2 = 6L possible values as y y y is binary indicator for the phenotypes. We denote those 6 · L possible values as z cse ℓn for cases and z cnt ℓn for controls respectively. Then z cnt 10 will represent the value of z j for a sample j if it belongs to study 1, has a genotype of A2A2 and meanwhile is a control. Then we only need to plug in the element of X X X based on the u, v values of interest. Noticing this, if we just count the occurrence of those values, the summation can be found out easily. This can be done using the genotype counts that we have already computed in section 4.1.3 . Therefore, for any SNP i with occurrence N cnt iℓ (n) of each genotype n and indicator I iℓ for each input study, in an iteration of the IRLS, we can compute all d ℓn and z cnt ℓn needed for this SNP as described in 1 . Then for this iteration, we shall have: n · I iℓ · d ℓn · (N cnt i1 (n) + N cse i1 (n)) (39) and n · d ℓn · (z cnt ℓn · N cnt iℓ (n) + z cse ℓn · N cse iℓ (n)) (42) Eqn. (35)-(43) grant us fast update of w w w = H H H −1 G G G in each iteration. We can do this repeatedly in the IRLS until convergence to get the final result w w w.