Across-cohort QC analyses of GWAS summary statistics from complex traits

Genome-wide association studies (GWASs) have been successful in discovering SNP trait associations for many quantitative traits and common diseases. Typically, the effect sizes of SNP alleles are very small and this requires large genome-wide association meta-analyses (GWAMAs) to maximize statistical power. A trend towards ever-larger GWAMA is likely to continue, yet dealing with summary statistics from hundreds of cohorts increases logistical and quality control problems, including unknown sample overlap, and these can lead to both false positive and false negative findings. In this study, we propose four metrics and visualization tools for GWAMA, using summary statistics from cohort-level GWASs. We propose methods to examine the concordance between demographic information, and summary statistics and methods to investigate sample overlap. (I) We use the population genetics Fst statistic to verify the genetic origin of each cohort and their geographic location, and demonstrate using GWAMA data from the GIANT Consortium that geographic locations of cohorts can be recovered and outlier cohorts can be detected. (II) We conduct principal component analysis based on reported allele frequencies, and are able to recover the ancestral information for each cohort. (III) We propose a new statistic that uses the reported allelic effect sizes and their standard errors to identify significant sample overlap or heterogeneity between pairs of cohorts. (IV) To quantify unknown sample overlap across all pairs of cohorts, we propose a method that uses randomly generated genetic predictors that does not require the sharing of individual-level genotype data and does not breach individual privacy.


Supplementary Figures
The open circles represent the mean of the inferred geographic locations for the cohorts from the same country. Red triangles (near ES circle -the 1000 Genomes Project IBS cohort) represent MIGEN cohorts, which are from Sweden, Finland, USA, and Spain. However, they are all filled with the same allele frequencies in the summary statistics uploaded to the data sever.

Figure S3
for GIANT height cohorts using 1000 Genome European as reference.
Using all 1000 Genomes Project European samples as the reference panel -an "averaged" European population, !" is calculated for each cohort. Amish population (from US) and NSPHS population (from Sweden) shows high !" . Cohorts from the same nations were arranged together on the x-axis.

Figure S4
Comparison between Meta-PCA and genotype PCA on 1000 Genome samples using nearly 1M SNPs.
The cohort-level allele frequencies were estimated for 26 1KG cohorts, and meta-PCA was conducted. (a) The projection of cohorts based on cohort-level allele frequency for 1KG samples on the first two eigenvectors. (b) Conventional PCA based on individual genotypes on the first two eigenvectors, the mean of the 1KG individuals within each cohort is represented with the big circles. (c) The correlation, measured in ! , between meta-PCA and genotype PCA for the first twenty eigenvectors. The projected cohorts were consistent with their genetic origins. In contrast, conventional PCA was also conducted on 1KG individual genotypes directly, and the mean coordinates for each cohort was then calculated. As illustrated, these two techniques resulted in nearly identical projection for 1KG, and the correlation between cohort coordinates remained consistently high, ! > 0.9, for the first eight eigenvectors. Two cohorts, with 1000 individuals each and 30,000 independent loci are simulated. The genetic effects are estimated using single-marker association. Using these 30,000 summary statistics, the is computed for each locus and contrasted to the null distribution. Each vertical panel represents different heritability, and each horizontal panel represents different combination of overlapping samples. The red line, with its expected slope of (1 − ! ! ! ! !! ! ), in each plot represents the expected distribution of the sampled loci, and gray area represents the 95% confidence interval for under the null distribution. The overlapping firstdegree relatives cause correlation between summary statistics when heritability is large (first two horizontal panels). The overlapping samples always cause reduced (the last two horizontal panels).

Figure S6
for PGC schizophrenia summary statistics.

Figure S7
for rheumatoid arthritis summary statistics between Eueropean samples and Asian samples.

Figure S9
and for GIANT height GWAS cohorts.
We investigated the relationship between !"#$ (the mean of all !"#$ values of a given cohort with each of the other 173 GIANT height cohorts) and !" among the GIANT height cohorts. If there are no technical issues, such as inflated or deflated sampling variance for the estimated effects, we would expect to see: i) a correlation between !" and sample size; ii) no correlation between !"#$ and sample size; iii) no correlation between !"#$ and !" . Consistent with a previous study 25 , for a polygenic trait such as height correlation between !"#$ and sample size was of 0.116 (p = 0.127) (Figure S9a,b). Nevertheless, the correlation between the mean of !"#$ and !" was 0.836 (p<10e-16) for 174 GIANT height cohorts ( Figure S9c). We note that the 20 MIGEN cohorts had proportionally small !" and !"#$ , with very high correlation between them ( = 0.98); in contrast, the SardiNIA cohort, which had the largest !" , showed the largest !"#$ (1.070 ± 0.049), standing out as a special case among the GIANT height cohorts. Assuming a polygenic model of ℎ ! = 0.5 over 30,000 independent loci, we simulated 174 cohorts using the actual size samples from the GIANT height cohorts, and observed an increased correlation ( ! = 0.78) between !"#$ and !" for simulated cohorts with sample sizes of the MIGEN cohorts ( Figure S9d). Other effects, such as inflated/deflated sampling variance of the estimated genetic effects could also lead to correlation between !"#$ and !" (Supplementary Figure S8). In addition, we constructed a single MIGEN analysis by combining the 20 MIGEN cohorts using an inverse variance weighted meta-analysis 26 , and calculated !"#$ between this combined MIGEN cohort and all 174 cohorts. As expected, the combined MIGEN had !"#$ = 0.90 ± 0.07 with 20 MIGEN cohorts due to overlapping samples. In contrast, !"#$ = 1.01 ± 0.02 with 154 other cohorts, was consistent with neither heterogeneity nor sample overlap. Given that the MIGEN (2,340 samples) and SardiNIA (4,303 samples) cohorts contributed less than 3% of the total sample size (253,288 samples from the GIANT height GWAS cohorts), any impact of unusual !"#$ values on the meta-analysis results is very small.
Simulated cohort-level summary statistics for this figure. independent loci were generated for cohortlevel summary statistics. Each locus had allele frequency ! , which was sampled from a uniform distribution ranging from 0.1 to 0.5, and had genetic effect ! , sampled from a standard normal distribution (0,1). After rescaling, !!! ! 2 ! 1 − ! ! ! = ℎ ! . and were treated as true parameters. For a particular cohort with samples, its !~( ! , . All cohorts were assumed to share common genetic architecture, and differences were only due to genetic drift, allele frequencies and sampling variance of genetic effects.
Figure S10 Statistical power for detecting overlapping samples between a pair of cohorts given type I error rate of 0.05.
The y-axis represents statistical power, and the x-axis the number of overlapping samples. Cohort 1 has 1,000, 5,000, 10,000, or 25,000 samples, and cohort 2 has 1,000 samples. Figure S11 Using constructed either on genetic effects or on allele frequency to estimate overlapping samples between WTCCC 7 diseases.
!"#$ can be constructed on reported genetic effects (red bars), and alternatively can be constructed on allele frequency (blue bars). Both can be used to detecting overlapping samples. The x-axis indicate the pair of cohorts in WTCCC, and the y-axis represent the estimated overlapping samples based on !"#$ , which is estimated over 30,000 markers. 95% confidence interval is represented on top of each bar. The mean of the estimated overlapping samples using !"#$ on genetic effects is 2127.38±257.73, for !"#$ on MAF is 2707.99±58.41. Figure S12 Workflow for PPSR regression.
Step 1: determine the number of pseudo profile scores. Given experiment-wise type I error rate = 0.01, type II error rate = 0.05 (power = 0.95). K pseudo profile scores should be generated using M markers, which guarantees the privacy of individual genotypes.
Step 2: generate profile scores for each cohort. The metaanalysis center generates a KXM matrix for pseudo genetic effects. In total K profile scores were generated for each individual in each cohort.
Step 3: PPSR method for detecting overlapping individuals using profile scores. For a pair of cohorts, PPS regression was conducted for each possible pair of individual for any two cohorts over the generated pseudo-profile scores. Once the regression coefficient was greater than the threshold, here b=0.95, the pair of individual was inferred to be having highly similar genotypes, which may indicate the individual has been included in both cohorts.