Introduction

To elucidate genetic architecture, which requires maximized statistical power for discovery of risk alleles of small effect, large genome-wide association meta-analyses (GWAMAs) are tending towards ever-larger scale that may contain data from hundreds of cohorts. At the individual cohort level, genome-wide association study (GWAS) analysis is often based on various genotyping chips and conducted with different protocols, such as different software tools and reference populations for imputation, inclusion of study-specific covariates and association analyses using different methods and software. Although solid quality control (QC) analysis pipelines of GWAMA exist,1 these analyses focus on QC for each cohort independently. With ever-increasing sizes of GWAMA, there is a need for additional QC that goes beyond the cohort-by-cohort genotype-level analysis performed to date.

In this study, we propose a new set of QC metrics for GWAMA. All these applications assume that there is a central analysis hub, where summary statistic data from GWAS are uploaded for each cohort. All methods proposed are implemented in freely available software GEAR.

Materials and methods

Overview of materials and methods

Cohort-level summary statistics

The height GWAS summary statistics were provided by the GIANT Consortium and were from 82 cohorts (174 separate files) representing a total of 253 288 individuals, and ~2.5 million autosomal SNPs imputed to the HapMap2 reference.2 Metabochip summary statistics for body mass index (BMI) were from 43 cohorts (120 files), representing a total of 103 047 samples from multiple ethnicities with about 200 000 SNPs genotyped on customised chips.3, 4

1000 Genomes project samples

1000 Genomes Project (1KG) reference samples5 were used as the reference samples for estimating Fst and meta-PC. When assessing the global-level Fst measures, Yoruba represent African samples (YRI, 108 individuals), Han Chinese in Beijing represent East Asian samples (CHB, 103 individuals), and Utah Residents with Northern and Western European Ancestry represent European samples (CEU, 99 individuals) were employed as the reference panels. For calculating within-Europe Fst, CEU, Finnish (FIN, 99 individuals), and Tuscani (TSI, 107 individuals) were employed to represent northwest, northeast, and southern Europeans, respectively. For analyses using a whole European panel, CEU, FIN, TSI, GBR (British, 91 individuals), and IBS (Iberian, 107 individuals) were pooled together as an ‘averaged’ European reference.

WTCCC GWAS data

WTCCC GWAS data has 2934 shared controls for seven diseases with a total of 14 000 cases.6 Individual GWAS was conducted for each disease using PLINK7 and their summary statistics used to estimate λmeta.

The four proposed metrics include:

  • Fpc: a genome-wide comparison of allele frequency differences across cohorts or against a common reference population.

  • Meta-PC: principal component analysis of reported allele frequencies.

  • λmeta: a pairwise cohort statistic that uses allele frequency or effect size concordance to detect the proportion of sample overlap or heterogeneity.

  • Pseudo profile score regression: an easy to implement analysis to pinpoint each between-cohort overlapping sample that does not require the sharing of individual-level genotype data.

The technical details of these four methods summarized here can be found in the Supplementary Notes. Overview and application of these four metrics in GWAMA can be found in the Text Box.

Results

Population genetic QC analysis using Fst

In GWAMA, only summary statistics such as allele frequencies are available to the central analysis hub, it is difficult to identify population outliers. Gross differentiation in allele frequencies at specific SNPs between GWAMA cohorts and a reference (such as 1000 Genomes Project, denoted as 1KG)5 are part of standard QC protocols,1 but checking for more differentiation than expected across the entire genome is not usually part of the QC pipeline. We propose that a genetic distance inferred from Fst, which reflects genetic distance between pairwise populations, is a useful additional QC statistic to detect cohorts that are population outliers. Using the relationship between Fst and principal components,8 our Fst cartographer algorithm can be used to estimate the relative genetic distance between cohorts (Supplementary Notes for Method I; Supplementary Figure S1).

We applied the Fst metric to the GIANT Consortium BMI Metabochip cohorts (55 male-only cohorts, 55 female-only cohort, and 10 mixed-sex cohorts), which were recruited from multiple ethnicities,3 such as Europeans, African Americans in the Atherosclerosis Risk in Communities Study (ARIC) and cohorts from Jamaica (SPT), Pakistan (PROMISE), Philippines (CLHNS), and Seychelles (SEY). For each Metabochip cohort, we sampled 30 000 independent markers to calculate Fst values with each of three 1KG samples (CEU, CHB, and YRI, respectively). For validation of the method, we also calculated Fst values against the 1KG Japanese (JPT, Japanese in Tokyo, Japan), Indian (GIH, Gujarati Indian in Houston, US), Kenyan (LWK, Luhya in Webuye, Kenya), and European samples (IBS, Iberian populations, Spain; FIN, Finnish, Finland; TSI, Toscani, Italy, and GBR, British in England, and Scortland, GBR), to see whether the known genetic origins of those cohorts can be recovered.

According to the origins of the samples, each Metabochip cohort showed a different genetic distance spectrum to the three reference populations (Figure 1a). The JPT and Philippine cohorts had very small genetic distances to CHB, as expected, but large to CEU and YRI; however, the Pakistan cohorts showed much closer genetic distances to CEU than to CHB and YRI, indicating their demographic history. The cohorts sampled from Jamaica, Seychelles, Hawaii, and the African American ARIC cohort had small genetic distances to YRI, but large distances to CHB and CEU. For most European cohorts, as expected, the distances to CEU were very small compared with those to CHB and YRI. Given their relative distances to CEU, CHB, and YRI, using our Fst cartographer algorithm (Supplementary Notes for Method I; Supplementary Figure S1), the cohorts were projected into a two-dimensional space, called Fst-derived principal components (FPC) space, constructed by YRI, CHB, and CEU as the reference populations (Figure 1b). The allocation of the cohorts to the FPC space resembles that of eigenvector 1 against eigenvector 2 in principal component analysis (PCA), and is similar to those observed in PCA using individual-level GWAS data for populations of various ethnicities such as in 1KG samples.5 Therefore, our method to place cohorts in geographical regions from GWAS summary statistics works well at a global-population scale.

Figure 1
figure 1

Recovery of cohort-level genetic background and inference of their geographic locations for GIANT BMI Metabochip cohorts and GIANT GWAS height cohorts using the Fst-derived genetic distance measure. (a) Genetic distance spectrum for all Metabochip cohorts to CEU, CHB, and YRI. The origins of the cohorts are denoted on the horizontal axis. (b) Projection for the Metabochip cohorts into FPC space defined by YRI, CHB, and CEU reference populations. The x and y axis represent relative distances derived from the genetic distance spectrum. Three dashed lines, blue for CEU, green for CHB, and red for YRI, partitioned the whole FPC space to three genealogical subspaces. (c) The genetic distance spectrum for the Metabochip European cohorts to CEU – northwest Europeans, FIN – northeast European, and TSI – southern Europeans. The nationality of the cohorts is denoted on the horizontal axis. (d) The projection for the Metabochip European cohorts to the FPC space defined by CEU, FIN, and TSI reference populations. The whole space is further partitioned into three subspaces, CEU-TSI genealogical subspace (red and blue dashed lines), FIN-TSI genealogical subspace (green-blue dashed lines), and CEU-FIN genealogical subspace (red-green dashed lines), respectively. (e) Each cohort has three Fst values by comparing with CEU, FIN, and TSI reference samples. The height of each bar represents its relative genetic distance to these three reference populations. The nationalities of the cohorts are denoted along the horizontal axis. The grey triangles along the x axis indicate MIGEN cohorts. (f) Given the three Fst values, the location of each cohort can be mapped. The whole space was partitioned into three subspaces, CEU-TSI genealogical subspace (red and blue dashed lines), FIN-TSI genealogical subspace (green and blue dashed lines), and CEU-FIN genealogical subspace (red and green dashed lines). DGI (in the blue box) had samples from the Botnia study. Across the MIGEN cohorts (denoted as red triangles in the red box), the same allele frequencies (likely calculated from a South European cohort) were presented for each cohort. The open circles represent the mean of inferred geographic locations for the cohorts from the same country. Cohort/country codes: AF, African; AU, Australia; CA, Canada; CH, Switzerland; DE, Germany; DK, Denmark; EE, Estonia; ES, Iberian Population in Spain in 1KG; EU, European Nations; FI, Finland; FIN, Fins in 1000 Genomes Project (1KG); FR, France; GBR, British in 1KG; GIB, Gujarati Indian in 1KG; GR, Greece; Hawaii, Hawaii in USA; IBS, Iberian Population in Spain in 1KG; IT, Italy; IS, Iceland; JM, Jamaica; JPT, Japanese in 1KG; LWK, Luhya in 1KG; NL, Netherlands; NO, Norway; PH, the Philippines; PK, Pakistan; SC, Seychelles; SCT, Scotland; SE, Sweden; TSI, Tuscany in 1KG; UK, United Kingdom; US, United States of America.

We next investigated whether our genetic distance method works at a much finer geographic scale. It is known that using individual-level data, PCA can mirror the geographic locations for European samples.9 Here we analyzed the 103 GIANT European-ancestry Metabochip cohorts (48 male-only cohorts, 47 female-only cohorts, and 8 mix-sex cohorts) for fine-scale Fst genetic distance measure using the CEU, FIN, and TSI reference populations, which represent northwest, northeast, and southern European populations, respectively. For each of the GIANT European-ancestry Metabochip cohorts, Fst was calculated relative to each of these three reference populations and showed concordance with the known origin of the samples (Figure 1c). For example, cohorts from Finland and Estonia were close to FIN but distant to TSI; cohorts from South Europe such as Italy and Greece had small genetic distance to TSI; and cohorts from West Europe had small genetic distance to CEU. Similarly, the projected origin for each European-ancestry Metabochip cohort resembles its geographic location within the European map as expected (Figure 1d). Therefore, Fpc based upon population differentiation also works at a fine scale.

We next applied the Fst genetic distance measures to 174 GIANT height GWAS cohorts (79 male-only cohorts, 76 female-only cohorts, and 19 mixed-sex cohorts; excluding Metabochip data), which were all of European ancestry imputed to the HapMap reference panel.2 Given the three Fst values to CEU, FIN, and TSI (Figure 1e), the geographic origin for each cohort can be inferred as for the GIANT BMI Metabochip data. The projected coordinates of each GWAS cohort matches its origin very well (Figure 1f). For example, a Canadian cohort, the Quebec Family Study (QFS), was closely located to DESIR, a French cohort, consistent with the French genetic heritage of the QFS.10 In addition, we also observe complexity due to mixed samples from different countries. For example, the DGI/Botnia study had samples recruited from Sweden and Finland, and its inferred geographic location is in between of the Swedish cohorts and Finnish cohorts.11 We also note that for the Myocardial Infarction Genetics Consortium (MIGEN) cohorts, which are recruited from Finland, Sweden, Spain, and the United States, the same allele frequencies were reported for all their sub-cohorts, and all cohorts were allocated to southern Europe (very closely located to 1KG IBS cohort; Figure 1f and Supplementary Figure S2). As the allele frequencies, used in QC steps to eliminate low-quality loci, were not directly used in estimating genetic effects in the GWAMA, the reported allele frequencies in MIGEN have not impacted much on the published GWAMA results.2

Next, we show that Fst can detect populations that have a different demographic past. Using all 1KG European samples as the reference panel (eg, an ‘averaged’ European reference panel), most cohorts in GIANT had Fst<0.005 with this average, which agrees with previously reported results using individual-level data from European nations.9 A few cohorts showed large Fst, such as the AMISH cohort with Fst=0.018, and the North Swedish Population Health Study12 with Fst=0.014. Both populations are known to have been genetically isolated (Supplementary Figure S3).

PCA for allele frequencies (meta-PCA)

Given the same allele frequencies as used for Fst-based analysis above, we conducted PCA for allele frequencies, denoted as meta-PCA (or mPC). In meta-PCA, each cohort was analogously considered as an ‘individual’. For example, 120 Metabochip cohorts were considered as a sample of 120 ‘individuals’. Although the inferred ancestral information was for each cohort rather than any individuals, implementation of meta-PCA was the same as the conventional PCA (Supplementary Notes for Method II). Meta-PCA was tested with 1KG samples. It indicated that meta-PCA could reveal the genetic background for each cohort as precisely as that based on individual-level data (Supplementary Figure S4).

We applied meta-PCA to 120 Metabochip cohorts for nearly 34K common SNPs between Metabochip and 1KG variants, with the inclusion of 10 1KG cohorts (East Asian: CHB and JPT; South Asian: GIH; European: CEU, FIN, GBR, IBS, and TSI; African: LWK and YRI) as the reference cohorts. Consistent with demographic information, the inferred ancestral information of each cohort agreed well with demographic information. For example, PROMISE (Pakistan) located very close to GIH, CLHNS (Philippines) close to CHB and JPT, ARIC (African American) and SPT (Jamaican) close to YRI and LWK, and the European cohorts close to CEU and FIN (Figure 4a).

We also applied meta-PCA to 174 GIANT height GWAS cohorts for nearly 1M SNPs, with the inclusion of 10 1KG reference cohorts. At the global-population level, the 174 cohorts were all allocated close to CEU and FIN, consistent with their reported demographic information (Figure 4b). For fine-scale inference, we conducted meta-PCA again but with the inclusion of the five 1KG European samples. As demonstrated (Figure 4c), the resolution of the inferred relative location between European cohorts reflected their real geographical locations, as previously observed using individual-level data.9 For example, of the four cohorts from Italy, the MICROS cohort was from South Tyrol, northern Italy. MICROS had its meta-PC coordinates much closer to CEU than another three Italian cohorts, reflecting its geographic location; the InCHIANTI cohort had its coordinates almost identical to TSI; the cohort SardiNIA located more southward than TSI, reflecting its relative geographic and genetic isolation as recently confirmed.13 Similarly, in the sub-plots for Finland and Sweden, the cohorts from the MIGEN consortium, which all had reported allele frequencies of south Europe origin, were located near 1KG TSI and IBS.

These results were consistent to what was observed from Fpc as described in the last section, and also agreed well with demographic information. Therefore, based on the reported allele frequencies, the demographic information could be verified by the meta-PCA method.

λmeta to detect pairwise cohort heterogeneity and sample overlap

In this study, we use the summary statistics for a pair of cohorts to calculate λmeta, a metric that examines heterogeneity from the concordance of reported effect sizes and sampling variance. For a SNP marker (i), given its reported estimated effect size (bi) and sampling variance (σi2) in a pair of cohorts 1 and 2, we can calculate a test statistic , the ratio between the squared difference of their reported effects and the sum of their reported sampling variances. We constructed 30 000 T statistics using markers in linkage equilibrium along the genome for a pair of cohorts. Under the null hypothesis of no overlapping samples/heterogeneity, T follows a χ2 distribution with 1 degree of freedom (Supplementary Notes for Method III).

Analogous to λGC, , the ratio between the median of the 30 000 T values and the median of a χ2 statistic with 1 degree of freedom (a value of 0.455) has an expected value of 1 for two independent GWAS summary statistics sets for the same trait. When there is heterogeneity between estimated genetic effects, the expectation is λmeta>1, and in contrast λmeta<1 if there are overlapping samples. In general, not only overlapping samples but also close relatives present in different cohorts can lead to correlated summary statistics generating λmeta<1. However, unless the proportion of overlapping relatives is substantial and their phenotypic correlation is high, the correlation of the summary statistics due to the effective number of overlapping samples (no) is expected to be dominated by the same individuals contributing phenotypic and genetic information to different cohorts (Supplementary Figure S5). Furthermore, if genomic control is applied to adjust the sampling variance, then λmeta will be reduced relative to its value without genomic control for λGC.

GWAS summary statistics for schizophrenia were available in two phases: the first had 9394 controls and 12 462 cases,14 and in the next phase ~18 000 Swedish samples were added.15 Such a substantial overlap sample between these two sets of summary statistics led to the estimated value of λmeta as low as 0.257 (Supplementary Figure S6), consistent with this known overlap. In contrast, heterogeneity between data sets (represented by λmeta>1) was observed between GWAS summary statistics of rheumatoid arthritis from European and Asian studies,16 for which λmeta=1.09 (Supplementary Figure S7). In addition, we note that the distribution of the empirical T-statistics deviates from expectation at the upper tail of the distribution, suggesting differences in effect size or linkage disequilibrium between these two ancestries.

Next, we estimated λmeta from pairs of cohorts from the 174 GIANT height GWAS cohorts.2 We found no evidence for substantial sample overlap but do observe between-cohort heterogeneity and technical artifacts. From the 174 GIANT height GWAS,2 we calculated 15 051 cohort pairwise λmeta values, resulting in a bell-shaped distribution (Figure 2a and b), with the mean of 1.013 and the empirical SD of 0.022, which was greater than theoretical SD of 0.014. The empirical mean and SD can be used to construct a z-score test for each λmeta. These results are consistent with a small amount of heterogeneity, which is not unexpected due to variation of actual (unknown) genetic architecture and analysis protocols. However, the mean is close to 1.0 and based upon this QC metric, the results are consistent with stringent QC and data cleaning. The minimum λmeta value was ~0.88 (between SORBS men and SORBS women; Figure 2c), with P-value<1e−10 (testing for the difference from 1), and the maximum was 1.245 (between SardiNIA and WGHS; Figure 2d), with P-value<1e−10, leading to the most deflated and inflated λmeta across GIANT height study cohorts, both were significant after correction for multiple testing. Of note, SORBS were analyzed using a method that corrected for relatedness, which potentially led to the deflated λmeta as implicated by the theory (Supplementary Notes for Method III). Illustrating λmeta (Figure 2a) highlighted that 20 cohorts from the MIGEN consortium showed substantially lower λmeta with many other cohorts (right-bottom triangle in Figure 2a) than the average, consistent with over-conservative models for statistical association analyses being used in these cohorts – which may be due to very small sample size (ranging from 36 to 320 for the 20 MIGEN cohorts, with an average sample size of 132). Consistent with this, cohorts from MIGEN also have many of their λGC<1 (Supplementary Figures S8 and S9). In contrast, the SardiNIA cohort (4303 samples) showed heterogeneity with nearly all other cohorts (Supplementary Figures S8 and S9), perhaps due to unknown artifacts or a slightly different genetic architecture for height as result of demographic history.17

Figure 2
figure 2

λmeta for the GIANT height GWAS cohorts. (a) Given 174 cohorts, there are 15 051 λmeta values, which provide the overview of the quality control of the summary statistics. The heat map represents 15 051 λmeta statistics, and the x and y axis index each pair of cohorts. The pairs of cohorts showed heterogeneity () are illustrated on left-top triangle, and homogeneity () on right-bottom triangle. (b) The distribution of λmeta from 174 cohorts/files used in the GIANT height meta-analysis. The overall mean of 15 051 λmeta is 1.013, and SD is 0.022. (c) Illustration for homogeneity between two cohorts (SORBS MEN and WOMEN), λmeta=0.876. (d) Illustration of SardiNIA and WGHS, this pair of cohorts has λmeta=1.245. The grey band represents 95% confidence interval for λmeta.

The statistical power of detection of overlapping samples is maximized when a pair of cohorts has equal sample size (Supplementary Figure S10), or in other words the confidence interval for null hypothesis of no overlapping samples depends on the sample sizes for a pair of cohorts. As a comparison, the estimation of a correlation between the genetic effects for a pair of cohorts has been proposed to quantify overlapping samples,18, 19 but this metric is confounded with genetic architecture, such as heritability underlying the trait(s) (Table 1; Supplementary Notes IV). When there was heritability, the estimated correlation between genetic effects could be biased and could lead to an incorrect inference about overlapping samples for a pair of cohorts. When there was no heritability, the estimated correlation was correct and agreed well with the one estimated with λmeta. As existence of heritability is one of the reasons to perform GWAMA, so λmeta is preferred when estimating overlapping samples between cohorts.

Table 1 The estimated correlation for a pair of cohorts via their summary statistics given 30 000 independent loci

Another parameterization of λmeta is to estimate it from differences in allele frequencies between a pair of cohorts instead of differences between estimated effect sizes (Supplementary Notes III; Supplementary Figure S11).

Detection of overlapping samples using pseudo profile score regression

In many circumstances, individual cohorts are not permitted to share individual-level data, either by national law or by local ethical review board conditions. Although the metric λmeta can be transformed to give an estimate of no between cohorts for quantitative traits, it cannot give an estimate of overlapping samples in case–control studies due to the ratio of the cases and controls in each study. To get around this problem, Turchin and Hirshhorn20 created a software tool, Gencrypt, which utilizes a security protocol known as one-way cryptographic hashes to allow overlapping participants to be identified without sharing individual-level data. We propose an alternative approach, pseudo profile score regression (PPSR), which involves sharing of weighted linear combinations of SNP genotypes with the central meta-analysis hub. In essence, multiple random profile scores are generated for each individual in each cohort, using SNP weights supplied by the analysis hub, and the resulting scores are provided back to the analysis hub. PPSR works through three steps (Supplementary Notes for Method IV; Supplementary Figure S12), and the purpose of PPSR is to estimate a relationship-like matrix of ni × nj dimension for a pair of cohorts, which have ni and nj individuals, respectively. Each entry of the matrix is filled with genetic similarity for a pair of samples from each of the two cohorts, estimated via the PPSR. The central hub analysts can determine the best set of SNPs that each individual analysis hub uses to generate PPS. Without the loss of generality, a set of loci directly genotyped in all cohorts would make good candidate set of SNPs for PPS.

We use WTCCC data as an illustration to detect 2934 shared controls between any two of the diseases by PPSR. Among 330K not palindromic loci, we randomly picked M=100, 200, and 500 SNPs, to generate pseudo profile scores. It generated 21 cohort-pair comparisons, leading to the summation for 488 587 090 total individual-pair tests. To have an experiment-wise type I error rate=0.01, type II error rate=0.05 (power=0.95) for detecting overlapping individuals, we needed to generated at least 57 PPSs. We generated scores S=[s1,s2,s3,…,s57], where each s is a vector of M elements, sampled from a standard normal distribution. S is shared across seven cohorts for generating PPSs for each individual. In total, 57 PPSs were generated for each individual in each cohort. For a pair of cohorts, PPSR was conducted for each possible pair of individuals for any two cohorts over the generated PPSs. Once the regression coefficient (b) was greater than the threshold, here b=0.95, the pair of individuals was inferred to be having highly similar genotypes, implying that the individual was included in both cohorts (Supplementary Notes for Method IV).

When using 200 and 500 random SNPs, all the known 2934 shared controls were detected from 21 cohort-pairwise comparison; when using 100 randomly SNPs, on average 2931 shared samples were identified, which is more accurate than using λmeta constructed using either genetic effects or allele frequencies (Figure 3a). In addition, for detected overlapping samples, there were no false positives observed – consistent with simulations that show the method was conservative in the controlling type I error rate (Supplementary Notes for Method IV). For comparison, we also used the Gencrypt to detect overlapping samples using the same set of SNPs as used in PPSR. Although Gencrypt guidelines suggest use of at least 20 000 random SNPs,20 selecting 500 random SNPs in the WTCCC cohorts also provided good accuracy with Gencrypt, and on average about 2920 (99.6% of the shared controls) overlapping samples were detected, only slightly lower than PPSR. For example, for BP and CAD, Gencrypt detected 2912 shared controls, but was unable to identify ~20 overlapping controls, due to missing data (on average 1% missing rate).

Figure 3
figure 3

Pseudo profile score regression for pinpointing overlapping samples/relatives. (a) Each cluster represents a pair of cohorts as denoted on the x axis. Within each cluster, from left to right, the detected overlapping controls using λmeta based either on effect size estimates or minor allele frequency (MAF), PPRS using 100, 200, and 500 markers. WTCCC cohort codes: BD for bipolar disorder, CAD for coronary artery disease, CD for Crohn’s disease, HT for hypertension, RA for rheumatoid arthritis, T1D for type 1 diabetes, and T2D for type 2 diabetes. (b) Illustration for regression coefficients between WTCCC BD and CAD from 57 pseudo profile scores (PPS) generated from 500 markers. The x axis is the PPSR regression coefficients and y axis is real genetic relatedness (as calculated from individual-level genotype data). The red points are the shared controls between two cohorts, and blue points are first-degree relatives. (c) The PPS regression coefficients for detecting overlapping first-degree relatives using 286 PPS generated from 500 markers. (d) Decoding genotypes from the PPS. Given the set of profile scores, one may run a GWAS-like analysis to infer the genotypes. The ratio between the number of markers (M) and number of pseudo profile scores (K) determines the potential discovery of individual-level information. The higher the ratio and, the higher the allele frequency, the less information can be recovered. From left to right, the profile scores generated using different number of markers. The y axis is a R2 metric representing the accuracy between the inferred genotypes and the real genotypes. From left to right panels, 100, 200, 500, and 1000 SNPs were used to generate 10, 20, 50, and 1000 profiles scores. In each cluster, the three bars are inferred accuracy using different MAF spectrum alleles, given the SE of the mean.

Furthermore, PPSR is able to detect pairs of relatives. For example, between the BD and CAD cohorts, two pairs of apparent first-degree relatives were detected (Figure 3b). To find additional first-degree relatives between BD and CAD cohorts, at least 265 PPSs were required to have a type I error rate of 0.01 and type II error rate of 0.05 for a regression coefficient cutoff of 0.45, a threshold for first-degree relatives. As expected, all other individuals that did not show high relatedness did not reach the threshold of 0.45 of the PPS regression coefficient for first-degree relatives (Figure 3c). Gencrypt did not detect any first-degree relatives.

PPSR for each individual uses very little personal information and can be minimized so that there is very low probability of decoding it. One way to attempt to decode the genotypes from PPS is to reverse the PPSR, so that the individual genotypes can be predicted in the regression (Supplementary Notes for Method IV). The individual-level genotypic information that can be recovered by an analyst, who knows the S matrix (the weights for generating PPS), is determined by the ratio between the number of markers (M) that generated PPS and the number of PPS (K). Therefore, inferred information on individual genotypes can be minimized and tailored to any specific ethics requirements. We suggest to protect the privacy with sufficient accuracy (Figure 3d).

Discussion

In this study, we provide four metrics for monitoring and improving the quality of large-scale GWAMA based on summary statistics. Using the Fst-derived genetic distance measure, we can place all cohorts on an inferred geographic map and can easily identify cohorts that are genetic outliers or that have unexpected ancestry. In application, we should note that the Fst measure can identify unusual summary information, such as detected in the MIGEN cohorts from GIANT Consortium GWAMAs, in which the same allele frequencies were reported for all cohorts. Meta-PCA can also be used to infer the genetic background of cohorts. The high concordance between Fpc and meta-PCA indicates the both methods are robust.

In practice, meta-PCA is much easier to implement when there are many cohorts, but FPC that has close-form analytical results provides a theoretical ground for meta-PCA. There are limitations for both FPC and meta-PCA. First, FPC depends on the choice of reference cohorts, such as 1KG reference cohorts, and the projection may be slightly different when other reference cohorts are adopted. Resembling any PCA, the projection from meta-PCA depends on the context of all cohorts, and the inclusion or exclusion of other cohorts will change the projection slightly. However, we believe the impact will not influence the inference of the genetic background of cohorts in a meta-analysis. Second, various mechanisms can give an identical projection in PCA. The purpose of both methods is to find the discordance between demographic information and genetic information, or outliers, in GWAMA.

Our third metric λmeta provides information on sample overlap and heterogeneity between cohorts by utilizing the estimated allelic effect sizes and their standard errors. In most meta-analyses, the overall λmeta is likely to be slightly >1 solely due to unknown heterogeneity, slight as observed, in generating the phenotype and genotype data that cannot be accounted for by QC. The observed mean of λmeta for the GIANT height GWAMA was 1.03 but with more variation than expected by chance. The strong correlation between λGC and λmeta indicated the reported sampling of the reported data were systematically driven by analysis protocols, such as single-marker regression and linear mixed model methods. For cohorts with λGC<1 and λmeta<1, it is likely that the GWAS modeling strategy employed for GWAS in the cohort was too conservative, eg, MIGEN cohorts might have on average too small sample size for each cohort. Conversely, for cohorts with λGC>1 and λmeta>1, results are too heterogeneous, perhaps reflecting systematically smaller sampling variances of the reported genetic effects. As GWAMA often uses inverse-variance-weighted meta-analysis,21 such cohorts may lead to incorrect weights to the different cohorts in the meta-analysis, suggesting that the statistical analysis in meta-analyses can be improved by applying better weighting factors.

It is well recognised that overlapping samples may inflate the type-I error rate of GWAMA and therefore lead to false positives. Although post hoc correction of the test statistic is possible,18, 19, 21 stringent QC ruling out overlapping samples makes the whole analysis easier and lowers the risk of false positives. A better solution would be to rule out shared samples at the start, for pairs of cohorts that show deflated λmeta, and we propose PPSR to accomplish this.