Abstract
Genome-wide association studies (GWASs) have been successful in discovering SNP trait associations for many quantitative traits and common diseases. Typically, the effect sizes of SNP alleles are very small and this requires large genome-wide association meta-analyses (GWAMAs) to maximize statistical power. A trend towards ever-larger GWAMA is likely to continue, yet dealing with summary statistics from hundreds of cohorts increases logistical and quality control problems, including unknown sample overlap, and these can lead to both false positive and false negative findings. In this study, we propose four metrics and visualization tools for GWAMA, using summary statistics from cohort-level GWASs. We propose methods to examine the concordance between demographic information, and summary statistics and methods to investigate sample overlap. (I) We use the population genetics Fst statistic to verify the genetic origin of each cohort and their geographic location, and demonstrate using GWAMA data from the GIANT Consortium that geographic locations of cohorts can be recovered and outlier cohorts can be detected. (II) We conduct principal component analysis based on reported allele frequencies, and are able to recover the ancestral information for each cohort. (III) We propose a new statistic that uses the reported allelic effect sizes and their standard errors to identify significant sample overlap or heterogeneity between pairs of cohorts. (IV) To quantify unknown sample overlap across all pairs of cohorts, we propose a method that uses randomly generated genetic predictors that does not require the sharing of individual-level genotype data and does not breach individual privacy.
Similar content being viewed by others
Introduction
To elucidate genetic architecture, which requires maximized statistical power for discovery of risk alleles of small effect, large genome-wide association meta-analyses (GWAMAs) are tending towards ever-larger scale that may contain data from hundreds of cohorts. At the individual cohort level, genome-wide association study (GWAS) analysis is often based on various genotyping chips and conducted with different protocols, such as different software tools and reference populations for imputation, inclusion of study-specific covariates and association analyses using different methods and software. Although solid quality control (QC) analysis pipelines of GWAMA exist,1 these analyses focus on QC for each cohort independently. With ever-increasing sizes of GWAMA, there is a need for additional QC that goes beyond the cohort-by-cohort genotype-level analysis performed to date.
In this study, we propose a new set of QC metrics for GWAMA. All these applications assume that there is a central analysis hub, where summary statistic data from GWAS are uploaded for each cohort. All methods proposed are implemented in freely available software GEAR.
Materials and methods
Overview of materials and methods
Cohort-level summary statistics
The height GWAS summary statistics were provided by the GIANT Consortium and were from 82 cohorts (174 separate files) representing a total of 253 288 individuals, and ~2.5 million autosomal SNPs imputed to the HapMap2 reference.2 Metabochip summary statistics for body mass index (BMI) were from 43 cohorts (120 files), representing a total of 103 047 samples from multiple ethnicities with about 200 000 SNPs genotyped on customised chips.3, 4
1000 Genomes project samples
1000 Genomes Project (1KG) reference samples5 were used as the reference samples for estimating Fst and meta-PC. When assessing the global-level Fst measures, Yoruba represent African samples (YRI, 108 individuals), Han Chinese in Beijing represent East Asian samples (CHB, 103 individuals), and Utah Residents with Northern and Western European Ancestry represent European samples (CEU, 99 individuals) were employed as the reference panels. For calculating within-Europe Fst, CEU, Finnish (FIN, 99 individuals), and Tuscani (TSI, 107 individuals) were employed to represent northwest, northeast, and southern Europeans, respectively. For analyses using a whole European panel, CEU, FIN, TSI, GBR (British, 91 individuals), and IBS (Iberian, 107 individuals) were pooled together as an ‘averaged’ European reference.
WTCCC GWAS data
WTCCC GWAS data has 2934 shared controls for seven diseases with a total of 14 000 cases.6 Individual GWAS was conducted for each disease using PLINK7 and their summary statistics used to estimate λmeta.
The four proposed metrics include:
-
Fpc: a genome-wide comparison of allele frequency differences across cohorts or against a common reference population.
-
Meta-PC: principal component analysis of reported allele frequencies.
-
λmeta: a pairwise cohort statistic that uses allele frequency or effect size concordance to detect the proportion of sample overlap or heterogeneity.
-
Pseudo profile score regression: an easy to implement analysis to pinpoint each between-cohort overlapping sample that does not require the sharing of individual-level genotype data.
The technical details of these four methods summarized here can be found in the Supplementary Notes. Overview and application of these four metrics in GWAMA can be found in the Text Box.
Results
Population genetic QC analysis using Fst
In GWAMA, only summary statistics such as allele frequencies are available to the central analysis hub, it is difficult to identify population outliers. Gross differentiation in allele frequencies at specific SNPs between GWAMA cohorts and a reference (such as 1000 Genomes Project, denoted as 1KG)5 are part of standard QC protocols,1 but checking for more differentiation than expected across the entire genome is not usually part of the QC pipeline. We propose that a genetic distance inferred from Fst, which reflects genetic distance between pairwise populations, is a useful additional QC statistic to detect cohorts that are population outliers. Using the relationship between Fst and principal components,8 our Fst cartographer algorithm can be used to estimate the relative genetic distance between cohorts (Supplementary Notes for Method I; Supplementary Figure S1).
We applied the Fst metric to the GIANT Consortium BMI Metabochip cohorts (55 male-only cohorts, 55 female-only cohort, and 10 mixed-sex cohorts), which were recruited from multiple ethnicities,3 such as Europeans, African Americans in the Atherosclerosis Risk in Communities Study (ARIC) and cohorts from Jamaica (SPT), Pakistan (PROMISE), Philippines (CLHNS), and Seychelles (SEY). For each Metabochip cohort, we sampled 30 000 independent markers to calculate Fst values with each of three 1KG samples (CEU, CHB, and YRI, respectively). For validation of the method, we also calculated Fst values against the 1KG Japanese (JPT, Japanese in Tokyo, Japan), Indian (GIH, Gujarati Indian in Houston, US), Kenyan (LWK, Luhya in Webuye, Kenya), and European samples (IBS, Iberian populations, Spain; FIN, Finnish, Finland; TSI, Toscani, Italy, and GBR, British in England, and Scortland, GBR), to see whether the known genetic origins of those cohorts can be recovered.
According to the origins of the samples, each Metabochip cohort showed a different genetic distance spectrum to the three reference populations (Figure 1a). The JPT and Philippine cohorts had very small genetic distances to CHB, as expected, but large to CEU and YRI; however, the Pakistan cohorts showed much closer genetic distances to CEU than to CHB and YRI, indicating their demographic history. The cohorts sampled from Jamaica, Seychelles, Hawaii, and the African American ARIC cohort had small genetic distances to YRI, but large distances to CHB and CEU. For most European cohorts, as expected, the distances to CEU were very small compared with those to CHB and YRI. Given their relative distances to CEU, CHB, and YRI, using our Fst cartographer algorithm (Supplementary Notes for Method I; Supplementary Figure S1), the cohorts were projected into a two-dimensional space, called Fst-derived principal components (FPC) space, constructed by YRI, CHB, and CEU as the reference populations (Figure 1b). The allocation of the cohorts to the FPC space resembles that of eigenvector 1 against eigenvector 2 in principal component analysis (PCA), and is similar to those observed in PCA using individual-level GWAS data for populations of various ethnicities such as in 1KG samples.5 Therefore, our method to place cohorts in geographical regions from GWAS summary statistics works well at a global-population scale.
We next investigated whether our genetic distance method works at a much finer geographic scale. It is known that using individual-level data, PCA can mirror the geographic locations for European samples.9 Here we analyzed the 103 GIANT European-ancestry Metabochip cohorts (48 male-only cohorts, 47 female-only cohorts, and 8 mix-sex cohorts) for fine-scale Fst genetic distance measure using the CEU, FIN, and TSI reference populations, which represent northwest, northeast, and southern European populations, respectively. For each of the GIANT European-ancestry Metabochip cohorts, Fst was calculated relative to each of these three reference populations and showed concordance with the known origin of the samples (Figure 1c). For example, cohorts from Finland and Estonia were close to FIN but distant to TSI; cohorts from South Europe such as Italy and Greece had small genetic distance to TSI; and cohorts from West Europe had small genetic distance to CEU. Similarly, the projected origin for each European-ancestry Metabochip cohort resembles its geographic location within the European map as expected (Figure 1d). Therefore, Fpc based upon population differentiation also works at a fine scale.
We next applied the Fst genetic distance measures to 174 GIANT height GWAS cohorts (79 male-only cohorts, 76 female-only cohorts, and 19 mixed-sex cohorts; excluding Metabochip data), which were all of European ancestry imputed to the HapMap reference panel.2 Given the three Fst values to CEU, FIN, and TSI (Figure 1e), the geographic origin for each cohort can be inferred as for the GIANT BMI Metabochip data. The projected coordinates of each GWAS cohort matches its origin very well (Figure 1f). For example, a Canadian cohort, the Quebec Family Study (QFS), was closely located to DESIR, a French cohort, consistent with the French genetic heritage of the QFS.10 In addition, we also observe complexity due to mixed samples from different countries. For example, the DGI/Botnia study had samples recruited from Sweden and Finland, and its inferred geographic location is in between of the Swedish cohorts and Finnish cohorts.11 We also note that for the Myocardial Infarction Genetics Consortium (MIGEN) cohorts, which are recruited from Finland, Sweden, Spain, and the United States, the same allele frequencies were reported for all their sub-cohorts, and all cohorts were allocated to southern Europe (very closely located to 1KG IBS cohort; Figure 1f and Supplementary Figure S2). As the allele frequencies, used in QC steps to eliminate low-quality loci, were not directly used in estimating genetic effects in the GWAMA, the reported allele frequencies in MIGEN have not impacted much on the published GWAMA results.2
Next, we show that Fst can detect populations that have a different demographic past. Using all 1KG European samples as the reference panel (eg, an ‘averaged’ European reference panel), most cohorts in GIANT had Fst<0.005 with this average, which agrees with previously reported results using individual-level data from European nations.9 A few cohorts showed large Fst, such as the AMISH cohort with Fst=0.018, and the North Swedish Population Health Study12 with Fst=0.014. Both populations are known to have been genetically isolated (Supplementary Figure S3).
PCA for allele frequencies (meta-PCA)
Given the same allele frequencies as used for Fst-based analysis above, we conducted PCA for allele frequencies, denoted as meta-PCA (or mPC). In meta-PCA, each cohort was analogously considered as an ‘individual’. For example, 120 Metabochip cohorts were considered as a sample of 120 ‘individuals’. Although the inferred ancestral information was for each cohort rather than any individuals, implementation of meta-PCA was the same as the conventional PCA (Supplementary Notes for Method II). Meta-PCA was tested with 1KG samples. It indicated that meta-PCA could reveal the genetic background for each cohort as precisely as that based on individual-level data (Supplementary Figure S4).
We applied meta-PCA to 120 Metabochip cohorts for nearly 34K common SNPs between Metabochip and 1KG variants, with the inclusion of 10 1KG cohorts (East Asian: CHB and JPT; South Asian: GIH; European: CEU, FIN, GBR, IBS, and TSI; African: LWK and YRI) as the reference cohorts. Consistent with demographic information, the inferred ancestral information of each cohort agreed well with demographic information. For example, PROMISE (Pakistan) located very close to GIH, CLHNS (Philippines) close to CHB and JPT, ARIC (African American) and SPT (Jamaican) close to YRI and LWK, and the European cohorts close to CEU and FIN (Figure 4a).
We also applied meta-PCA to 174 GIANT height GWAS cohorts for nearly 1M SNPs, with the inclusion of 10 1KG reference cohorts. At the global-population level, the 174 cohorts were all allocated close to CEU and FIN, consistent with their reported demographic information (Figure 4b). For fine-scale inference, we conducted meta-PCA again but with the inclusion of the five 1KG European samples. As demonstrated (Figure 4c), the resolution of the inferred relative location between European cohorts reflected their real geographical locations, as previously observed using individual-level data.9 For example, of the four cohorts from Italy, the MICROS cohort was from South Tyrol, northern Italy. MICROS had its meta-PC coordinates much closer to CEU than another three Italian cohorts, reflecting its geographic location; the InCHIANTI cohort had its coordinates almost identical to TSI; the cohort SardiNIA located more southward than TSI, reflecting its relative geographic and genetic isolation as recently confirmed.13 Similarly, in the sub-plots for Finland and Sweden, the cohorts from the MIGEN consortium, which all had reported allele frequencies of south Europe origin, were located near 1KG TSI and IBS.
These results were consistent to what was observed from Fpc as described in the last section, and also agreed well with demographic information. Therefore, based on the reported allele frequencies, the demographic information could be verified by the meta-PCA method.
λmeta to detect pairwise cohort heterogeneity and sample overlap
In this study, we use the summary statistics for a pair of cohorts to calculate λmeta, a metric that examines heterogeneity from the concordance of reported effect sizes and sampling variance. For a SNP marker (i), given its reported estimated effect size (bi) and sampling variance (σi2) in a pair of cohorts 1 and 2, we can calculate a test statistic , the ratio between the squared difference of their reported effects and the sum of their reported sampling variances. We constructed 30 000 T statistics using markers in linkage equilibrium along the genome for a pair of cohorts. Under the null hypothesis of no overlapping samples/heterogeneity, T follows a χ2 distribution with 1 degree of freedom (Supplementary Notes for Method III).
Analogous to λGC, , the ratio between the median of the 30 000 T values and the median of a χ2 statistic with 1 degree of freedom (a value of 0.455) has an expected value of 1 for two independent GWAS summary statistics sets for the same trait. When there is heterogeneity between estimated genetic effects, the expectation is λmeta>1, and in contrast λmeta<1 if there are overlapping samples. In general, not only overlapping samples but also close relatives present in different cohorts can lead to correlated summary statistics generating λmeta<1. However, unless the proportion of overlapping relatives is substantial and their phenotypic correlation is high, the correlation of the summary statistics due to the effective number of overlapping samples (no) is expected to be dominated by the same individuals contributing phenotypic and genetic information to different cohorts (Supplementary Figure S5). Furthermore, if genomic control is applied to adjust the sampling variance, then λmeta will be reduced relative to its value without genomic control for λGC.
GWAS summary statistics for schizophrenia were available in two phases: the first had 9394 controls and 12 462 cases,14 and in the next phase ~18 000 Swedish samples were added.15 Such a substantial overlap sample between these two sets of summary statistics led to the estimated value of λmeta as low as 0.257 (Supplementary Figure S6), consistent with this known overlap. In contrast, heterogeneity between data sets (represented by λmeta>1) was observed between GWAS summary statistics of rheumatoid arthritis from European and Asian studies,16 for which λmeta=1.09 (Supplementary Figure S7). In addition, we note that the distribution of the empirical T-statistics deviates from expectation at the upper tail of the distribution, suggesting differences in effect size or linkage disequilibrium between these two ancestries.
Next, we estimated λmeta from pairs of cohorts from the 174 GIANT height GWAS cohorts.2 We found no evidence for substantial sample overlap but do observe between-cohort heterogeneity and technical artifacts. From the 174 GIANT height GWAS,2 we calculated 15 051 cohort pairwise λmeta values, resulting in a bell-shaped distribution (Figure 2a and b), with the mean of 1.013 and the empirical SD of 0.022, which was greater than theoretical SD of 0.014. The empirical mean and SD can be used to construct a z-score test for each λmeta. These results are consistent with a small amount of heterogeneity, which is not unexpected due to variation of actual (unknown) genetic architecture and analysis protocols. However, the mean is close to 1.0 and based upon this QC metric, the results are consistent with stringent QC and data cleaning. The minimum λmeta value was ~0.88 (between SORBS men and SORBS women; Figure 2c), with P-value<1e−10 (testing for the difference from 1), and the maximum was 1.245 (between SardiNIA and WGHS; Figure 2d), with P-value<1e−10, leading to the most deflated and inflated λmeta across GIANT height study cohorts, both were significant after correction for multiple testing. Of note, SORBS were analyzed using a method that corrected for relatedness, which potentially led to the deflated λmeta as implicated by the theory (Supplementary Notes for Method III). Illustrating λmeta (Figure 2a) highlighted that 20 cohorts from the MIGEN consortium showed substantially lower λmeta with many other cohorts (right-bottom triangle in Figure 2a) than the average, consistent with over-conservative models for statistical association analyses being used in these cohorts – which may be due to very small sample size (ranging from 36 to 320 for the 20 MIGEN cohorts, with an average sample size of 132). Consistent with this, cohorts from MIGEN also have many of their λGC<1 (Supplementary Figures S8 and S9). In contrast, the SardiNIA cohort (4303 samples) showed heterogeneity with nearly all other cohorts (Supplementary Figures S8 and S9), perhaps due to unknown artifacts or a slightly different genetic architecture for height as result of demographic history.17
The statistical power of detection of overlapping samples is maximized when a pair of cohorts has equal sample size (Supplementary Figure S10), or in other words the confidence interval for null hypothesis of no overlapping samples depends on the sample sizes for a pair of cohorts. As a comparison, the estimation of a correlation between the genetic effects for a pair of cohorts has been proposed to quantify overlapping samples,18, 19 but this metric is confounded with genetic architecture, such as heritability underlying the trait(s) (Table 1; Supplementary Notes IV). When there was heritability, the estimated correlation between genetic effects could be biased and could lead to an incorrect inference about overlapping samples for a pair of cohorts. When there was no heritability, the estimated correlation was correct and agreed well with the one estimated with λmeta. As existence of heritability is one of the reasons to perform GWAMA, so λmeta is preferred when estimating overlapping samples between cohorts.
Another parameterization of λmeta is to estimate it from differences in allele frequencies between a pair of cohorts instead of differences between estimated effect sizes (Supplementary Notes III; Supplementary Figure S11).
Detection of overlapping samples using pseudo profile score regression
In many circumstances, individual cohorts are not permitted to share individual-level data, either by national law or by local ethical review board conditions. Although the metric λmeta can be transformed to give an estimate of no between cohorts for quantitative traits, it cannot give an estimate of overlapping samples in case–control studies due to the ratio of the cases and controls in each study. To get around this problem, Turchin and Hirshhorn20 created a software tool, Gencrypt, which utilizes a security protocol known as one-way cryptographic hashes to allow overlapping participants to be identified without sharing individual-level data. We propose an alternative approach, pseudo profile score regression (PPSR), which involves sharing of weighted linear combinations of SNP genotypes with the central meta-analysis hub. In essence, multiple random profile scores are generated for each individual in each cohort, using SNP weights supplied by the analysis hub, and the resulting scores are provided back to the analysis hub. PPSR works through three steps (Supplementary Notes for Method IV; Supplementary Figure S12), and the purpose of PPSR is to estimate a relationship-like matrix of ni × nj dimension for a pair of cohorts, which have ni and nj individuals, respectively. Each entry of the matrix is filled with genetic similarity for a pair of samples from each of the two cohorts, estimated via the PPSR. The central hub analysts can determine the best set of SNPs that each individual analysis hub uses to generate PPS. Without the loss of generality, a set of loci directly genotyped in all cohorts would make good candidate set of SNPs for PPS.
We use WTCCC data as an illustration to detect 2934 shared controls between any two of the diseases by PPSR. Among 330K not palindromic loci, we randomly picked M=100, 200, and 500 SNPs, to generate pseudo profile scores. It generated 21 cohort-pair comparisons, leading to the summation for 488 587 090 total individual-pair tests. To have an experiment-wise type I error rate=0.01, type II error rate=0.05 (power=0.95) for detecting overlapping individuals, we needed to generated at least 57 PPSs. We generated scores S=[s1,s2,s3,…,s57], where each s is a vector of M elements, sampled from a standard normal distribution. S is shared across seven cohorts for generating PPSs for each individual. In total, 57 PPSs were generated for each individual in each cohort. For a pair of cohorts, PPSR was conducted for each possible pair of individuals for any two cohorts over the generated PPSs. Once the regression coefficient (b) was greater than the threshold, here b=0.95, the pair of individuals was inferred to be having highly similar genotypes, implying that the individual was included in both cohorts (Supplementary Notes for Method IV).
When using 200 and 500 random SNPs, all the known 2934 shared controls were detected from 21 cohort-pairwise comparison; when using 100 randomly SNPs, on average 2931 shared samples were identified, which is more accurate than using λmeta constructed using either genetic effects or allele frequencies (Figure 3a). In addition, for detected overlapping samples, there were no false positives observed – consistent with simulations that show the method was conservative in the controlling type I error rate (Supplementary Notes for Method IV). For comparison, we also used the Gencrypt to detect overlapping samples using the same set of SNPs as used in PPSR. Although Gencrypt guidelines suggest use of at least 20 000 random SNPs,20 selecting 500 random SNPs in the WTCCC cohorts also provided good accuracy with Gencrypt, and on average about 2920 (99.6% of the shared controls) overlapping samples were detected, only slightly lower than PPSR. For example, for BP and CAD, Gencrypt detected 2912 shared controls, but was unable to identify ~20 overlapping controls, due to missing data (on average 1% missing rate).
Furthermore, PPSR is able to detect pairs of relatives. For example, between the BD and CAD cohorts, two pairs of apparent first-degree relatives were detected (Figure 3b). To find additional first-degree relatives between BD and CAD cohorts, at least 265 PPSs were required to have a type I error rate of 0.01 and type II error rate of 0.05 for a regression coefficient cutoff of 0.45, a threshold for first-degree relatives. As expected, all other individuals that did not show high relatedness did not reach the threshold of 0.45 of the PPS regression coefficient for first-degree relatives (Figure 3c). Gencrypt did not detect any first-degree relatives.
PPSR for each individual uses very little personal information and can be minimized so that there is very low probability of decoding it. One way to attempt to decode the genotypes from PPS is to reverse the PPSR, so that the individual genotypes can be predicted in the regression (Supplementary Notes for Method IV). The individual-level genotypic information that can be recovered by an analyst, who knows the S matrix (the weights for generating PPS), is determined by the ratio between the number of markers (M) that generated PPS and the number of PPS (K). Therefore, inferred information on individual genotypes can be minimized and tailored to any specific ethics requirements. We suggest to protect the privacy with sufficient accuracy (Figure 3d).
Discussion
In this study, we provide four metrics for monitoring and improving the quality of large-scale GWAMA based on summary statistics. Using the Fst-derived genetic distance measure, we can place all cohorts on an inferred geographic map and can easily identify cohorts that are genetic outliers or that have unexpected ancestry. In application, we should note that the Fst measure can identify unusual summary information, such as detected in the MIGEN cohorts from GIANT Consortium GWAMAs, in which the same allele frequencies were reported for all cohorts. Meta-PCA can also be used to infer the genetic background of cohorts. The high concordance between Fpc and meta-PCA indicates the both methods are robust.
In practice, meta-PCA is much easier to implement when there are many cohorts, but FPC that has close-form analytical results provides a theoretical ground for meta-PCA. There are limitations for both FPC and meta-PCA. First, FPC depends on the choice of reference cohorts, such as 1KG reference cohorts, and the projection may be slightly different when other reference cohorts are adopted. Resembling any PCA, the projection from meta-PCA depends on the context of all cohorts, and the inclusion or exclusion of other cohorts will change the projection slightly. However, we believe the impact will not influence the inference of the genetic background of cohorts in a meta-analysis. Second, various mechanisms can give an identical projection in PCA. The purpose of both methods is to find the discordance between demographic information and genetic information, or outliers, in GWAMA.
Our third metric λmeta provides information on sample overlap and heterogeneity between cohorts by utilizing the estimated allelic effect sizes and their standard errors. In most meta-analyses, the overall λmeta is likely to be slightly >1 solely due to unknown heterogeneity, slight as observed, in generating the phenotype and genotype data that cannot be accounted for by QC. The observed mean of λmeta for the GIANT height GWAMA was 1.03 but with more variation than expected by chance. The strong correlation between λGC and λmeta indicated the reported sampling of the reported data were systematically driven by analysis protocols, such as single-marker regression and linear mixed model methods. For cohorts with λGC<1 and λmeta<1, it is likely that the GWAS modeling strategy employed for GWAS in the cohort was too conservative, eg, MIGEN cohorts might have on average too small sample size for each cohort. Conversely, for cohorts with λGC>1 and λmeta>1, results are too heterogeneous, perhaps reflecting systematically smaller sampling variances of the reported genetic effects. As GWAMA often uses inverse-variance-weighted meta-analysis,21 such cohorts may lead to incorrect weights to the different cohorts in the meta-analysis, suggesting that the statistical analysis in meta-analyses can be improved by applying better weighting factors.
It is well recognised that overlapping samples may inflate the type-I error rate of GWAMA and therefore lead to false positives. Although post hoc correction of the test statistic is possible,18, 19, 21 stringent QC ruling out overlapping samples makes the whole analysis easier and lowers the risk of false positives. A better solution would be to rule out shared samples at the start, for pairs of cohorts that show deflated λmeta, and we propose PPSR to accomplish this.
References
Winkler TW, Day FR, Croteau-Chonka DC et al: Quality control and conduct of genome-wide association meta-analyses. Nat Protoc 2014; 9: 1192–1212.
Wood AR, Esko T, Yang J et al: Defining the role of common variation in the genomic and biological architecture of adult human height. Nat Genet 2014; 46: 1173–1186.
Locke AE, Kahali B, Berndt SI et al: Genetic studies of body mass index yield new insights for obesity biology. Nature 2015; 518: 197–206.
Voight BF, Kang HM, Ding J et al: The Metabochip, a custom genotyping array for genetic studies of metabolic, cardiovascular, and anthropometric traits. PLoS Genet 2012; 8: e1002793.
The 1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature 2012; 491: 56–65.
The Wellcome Trust Case Control Consortium: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007; 447: 661–678.
Purcell S, Neale B, Todd-Brown K et al: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007; 81: 559–575.
Patterson N, Price AL, Reich D : Population structure and eigenanalysis. PLoS Genet 2006; 2: e190.
Novembre J, Johnson T, Bryc K et al: Genes mirror geography within Europe. Nature 2008; 456: 98–101.
Chaput J-P, Pérusse L, Després J-P, Tremblay A, Bouchard C : Findings from the Quebec family study on the etiology of obesity: genetics and environmental highlights. Curr Obes Rep 2014; 3: 54–66.
Diabetes Genetics Initiatives: Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science 2007; 316: 1331–1336.
Igl W, Johansson A, Gyllensten U : The Northern Swedish Population Health Study (NSPHS)—a paradigmatic study in a rural population combining community health and basic research. Rural Remote Health 2010; 11: 1363.
Danjou F, Zoledziewska M, Sidore C et al: Genome-wide association analyses based on whole-genome sequencing in Sardinia provide insights into regulation of hemoglobin levels. Nat Genet 2015; 47: 1264–1271.
Ripke S, Sanders AR, Kendler KS et al: Genome-wide association study identifies five new schizophrenia loci. Nat Genet 2011; 43: 969–976.
Ripke S, O’Dushlaine C, Chambert K et al: Genome-wide association analysis identifies 13 new risk loci for schizophrenia. Nat Genet 2013; 45: 1150–1159.
Okada Y, Wu D, Trynka G et al: Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature 2014; 506: 376–381.
Calò C, Melis A, Vona G, Piras I : Sardinian population (Italy): a genetic review. Int J Mod Anthropol 2010; 1: 39–64.
Bolormaa S, Pryce JE, Reverter A et al: A multi-trait, meta-analysis for detecting pleiotropic polymorphisms for stature, fatness and reproduction in beef cattle. PLoS Genet 2014; 10: e1004198.
Zhu X, Feng T, Tayo BO et al: Meta-analysis of correlated traits via summary statistics from GWASs with an application in hypertension. Am J Hum Genet 2015; 96: 21–36.
Turchin MC, Hirschhorn JN : Gencrypt: one-way cryptographic hashes to detect overlapping individuals across samples. Bioinformatics 2012; 28: 886–888.
Lin D-Y, Sullivan PF : Meta-analysis of genome-wide association studies with overlapping subjects. Am J Hum Genet 2009; 85: 862–872.
Acknowledgements
This work was funded by Australian National Health and Medical Research Council Project and Fellowship grants (1011506, 613601, 613602, 1078901, and 1078037), grants GM 099568 from the National Institutes of Health and the Sylvia & Charles Viertel Charitable Foundation. This study makes use of data generated by the Wellcome Trust Case-Control Consortium. A full list of the investigators who contributed to the generation of the data is available at www.wtccc.org.uk.
Author contributions
GBC and PMV designed the study. GBC, PMV, and SHL derived the analytical results. GBC performed all analyses. GBC and ZXZ developed the GEAR software. GBC and PMV wrote the first draft of the paper. MRR, MT, JY, and NW discussed results and methods, and provided comments that improved earlier versions of the manuscript. Other authors provided cohort-level summary statistics and contributed to improving the study and manuscript.
Author information
Authors and Affiliations
Consortia
Corresponding authors
Ethics declarations
Competing interests
The authors declare no conflict of interest.
Additional information
Supplementary Information accompanies this paper on European Journal of Human Genetics website
A supplementary video accompanies this article on European Journal of Human Genetics website
Rights and permissions
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
About this article
Cite this article
Chen, GB., Lee, S., Robinson, M. et al. Across-cohort QC analyses of GWAS summary statistics from complex traits. Eur J Hum Genet 25, 137–146 (2017). https://doi.org/10.1038/ejhg.2016.106
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/ejhg.2016.106
This article is cited by
-
Multi-ancestry and multi-trait genome-wide association meta-analyses inform clinical risk prediction for systemic lupus erythematosus
Nature Communications (2023)
-
Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated
Scientific Reports (2022)
-
Multi-tissue transcriptome analyses identify genetic mechanisms underlying neuropsychiatric traits
Nature Genetics (2019)
-
A correction for sample overlap in genome-wide association studies in a polygenic pleiotropy-informed framework
BMC Genomics (2018)
-
Age at first birth in women is genetically associated with increased risk of schizophrenia
Scientific Reports (2018)