Fast set-based association analysis using summary data from GWAS identifies novel gene loci for human complex traits

We propose a method (fastBAT) that performs a fast set-based association analysis for human complex traits using summary-level data from genome-wide association studies (GWAS) and linkage disequilibrium (LD) data from a reference sample with individual-level genotypes. We demonstrate using simulations and analyses of real datasets that fastBAT is more accurate and orders of magnitude faster than the prevailing methods. Using fastBAT, we analyze summary data from the latest meta-analyses of GWAS on 150,064–339,224 individuals for height, body mass index (BMI), and schizophrenia. We identify 6 novel gene loci for height, 2 for BMI, and 3 for schizophrenia at PfastBAT < 5 × 10−8. The gain of power is due to multiple small independent association signals at these loci (e.g. the THRB and FOXP1 loci for schizophrenia). The method is general and can be applied to GWAS data for all complex traits and diseases in humans and to such data in other species.

from the fastBAT analysis of the latest GWAS data using the 1KGP-imputed ARIC data as a reference for LD estimation vs. that using the 1KGP-imputed HRS data for (a) height, (b) BMI, and (c) schizophrenia.   Figure 8 Correlation between -log10(P fastBAT ) using WGS data and that using 1KGP-imputed data. Shown are the results from simulations based on UK10K-WGS data (Supplemental Note). In each simulation replicate, a quantitative trait is simulated using WGS data, and the analysis was performed using both WGS and 1KGP-imputed data (see Supplemental Note for details).

Comparison between fastBAT with sequence and imputed variants
We performed simulations based on WGS data from the UK10K project 7 , of which there were 17.6M genetic variants across 3,781 unrelated individuals after quality control 8 . Following the strategy proposed in Yang et al. 8 , we extracted the set of SNPs that can be found on an Illumina CoreExome array from the UK10K data, and used IMPUTE2 (ref 9 ) to impute the subset of SNPs to 1000 Genome Project (1KGP) reference panels 10 . The imputed SNPs with Hardy-Weinberg Equilibrium (HWE) test p-value < 1e-6 or minor allele counts < 3 were removed from the analysis.
Following Yang et al. 8 , to quantify the variation at sequence variants that can be captured by 1KGP-imputed variants, we simulated traits using WGS data and performed the analysis using 1KGP-imputed data. For the ease of computation, we only used data on chromosome 1. We simulated a quantitative trait using the GCTA simulation function (50 causal variants with a total heritability of 10%) under two scenarios, I) causal variants were sampled at random from all the sequence variants; II) causal variants are clustered in a few randomly sampled genomic regions (see Online Method for the method to simulated clustered causal variants). We then performed the fastBAT analysis for the simulated trait using the 1KGP-imputed genotypes and compared the result with that using WGS data.

Benchmarking Performance
We compared the computational performance of the three implementations, PLINK-set, VEGAS (offline version) and GCTA-fastBAT by re-running the analysis presented in Fig. 1a on identical hardware, recording the execution time and maximum memory usage, and reported the mean results of 10 executions. We ran a gene-based PLINK-set test (10 6 permutations) with the individual-level genotype and phenotype data in the ARIC cohort (chromosome 22). The GCTA-fastBAT and VEGAS (command-line version) analyses were performed using the summary statistics. On average, PLINK-set used ~38 hours to complete the analysis (note that the set-based test implemented in PLINK2 is much faster than PLINKset but still much slower than fastBAT), VEGAS (default parameters) took 36 minutes, and GCTA-fastBAT (using only a single thread) completed in 8 seconds (see the table below).
The LD-pruned fastBAT analysis has slightly higher memory requirements than that without LD-pruning but it is still orders magnitude faster than PLINK-set and VEGAS (see the fastBAT with LD pruning 7.9 sec 424MB

Running fastBAT
A complete manual is available at the GCTA website (URLs). The implementation of fastBAT in GCTA uses a PLINK binary file as the reference set for LD estimation. If no reference for LD is available it is possible to use the HapMap3 or 1KGP data (URLs). A list of gene coordinates is available from the PLINK website (URLs) and mirrored on the GCTA website (URLs).