The rapid increase in loci discovered in genome-wide association studies has created a need to understand the biological implications of these results. Gene-set analysis provides a means of gaining such understanding, but the statistical properties of gene-set analysis are not well understood, which compromises our ability to interpret its results. In this Analysis article, we provide an extensive statistical evaluation of the core structure that is inherent to all gene- set analyses and we examine current implementations in available tools. We show which factors affect valid and successful detection of gene sets and which provide a solid foundation for performing and interpreting gene-set analysis.
Gene-set analysis of GWAS data can best be understood as an analysis using genes as data points, carrying out a test of the relationship between a gene set and the genetic associations of genes with a phenotype.
Self-contained gene-set analysis does not provide information about the gene set itself, but only about the genes it contains. As such it cannot be used to draw biologically meaningful conclusions and is therefore unsuitable for the research questions gene-set analysis is generally used to address.
Competitive gene-set analysis can be biologically informative but is vulnerable to various forms of confounding and biases as a result of linkage disequilibrium. Of evaluated competitive gene-set analysis methods, only INRICH and MAGMA show consistently good statistical performance.
Statistical power for competitive gene-set analysis is strongly dependent on the heritability of a phenotype. Gene-set effect sizes for more strongly heritable phenotypes must be higher to achieve the same level of power as less strongly heritable phenotypes.
As a result of the structure of competitive gene-set analysis, increasing sample size will only improve its statistical power to a limited extent, especially for more strongly heritable phenotypes.
Gene-set analysis applied to GWAS data and gene expression data show the same kind of statistical behaviour. Both are instances of a broader framework of gene-level analysis, which also includes other approaches such as gene-network analysis.
Access optionsAccess options
Subscribe to Journal
Get full journal access for 1 year
only $22.08 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
This study makes use of data generated by the Wellcome Trust Case-Control Consortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. Funding for the project was provided by the Wellcome Trust under award 076113 and 085475. This work was funded by The Netherlands Organization for Scientific Research (NWO VICI 453-14-005, 645-000-003).
The statistical properties of gene-set analysis
Mean significance rates for different types of method at sample size of about 100,000.
Mean significance rates for self-contained analysis at different levels of polygenicity.
Gene association scores as a function of gene size.
Accounting for gene density in competitive analysis.
Additional simulation results for self-contained and competitive analysis tools.
Results for competitive analysis tools at different settings.
Type 1 error rates for MAGMA and INRICH at lower significance thresholds for analysis of Reactome gene sets.
Result for MHC simulation for MAGMA and INRICH.
Effect of SNP effect distribution on power.
Effect of gene effect distribution on power.
QQ-plots for genes in associated gene sets.
Comparison of power for different types of competitive analysis method.
Comparison of power between self-contained and competitive analysis at 0% background heritability.
Comparison in power for competitive analysis with MAGMA and INRICH.
- Gene sets
Groups of genes that share a particular property, typically their involvement in a particular biological process.
- Gene-association scores
Measures of the strength of the association, or the evidence for that association, between a gene and a phenotype of interest.
- Self-contained GSA
A type of gene-set analysis (GSA) that tests the null hypothesis that none of the genes in the gene set are associated with the phenotype.
- Linkage disequilibrium
The presence of statistical associations between alleles at different loci.
- Competitive GSA
A type of gene-set analysis (GSA) that tests the null hypothesis that the genes in the gene set are no more strongly associated with the phenotype than other genes.
- Genetic architecture
The pattern of genetic variants underlying a phenotype, including the number of variants, the allele frequencies, the effect sizes and the nature of effects (for example, additive or non-additive).
The proportion of the phenotypic variance that can be attributed to genetic differences among individuals.
- Phenotype permutation
A permutation scheme in which phenotypes are permuted, implying that phenotypes are independent of the genotypes.
- Gene permutation
A permutation scheme in which genes are permuted, implying that gene-association scores are independent of the target gene set. This is equivalent to randomly drawing gene sets of the same size.
- Polygenic phenotypes
Phenotypes influenced by large numbers of genetic variants, each with individually small effects.
- Biological confounding
Confounding that occurs in gene-set analysis (GSA) when the biological process according to which a gene set is defined has no causal role in the phenotype but contains many genes also involved with another biological process that does.
- Methodological confounding
Confounding that occurs in gene-set analysis (GSA) as a result of choices in data collection or computation of the gene-association scores, which creates a gene-set association that biologically does not exist.
- Population stratification
The presence of systematic differences in allele frequencies in subpopulations of the sample, possibly as a result of different ancestry. This can lead to inflation of gene-association scores when correlated to the phenotype, potentially resulting in spurious gene-set associations.