Subscribe to Journal
Get full journal access for 1 year
only $20.17 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Creixell, P. et al. Nat. Methods 12, 615–621 (2015).
Ashburner, M. et al. Nat. Genet. 25, 25–29 (2000).
Croft, D. et al. Nucleic Acids Res. 42, D472–D477 (2014).
Reimand, J., Kull, M., Peterson, H., Hansen, J. & Vilo, J. Nucleic Acids Res. 35, W193–W200 (2007).
Mi, H., Muruganujan, A. & Thomas, P.D. Nucleic Acids Res. 41, D377–D386 (2013).
Huang, W., Sherman, B.T. & Lempicki, R.A. Nat. Protoc. 4, 44–57 (2009).
Marcotte, R. et al. Cell 164, 293–309 (2016).
Tamborero, D. et al. Sci. Rep. 3, 2650 (2013).
Flavahan, W.A. et al. Nat. Neurosci. 16, 1373–1382 (2013).
Takebe, N. et al. Nat. Rev. Clin. Oncol. 12, 445–464 (2015).
We thank the reviewers and the bioRxiv community for insightful comments. This study was supported by the Ontario Institute for Cancer Research (Investigator Award to J.R.).
The authors declare no competing financial interests.
Integrated supplementary information
We analyzed the accumulation of knowledge of gene function during the period 2009-2016 and its impact on practical analysis of gene lists. Our analysis involved three major steps (I-III). First, we studied the evolution of vocabulary of biological processes and pathways from Gene Ontology and the Reactome database (panel I). Second, we studied how gene annotations to these pathways and processes have changed over time (panel II). Third, we evaluated the practical impact of knowledge accumulation by performing pathway enrichment analysis using current and out-dated functional resources on gene lists derived from recent cancer genomics studies (panel III).
(A) The number of human biological processes and molecular pathways has doubled during 2009-2016. Similar trends are apparent among human cell components and molecular functions. We counted the number of GO terms and Reactome pathways with at least one annotated human gene. (B) The numbers of annotated GO terms have also grown rapidly for model organisms.
(A) Histogram shows mean length of paths in the Gene Ontology connecting a given term and the root term. Significant increase in the depth of the GO hierarchy between 2009 and 2016 (P < 10−5, permutation test) indicates that the biological vocabulary is increasingly detailed and terms are becoming more specific. (B) The average number of parents per GO term has increased over time (2009-2016, 1.73 to 2.09; P < 10−5). We used a permutation test (n = 100,000) to compute p-value to evaluate difference of earlier and recent values. Error bars represent 95% confidence intervals from resampling.
Violin plots show the comparison of pathway size and gene annotation frequency in 2009 and 2016. In the top panels, the median pathway size (total number of genes in pathway) is shown for every gene on log2 scale. In the bottom panels, number of pathways annotated per every gene is shown on log2 scale. P-values were computed using permutation tests (n=100,000). Genes without annotations were excluded. GO biological processes (left) and Reactome pathways (right) are shown separately. The median number of each plot is shown in boldface letters.
Violin plots show the comparison of pathway size and gene annotation frequency in 2009 and 2016 for human and several model organisms. P-values were computed using permutation tests (n = 100,000). Genes without annotations are excluded.
Supplementary Figure 6 Contemporary gene annotations include a prominent class of small and specific pathways from the manually curated Reactome resource.
Two-dimensional density plots of median pathway size per gene and numbers of pathway annotations per gene (Fig. 1b) reveal a bimodal distribution of pathways in current annotations from 2016. The group of pathways in the bottom left quadrant of the left panel of Figure 1b primarily represents gene annotations of the Reactome resource (98%). The corresponding genes have relatively few annotations to pathways (below median value) and the pathways themselves contain relatively few genes (also below median value). The group of Reactome pathways is not apparent among annotations of 2009.
In 2009, one of eight high-confidence protein-coding genes (12.4%) from the CCDS database had no annotations in Gene Ontology or Reactome while this “dark matter” has decreased to 4.9% in 2016. Dark-matter genes included those with no annotations and also genes that only had root-level annotations in GO. We used the closest earlier release of the CCDS database to count annotated genes (e.g., the 2015 release of CCDS for 2016 annotations of GO, as CCDS of 2016 had not been released at the time of the analysis).
Analysis of human gene lists from current datasets will cause mismatches of gene symbols as standard nomenclature has been updated over the years. We compared the HGNC symbols in the latest CCDS database (2015) to earlier database versions and counted the number of unmatched symbols.
We investigated the number of annotations for genes whose symbols differed in 2010 and 2016 and found that genes with changed symbols have significantly fewer annotations in 2010 than consistently named genes (average number of annotations per gene 3.1 vs 9.3, permutation test of n=100,000, p<10−5). Error bars represent 95% confidence intervals from resampling.
Supplementary Figure 10 Pathway enrichment analysis of essential genes of breast cancer confirms loss of information in outdated annotations.
(A) We analysed top-500 essential genes from each of 77 cancer cell lines derived from recent shRNA screens. We studied (i) annotations from 2010 and (ii) annotations from 2016, and quantified enrichments using Fisher’s exact test and multiple testing correction (FDR P < 0.05). We then compared the resulting enriched terms from both analyses. We found a three-fold increase in detected pathways and processes when data were analyzed with current annotations from 2016 (695 pathways and processes per median cell line) compared to outdated annotations from 2010 (191 per median cell line, 74% missed when accounting for terms only appearing in 2010 annotations). GO biological processes and Reactome pathways were analyzed and respective counts are aggregated in the plot. (B) We repeated our pathway enrichment analysis of breast cancer essential genes by analyzing top-100 essential genes of the same dataset and found a similar difference of the effect of outdated and current pathway annotations (143 v 455 pathways, 71% missed in earlier annotations).
Supplementary Figure 11 Evolution of pathway information affects recently updated and out-of-date software tools.
We analyzed significantly mutated driver genes of glioblastoma using gene annotations of 2009-2016 and compared the results of 2016-era analysis with results of each earlier year. Colors indicate the fraction of commonly detected (yellow), 2016-only (purple) and outdated-only (dark blue) pathways from the Reactome resource with statistically significant enrichment (FDR p<0.05).
Supplementary Figure 12 Most missed GO terms in 2010-era analysis involve known pathways and processes that do not associate significantly with input genes.
We compared results of pathway enrichment analyses that used annotations from 2010 and 2016. The majority of pathways missed in the outdated annotations (∼75%) exist in the 2010 edition of Gene Ontology, however these are not significantly associated to input genes. The remaining 25% represent processes added to GO after 2010.
Supplementary Figure 13 Most pathway enrichments from outdated annotations from 2010 are based on low-quality information.
We repeated the pathway analysis of frequently mutated glioblastoma genes by only analyzing high-quality gene annotations from 2010 and 2016 (IEA annotations from GO were excluded). We found that 96.5% of results from 2016 analysis were missed when 2010 annotations were used, showing that earlier annotations are largely based on low-confidence information.
Supplementary Figure 14 Some enriched pathways and processes are missed in relatively recent gene annotations.
We compared results of pathway enrichment analyses that used annotations from 2015 and 2016. We focused on the terms that were found in the earlier analysis and missed in the most current annotations, including 89/743 (12%) GO terms and 29/116 (25%) Reactome pathways. The majority of missing pathways were part of the pathway database or GO in the up-to-date analysis although not detected at statistically significant levels (blue tones), while a smaller fraction of terms were entirely missing from the analysis, likely because of restructuring of pathways and processes.
About this article
Cite this article
Wadi, L., Meyer, M., Weiser, J. et al. Impact of outdated gene annotations on pathway enrichment analysis. Nat Methods 13, 705–706 (2016). https://doi.org/10.1038/nmeth.3963
Nature Communications (2019)
Frontiers in Genetics (2019)
BMC Bioinformatics (2019)
SkeletalVis: an exploration and meta-analysis data portal of cross-species skeletal transcriptomics data
g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update)
Nucleic Acids Research (2019)