Gene annotation bias impedes biomedical research

We found tremendous inequality across gene and protein annotation resources. We observed that this bias leads biomedical researchers to focus on richly annotated genes instead of those with the strongest molecular data. We advocate that researchers reduce these biases by pursuing data-driven hypotheses.

We examined other gene annotation databases to ensure that the observed phenomena was not specific to the GO. Other pathway databases, including Reactome (Gini = 0.33) 3 and the CTD Pathway (Gini = 0.47) 4 , have a similarly high level of inequality (Fig. 1A). Indeed, every gene annotation resource we examined displayed a similarly high level of annotation inequality, including CTD chemical-gene associations (Gini = 0.63) 4 , PDB 3D protein structures (gini = 0.68) 6 , DrugBank drug-gene associations (Gini = 0.70) 5 , GeneRIF gene publication annotations (Gini = 0.79), and Pubpular disease-gene publication associations (Gini = 0.82) 7,19 . When considering the number of annotations pooled across all these databases, global gene annotation Gini coefficient was 0.63.
Annotation inequality bias affects biomedical research. Next, we explored whether disease research may be affected by the inequality in gene annotation databases. Concerns that most published findings are false 20 , many results are inflated 21 , and research funding is being wasted 22,23 have led to a number of proposals for reproducible and clinically relevant findings [24][25][26] . We have previously described a multi-cohort analysis framework [27][28][29] that leverages biological and technical heterogeneity across multiple independent datasets to identify robust disease signatures. Using this framework, we have repeatedly demonstrated that it can identify robust disease signatures across a broad spectrum of diseases including organ transplant 27 , infections [30][31][32][33] , autoimmune disease 34 , cancer [35][36][37] , vaccination 38 , and neurodegenerative diseases 39 for identifying diagnostic and prognostic markers, novel drug targets, and repurposing FDA-approved drugs.
In our manually curated meta-analyses of 104 distinct human conditions, we have integrated transcriptome data from over 41,000 patients and 619 studies to calculate an effect size for disease-gene associations 28 . Our analyses included diverse classes of human conditions such as cancer, autoimmune disease, viral infection, neurodegenerative and psychiatric disorders, pregnancy, and obesity. For these conditions, we extracted all disease gene associations with at least ten publications 7,19 . Published disease-gene associations exhibited no significant correlation with differential gene expression false discovery rate (FDR) rank (Spearman's correlation = −0.003, p = 0.836, Fig. 2A) Overall, only 19.5% of published disease-gene associations were identified in gene expression analyses at a FDR of 5% that is consistent with previous publications that have successfully replicated between 11-25% of research studies 40,41 ( Figure S2A).
To observe whether this phenomenon was specific to gene expression, we extracted genome wide significant single nucleotide polymorphisms (SNPs) from the GWAS catalog 42 . We observed a non-significant correlation between the number of publications and SNP p-values, indicating a lack of concordance between genetic mutations and disease-gene publications (Spearman's correlation = 0.017, p = 0.836, Figure S2B).
Based on these results, we hypothesized that the lack of correlation with molecular evidence may be an artifact of research bias towards well-characterized genes. Therefore, we examined correspondence between publications about a disease-gene pair and existing knowledge about that gene as indicated by the number of GO annotations. Indeed, the number of GO annotations for a gene of interest was significantly correlated with the published disease-gene associations (Spearman's correlation = 0.110, p = 2.1e-16, Fig. 2B), but not with gene expression effect size FDR rank in disease (Spearman's correlation = −0.023, p = 0.080, Figure S2C Many of the highly published disease-gene associations may have been studied for reasons that would not be directly reflected in gene expression analysis, including BRCA1 in breast cancer and CD4 in human immunodeficiency virus. The more troubling bias occurs when associations with strong molecular evidence have no publication record. Disease-gene associations we have reported in our published meta-analyses were typically novel findings with few Gene Ontology annotations, despite having extremely low false discovery rates and high

Discussion
Collectively, our results provide an evidence of a strong research bias in literature that focuses on well-annotated genes instead of those with the most significant disease relationship in terms of both expression and genetic variation. We show that the inequality follows a "rich-getting-richer" pattern, where annotation growth is biased towards genes that were richly annotated in the initial versions of GO. We believe this stems from the typical experimental design. To illustrate this, consider an omics experiment that generates a list of hundreds or thousands of interesting genes. To interpret these genes, researchers use GO and pathway analysis tools. The researchers then generate targeted hypotheses for validation by interpreting the list of significant GO terms, focusing the genes or proteins annotated with that GO term. The researchers learn more about those targeted genes, leading to additional GO annotations for the already annotated genes. In this process, the list of unannotated genes is simply ignored because pathway analysis tools cannot map them to any GO terms. Hence, the self-perpetuating cycle of inequality continues.
While focusing research on the best characterized genes may be natural because it is easy to formulate a mechanistic hypothesis of the gene's function in disease, we propose that the researchers in the era of omics should instead allow data to drive their hypotheses. We have repeatedly shown that expanding research outside of the streetlight of well characterized genes identifies novel disease-gene relationships [35][36][37] , identifies FDA-approved drugs that can be repurposed for other diseases 27 , and identifies clinically translatable diagnostic and prognostic disease signatures 27,[30][31][32][33][34]39 . For example, we have previously identified PTK7 as causally involved in non-small cell lung cancer 37 . At the time of publication, PTK7 was labelled as an orphan tyrosine kinase receptor. In a very short span, this finding was transformed into an antibody-drug conjugate targeting PTK7 that induced sustained tumor regression, outperformed standard-of-care chemotherapy, and reduced frequency of tumor-initiating cells in a preclinical study 45 . A Phase 1 clinical trial (NCT02222922) of PTK7 antibody-drug conjugate, PF-06647020, has already completed with acceptable and manageable safety profile, and is now being considered for further clinical development. To enable researchers to pursue data-driven hypotheses, we have made our rigorously validated gene expression multicohort analysis data publicly available (http://metasignature.stanford.edu) where it may be explored based on either diseases or genes of interest 29,46 . Focusing on genes with the strongest molecular evidence instead of the most annotations would enable researchers to break the self-perpetuating annotation inequality cycle that results in research bias.  where n is the number of genes and x i is the number of annotations for a gene i 17 . We included all human genes with at least one annotation in the Gini calculations.

Inequality metrics calculations.
List of human gene names. We used the Entrez Gene list downloaded in February 2017 of 20,698 current, protein-coding, human genes as our source of human genes.
Gene Ontology Annotations data. We calculated the number of annotations for each human gene in the Gene Ontology 2 . in every version of GO annotations since 2001 that was available at http://http.ebi.ac.uk/pub/ databases/GO/goa/old. Duplicate annotations that only differ in evidence codes were counted once. We examined the Gini coefficient for the different classes of evidence codes (experimental, computational analysis, author statement, curatorial statement, and automatically assigned) and namespaces (cellular component, biological process, molecular function). We found no substantial differences in the Gini coefficient values and trends regardless of the terms being considered ( Figure S1D). To focus on terms with the strongest evidence, the remainder of our manuscript excluded the evidence codes IEA, ND, and NR 8 . To focus on terms related to functional understanding of genes, we only considered the biological process and molecular function GO namespaces.

Confidence intervals. Using bootstrap resampling, we calculated 95% confidence intervals around our
Gini coefficients based on 1000 permutations of each version of the human Gene Ontology annotation data [ Figure S1B].

Modeling Gini coefficient over time.
We used the first available version of the human GO annotations (http://http.ebi.ac.uk/pub/databases/GO/goa/old/HUMAN/gene_association.goa_human.1.gz) as our baseline measurement in all models. We modeled every release of GO under different growth models, distributing the number of new annotations from that release across genes according to the model. We define our update step as: . Models inequality growth consistent with exponential initial probability.

Cubic initial weight.
. Models inequality growth consistent with cubic initial probability.
. Models inequality growth consistent with squared probability from previous round.
. Models inequality growth consistent with squared initial probability. Initial weight. Other gene annotation database Gini coefficient calculation. Pubpublar. We manually downloaded gene-publication data in August 2016 from Pubpular for 102 of the diseases in our gene expression database 7,19 . "Pubpular Total" refers to the inequality of gene-publication data across all diseases. "Pubpular Median" refers to the median inequality of gene-publication for each disease. Reactome. We downloaded Reactome pathway data from the complete database release 59 3 . We downloaded data in MySQL format and parsed pathways into UniProt identifiers using custom scripts. We converted UniProt SCIenTIfIC REPORTs | (2018) 8:1362 | DOI:10.1038/s41598-018-19333-x identifiers to gene names using the UniProt identifier conversion tool 48 . We calculated the number of pathways including each gene name.
CTD. We downloaded the CTD 4 data in February 2017, with the chemical-gene associations and the gene-pathway associations. We calculated the number of chemical-gene and gene-pathway associations for each gene name.
GeneRIFs. We downloaded GeneRIFs from the NCBI in February 2017. We included all human GeneRIFS (Tax ID: 9606). We calculated the number of GeneRIFs for each gene.
Protein Data Bank. We downloaded the gene names associated with protein structures from the Protein Data Bank 6 in February 2017 and calculated the number of structures per gene name.
DrugBank. We downloaded the DrugBank 5 database version 5.0.5 and identified all drugs with known activities on human genes. We calculated the number of drugs targeting each gene.
Gene expression data collection and multicohort analysis. Gene expression multicohort analysis data was compiled from the MetaSignature database 28 . MetaSignature includes data from manual multicohort analysis of over 41,000 samples, 619 studies, and 104 diseases. Briefly, relevant data were downloaded from Gene Expression Omnibus and ArrayExpress 49,50 . Cases and controls were manually labeled for each disease and multicohort analysis was performed using the MetaIntegrator package 28 . We used the Hedges' g summary effect size, standard error, and false discovery rate which the MetaIntegrator package calculates for every gene.
Data collection for disease-gene publications and SNP data. We downloaded the number of publications for each disease-gene relationship from PubPular and HuGE Navigator in August 2016 for as many of the 104 disease in MetaSignature as were present in the databases (102 in PubPular and 81 in HuGE) 7,19,43 . PubPular gave the top 261 gene associations, and HuGE gave all known associations. For all correlations, we only considered disease-gene associations with at least 10 publications to limit false positive associations.
We downloaded disease-SNP relationships, including gene mappings, odds ratios, and p-values, from the GWAS Catalog and HuGE Navigator for 61 and 54, respectively, of the 103 diseases in MetaSignature 42,44 . From Gene Ontology, we calculated the counts of non-Inferred from Electronic Annotation annotations for all the genes in the MetaSignature database 2 . The Spearman rank correlation was used for all correlations.
Our plots show the top 10,000 gene associations for each disease by effect size FDR rank. Correlation calculations do not include a similar limit.