ISOGO: Functional annotation of protein-coding splice variants

The advent of RNA-seq technologies has switched the paradigm of genetic analysis from a genome to a transcriptome-based perspective. Alternative splicing generates functional diversity in genes, but the precise functions of many individual isoforms are yet to be elucidated. Gene Ontology was developed to annotate gene products according to their biological processes, molecular functions and cellular components. Despite a single gene may have several gene products, most annotations are not isoform-specific and do not distinguish the functions of the different proteins originated from a single gene. Several approaches have tried to automatically annotate ontologies at the isoform level, but this has shown to be a daunting task. We have developed ISOGO (ISOform + GO function imputation), a novel algorithm to predict the function of coding isoforms based on their protein domains and their correlation of expression along 11,373 cancer patients. Combining these two sources of information outperforms previous approaches: it provides an area under precision-recall curve (AUPRC) five times larger than previous attempts and the median AUROC of assigned functions to genes is 0.82. We tested ISOGO predictions on some genes with isoform-specific functions (BRCA1, MADD,VAMP7 and ITSN1) and they were coherent with the literature. Besides, we examined whether the main isoform of each gene -as predicted by APPRIS- was the most likely to have the annotated gene functions and it occurs in 99.4% of the genes. We also evaluated the predictions for isoform-specific functions provided by the CAFA3 challenge and results were also convincing. To make these results available to the scientific community, we have deployed a web application to consult ISOGO predictions (https://biotecnun.unav.es/app/isogo). Initial data, website link, isoform-specific GO function predictions and R code is available at https://gitlab.com/icassol/isogo.

: Input: the isoforms-domains annotation (TxD), the isoforms-genes annotation (GxT), the genes-function annotation (GenesGO) and the expression at isoform level of TCGA cell lines (TxS). Domain-based regression: From TxD and GxT we get the genes-domain annotation (GxD). Then an elastic-net regularized logistic regression model is built using the train set achieving the logit2 matrix and the domain train model. Correlation method: gene expression is calculated from TxS and GxT. We compute a spearman correlation for each gene pair of the train set getting the matrix P. Then, in order to test if a gene "i" has a function "j" we applied a Wilcoxon test with the corresponding row of the matrix P comparing genes annotated to function "j" with the remaining genes achieving the z-scores matrix. Combination method: Previous scores -Logits2 and z-scores matrices-are combined by a logistic regression achieving the final combination train model for train set. Figure S2: Input: the isoforms-domains annotation (TxD), the isoforms-genes annotation (GxT), the genes-function annotation (GenesGO) and the expression at isoform level of TCGA cell lines (TxS). Domain-based regression: From TxD and GxT we get the genes-domain annotation (GxD). Then we applied the Domain train model to the test set achieving the matrix !"#$% & ' . Correlation method: gene expression is calculated from TxS and GxT. Then, we calculate the spearman correlation between each pair of the test and train set achieving the matrix P (size 2,000 x 17,637). In order to test if a gene "I" of the test set has a function "j" we applied a Wilcoxon test with the corresponding row of the matrix P comparing genes annotated to function "j" with the remaining genes achieving the z-scores matrix (size 2,000 x 5,777). Combination method: We applied the combination train model for train set to previous scores -!"#$% & ' and Z matrices obtaining the final matrix !"#$% & ( . Figure S3: Input: the isoforms-domains annotation (TxD), the isoforms-genes annotation (GxT), the genes-function annotation (GenesGO) and the expression at isoform level of TCGA cell lines (TxS). Domain-based regression: From TxD and GxT we get the genes-domain annotation (GxD). Then an elastic-net regularized logistic regression model is built using the complete set achieving the logit2 matrix and the domain complete model. Correlation method: gene expression is calculated from TxS and GxT. Then, we computed the Spearman correlation coefficient of each gene pair resulting on a matrix of correlations P (size 19,637 x 19,637). In order to test if a gene "i" have a particular function "j", we computed a Wilcoxon test with corresponding row of the P comparing the genes annotated to the "j" GO term with genes non-annotated to it -excluding the gene itself-achieving the z-scores matrix (size 2,000 x 5,777). Combination method: Previous scores -Logits2 and z-socres matrices-are combined by a logistic regression achieving the final combination complete model.  864 x 19,637). In order to test if a isoform "i" have a particular function "j", we computed a Wilcoxon test with corresponding row of the P comparing the genes annotated to the "j" GO term with genes non-annotated to it achieving the z-scores matrix (size 79,864 x 5,777). Combination method: We applied the combination complete model to previous scores -!"#$% & ' and Z matrices-obtaining the final matrix !"#$% & ) -The ISOGO matrix-.

APPRIS:
Figure S10: APPRIS [2] indirect transcriptome-wide validation. For each GO term: 1): we selected its annotated genes. In 1) annotated genes are shaded in blue and non-annotated in red ; 2): we selected for these genes its corresponding isoforms; 3):by using BiomaRt R package [3] we distinguished the APPRIS isoforms and the non-APPRIS isoforms. In 2) the splice variants that belong to the selected genes are marked in blue and the major isoforms are distinguished with red diamonds; 4): we tested, among all the isoforms of the selected genes, whether the APPRIS isoforms were the most likely to perform the function. Figure S11: CAFA3 indirect transcriptome-wide validation. For each GO term: 1): we selected its annotated genes. In 1) annotated genes are shaded in blue and non-annotated in red; 2): we selected for these genes its corresponding isoforms; 3):by using BiomaRt R package [3] we related the protein included in the CAFA3 challenge with isoforms, achieving a dataset were protein, function and isoform are merged i.e. specific function are related with specific isoforms; 4): we tested, among all the isoforms of the genes, whether the 'CAFA3' isoforms were the most likely to perform the function.

Coherence technique
In some predictions, the probability assigned to an ancestor is not larger than the ones assigned to any of its descendants. Several works tried to explicitly consider the relationship between functional classes [4][5] [6]. These works propose different methods or heuristics to ensure the coherence [1] [7][8] [9] of the ontology, so that, the predicted probability of any given GO term is equal or smaller than that of its parents.
Our method to ensure coherence uses the variance of the estimate to decide how to update the logits. If variance is low, then the algorithm is confident of the estimated likelihood. Conversely, if variance is high, then the predicted estimate is not as reliable. This intuition can be expressed mathematically in terms of maximizing likelihood. As logistic regression logits can be considered to follow -asymptotically-a normal distribution, the proposed minimization problem is: Subject to " # ≥ " % for all " % that are children of " # in GO hierarchy.
In this formula, n is the number of GO terms. a is a nx1 vector, whose entries are the logits after coherence (i.e. the unknown variables of the optimization problem). b is a nx1 vector whose entries are the estimated logits before applying coherence (i.e. the output of the combination method). σ is a nx1 vector with the standard deviation of the b estimates. The restriction ensures that the logits of the parents must be larger or equal to the logits of its children.
The system of equations contains 1,428 restrictions and 748 functions, for Cellular Component; 13,633 restrictions and 6,074 functions, for Biological Process and 1,474 restrictions and 1,107 functions for Molecular Function. In turn, these three optimizations must be solved for each gene or isoform. This is a convex quadratic programming problem that can be solved by using Rcplex [10], an R package that interface to Cplex [11].

Isoforms of the same gene with opposite functions
ITSN1 is a gene with the annotations to two opposite functions, namely, GO: 0043065 ("positive regulation of apoptosis") and GO:0043066 ("negative regulation of apoptosis"). ITSN1 short isoform has shown to be a negative regulator of apoptosis whereas its long isoform has the opposite function [12,13]. ISOGO predicted this phenomenon assigning an increased likelihood of positive regulation of apoptosis to the long isoform. Similarly, an increased likelihood of negative regulation of apoptosis was assigned to the short isoform ( Figure S12).