A Network of Splice Isoforms for the Mouse

The laboratory mouse is the primary mammalian species used for studying alternative splicing events. Recent studies have generated computational models to predict functions for splice isoforms in the mouse. However, the functional relationship network, describing the probability of splice isoforms participating in the same biological process or pathway, has not yet been studied in the mouse. Here we describe a rich genome-wide resource of mouse networks at the isoform level, which was generated using a unique framework that was originally developed to infer isoform functions. This network was built through integrating heterogeneous genomic and protein data, including RNA-seq, exon array, protein docking and pseudo-amino acid composition. Through simulation and cross-validation studies, we demonstrated the accuracy of the algorithm in predicting isoform-level functional relationships. We showed that this network enables the users to reveal functional differences of the isoforms of the same gene, as illustrated by literature evidence with Anxa6 (annexin a6) as an example. We expect this work will become a useful resource for the mouse genetics community to understand gene functions. The network is publicly available at: http://guanlab.ccmb.med.umich.edu/isoformnetwork.

and experimental technologies has provided multiple types of genomic data sources at the isoform level, including RNA-seq [17][18][19][20] . We also included the computationally predicted isoform-isoform docking scores 21 previously developed by co-authors of this work, which achieved the top performance in the CASP (Critical Assessment of protein Structure Prediction) benchmark study. The availability of these data provided a solution to the first obstacle.
The second challenge facing isoform-level network modeling remains: we do not have a large set of functionally related isoform pairs to serve as the gold standard for evaluating and integrating large-scale genomic data. Of interest, recent studies have analyzed protein-protein interactions at the isoform level 22,23 , providing high-resolution protein interaction data. But the resulting data size is not sufficient to be used as a gold standard to build a genome-wide isoform network. Both biological functions 24 and pathways [25][26][27][28][29] are conventionally documented at the gene level rather than at the isoform level, preventing any classical method developed for building gene-level networks from being directly applied to splice isoforms.
In this paper, we used a Bayesian network-based multiple-instance learning (MIL) algorithm to solve this problem. MIL formulates a gene pair as a bag of multiple isoform pairs of potentially different probabilities to be functionally related, in analogy to a gene considered as a bag of isoforms of different functions 15,[30][31][32] . Our work is focused on constructing isoform-level functional relationship networks (FRN) for the mouse, which significantly differs from previous studies 11,12 in terms of both computational approaches and research content. The SpliceNet focused on building isoform-level coexpression networks using large dimensional trace and identifying differential networks between control and cancer samples 11 . The IIIDB work 12 used a gold standard of physical interaction data from the IntAct database 33 , and built logistic regression models to predict human isoform-level physical interaction networks. Using the proposed approach, we have built an isoform-level network for the mouse through heterogeneous data integration and validated it through both cross-validation and literature evidence 31,34 .

Methods
A multiple-instance learning algorithm for predicting isoform networks. The key challenge facing isoform-level network modeling is the lack of ground-truth functionally related isoform pairs. To solve this problem, following our recent work in predicting isoform functions 15,30 , we assume that: (i) of a functionally related gene pair (a positive bag), at least one of its isoform pairs (the instances) must be functionally related (Fig. 1A); (ii) for an unrelated gene pair (a negative bag), none of its isoform pairs can be functionally related (Fig. 1B). Then, the aim is to identify the truly functionally related isoform pairs ("witnesses") of the positive bags (Fig. 1). Under these two assumptions, the isoform network prediction is formulated as a MIL problem.
MIL can embed any base learner, such as support vector machines (SVM) [35][36][37] and Bayesian network classifiers 7,38 . Because genomic datasets are characterized by large scale, heterogeneity, and missing data, the , gene III with 2 isoforms and gene IV with 2 isoforms. None of the isoform pairs between gene III and IV is functionally related. (C) In the traditional gene-level network prediction, a classification model is built to distinguish positive pairs from negative pairs. (D) In the isoform-level network prediction using MIL, gene pairs are considered as 'bags' , each containing one or several isoform pairs, called 'instances' . A positive bag must have at least one of its instances (isoform pairs) being functionally related, which are called 'witnesses' (pairs in red). All instances (isoform pairs) in a negative bag must be functionally unrelated. MIL gives predictions at both bag and instance level.
Scientific RepoRts | 6:24507 | DOI: 10.1038/srep24507 state-of-the-art naïve Bayesian network was chosen as the base learner of MIL since it has been proven promising for data integration in the functional genomics field [39][40][41][42][43] . An extended version of naïve Bayesian networks is to incorporate conditional dependencies among features into the network model, which was shown to be a powerful approach in predicting eukaryotic transcriptional cooperativity, a type of functional networks between transcriptional factors 44 . One limitation of previously developed MIL algorithms is that they treat all isoform pairs or randomly choose one isoform pair in a positive bag as the witnesses in the initial iteration, which is very likely to introduce false positives 30,35 . To overcome this issue, we developed a single-instance bag MIL (SIB-MIL) algorithm which achieved the best performance in this context compared to the previous methods (Supplementary Figure S1). The algorithm is detailed in the following.
Without loss of generality, the ith gene pair containing m isoform pairs is denoted by = X i {x i1, x i2⋯ x im } with x ij, j = 1, 2 … m denoting the jth isoform pair of the ith gene pair. We assign the class label of the ith gene pair, denoted as y i , based on the hypotheses that for a positive gene pair, at least one of its isoform pairs is functionally related; if a gene pair is negative, none of its isoform pairs should be functionally related, which can be mathematically expressed as follows: if j subject to y for j subject to y where y i indicates the label of the jth isoform pairs of the ith gene pair. We refer to those positive instances (functionally related isoform pairs) of a positive bag (positive gene pair) as the witnesses. The MIL algorithm consists of three steps: (1) Initialization: because there is no existing isoform-level gold standard, a set of isoform pairs in positive gene pair bags needs to be selected as witnesses to build an initial model. Motivated by the fact that the isoform pair in single-instance positive gene pair bags must be positive, all the instances in such bags are labeled as Class 1 (functionally related). All instances in negative bags are labeled as Class 0 (functionally unrelated). In doing so, unlike the algorithms in 35 , no false positives will be introduced in this step. (2) The loop: (2.1) Model building: using the current witness set and the negative isoform pairs, we build a Bayesian network classifier that will be used to re-assign a probability score to all instances in the original training set. (2.2) Witness updating: For each positive bag, reselect the instance with the maximum probability score as the "witness" and label it as Class 1. For each negative bag, only the highest scored instance was chosen and labeled as Class 0; the reason is that a classifier is expected to perform well if it can correctly classify the most difficult examples. (3) Stop criteria and final predictions: The iteration is stopped when cross-validation performance does not change any more. The final classifier, built at the instance (isoform pair) level, will be used to predict the isoform network. Each isoform pair will be assigned with a probability to be functionally related. At the genepair bag level, the score of each bag is defined as the maximum of all scores of its instances.
We used a Bayesian network classifier, as previously described in the work 7,41,45,46 as the base learner. Assuming that each isoform pair is characterized by an n-dimensional feature vector (E 1 , E 2 , … E n ), the probability with which an isoform pair belongs to the positive class can be calculated using the Bayesian formula 7,47 : where P(y = 1) is the prior probability for an isoform pair to be positive, P(E i |y = 1), i = 1, 2, … , n, is the probability of the ith feature value, conditioned that the isoform pair is functionally related, and C is a constant normalization factor.
Isoform-level genomic data processing and gold standard construction. We have processed 65 heterogeneous datasets, including RNA-seq data, exon array, pseudo-amino acid composition and isoform-docking data. For the isoform docking data, the docking score between a protein pair is used as the feature data. For the other three types of datasets, we calculated the correlation between each isoform pair and used it as the feature input. Details for processing these four types of data into pairwise features are described in Supplementary Text S1. The data integrated in this work are listed in Supplementary Table S1. These four types of feature datasets together provide a largely comprehensive characterization of isoform pairs. In our study, a gene is assumed to carry a function if it is annotated to a biological process or a pathway. Two genes are assumed to have a functional relationship if they are co-annotated to the same biological process or pathway. We constructed a gene-level gold standard of functionally related pairs using the Gene Ontology (GO) 24 , KEGG 25 , and BioCyc 27 databases; 675, 124 positive gene pairs were obtained in total (Supplementary Text S1). Functionally related and unrelated gene pairs are called positives and negatives, respectively. Consistent with previous work in this field 7,38 , negative gene pairs are generated randomly.

Results
Simulation study shows accurate prediction of functional relationships at the isoform level. We first tested the performance of this algorithm using simulated data under different scenarios (Supplementary Text 2). Two parameters were tested in the simulation study: (1) the discriminativeness of the input data, measured by the mean difference (MD, see Fig. 2A-C),) of the values between the population of functionally related isoform pairs and the population of functionally unrelated isoform pairs and (2) the multi-isoform gene ratio (MGR), defined as the ratio of multi-isoform genes to the total number of genes (Supplementary Text 2). Though studies have shown that around 95% of multi-exon genes have isoforms 48 , we tested MGR values of 0.2, 0.3 and 0.5 for the reason that the NCBI RefSeq gene build used in this study (version 37.2) contains only high quality isoforms and therefore has many fewer annotated transcripts compared with the alternative. Ensembl gene model. The GTF file downloaded from the cufflinks website has only 3505 multi-isoform genes and 22287 single isoform genes, giving a MGR only 13.5%. For each simulation, we randomly partitioned gene pairs into disjoint training and test set, respectively. We repeated the partitioning 20 times and thus tested this method on 20 randomly generated test sets. For each partition, we built a Bayesian classifier model and predicted a probabilistic functional relationship score for each isoform pair.
We first investigated the influence of the discriminativeness of the input data on the predictive performance of this algorithm at the isoform pair level. We simulated that, for both positive and negative isoform pairs, the input feature values follow a normal distribution, with standard deviation equal to 1. Then, the mean difference (MD) values between the positives and the negatives can vary with the discriminativeness of the feature. With MGR fixed at 0.3, the prediction accuracies at the isoform pair level in terms of AUC with MD = 0.1, 0.2 and 0.3 are shown in Fig. 2D-F. As expected, with increasing MD values the classification performance improved significantly. The median AUC on the 20 test sets for MD = 0.1, 0.2, 0.3 are 0.659, 0.838 and 0.929, respectively. In addition, we also calculated the AUPRC (area under precision recall curve) and found that it also improves with increasing MD values ( Fig. 2G-I). This observation suggests that this algorithm works well with input genomic data of very weak (MD = 0.1 or 0.2) discriminative power.
We further looked at how the predictive performance of this algorithm will change with the multi-isoform gene ratio. To this end, we fixed the value of the mean difference of input data to be 0.2. The prediction accuracy at the isoform level in terms of AUC at MGR = 0.2, 0.3 and 0.5 are 0.831, 0.838 and 0.827, respectively ( Fig. 3A-C). This range of MGR is equivalent to the current MGR ratio in the RefSeq database. In addition, we also calculated the AUPRC (Fig. 3D-F), which is much higher than the baseline AUC (approximately 0.05) since the proportion of simulated functionally related gene pairs is 0.05. Interestingly, we found that the performance gain of the converged model over the model at the first iteration increases with the fraction of multi-isoform genes. These gains for MGR = 0.2, 0.3, 0.5 are 0.0019, 0.0030 and 0.0124 for AUC. Overall, this shows that this algorithm is robust against the percentage of multi-isoform genes among all genes, and will remain to be applicable when new alternatively spliced isoforms are identified and verified.
We also analyzed the effects of different combinations of MD and MGR values, and found that, for assigning isoform pair-level labels, this algorithm is robust to variations of the input data accuracy as well as the fraction of multi-isoform genes ( Supplementary Figs S2-S4).
Modeling and validating the functional relationship network using single-isoform gene pairs as true gold standard. The RefSeq gene build (version 37.2) was used in our study to build the isoform network for the mouse because it is validated and of high quality. First, we evaluated the performance of each type of feature data through 5-fold cross validation. The results in terms of AUC are presented in Table 1 (also Fig. S5). The AUC of all these features range from 0.521 to 0.612 with protein docking data ranking the highest. The reason why docking is most discriminating may be that it directly models protein binding, unlike expression features which captures only co-expression/co-regulation properties. The integrated network achieves the best performance, supporting expected power of integrated data analysis. The gold standard contains a large proportion of genes containing only a single isoform. Gene pairs formed by such single-isoform genes are therefore equal to isoform-level gold standard and therefore can be used for evaluating the performance of isoform networks.
To more reliably assess the performance on the real mouse data, only the experimentally validated (evidence codes: EXP, IMP, IPI, IGI, IEP and IDA) gene annotation in Gene Ontology (79,562 pairs) combined with those from KEGG and BioCyc pathways (106,276 positive pairs in total) were used for computational validation. We  Table 1. Comparison of 5-fold cross-validated prediction for each type of feature data and the integrated network.
randomly generated 5 disjoint training and test sets for cross-validation, and showed the predictive performances of single-isoform gene pairs in Fig. 4A. The results from single-isoform gene pairs (true gold standard) are highly accurate with AUC being 0.656 ± 0.002, demonstrating that the proposed SIB-MIL method works very well also with experimental data. In addition, for multi-isoform gene pairs, we assigned each gene pair a score as the maximum probability of all its isoform pairs (since the isoform-level information is not available-unlike the simulation studies), under the assumption that co-functionality of a gene pair must be carried out by at least one of its isoform pairs. Gene-level prediction results are also provided in Fig. 4B, which are similar to that of single-isoform pairs. In addition, we also tested the performance of alternative validation methods using pre-filtered feature subsets (Fig. S6). These results are also accurate. We further validated the predictions with a set of experimentally verified isoform-isoform interaction data provided in the Corominas data 23 . This dataset is recent, and it is the largest isoform-level, experimentally validated interaction dataset that has been generated. In this study, the authors tested a set of multi-isoform genes against a set of isoforms for interactions. As a result, for each interacting gene pair, one of the isoform pairs may be interacting, while others are not. The uniqueness of this dataset allows us to directly test whether we will be able to differentiate the isoform pairs that are truly interacting, against the isoform pairs that come from the same gene pair but are not interacting. Based on the supplementary materials of this paper, we obtained a total number of 629 isoform pairs that have interactions. Of the 629 pairs, 614 are between genes. These 614 pairs are used as positives. The remaining 15 pairs are interactions between the same isoforms and are excluded from this analysis. Next, we identified the gene pairs that contain these 614 isoform pairs. These gene pairs contained 1304 isoform pairs in total. This means that 690 isoform pairs are not identified as interacting in this experiment. These 690 isoform pairs are used as negatives. This set allows us to test whether we can differentiate which isoforms are actually interacting, against the isoform pairs coming from the same gene pair but not interacting. This is a desired and unique function of the algorithm described in this study.
We found that our predictions are accurate with excellent precision for the top predictions (Fig. 5). The precision, obtained at the isoform level, is comparable to those of our previous studies obtained at the gene level 6,7,45 . This validates that even for the same gene pair, this method is now capable of differentiating the isoform pairs that are truly interacting, versus the ones that are not. This is a capability that is new to this method, and would be valuable to the mouse genetics community.
We have built a genome-wide isoform-level network for the mouse by integrating the isoform-level genomic features and the gene-level gold standard functionally related gene pairs. We also shuffled the gold standard and  computed a random network whose isoform pair scores are closely normally distributed with mean 0.050 and standard deviation 0.023. Based on this randomization control, a score > 0.119 (3 standard deviation away) in the network would be likely non-random.
We next evaluated the accuracy of the newly predicted highest scored isoform pairs (not recorded in GO, KEGG or BioCyc) against public databases, including protein-protein interactions [49][50][51][52][53][54][55] , MSigDB gene sets 56 and Reactome pathways 29 (Table 2). These databases provide a rich resource to test the performance of novel predictions. For multi-isoform gene pairs, we assigned the maximal probability of all its isoform pairs to each gene pair. We identified a list of 680, 624 gene pairs with probability >0.95, representing a set of functionally highly related gene pairs that were newly predicted. We found that 36.0% of top predicted gene pairs (244,957 gene pairs) were supported by co-annotation to the same biological process/pathway or having physical/genetic interactions, which is significant (p < 0.000001), compared to 7.0% of randomly generated gene pairs. This result further supports the high precision of the proposed network prediction model.

The isoform-level network provides a high-resolution view of functional relationships.
We found that this isoform-level functional relationship network for the mouse is capable of identifying differential functional relationships between isoform pairs belonging to the same gene pair. For example, the local gene-level network of Ptbp1 (pyrimidine tract binding protein 1, an element of the spliceosome machinery) describes the functional relationship between any two genes with a single probability value (Fig. 6A, left panel). In fact, many genes in this network have multiple isoforms. For example, both Ptbp1 and Banf1 have two alternatively spliced isoforms. The functional relationship between Ptbp1 and Banf1 in this isoform-level network can therefore be dissected into four functional linkages corresponding to four isoform pairs (Fig. 6A, right panel). Among the four isoform pairs, the functional relationship of the isoform pair [NM_008956.2, NM_011793.2] is predicted to be 0.999, whereas the probabilities of the other three isoform pairs are much lower − 0.233, 0.084, and 0.045, respectively.
Such disparities of connections between isoform pairs of the same gene pair are prevalent. To quantify such differences, we calculated the ratio of the maximum to the minimum of the predicted probability among all isoform pairs of a given gene pair, respectively. For example, this ratio for the Ptbp1 and Banf1 gene pair is calculated as 0.999/0.045 = 22.2, where 0.999 and 0.045 are the maximum and minimum score between this gene pair, respectively (Fig. 6A). From the whole isoform-level network of the mouse, we found that this ratio spans a wide range from 1.0 (no difference) to more than 3500 (3500 fold difference) (Fig. 6B). Notably, 25% of these gene pairs have a fold change value larger than 3.0, implying that a high proportion of the gene pairs are functionally highly differentiated at the isoform level. The significance of the wide ratio span was indicated using a random network built with shuffled gold standard (Fig. 6C). These results suggest that difference is prevalent across isoform pairs coming from the same gene pair and that this isoform network can reveal such variations.
The isoform-level network reveals functional diversity of different isoforms of the same gene. It is known that proteins encoded by isoforms of the same gene can carry out different and even opposite biological functions, such as pro-apoptotic versus anti-apoptotic actions of bclx-L vs bclx-S and of caspase 3 (L vs S) and transcriptional activation versus transcriptional repression for odd-skipped 2 34 . Investigating and revealing the functional diversity of the same gene achieved by alternative splicing is pivotal to biology. Because of its high resolution, this isoform network has the ability to reveal such functional diversity.
To systematically examine the functional diversity represented at the network level, for each of the 3427 validated multi-isoform genes in the RefSeq database, we compared the networks (with the top 25 neighbors) of its isoforms and counted the number of shared functionally related neighbors (Supplementary Fig. S7). We found that the minimum, mean and maximum numbers of shared neighbors are 0, 4 and 24, respectively. These statistics indicate that many isoforms of the same gene have different functional connections and may participate in different biological processes. Based on literature 15,30,34,47 , we collected a list of genes whose isoforms were shown to have different functions, and calculated for each gene the number of shared connections between isoform networks (Table 3). For example, Anxa6 has two alternatively spliced isoforms: NM_001110211.1 and NM_013472.4. Both isoforms have the same N-and C-termini, but the former encodes a shorter protein by six amino acids (525-530) due to the lack of an alternate in-frame exon compared to the latter. As an illustration, we identified the local networks of these two isoforms (Fig. 7). Their local networks share only 13 out of 25 neighboring isoforms, indicating a diverse functional relationship map of these two isoforms despite similar structures.    To investigate the functional differences of the genes in the two local networks, we performed Gene Ontology (biological process terms) enrichment analysis using the top connected genes in each network (Supplementary  Table S2). We found that, while sharing the same GO terms (such as GO:0006944 cellular membrane fusion and GO:0061025 membrane fusion), the two isoform networks are also enriched for genes annotated to different GO biological processes. The isoform network of NM_001110211.1 is enriched for genes associated to vesicle fusion (GO:0006906, p = 0.0096), organelle fusion (GO:0048284, p = 0.044), and amino acid activation (GO:0043038, p = 0.0262), whereas the isoform network of NM_013472.4 is enriched for genes related to regulation of cell shape (GO:0008360, p = 0.0135). These disparate enriched functions strongly support the functional diversity of the two isoforms of the Anxa6 gene. This computational modeling of the folding and conformation of the two isoforms shows a striking difference in likelihood of phosphorylation in the Thr-Pro-Ser (535-537 vs 529-531) sequence 34 . In addition, the alternative splice isoforms of the Anxa6 gene have been reported to have functional differences on catecholamine secretion 57 , which is consistent with this functional enrichment analysis related to the vesicle fusion and organelle fusion (Supplementary Table S2). These results suggest that this isoform-level network is able to reveal functional diversity of different isoforms of the same gene and could therefore become a promising tool for investigating gene functions at the isoform level. To facilitate new isoform function annotation based on this network, we included this enrichment analysis for all local networks of individual isoforms in our website.

Discussion
We have developed a novel Bayesian network-based multiple instance learning approach to probe functional relationships at the isoform level, thus being able to provide a higher resolution view compared to traditional gene-level networks. Determining functional connections between splice isoforms from the same gene is essential to functional genomics, which would help deepen our understanding of gene functions and functional relationships and may provide useful information on diseases caused by alternative splicing. It is widely understood that the topology of molecular pathways varies between transcripts or protein isoforms. The current work also has limitations since it represents a generic network without considering tissue-specific expression of isoforms across tissues or cell types. A major next step is to build tissue, cell-type, and phenotype-specific networks for more refined understanding of functional relationships in a given context both for the mouse and for the extension of this strategy into human studies. As an example, isoform networks for normal brain and brain with Alzheimer's disease (AD) would be of interest for understanding perturbed pathways or networks in AD and probably heterogeneous causes of AD. The established isoform network provides a first systematic attempt to convey isoform-level connection and functional relationships. We expect that isoform-level networks will find wide applications in genomic and biomedical applications, and that the current gene-centered network modeling approach will be expanded to a more refined isoform level.