An integrated network of Arabidopsis growth regulators and its use for gene prioritization

Elucidating the molecular mechanisms that govern plant growth has been an important topic in plant research, and current advances in large-scale data generation call for computational tools that efficiently combine these different data sources to generate novel hypotheses. In this work, we present a novel, integrated network that combines multiple large-scale data sources to characterize growth regulatory genes in Arabidopsis, one of the main plant model organisms. The contributions of this work are twofold: first, we characterized a set of carefully selected growth regulators with respect to their connectivity patterns in the integrated network, and, subsequently, we explored to which extent these connectivity patterns can be used to suggest new growth regulators. Using a large-scale comparative study, we designed new supervised machine learning methods to prioritize growth regulators. Our results show that these methods significantly improve current state-of-the-art prioritization techniques, and are able to suggest meaningful new growth regulators. In addition, the integrated network is made available to the scientific community, providing a rich data source that will be useful for many biological processes, not necessarily restricted to plant growth.


Figure S1 -General topological properties distribution A) Degree distribution of growth regulators versus non growth regulators. The median degree for GR genes is 2259 (black line), while this value for non-GR genes is 1023.
B) Distribution of the number of direct edges to growth regulators for growth regulators themselves, and non-growth regulators. The median for GR genes is 38, while for non GR genes it is 7.

C) Distribution of the number of shared neighbours for growth regulators, and non growth regulators. The median for GR genes is 2673, while it is 335 for non GR genes.
Figure S2 -Edge and node betweenness for the local network of growth regulators. A bigger font size of the nodes (genes) corresponds to a more prominent role in connecting subparts of the network, while an increased thickness of the edges corresponds to the edge betweenness, illustrating paths that are taken more frequently when connecting nodes. Genes are again organized into layers based on their degree. In addition to the leave-one-out crossvalidation (LOOCV) setup, we also explored the effect of using less GR genes for training the model, using a 10-fold crossvalidation setup. Figure S4 shows a comparison between the LOOCV setup and the 10-fold crossvalidation for the model-based approaches. Similar graphs are shown for the comparison of the best models to GeneMANIA ( Figure S5), and the best ensemble models ( Figure S6).     Table S1 gives an overview of all genes that were used as input for the prioritization algorithm. These include both the Intrinsic Yield Genes (IYG) as well as the growth regulators that were identified through a microarray experiment of developing early leaves (GR). Name  AT codes Type  Name  AT codes Type  Name   AT1G01160  IYG  GIF2  AT4G37740  IYG-GR  AtGRF2  AT4G37610  GR  BT5   AT1G07640  IYG  OBP2  AT4G37750  IYG-GR  ANT  AT5G04150  GR  BHLH101   AT1G13710  IYG  CYP78A5  AT4G39400  IYG  ATBRI1  AT5G08330  GR  ATTCP11   AT1G15690  IYG  ATAVP3  AT4G40060  IYG  ATHB-16  AT5G11060  GR  KNAT4   AT1G19270  IYG  DA1  AT5G07310  IYG  -AT5G11260  GR  HY5   AT1G21700  IYG  ATSWI3C  AT5G08790  IYG  anac081  AT3G04730  GR  IAA16   AT1G26770  IYG  AT-EXP10  AT5G10510  IYG  AIL6  AT3G09600  GR  LCL5   AT1G30210  IYG  ATTCP24  AT5G13060  IYG  ABAP1  AT3G13040  GR  -AT1G32310  IYG  SAMBA  AT5G18010  IYG  SAUR19  AT3G15540  GR  IAA19   AT1G50030  IYG  TOR  AT5G28640  IYG-GR  AN3  AT3G16870  GR  GATA17   AT1G53230  IYG  TCP3  AT5G47230  IYG (Yilmaz et al., 2011).

AT codes Type
The AtRegNet database was converted into a network, keeping for each TF its direct target genes.

MaMut genetic modification design network
This network was extracted from the "genetic modification dataset" of CORNET, a publicly available database on gene associations in plants (De Bodt et al., 2012).
This network contains information on differentially expressed genes when comparing wild type plants to transgenic plants. Links in this network represent genes that are either up-or downregulated when knocking out one or more transcription factors.
These differentially expressed genes are assumed to be the target (either direct or indirect) of the transcription factor that was knocked out.

Protein-protein interaction (PPI) network
The PPI network was extracted from CORNET, and includes predicted as well as experimentally identified protein-protein interactions in Arabidopsis from different sources. Some of these interactions were derived from the original AraNet network (De Bodt et al., 2012).

GeneMANIA network
The GeneMANIA network represents a combination of different publicly available data sets, collected from a variety of databases. These include co-expression data, co-localization data, genetic interactions, physical interactions, shared protein domains and predicted interactions, all combined into a single network. A detailed overview of all networks used by GeneMANIA can be found on the GeneMANIA website 1 .

GENIE3 predicted regulatory network
In order to construct computationally predicted transcriptional regulatory networks, we used the GENIE3 algorithm (Huynh-Thu et al, 2009), which achieved the best performance on the DREAM5 network inference challenge (Marbach et al., 2012).
GENIE3 was run on the "leaf" compendium from the CORNET database. The result of this analysis is a list of predicted transcription factortarget relations and an associated confidence score. Using a cutoff of 0.02 we only used the most confident associations to build the network.

Co-expression network using Pearson correlation coefficient (PCC)
Using the "leaf" compendium from the CORNET database we calculated a coexpression network by calculating the Pearson correlation coefficient between the expression patterns of all genes. As the resulting network is huge, we only keep the 5% most significant gene pairs, corresponding to the correlations that have at least an absolute value of about 0.8, thus keeping the most correlated or anti-correlated genes.

Text mining network
A network of gene-gene associations predicted using text mining was extracted from computations related to network properties as well as integration were mostly done using two R packages: "igraph" (Csardi and Nepusz, 2006) and "Matrix" (Bates and Maechler, 2012).

Topological features
Two major classes of features were extracted from the network. General topological properties only capture the topology of genes in the network, and similarity measures compute how much a gene is similar to a set of pre-defined genes. The similarity can be measured either using topology information or using GO terms.

General topological properties
The following general topological properties were used for the prioritization approaches:  Degree The most elementary characteristic of a gene is its degree (or connectivity), k, which tells us how many links the gene has to other genes (Junker and Schreiber, 2008)  Betweenness Centrality Suppose that, in order for gene i to contact gene j, gene k must be used as an intermediate station. Then gene k is such that it has a certain "responsibility" to gene i and j. If we count all of the minimum (short) paths which pass through gene k, then we have a measure of "stress" which gene k must undergo during the activity of the network (Freeman, 1977). We can then calculate the total number of paths passing through the gene k, defined as where n is the number of genes in the graph and ( ) ( ) is the number of shortest paths linking p i and p j , and ( ) is the number of shortest paths linking p i and p j that contain p k .  Kleinberg's hub and authority scores The Kleinberg's hub and authority scores are based on the principle eigenvector of the adjacency matrix A of the network. The authority score is defined as the principle eigenvector of A T A and the hub score is based on the principle eigenvector of AA T (Kleinberg, 1998).

 Closeness
The closeness of a gene is the inverse of the average length of the shortest paths to/from all the other genes in the graph (Freeman, 1978).

Topology-based similarity measures
 Number of shared genes with the known genes This number represents the number of neighbouring genes that a gene has in common with the known genes. This idea is based on the fact that, when two genes share many neighbours with each other, they are likely to be involved in the same biological process.  Number of direct connections to the list of known genes This represents the number of edges of each gene directly connected to class GR genes  Jaccard Similarity Index The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for comparing the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between sample sets and is defined as the size of the intersection divided by the size of the union of the sample sets: In a network, the Jaccard similarity coefficient of two genes is the number of common neighbors divided by the number of genes that are neighbors of at least one of the two genes being considered.  Dice similarity index Related to the Jaccard index, the Dice index expresses the degree to which two different species are associated. The similarity of two sample sets A and B is twice the intersection divided by the sum of the elements in two sets: | | | | | | In a network, the Dice index of two genes is twice the number of common neighbors divided by the sum of the degrees of the genes (Dice, 1945).  Inverted weight score The inverse log-weighted similarity index was proposed to mine information on the internet in order to extract social networks, and is defined as: In a network, the inverse log-weighted similarity index of two genes is the number of their common neighbors, weighted by the inverse logarithm of their degrees. It is based on the assumption that two genes should be considered more similar if they share a low-degree common neighbor, since high-degree common neighbors are more likely to appear even by pure chance. Isolated genes will have zero similarity to any other vertex. Self-similarities are not calculated (Adamic and Adar, 2003).  Shortest path The shortest distance between a gene and all class A genes can be considered a similarity index which can show some relatedness between a gene and the genes in class A.

GO-based similarity measures
For the approach based on GO term overlap, we define three term overlap similarity measures: term overlap between neighbours of a query gene and the seed genes, term overlap between the query gene itself and the seed genes, and finally the combination of two.
The magnitude of the term overlap was measured by the Jaccard similarity coefficient. For the approach based on GO enrichment, we followed the approach proposed in (Rahmani et.al, 2012). In this method, we count relevant GO terms for the set of seed genes S, and select the top ten statistically most overrepresented terms (Table S3). Next, for these terms a two-way table is constructed using the frequency of the terms in the seed genes as well as the query gene and its neighbors.
The reason for including neighbors is the fact that GO annotations of proteins can often be predicted well from the GO annotations of their neighbors (Schwikowski et al., 2000, Rahmani et al., 2011. The p-value of a Fisher exact test, comparing the frequency of terms in the two groups is then used as a similarity measure. The higher the p-value, the more similar the representation of GO terms between the two groups will be. The same procedure was then also applied for the top five and the first GO terms.

Model-based approaches
Model types

Naïve Bayes
A Naïve Bayes classifier uses Bayes' theorem, additionaly making the assumption of independence between features, given the class variable, to reduce the computational cost. Although it may seem that the independence assumption ignores relationships between features, most of the time Naïve Bayes performs well. We computed the conditional posterior probabilities for each class and consider the probability for class S to rank the test genes. The method was implemented by the Rpackage "e1071" (Dimitriadou et al., 2011).

Linear Discriminant Analysis (LDA)
LDA is a statistical classifier which tries to find the best linear combination of features in order to separate two classes of events. In the package "MASS" (Venables and Ripley, 2002), we used the lda function to fit the LDA. The prior probabilities of class membership are the class proportions for the training set. The posterior probability that a gene belongs to class S is the product of the prior probability and the multivariate normal density.

Support Vector Machines (SVM)
SVM, a machine learning algorithm, is a non-probabilistic binary classifier. The basic idea in SVM is to find a maximal margin separation between the two classes. SVM is able to manage non-linear classification problems by means of the so-called kernel trick.
The package "e1071" (Dimitriadou et al., 2011) was used to fit the SVM model. The kernel used is the default one, radial basis, the degree=3, gamma parameter defined 1/(data dimension), tolerance of termination criterion (default: 0.001), epsilon in the insensitive-loss function (default: 0.1), the shrinking option was used as well. To produce probabilistic outputs, we used the probability option in this package.

Lasso and elastic-net regularized generalized linear models (Glmnet)
The elastic net is a regularized regression model that uses two penalization criteria, L1 and L2. It combines these two criteria which come from lasso and ridge methods.
We used the "glmnet" (Friedman et al., 2010) R package to fit a regularized version of logistic regression. This approach fits a generalized linear model by using penalized maximum likelihood. The regularization path is computed for the lasso or elasticnet penalty at a grid of values for the regularization parameter lambda. The lambda parameter is fixed to 0.9.

Random Forests (RF)
Random Forests are an ensemble of many single decision trees. The method uses a voting systems to determine the class of objects based on the output of singles trees.
We used the random forests implementation of the "randomForest" R package (Liaw and Wiener, 2002). The number of trees used was the default value; 500. The number of variables randomly sampled at each split is the square root of the number of features. Sampling was done with replacement, and the sample size was chosen equal to the training set size. We used the vote ratio as a probability that a gene is a member of class S. The vote ratio for class S is the number of trees that predicted class S for a given gene, divided by the total number of trees in the random forests model.

Generalized Boosted Regression Models (GBM)
GBM uses boosting, an iterative process to add new functions in order to reduce the misclassification rate. We used the "gbm" R package (Ridgeway, 2012), which implements extensions to Freund and Schapire's AdaBoost algorithm and J.
Friedman's gradient boosting machine. For classification problems, the distribution was defined for "adaboost" (the AdaBoost exponential loss for 0-1 outcomes). All other parameters were fixed to their default values.

Ensemble methods
To explore to which extent the results of the model based classifiers could be improved we also tested different combinations of classifiers, a technique commonly referred to as ensemble models. Two main approaches to create ensembles were considered. First, by averaging the prediction probability of each classifie and second, by aggregating the ranks of the genes resulting from each classifier. We defined ten ranking ensembles as well as nine probability prediction ensembles. An overview of all combinations of methods that we tested can be found in Table S4.
Overall, no major improvements were noted by combining methods, and the best combination was only able to slightly improve the median ranking from 589 to 520 (RF+Glmnet, Table S5). The result of the best first quartile (rank 127) was not improved upon by any combination of methods.

Network importance
To assess the importance of each subnetwork for the prioritization we explored the impact of leaving out each network, and compared the prioritization results without the network to the original approach using all networks integrated. Results are presented for the model-based methods ( Figure S7) and the best ensemble method ( Figure S8). The first box in each plot shows the ranking when we used the complete integrated network. The other boxes show the ranking when we removed the corresponding network from the integrated one. A decrease in the ranking compared to the first box (Total) shows that the corresponding network has a negative impact on the final ranking since removing the subnetwork allows the GR genes to be better ranked.