Improving Gene Regulatory Network Inference by Incorporating Rates of Transcriptional Changes

Organisms respond to changes in their environment through transcriptional regulatory networks (TRNs). The regulatory hierarchy of these networks can be inferred from expression data. Computational approaches to identify TRNs can be applied in any species where quality RNA can be acquired, However, ChIP-Seq and similar validation methods are challenging to employ in non-model species. Improving the accuracy of computational inference methods can significantly reduce the cost and time of subsequent validation experiments. We have developed ExRANGES, an approach that improves the ability to computationally infer TRN from time series expression data. ExRANGES utilizes both the rate of change in expression and the absolute expression level to identify TRN connections. We evaluated ExRANGES in five data sets from different model systems. ExRANGES improved the identification of experimentally validated transcription factor targets for all species tested, even in unevenly spaced and sparse data sets. This improved ability to predict known regulator-target relationships enhances the utility of network inference approaches in non-model species where experimental validation is challenging. We integrated ExRANGES with two different network construction approaches and it has been implemented as an R package available here: http://github.com/DohertyLab/ExRANGES. To install the package type: devtools::install_github(“DohertyLab/ExRANGES”).


Supplemental Figure 3: TF targets identified differ when using EXPRESSION or ExRANGES as features.
Scatter plots showing targets for the TFs A) NPAS2 B) PER1 C) NR1d1 and D) ARNTL. TF targets identified with similar rank by both approaches are shown in grey. Targets identified as high ranking by RANGES are shown in blue and those identified by EXPRESSION are red. TF targets identified by ChIP-Seq 20,21 are marked as stars. Genes that were not in the ChIP-Seq identified targets are plotted as points. Histogram showing the top 1000 PER1 targets identified by A) EXPRESSION (red) have a higher distribution of expression as measured by hybridization intensities compared to the background distribution of all genes (grey). B) Rate Change (blue) identified targets show a similar expression distribution to the background genes. C) The distribution of expression levels of the top 1000 targets identified by EXPRESSION (red) for all TFs is higher than the background gene expression (grey). D) The distribution of expression levels for the top 1000 targets of each TF identified by Rate Change (blue) is similar to the distribution of the expression levels of all genes (grey). Figure 7: Variance comparison of the viral and circadian data sets. The index of dispersion is calculated by dividing the variance of each gene by its mean expression level and taking the mean of these values over all genes in the dataset. The circadian data set showed a significantly higher Index of Variation than the viral data set (Student's t-test, p-value < 10-15 ).

Supplemental Figure 8: ExRANGES improves TF-TF network reconstruction of the Arabidopsis circadian clock.
Network model of the TF-TF interactions in the Arabidopsis circadian clock as generated using ExRANGES or EXPRESSION as input to GENIE3. The model from literature as reviewed in Greenham and McClung, 2015 54 is shown on the bottom for comparison. Each node is colored by the phase of expression: morning (yellow), afternoon (orange), and evening (blue). Dashed edges are predicted interactions that exist in both the EXPRESSION and the ExRANGES networks; solid edges are unique to either the EXPRESSION or the ExRANGES network. The * indicates that the probe set on the microarray corresponding to this gene can bind transcripts from more than one unique locus. Figure 9: Using Inferelator 4 , ExRANGES as input improves identification of TF targets compared to EXPRESSION for some data sets. ROC curves and Precision-Recall curves showing performance of Inferelator using EXPRESSION (red, solid) or ExRANGES (blue, dashed) as input. From left to right Circadian is CircaDB data set from mouse tissues; Viral is Human viral data set, and Arabidopsis is Arabidopsis time series data sets across different environmental variables.

Supplemental Materials and Methods for "Improving Gene Regulatory Network Inference by Incorporating Rates of Transcriptional Changes"
Jigar S. Desai, Ryan C. Sartor, Lovely Mae Lawas, SV Krishna Jagadish, Colleen J. Doherty

Pseudocode:
LS is a gene by time point matrix with genes as rows and timepoints as columns.
Times is a vector of actual times (usually hours 1 -24) To calculate rate of change (RATE values):  [1,]))]

Circadian Data Set
Normalized expression data from murine sources were downloaded from CircaDB 19 . Microarray-based expression levels from 288 samples were used in this study. The data available was from twelve different tissues that were sampled every 2 h for 48 h.

Viral Data Set
The expression data used for the viral experimental analysis was downloaded from GEO GSE73072. This dataset is composed of seven studies of individuals sampled before and after a respiratory infection. The transcript levels are assayed from blood samples of approximately twenty individuals taken over a seven to nine day period, depending on the individual study. Sampling was not evenly spaced between time points. In total, data from 2372 microarrays were used. The expression datasets used for the analyses described in this manuscript were contributed by Drs. Ephraim Tsalik and Geoffrey Ginsburg from Duke University and the Durham VA Medical Center. They were obtained as part of The Respiratory Viral DREAM Challenge through Synapse ID syn5647810 29,30 .

S. cerevisiae RNA-Seq Data
RNA-Seq based expression data from S. cerevisiae was downloaded from GEO GSE61668 33 . This data set was collected from a study to evaluate phosphate starvation in six genotypes of S. cerevisiae. Transcript expression was measured by RNA-Seq every 15m for six hours after transfer to reduced phosphate media (150 samples total).

Arabidopsis Circadian Data
Normalized microarray expression data for Arabidopsis were obtained from www.mocklerlab.org/diurnal [36][37][38]55,56 . This data set consists of transcript data from Arabidopsis plants of various ages grown under 12 different environmental conditions sampled every 4 h for 48 h for a total of 144 samples.

Oryza sativa Diel Data
Rice variety IR64 was grown in the field at the International Rice Research Institute (Manila, Philippines). When the plants reached 50% flowering, panicle tissue was harvested at dawn, dawn + 3.5h, dawn + 7h, dawn + 10.5h, dusk, dawn + 14h, dawn + 17.5h, and dawn + 21h. Four replicates were harvested for each of these eight time points for a total of 32 samples. The third rachis of the panicle was ground in liquid nitrogen with a metal pestle. The tissue was then lyophilized at -60C°overnight. Total RNA was isolated using RNeasy Plant Mini Kit (Qiagen, Germany) with the recommended RLT lysis buffer. The RNA extraction protocol was modified to include an additional incubation with DNaseI. mRNA was isolated from 2 µg of total RNA using magnetic oligo(dT) (NEB, Ipswich, MA). Directional RNA-Seq libraries were prepared from isolated mRNA. Libraries were quantified using a 2100 Bioanalyzer (Agilent, Santa Clara,