Functional Cohesion

33 Transcriptional regulatory networks (TRNs) can be developed by computational approaches that


76
Transcriptional regulatory networks provide a framework for understanding how signals are 77 propagated throughout the transcriptome of an organism. These regulatory networks are 78 biological computational modules that carry out decision-making processes and, in many cases, 79 determine the ultimate response of an organism to a stimulus [1]. Understanding the regulatory 80 networks that drive responses of an organism to the environment provide access points to 81 modulate these responses through breeding or genetic modifications. The first step in 82 constructing such networks is to identify the primary relationships between transcription factor 83 (TF) regulators and the target genes they control. 84 Experimental approaches such as ChIP-Seq can identify direct targets of transcriptional 85 regulators. However, ChIP-Seq must be optimized to each specific TF and specific antibodies 86 must be used that recognize either the native TF or a tagged version of the protein. This can 87 present a technical challenge particularly for TFs where the tag interferes with function, for 88 species that are not easily transformable, or for tissues that are limited in availability [2]. Since 89 global transcript levels are comparatively easy to measure in most species and tissues, several 90 approaches have been developed to identify connections between regulators and their targets by 91 examining the changes in transcription levels across many samples [3][4][5][6]. The assumption of 92 these approaches is that there is a correspondence between the expression of the regulator gene 93 and its targets that can be discerned from RNA levels. Therefore, given sufficient variation in 94 expression, the targets of a given factor can be predicted based on associated changes in 95 expression. Initial approaches focused on the correlation between regulators and targets such 96 that activators are positively correlated and repressors are negatively correlated with their target 97 expression levels. These approaches have been successful in identifying some relationships [7].

Rate Change Identified Samples with Lower Variation Between Tissues 194
To understand why some targets are identified by EXPRESSION only and others by RANGES 195 only we compared the expression of the top predicted PER1 targets for each method (Fig 2A). 196 Scatter plot of targets of PER1 as identified by EXPRESSION or RANGES approaches. PER1 targets identified with similar rank by both approaches are shown in grey. PER1 targets identified as high ranking by RANGES are shown in blue and those ranking higher by EXPRESSION are red. PER1 targets identified by ChIP-Seq [20] are marked as stars. Genes identified as PER1 targets by each approach that were not identified in the ChIP-Seq identified targets are plotted as points.
We observed that the top hits identified by EXPRESSION showed more variation between each 197 tissue than in those identified by RANGES. We therefore examined the variance between each 198 tissue by calculating the variance of the mean expression for each of the 12 tissue samples for the 199 top 1000 targets for all five of the TFs with ChIP-Seq data available (Fig 2B) [20]. As observed 200 for the top PER1 targets, the targets identified by EXPRESSION generally showed more 201 variation between tissues than the targets identified by RANGES. We also examined the within 202 tissue variation to evaluate how well each approach identified targets that show a range of 203 expression throughout the day within each time series (Fig 2C). The targets identified by 204 RANGES showed more variation in the time series within each tissue suggesting that this 205 approach might be more sensitive to changes that are dependent on the rate of expression as we 206 would expect for this rate-based approach. To evaluate if the increased variance within each 207 tissue observed for top TF targets identified by the RANGES approach is limited to circadian 208 associated TFs, we compared the between tissue and within tissue standard deviation for the top 209 1000 targets identified by EXPRESSION or RANGES for all 1690 TF regulators (Figs 2D and 210 E). As we observed for the circadian TFs, the targets identified by EXPRESSION showed more 211 variation between tissue types ( Fig 2D). The RANGES approach was able to identify targets 212 with increased variation within each tissue time series compared to the EXPRESSION approach 213 ( Fig 2E). 214 We also compared the mean intensity level of the top 1000 predicted targets of the 215 RANGES and EXPRESSION approaches. We observed that the top 1000 targets of PER1 216 identified by EXPRESSION had higher intensity levels compared to the distribution of 217 expression of all transcripts on the microarray (Fig S3A). In contrast, the top 1000 predicted 218 targets of PER1 identified by RANGES resembled the background distribution of intensity for 219 all the transcripts on the array (Fig S3B). Likewise, the hybridization intensity of the genes 220 identified as the top 1000 targets identified by EXPRESSION of all 1690 TFs considered as 221 regulators was shifted higher compared to the background distribution levels ( Fig S3C). While 222 the top 1000 targets of all 1690 TFs identified by RANGES reflected the background distribution 223 of hybridization intensity (Fig S3D). While hybridization intensity cannot directly be translated 224 into expression levels, these observations suggest that there are features of the targets identified 225 by RANGES that are distinct from those identified by EXPRESSION. We hypothesized that 226 combining these two approaches would improve the overall ability to detect true positive targets 227 of each regulator. 228 229

ExRANGES Combines Rate Change with Expression Levels 230
Since many of the true positive targets of the TFs we evaluated identified by RANGES were not 231 identified by EXPRESSION and visa versa, we hypothesized that combining these two would 232 improve the overall ability to predict true positive targets. To combine these approaches we took 233 the product of the expression at time point t by the RANGES p-values for the change in 234 expression from time point t to t+1 for each target (ExRANGES) (Fig 3A). This adjusts each 235 time point by the rate of change in the following time interval. Therefore, the value of the time 236 point preceding a significant change in expression is higher than the value of a time point when 237 the following expression remains unchanged. We anticipate that this will enhance the signal 238 between the regulator and target for the time points where regulation is occurring, thus 239 improving the ability to correctly identify targets of each TF. For the regulators, only the 240 expression value of the TF was provided. For all targets, this ExRANGES value was provided to 241 GENIE3. All TFs were also considered as potential targets and the ExRANGES value was used 242 in the target matrix for all TFs. 243 calculated the area under the ROC curve to compare the identification of true targets attained by 245 EXPRESSION to the combination of expression and p-values using ExRANGES. We observed 246 that for all five TFs there was an improvement in the ability to identify ChIP-Seq targets (Fig  247   3B). 248 A modification of GENIE3 uses a time delay to identify transcriptional changes in the 249 regulator that precedes the effects on the target by a defined time step as incorporation of a delay 250 between regulator expression and target expression has previously been shown to improve the 251 ability to identify regulatory networks [22]. We compared our approach to this modified 252 implementation of GENIE3 that includes the time delay step. As previously reported, we 253 observed that the time step delay improved target identification for some transcription factors, 254 compared to EXPRESSION alone, although in this data set, target identification for CLOCK, 255 PER1, and NR1D2 TFs did not improve. However, for all five TFs, ExRANGES outperformed 256 both the EXPRESSION and time-delay approaches in identifying the true positive targets of each 257 TF; although for CLOCK, this improvement was very small (Fig 3B). 258 259

Components of the Circadian Clock 261
To evaluate the performance of ExRANGES on TFs that are not core components of the 262 circadian clock, we compared the ability to identify targets of additional TFs validated by ChIP-263 Seq. To test ExRANGES performance across tissue types, we selected seven TFs in our 264 regulator list that have available ChIP-Seq data from at least two experimental replicates 265 performed in epithelial cells, a tissue not included in the circadian time series samples. The 266 14 seven TFs that we tested are: ESR1, STAT5A, STAT5B, POL2A, FOXA1, TFAP2A, and CHD4 267 [23]. We observed improvement of the area under the ROC curve for five of the seven TFs 268 (ESR1, POL2A, FOXA1, TFAP2A, and CHD4) by combining expression and rate change 269 information using ExRANGES ( Fig 3C). As we observed above for CLOCK, STAT5A and 270 STAT5B performed equally well, but did not show significant improvement. STAT5A and 271 STAT5B are known to be activated post-transcriptionally perhaps indicating why evaluating the 272 change in expression of these TFs did not lead to improved identification of targets [24][25][26][27][28][29]. observed an overall improvement in the detection of ChIP-Seq identified targets (Fig 4B). The 294 improvement varies by TF (Fig 4C). 295

ExRANGES Improves Functional Cohesion of Identified Targets 296
ChIP-Seq targets are one method to identify true targets of a TF. Another approach is to look at 297 functional enrichment of predicted targets for a given regulator. The true targets of a TF are 298 likely to be involved in the same functional pathways and therefore true targets would be 299 enriched for the same functional categories as measured by enrichment of GO terms. 300 Comparison of functional enrichment of TF targets identified by each approach enables the 301 evaluation of how each approach performs on identifying targets for TFs without available ChIP-302 Seq data. We compared the functional enrichment of the top 1000 targets of each TF predicted 303 by either approach using Homo sapiens GO slim annotation categories. We evaluated the 930 304 TFs on the HGU133 microrarray [19]. Of these, the targets identified by ExRANGES for the 305 majority of the TFs (590) showed improved functional enrichment compared to the targets 306 identified by EXPRESSION (Fig 5A and B). Likewise, when focusing on the 83 TFs with 307 available ChIP-Seq data from blood, the majority of TF targets predicted by ExRANGES were 308 more functionally cohesive compared to EXPRESSION targets as evaluated by GO slim (Fig  309   5C). We observed that the improvement ranking of ExRANGES over EXPRESSION varies 310 between the two validation approaches. For example, targets of the TF JUND identified by 311

ExRANGES show no improvement over EXPRESSION when validated by ChIP-Seq identified 312
targets, yet showed improved functional cohesion (Supplemental Table ST1). C) Enrichment score difference of the 83 TFs with available ChIP-Seq data (Fig 4). Positive values indicate TF targets with a higher enrichments score in ExRANGES compared to EXPRESSION.
identification of the ChIP-based true positive TF targets (Fig 6B). To evaluate a larger range of 334 targets we compared our predicted targets by EXPRESSION or ExRANGES to 307 TFs targets 335 identified by DAP-Seq [39]. We observed that ExRANGES also showed an improved ability to 336 identify targets as validated by DAP-Seq compared to EXPRESSION (Fig 6C). 337 338

Application of ExRANGES to Smaller Data Sets with Limited Validation Resources 339
Time series data offers several advantages, however the expense is also significantly increased. 340 We have shown that using ExRANGES in conjunction with GENIE3 improves performance on 341 large data sets as validated by ChIP-Seq (228 samples in mouse, 2372 in human, and 144 in 342 arabidopsis) (Fig 7). We also compared the use of the ExRANGES approach to EXPRESSION 343 in all three data sets; the largest increase observed was in the Arabidopsis data set, which has the 345 lowest sample number (Fig S4). Since our interest is to develop a tool that can assist with the 346 identification of regulatory networks in non-model species, we wanted to determine if 347 ExRANGES could also improve identification of TF targets in more sparsely sampled data sets 348 where there is only limited validation data available.  ChIP-Seq, ExRANGES showed an improved ability to identify these targets (Fig 8)  Many approaches have been developed to identify regulator targets, but most of these use 361 show that data sets that combine circadian time series in multiple tissues can be a powerful 377 resource for identifying regulatory relationships between TFs and their targets not just for 378 circadian regulators, but also for regulators that are not components of the circadian clock. 379 Targets identified using EXPRESSION as the features were those that showed large variance 380 between tissue, while RANGES identified targets that showed larger variance within each time 381 series. ExRANGES takes advantage of both sources of variation and improves the identification 382 of TF targets for most regulators tested, including for TF-target relationships in tissues not 383 included in the transcriptional analysis. Additionally, ExRANGES simplifies incorporation of 384 replicate samples. 385 As implemented, ExRANGES improves the ability to identify regulator targets, however, 386 there are many aspects that could be further optimized. For example, we tested ExRANGES 387 with the network inference algorithm GENIE3 and demonstrated that it improved the 388 performance of this algorithm. The ExRANGES method can be applied to most other machine 389 learning applications such as Bayesian networks, mutual information networks, or even 390 supervised machine learning tools. In addition, we showed that ExRANGES outperformed a 391 one-step time delay. Conceptually, our method essentially increases the weight of the time point 392 before a major change in expression level. ExRANGES could be further modified to adjust 393 where that weight is placed, a step or more in advance, depending on the time series data. Such 394 incorporation of a time delay optimization into the ExRANGES approach could lead to further 395 improvement for identification of some TF targets, although it would increase the computational 396

cost. 397
Here, we compared ExRANGES based features to EXPRESSION based features by 398 validating against TF targets identified by ChIP-Seq, ChIP-Chip, DAP-Seq, and protein binding 399 microarray. While these experimental approaches identify potential TF targets in a genome-wide 400 manner, they are not perfect as gold-standards for validation of transcriptional regulatory 401 networks. If there are systematic errors in target identification by ChIP-Seq, ExRANGES may 402 perform better than indicated here. Although ChIP-Seq may not be an ideal gold standard, it 403 does provide a benchmark for comparing computational approaches to identifying TF targets. 404 Unfortunately, high quality ChIP-Seq data is not available in most organisms for more than a 405 handful of TFs. For example, validation of this approach in rice was limited to one recently 406 published ChIP-Seq dataset. This lack of experimentally identified targets is a severe hindrance 407 to advancing research in these species. New experimental approaches such as DAP-Seq may 408 provide alternatives for TF target identification in species recalcitrant to ChIP-Seq analysis [39]. 409 Additionally, the authors of this paper improved their recall of ChIP-Seq identified targets by 410 selecting targets that were also supported by DNase-Seq sensitivity assays [43,44]. Likewise, 411 distinguishing between direct and indirect targets predicted computationally could be enhanced economical first pass that can be followed up by experimental analysis of predicted targets, 420 accepting the fact that there will be false positives in the validation pipeline. In this strategy, a 421 small improvement in the ability to identify true targets of a given TF can translate into a 422 reduced number of candidates to test and fewer experiments that must be performed. We hope 423 that the modest improvements to regulatory network algorithms provided by the ExRANGES 424 approach can facilitate research in species where identification of TF targets is experimentally 425 challenging. Additionally, we hope that our finding of how gene expression values are 426 incorporated in a network has a significant effect on the ability to identify regulatory 427 relationships will stimulate evaluation of new approaches that use alternative methods to 428 incorporate time signals into regulatory network analysis. 429 In summary, we demonstrate that consideration of how expression data is incorporated 430 can contribute to the success of transcriptional regulatory network reconstruction. ExRANGES is 431 a first step at evaluating different approaches for how features are supplied to regulatory network 432 inference algorithms. We anticipate that further optimization and other novel methods for 433 integrating expression information will lead to improvements in network reconstruction that 434 ultimately will accelerate biological discovery. 435 436

Sources for Expression Data Sets 438
Circadian Data Set 439 Normalized expression data from murine sources was downloaded from CircaDB [18]. ExRANGES values were 506 determined for each gene by multiplying the expression at t n by the weighted rate calculated 507 from t n to t n+1 . See R package provided in http://github.com/DohertyLab/ExRANGES and linked 508 in OMICS tools https://omictools.com/expression-in-a-rate-normalized-gene-specific-tool. 509

510
To predict regulatory interaction between transcription factor and target gene, GENIE3 was used. 511 GENIE3 script was taken from http://www.montefiore.ulg.ac.be/~huynh-thu/software.html on 512