1 multiSLIDE : a query-driven heatmap visualization tool for 1 multi-omics data 2 3

Abstract We present multiSLIDE, an open-source tool for query-driven visualization of quantitative single- or multiomics data. Using pathways and networks as the basis for data linkage, multiSLIDE provides an interactive platform for querying the multi-omics data by genes, pathways, and intermolecular relationships. Representing individual -omics levels as separate heatmaps, multiSLIDE visualizes quantitative data for selected genes at all omics levels in a single snapshot. The tool also provides functionalities to arrange data both ways, by their phenotypic characteristics and by their molecular interactions or co-membership to common pathways. Built-in statistical tests and clustering methods provide display subsets of interesting genes or rearrange the genes based on expression patterns. All visualization panels are fully customizable, and both the graphics and the analysis workspace can be saved and shared between collaborating parties. We demonstrate the utility of multiSLIDE through two example studies. First, with a time-course data of HeLa cells, subjected to dithiothreitol induced endoplasmic reticulum stress, we visualized different stages of unfolded protein response and identified temporal patterns of gene expression response at the mRNA and protein levels. Second, through joint visualization of mRNA and protein expression of TCGA/CPTAC Invasive Breast Carcinoma data, we explored the estrogen receptor α regulon and prioritized clusters of genes associated with PAM50 basal-like subtype.


Introduction
can help disentangle the crowding of network 83 diagrams without overwhelming the graphical interface. While heatmaps have not been the primary mode 84 of representation in multi-omics data visualization tools, many tools often include heatmaps as an additional 85 visualization. For instance, PaintOmics3 supplements their pathway diagrams with additional heatmaps 86 displaying omics expression profiles, where the interactivity and data handling capacity of these heatmaps 87 are usually restricted even for moderately sized data sets, limiting its utility. 88 Motivated by the difficulties presented in the existing tools, we developed multiSLIDE, an 89 interactive data visualization tool for easy exploration of multi-omics data. In multiSLIDE, quantitative 90 multi-omics data are visualized for genes in specific pathways or gene ontology (GO) categories preselected 91 by the user, simultaneously at different molecular levels. The tool displays the abundance measurements of 92 DNA copy number, mRNA transcripts, and proteins in functional groups (e.g. pathways and GO terms) 93 selected by the user in aligned panels of heatmaps, synchronizing them by gene identifiers and sample 94 names. 95 multiSLIDE also integrates biological networks in queries, including transcription factor (TF) 96 regulatory networks and protein-protein interaction networks, which capture the dependencies across and 97 within molecular levels, respectively. These networks can be quickly queried for genes of interest, which 98 allows the user to add other genes interacting with them to the current heatmap visualization in real time. 99 (TCGA-BRCA) cohort, aiming to visualize key hormone receptor ESR1 and growth factor EGFR proteins 109 in the four intrinsic molecular subtypes (27). 110

112
The Visualization Workflow 113 Figure 1A shows the web-based visualization interface of multiSLIDE. In multiSLIDE, data analysis 114 begins with the user's selection of pathways, GO terms, or individual genes. multiSLIDE provides an 115 intuitive keyword-based search syntax for searching multiple pathways, GO terms, and genes. The user 116 selects relevant genes and gene groups from the search results and those genes are visualized with clicks 117 on the group names, as shown in Figure 1B. In addition, the user can choose to add network neighbors of 118 a target gene on protein-protein interaction (PPI) and TF regulatory networks via a network neighborhood 119 search, all enabled by a simple right-click on the gene of interest. Selected network neighbors are added to 120 the visualization in real time ( Figure 1C). Alternatively, the user can visualize genes in GO terms or 121 pathways, with the side tracks indicating the membership of genes to the functional groups (side bars on 122 the left side of heatmaps, Figure 1A). 123 The scales and dynamic ranges of detection and quantification techniques can be inconsistent for 124 different -omics data. Therefore, we made the graphical parameters customizable in each omics data 125

Software Design 164
Databases: Pathways, GO terms and Molecular Networks. Comprehensive genome-wide annotations and 165 gene ontology databases for mouse and human genes were extracted using libraries in R Bioconductor 166 (28,29). Data is highly structured in these R packages and are routinely used by bioinformaticians to analyze 167 their data. multiSLIDE recognizes Entrez, HUGO  line visualization of single -omics data (41), multiSLIDE is available online and can also be used as a 192 standalone software. Due to its modular design, multiSLIDE can also easily scale to distributed multi-node 193

environments. 194
Supplementary Figure S1 shows a schematic view of multiSLIDE's software architecture. The 195 server side of multiSLIDE consists of an HTTP server and a database server. The client can be any modern 196 web browser. The HTTP server hosts the analytics module, which is the main computation engine, as well 197 as the graphics server. The analytics module carries out the bulk of the computation. The HTTP server and 198 the browser communicate via highly optimized JavaScript Object Notation (JSON) objects. The data tier, 199 implemented using MongoDB, manages the physical storage of all curated gene annotation, regulatory 200 networks, biological pathways and Gene Ontology (GO) tables. In multiSLIDE, individual components 201 within each tier are also highly compartmentalized. At the client, the data and presentation layers are also 202 decoupled. Layouts and graphics can therefore be altered without the need for fetching the data again from 203 the server. User interactions and styles were implemented using TypeScript (TS) and Cascading Style 204 Sheets (CSS). The visualizations are rendered using resolution independent Scalable Vector Graphics 205

(SVG). 206
A key design philosophy in multiSLIDE is to visualize only user queried genes (molecules). As a 207 result of the extensive user interactions available in multiLSIDE, there is frequent communication between 208 client and server. At the same time, owing to multiple -omics datasets, the server has to manage a much 209 larger amount of data. In multiSLIDE, lazy execution combined with memoization is extensively used to 210 balance the server-side memory footprint and response times. The computation intensive modules are 211 developed by leveraging the advantages of Java and Python. Aggressive caching in the browser and using Input Data Format. In multiSLIDE, the user creates an analysis by uploading a set of required input files. 215 Inputs to multiSLIDE should be delimited ASCII text files (one for each -omics dataset), containing 216 quantitative measurements across samples. These files can be created and edited using any text editor and 217 there is no restriction on the number of input files that can be uploaded for visualization. Measurements can 218 be counts, categorical data or continuous data that have already undergone standard pre-processing and 219 transformations. The databases integrated into multiSLIDE contain functionally annotated genes assigned 220 to pathways and GO terms. To fully utilize multiSLIDE's querying capabilities, individual -omics input 221 files should have at least one column with standard gene identifiers available in multiSLIDE. Also, high 222 throughput -omics data, such as DNA methylation, are molecular features at specific genomic coordinates. 223 Since multiSLIDE's search capabilities are gene-based, for omics data with no designated gene identifiers 224 (e.g. sequence variants, CpG islands in DNA methylation), genomic coordinates of the molecular features 225 have to be mapped to the nearest gene and labeled as such in the current version of the software. 226 In addition, a separate file containing sample attributes (e.g. clinical data) is also required, 227 formatted as a separate delimited ASCII text file. The information in this file should map the samples in -228 omics data files to their corresponding phenotype information. The attributes file may include optional 229 sample information such as descriptive sample names, replicate names, and time points. 230 231

Data preprocessing for TCGA/CPTAC breast cancer data 232
Genes that had fewer than 80% samples expressed in Luminal A, Luminal B, and basal-like subtypes in 233 both -omics level were removed. For the HER2-enriched subtype, genes with fewer than ten samples 234 expressed at either -omics levels were removed from the corresponding -omics level. For the remainder of 235 genes, weighted k-nearest neighbor method, where k = 10, (42) was used to impute the missing data prior 236 to importing into multiSLIDE. The mRNA expression values were log-transformed (base 2) and each gene 237 was centered by the mean of their expression levels in multiSLIDE visualization. For the protein level data,

240
We present two case studies using previously published datasets to illustrate multiSLIDE's features. The 241 first case study explores multiple pathways and GO terms simultaneously in a time-course multi-omics 242 dataset. The second case study illustrates exploration of the TF regulatory networks and statistical filtering-243 based gene prioritization functionalities on a paired transcriptomic and proteomic TCGA/CPTAC breast 244 cancer dataset. In addition, we also demonstrate the exploration of the PPI network around a receptor 245 tyrosine kinase HER2/ERBB2 on the same dataset in Supplementary Figure S2. 246

UPR reveals independent gene expression regulation at the mRNA and protein level 248
Using multiSLIDE, we first visualize the time-course mRNA and protein expression data of HeLa cells in 249 response to dithiothreitol (DTT) treatment, which induces ER stress (26). In the study, cells were sampled 250 at eight time points (0, 0.5. 1, 2, 8, 16, 24, and 30h) and their transcriptome and proteome were measured. 251 Both -omics data were normalized by dividing the measurements at post-treatment time points by their 252 respective measurement at 0h. During the ER stress, we expect to observe complex signaling cascades 253 involved in UPR, translation attenuation, ER-associated protein degradation, and cellular apoptosis (43-254

46). 255
Before we explore this multi-omics data, we first visualize the entire data sets separately using the 256 SLIDE tool, another related tool that we had previously developed for full-scale single -omics data 257 visualization (41). We applied hierarchical clustering with Euclidean distance and complete linkage to the One of the direct consequences of ER stress is the aggregation of misfolded, unassembled proteins 265 in the organelle. As a survival mechanism to avert the loss of homeostasis, the ER responds by increasing 266 the protein folding capacity. This activated pro-survival cellular mechanism, otherwise known as the UPR, 267 is involved in extensive reprogramming of the transcriptional and translational regulation (43,47,48). First, 268 to identify action patterns of UPR related genes, the global view in Supplementary Figure S3A was tagged 269 to highlight genes belonging to the GO term 'endoplasmic reticulum unfolded protein response' (green bars 270 to the right of the heatmap). As an adaptive response, a hallmark of UPR is to reduce ER stress and restore 271 homeostasis by the coordinated transcriptional upregulation of ER chaperones and protein folding enzymes. 272 The initial visual inspection of the whole mRNA data confirmed the upward expression trends of the ER 273 chaperones, heat shock protein family A (Hsp70) member 5 (HSP5A/GRP78/BiP) and folding chaperones 274 such as protein disulfide isomerase (PDI) family genes (Supplementary Figure S3B). 275 Cheng et al. also analyzed the dynamics of 1,237 mRNA/protein pairs that passed the filtering for 276 missing data and noise. Supplementary Figure S3C shows the protein abundances of these 1,237 proteins 277 in SLIDE. Most UPR related proteins expressed upward trends in the proteomics data, similar to the mRNA 278 expression regulation, but with a delayed temporal response. The global heatmap clearly shows that this 279 up-regulation persisted even at the late phase of the stress response. 280 To investigate these patterns in the transcriptomic and proteomic data simultaneously, we loaded 281 the data sets into multiSLIDE. Hierarchical clustering of UPR-related genes at the mRNA level in 282 multiSLIDE reveals a cluster of upregulated genes. The master sensor of misfolded proteins in the ER, 283 HSPA5 (also known as GRP78 or BiP), is a member of the heat shock protein 70 family, which is 284 upregulated at both mRNA and protein level. In Figure 2, the mRNA level data shows early up-regulation 285 of HSPA5, at 0.5 h of stress induction. HSPA5 regulates the activation of ER stress transducers, including 286 PERK (protein kinase R-like ER kinase, also known as EIF2AK3/PEK), inositol-requiring 1 (IRE1) and 287 activating transcription factor 6 (ATF6/ACHM7). Under basal conditions, HSPA5 keeps the three ER stress 288 transducers inactive (47). The downstream effectors of these three key UPR regulators converge on 289 promoting ER chaperone synthesis, ER-associated protein degradation and ER membrane biogenesis (49). 290 Observing the dynamics of transcriptome regulation in multiSLIDE in Figure 2, we also notice the 291 early phase activation of ER chaperones, heat shock protein 90 beta family member 1 (HSP90B1), 292 DNAJC3/P58IPK (member of the HSP40 chaperone family) and protein disulfide isomerases (PDIs), 293 PDIA4, PDIA6 and PDIA3, that remains activated in the intermediate and late phases. PDIs are known to 294 be responsible for the oxidation (formation), reduction (break down) and isomerization (rearrangement) of 295 protein disulfide bonds via disulfide interchange activity. The other major role of PDIs is in general 296 chaperone activity and recent studies have also identified PDI for its role as PERK activator (50-52). In 297 Figure 2, it is interesting to observe that mRNA-level upregulation is countered by protein-level down-298 regulation of DnaJ heat shock protein family (Hsp40) member C3 (DNAJC3), which is a known inhibitor 299 of PERK (53). This suggests that an active feedback control defense mechanism is in place to mitigate ER 300 stress, allowing the PERK pathway to remain uninhibited (54,55). 301 The reversal of expression levels in the ER resident proteins SELK and MANF to their pre-302 treatment levels at the very late phase can also be attributed to the ER stress attenuation mechanism of UPR. 303 Up-regulation of selenoprotein K (SELK) and mesencephalic astrocyte-derived neurotrophic factor 304 (MANF) are a protective mechanism to avoid ER stress mediated cell death. 305 In summary, jointly visualizing the mRNA-and protein-level expression data in multiSLIDE helps 306 the user to uncouple the distinct expression regulation patterns at different response phases of UPR. 307 Clustering genes at the mRNA level and applying the same ordering at the protein level helped visualize 308 whether the clusters propagate across -omics levels. An activated UPR initiates an adaptive stress response 309 to regulate downstream effectors and further through a feedback control switches on/off transcriptional 310 regulation and protein synthesis to restore ER homeostasis. A failure to attain homeostasis leads to 311 programmed cell death (56,57). 312 313

Exploring ESR1 regulon in basal-like breast cancer tumors 314
In recent years, genome-scale characterization of breast cancer subtypes has helped us understand its tissue 315 heterogeneity and identify additional therapeutic targets such as human epidermal growth factor receptor 2 (HER2/ERBB2). The four major intrinsic subtypes of breast cancer are defined as luminal A, luminal B, 317

HER2-enriched and basal-like (27). 318
However, immunohistochemistry (IHC)-based surrogate subtyping, which are still routinely used 319 for classification in pathological settings, is based on three key hormone receptors: estrogen receptor 320 (ESR1), progesterone receptor (PGR), and HER2 (58). Basal-like tumors tend to have the worst prognosis, 321 with all three receptors negative, making them unresponsive to endocrine therapy. Specifically, the 322 expression of ESR1 is widely linked to better survival and considered a major regulator of the phenotypic 323 properties of these breast cancers. ESR1 is a hormone-regulated transcription factor, a member of a 324 superfamily of nuclear receptors, and plays a key role in cell proliferation. Typically, cell growth in breast 325 cancer is stimulated either by the hormone estrogen (17β-estradiol) or growth factors such as EGF. 326 Interestingly, transcription of EGF receptor is regulated by ESR1. Therefore, exploring the behavior of TF 327 targets of ESR1, particularly in the basal-like subtype, can reveal genes that bypass ESR1 negativity. 328 We visualized proteomics data from CPTAC (59) and transcriptomics data (60) from TCGA 329 invasive ductal breast carcinoma (TCGA-BRCA) in multiSLIDE. To understand the role of ESR1 as a 330 master regulator of pathways, we first query and visualize all its transcription target genes at the mRNA 331 and protein levels in multiSLIDE. A query for the TF targets of ESR1 in multiSLIDE's databases resulted 332 in a total of 3,787 targets, of which the 2,312 unique targets present in the datasets were included in the 333

analysis. 334
Next, using the PAM50 subtype classification as phenotype, we performed differential expression 335 analysis at the protein level in multiSLIDE. With (unadjusted) p-value < 0.001 from analysis of variance 336 as the statistical filter, we were able to select 172 differentially expressed genes. We applied hierarchical 337 clustering to these genes with Euclidean distance and complete linkage, resulting in four distinct clusters 338 consistent at mRNA and protein level. For visual clarity, we visualized the four clusters in two parts, as 339 shown in Figures 3A and 3B (the side track on top of each heatmap indicating the subtypes). The genes in 340 Figure 3A show two distinct clusters of genes (top-half and bottom-half of the figure) where the ESR-subtypes are mostly up-regulated (red), across both -omics levels. These genes had signatures that were 343 strongly coherent with that of ESR1. The clusters shown in Figure 3B had expression patterns that were 344 opposite to that of ESR1 across subtypes. The latter of the two clusters (bottom-half of Figure 3B) had 345 pronounced up-regulation only in the basal-like subtype but no distinct signature in the other subtypes. 346 In Figure 3A Figure 3B, we also observed 365 CEBPB and NFIX were overexpressed at both -omics levels. Given that these are TFs that in turn affect 366 expression of their target genes, their overexpression in the basal-like subtype is suggestive that they play In summary, concurrent visualization of the curated ESR1 regulon both at the protein and mRNA 374 level reveals target genes that are under-expressed in basal-like subtype due to ESR1 negativity, consistent 375 at both molecular levels. At the same time, we were able to zoom into a set of specific target genes that 376 bypass ESR1 negativity and are activated in the basal-like subtype. The same type of query-driven 377 inspection of multi-omics data can be generically extended to other applications. 378