## Introduction

Transcriptomics and proteomics (omics) signatures in response to cellular perturbations consist of changes in gene or protein expression levels after the perturbation. An omics signature is a high-dimensional readout of cellular state change that provides information about the biological processes affected by the perturbation and perturbation-induced phenotypic changes of the cell. The signature on its own provides information, although not always directly discernable, about the molecular mechanisms by which the perturbation causes observed changes. If we consider a disease to be a perturbation of the homeostatic biological system under normal physiology, then the omics signature of a disease are the differences in gene/protein expression levels between disease and non-diseased tissue samples.

The low cost and effectiveness of transcriptomics assays1,2,3,4 have resulted in an abundance of transcriptomics datasets and signatures. Recent advances in the field of high-throughput proteomics made the generation of large numbers of proteomics signatures a reality5,6. Several recent efforts were directed at the systematic generation of omics signatures of cellular perturbations7 and at generating libraries of signatures by re-analyzing public domain omics datasets8,9. The recently released library of integrated network-based cellular signatures (LINCS)7 L1000 dataset generated transcriptomic signatures at an unprecedented scale2. The availability of resulting libraries of signatures opens exciting new avenues for learning about the mechanisms of diseases and the search for effective therapeutics10.

The analysis and interpretation of omics signatures has been intensely researched. Numerous methods and tools have been developed for identifying changes in molecular phenotypes implicated by transcriptional signatures based on gene set enrichment, pathway, and network analyses approaches11,12,13. Directly matching transcriptional signatures of a disease with negatively correlated transcriptional signatures of chemical perturbations (CP) underlies the Connectivity Map (CMAP) approach to identifying potential drug candidates10,14,15. Similarly, correlating signatures of chemical perturbagens with genetic perturbations of specific genes has been used to identify putative targets of drugs and chemical perturbagens2.

To fully exploit the information contained within omics signature libraries and within countless omics signatures generated frequently and constantly by investigators around the world, new user-friendly integrative tools, accessible to a large segment of biomedical research community, are needed to bring these data together. The integrative LINCS (iLINCS) portal brings together libraries of precomputed signatures, formatted datasets, connections between signatures, and integrates them with a bioinformatics analysis engine and streamlined user interfaces into a powerful system for omics signature analysis.

## Results

iLINCS (available at http://ilincs.org) is an integrative user-friendly web platform for the analysis of omics (transcriptomic and proteomic) datasets and signatures of cellular perturbations. The key components of iLINCS are: Interactive and interconnected analytical workflows for the creation and analysis of omics signatures; The large collection of datasets, precomputed signatures, and their connections; And user-friendly graphical interfaces for executing analytical tasks and workflows.

The central concept in iLINCS is the omics signature, which can be retrieved from the precomputed signature libraries within the iLINCS database, submitted by the user, or constructed using one of the iLINCS datasets (Fig. 1a). The signatures in iLINCS consist of the differential gene or protein expression levels and associated P values between perturbed and baseline samples for all, or any subset of measured genes/proteins. Signatures submitted by the user can also be in the form of a list of genes/proteins, or a list of up- and downregulated genes/proteins. iLINCS backend database contains >34,000 processed omics datasets, >220,000 omics signatures and >109 statistically significant “connections” between signatures. Omics signatures include transcriptomic signatures of more than 15,000 chemicals and genetic perturbations of more than 4400 genes (Fig. 1a). Omics datasets available for analysis and signatures creation cover a wide range of diseases and include transcriptomic (RNA-seq and microarray) and proteomic (Reverse Phase Protein Arrays16 and LINCS-targeted mass spectrometry proteomics5) datasets. Datasets collections include close to complete collection of GEO RNA-seq datasets and various other dataset collections, such as The Cancer Genome Atlas (TCGA), GEO GDS microarray datasets17, etc. A detailed description of iLINCS omics signatures and datasets is provided in “Methods”. Analysis of 8942 iLINCS datasets from GEO, annotated by MeSH terms18, shows a wide range of disease coverage (Fig. 1a).

iLINCS analytical workflows facilitate systems biology interpretation of the signature (Fig. 1b) and the connectivity analysis of the signature with all iLINCS precomputed signatures (Fig. 1c). Connected signatures can further be analyzed in terms of the patterns of gene/protein expression level changes that underlie the connectivity with the query signature, or through the analysis of gene/protein targets of connected perturbagens (Fig. 1d). Ultimately, the multi-layered systems biology analyses, and the connectivity analyses lead to biological insights, and identification of therapeutic targets and putative therapeutic agents (Fig. 1e).

Interactive analytical workflows in iLINCS facilitate signature construction through differential expression analysis as well as clustering, dimensionality reduction, functional enrichment, signature connectivity analysis, pathway and network analysis, and integrative interactive visualization. Visualizations include interactive scatter plots, volcano and GSEA plots, heatmaps, and pathway and network node and stick diagram (Supplemental Fig. 1). Users can download raw data and signatures, analysis results, and publication-ready graphics. iLINCS internal analysis and visualization engine uses R19 and open-source visualization tools. iLINCS also facilitates seamless integration with a wide range of task-specific online bioinformatics and systems biology tools and resources including Enrichr20, DAVID21, ToppGene22, Reactome23, KEGG24, GeneMania25, X2K Web26, L1000FWD27, STITCH28, Clustergrammer29, piNET30, LINCS Data Portal31, ScrubChem32, PubChem33 and GEO34. Programmatic access to iLINCS data, workflows and visualizations are facilitated by the calls to iLINCS API which is documented with the OpenAPI community standard. Examples of utilizing the iLINCS API within data analysis scripts are provided on GitHub (https://github.com/uc-bd2k/ilincsAPI). The iLINCS software architecture is described in Supplemental Fig. 2.

### Use cases

iLINCS workflows facilitate a wide range of possible use cases. Querying iLINCS with user-submitted external signatures enables identification of connected perturbations signatures, and answering in-depth questions about expression patterns of individual genes or gene lists of interest in specific datasets, or across classes of cellular perturbations. Querying iLINCS with individual genes or proteins can identify sets of perturbations that significantly affect their expression. Such analysis leads to a set of chemicals, or genetic perturbations, that can be applied to modulate the expression and activity of the corresponding proteins. Queries with lists of genes representing a hallmark of a specific biological state or process35 can identify a set of perturbations that may accordingly modify cellular phenotype. iLINCS implements complete systematic polypharmacology and drug repurposing36,37 workflows, and has been listed as a Bioinformatics resource for cancer immunotherapy studies38 and multi-omics computational oncology39. Most recently, iLINCS has been used in the drug repurposing workflow that combines searching for drug repurposing candidates via CMAP analysis with the validation using analysis of Electronic Health Records40. Finally, iLINCS removes technical barriers for re-using any of more than 34,000 preprocessed omics datasets enabling users to construct and analyze new omics signatures without any data generation and with only a few mouse clicks.

Here, we illustrate the use of iLINCS in detecting and modulating aberrant mTOR pathway signaling, analysis of proteogenomic signatures in breast cancer and in search for COVID-19 therapeutics. It is important to emphasize that all analyses were performed by navigating iLINCS GUI within a web browser, and each use case can be completed in less than five minutes. Step-by-step instructions are provided in the Supplemental Materials (Supplemental Workflows 1, 2, and 3). In addition, links to instructional videos that demonstrate how to perform these analyses are provided on the landing page of iLINCS at ilincs.org. The same analyses can also be performed programmatically using the iLINCS API. R notebooks demonstrating this can be found on the GitHub (https://github.com/uc-bd2k/ilincsAPI).

### Use case 1: detecting and modulating aberrant mTOR pathway signaling

Aberrant mTOR signaling underlies a wide range of human diseases41. It is associated with age-related diseases such as Alzheimer’s disease42 and the aging process itself41. mTOR inhibitors are currently the only pharmacological treatment shown to extend lifespan in model organisms43, and numerous efforts in designing drugs that modulate the activity of mTOR signaling are under way41. We use mTOR signaling as the prototypical example to demonstrate iLINCS utility in identifying chemical perturbagens capable of modulating a known signaling pathway driving the disease process, in establishing MOA of a chemical perturbagen, and in detecting aberrant signaling in the diseased tissue. Detecting changes in mTOR signaling activity in transcriptomic data is complicated by the fact that it is not reflected in changes in expression of mTOR pathway genes, and standard pathway analysis methods are not effective44. We show that CMAP analysis approach, facilitated by iLINCS, is essential for the success of these analyses. Step-by-step instructions for performing this analysis in iLINCS are provided in Supplemental Workflow SW1.

Identifying chemicals that can modulate the activity of a specific pathway or a protein in a specific biological context is often the first step in translating insights about disease mechanisms into therapies that can reverse disease processes. Here we demonstrate the use of iLINCS in identifying chemicals that can inhibit the mTOR activity. We use the Consensus Genes Signatures (CGSes) of CRISPR mTOR genetic loss of function perturbation in MCF-7 cell line as the query signature. The CMAP analysis identifies 258 LINCS CGSes and 831 CP Signatures with statistically significant correlation with the query signature. Top 100 most connected CGSes are dominated by the signatures of genetic perturbations of mTOR and PIK3CA genes (Fig. 2a), whereas all top 5 most frequent inhibition targets of CPs among top 100 most connected CP signatures are mTOR and PIK3 proteins (Fig. 2b). Results clearly indicate that the query mTOR CGS is highly specific and sensitive to perturbation of the mTOR pathway and effectively identifies chemical perturbagens capable of inhibiting mTOR signaling. The full list of connected signatures is shown in Supplemental Data SD1. The connected CP signatures also include several chemical perturbagens with highly connected signatures that have not been known to target mTOR signaling providing new candidate inhibitors.

Identifying proteins and pathways directly targeted by a bioactive chemical using its transcriptional signature is a difficult problem. Transcriptional signatures of a chemical perturbation often carry only an echo of such effects since the proteins directly targeted by a chemical and the interacting signaling proteins are not transcriptionally changed. iLINCS offers a solution for this problem by connecting the CP signatures to LINCS CGSes and facilitating a follow-up systems biology analysis of genes whose CGSes are highly correlated with the CP signature. This is illustrated by the analysis of the perturbation signature of the mTOR inhibitor drug everolimus (Fig 2c–e). Traditional pathway enrichment analysis of this CP signature via iLINCS connection to Enrichr (Fig. 2c) fails to identify the mTOR pathway as being affected. In the next step, we first connect the CP signature to LINCS CGSes and then perform pathway enrichment analysis of genes with correlated CGSes. This analysis correctly identifies mTOR signaling pathway as the top affected pathway (Fig. 2d). Similarly, connectivity analysis with other CP signatures followed by the enrichment analysis of protein targets of top 100 most connected CPs again identifies the Pi3k-Akt signaling pathway as one of the most enriched (Fig. 2e). In conclusion, both pathway analysis of differentially expressed genes in the everolimus signature and pathway analysis of connected genetic and chemical perturbagens provide us with important information about effects of everolimus. However, only the analyses of connected perturbagens correctly pinpoints the direct mechanism of action of the everolimus, which is the inhibition of mTOR signaling.

The connectivity-based pathway analysis shares methodological shortcomings with the standard enrichment/pathway analyses of lists of differentially expressed genes, such as, for example, overlapping pathways. While the MTOR pathway shows the strongest association with everolimus, other pathways were also significantly enriched. A closer examination of the results indicates that this is due to the core mTOR signaling cascade being included as a component of other pathways and many of the genes that drive the associations with other four most enriched pathways are common with the mTOR pathway (Fig. 3).

One caveat in the results presented above is that the LINCS signatures based on L1000 platform provide a reduced representation of the global transcriptome consisting of expression levels of about 1000 “landmark” genes2. The landmark genes are selected in such a way that they jointly capture patterns of expression of majority genes in the genome and the computational predictions of expression of additional 12,000 genes are also made. The relatively low number of measured genes could sometimes adversely affect the gene expression enrichment analysis of poorly represented pathways. To establish that this is not the case for mTOR signaling, we repeated the MOA analysis using the whole genome transcriptional signature of the mTOR inhibitor sirolimus from the original CMAP dataset15, which is also included in the iLINCS signature collection. Results of these analyses closely resemble the results with L1000 everolimus signature with connectivity analysis clearly pinpointing mTOR pathway and enrichment analysis of differentially expressed genes failing to do so (Supplemental Results 4).

To verify that mTOR signaling modulation is also detectible in complex tissues we used iLINCS to re-analyze the effect of rapamycin in aged rat livers45 (GEO dataset GSE108978). The rapamycin signature was constructed by comparing expression profiles of livers in eight rapamycin-treated rats to the nine vehicle controls at 24 months of age (Fig. 4, heatmap). The signature correlated strongly with CP signatures of chemicals targeting mTOR pathway genes (Fig. 4, bar plot).

### Use case 2: proteo-genomics analysis of cancer driver events in breast cancer

Contrasting transcriptional and proteomic profiles of different molecular cancer subtypes has long been a hallmark of cancer omics data analysis when seeking targets for intervention46. Constructing signatures by comparing cancer with normal tissue controls usually results in a vast array of differences characteristic of any cancer (proliferation, invasion, etc.)47, and are not specific to the driver mechanisms of the cancer samples at hand. On the other hand, comparisons of different cancer subtypes, as illustrated here, is effective in eliciting key driver mechanisms by factoring out generic molecular properties of a cancer48. Here, we demonstrate the use of matched preprocessed proteomic (RPPA) and transcriptomic (RNA-seq) breast cancer datasets to identify driver events which can serve as targets for pharmacological intervention in two different breast cancer subtypes. The analysis of proteomic data can directly identify affected signaling pathways by assessing differences in the abundance of activated (e.g., phosphorylated) signaling proteins. By contrasting proteomics and transcriptomic signatures of the same biological samples, we can distinguish between transcriptionally and post-translationally regulated proteins, and transcriptional signatures facilitate pathway enrichment and CMAP analysis. Step-by-step instructions for performing this analysis in iLINCS are provided in Supplemental Workflow SW2.

We analyzed TCGA breast cancer RNA-seq and RPPA data using the iLINCS “Datasets” workflow to construct the differential gene and protein expression signatures contrasting 174 Luminal-A and 50 Her2 enriched (Her2E) breast tumors. The results of the iLINCS analysis track closely the original analysis performed by the TCGA consortium48. The protein expression signature immediately implicated known driver events, which are also the canonical therapeutic targets for the two subtypes: The abnormal activity of the estrogen receptor in Luminal-A tumors, and the increased expression and activity of the Her2 protein in Her2E tumors (Fig. 5a). To further validate the strategy of directly comparing two subtypes of tumors, we compared our results with the analysis results when different subtypes are compared to normal breast tissue controls (Supplemental Results 5). The result of these analyses are much more equivocal, with Her2 and ERalpha proteins now superseded in significance by several more generic cancer-related proteome alterations, common to both subtypes (Supplemental Results 5).

The corresponding Luminal A vs Her2E RNA-seq signature, constructing by differential gene expression analysis between 201 Luminal-A and 47 Her2E samples, showed similar patterns of expression of key genes (Fig. 5b). All genes were differentially expressed (Bonferroni adjusted P value < 0.01) except for EGFR, indicating that the difference in expression levels of the EGFR protein with the phosphorylated Y1068 tyrosine residue (EGFR_pY1068) may be a consequence of post-translation modifications instead of increased transcription rates of the EGFR gene.

Following down the iLINCS workflow, the pathway analysis of 734 most significantly upregulated genes in Luminal-A tumors (P value < 1e-10) (Fig. 5c) identified the Hallmark gene sets35 indicative of Estrogen Response to be the most significantly enriched (Fig. 5d) (See Supplemental Data SD2 for all results). Conversely, the enrichment analysis of 665 genes upregulated in Her2E tumors identified the Hallmark gene sets of proliferation (E2F Targets, G2-M Checkpoint) and the markers of increased mTOR signaling (mTORC1 signaling). This reflects a known increased proliferation of Her2E tumors in comparison to Luminal-A tumors49. The increase in mTOR signaling is consistent with the increased levels of the phosphorylated 4E-BP protein, a common marker of mTOR signaling50.

The CMAP analysis of the RNA-seq signature with LINCS CP signatures (Fig. 6) shows that treating several different cancer cell lines with inhibitors of PI3K, mTOR, CDK, and inhibitors of some other more generic proliferation targets (e.g., TOP21, AURKA) (see Supplemental Data SD3 for complete results) produces signatures that are positively correlated with RNA-seq Luminal A vs Her2E signature, suggesting that such treatments may counteract the Her2E tumor driving events.

The detailed analysis of 100 most connected CP signatures showed that all signatures reflected proliferation inhibition as indicated by the enrichment of the genes in the KEGG Cell cycle pathway among the genes downregulated across all signatures (Fig. 6a). However, the analysis also showed that a subset of the signatures selectively inhibited expression of the mTORC1 signaling Hallmark gene set, and the same set of signatures exhibited increased upregulation of Apoptosis gene sets in comparison to the rest of the signatures. This indicates that the increased proliferation of in Her2E tumors may be partly driven by the upregulation in mTOR signaling.

We also used iLINCS to identify de novo all signatures enriched for the mTOR-associated genes from Fig. 6a. The most enriched signatures (top 100) were completely dominated by signatures of mTOR inhibitors (Fig. 6b). The most highly enriched signature was generated by WYE-125132, a highly specific and potent mTOR inhibitor51. Using the iLINCS signature group analysis workflow we also summarized the drug-target relationships for the top 100 signatures (Fig. 6c) which recapitulate the dominance of mTOR inhibitors along with proliferation inhibitors targeting CDK proteins (Palbociclib and Milciclib).

### Use case 3: drug repurposing for COVID-19

The ongoing COVID-19 pandemic has underscored the importance of rapid drug discovery and repurposing to treat and prevent emerging novel pathogens, such as SARS-CoV-2. As part of the community-wide efforts to identify novel targets and treatment options, the transcriptional landscape of SARS-CoV-2 infections has been characterized extensively, including the identification of transcriptional signatures from patients as well as model systems52,53. CMAP approach has been extensively used to explore that space of potential therapeutic agents with the search of Google Scholar website listing 662 studies for the covid AND “connectivity map” search. In iLINCS, 105 COVID-19-related datasets are organized into a COVID-19 collection, facilitating signature connectivity-based drug discovery and repurposing in this context.

We used iLINCS to construct a SARS-CoV-2 infection signature by re-analyzing the dataset profiling the response of various in vitro models to SARS-CoV-2 infection52 (GEO dataset GSE147507). The use of multiple models, which respond differently to the virus infection, would make the signature created by direct comparisons of all “Infected” vs all “Mock infected” samples too noisy. The main mechanism implemented in iLINCS for dealing with various confounding factors is filtering samples by levels of possible confounding factors, which is the approach that is most used in the omics data analysis. In this case, we filtered samples to construct a signature by differential gene expression analysis of infected vs mock-infected A549 cell line, which was genetically modified to express ACE2 gene to facilitate viral entry into the cell. This left us with the comparison of three “Infected” and three “Mock infected” samples. Filtering of samples and the analysis using iLINCS GUI is demonstrated in the Supplemental workflow SW3.

The resulting signature comprises many upregulated chemokines and other immune system-related genes, including the EGR1 transcription factor that regulates inflammation and immune system response54,55, and the pathway analysis implicates TNF signaling and NK-kappa B signaling as the two pathways most enriched for upregulated genes (Fig. 7a). CMAP analysis against the LINCS gene overexpression signatures in A549 cell line identified the signature of LYN tyrosine kinase as the most positively correlated with the SARS-CoV-2 infection signature (Fig. 7b). LYN is a member of SRC/FYN family of tyrosine kinases has been shown to be required for effective MERS-CoV replication56. The enrichment of genes with positively correlated overexpression signatures in A549 cell line, identified NF-kappa B signaling pathway as the most enriched (Fig. 7b), confirming mechanistically the role of NF-kappa B signaling in inducing the infection signature. Finally, CMAP analysis identified CDK inhibitors and the drug Avlodicip as the potential therapeutic strategies based on their ability to reverse the infection signature (Fig. 7c). Alvocidib is a CDK9 inhibitor with a broad antiviral activity and it has been suggested as a potential candidate for COVID-19 drug repurposing57.

These results agree with the previous study utilizing iLINCS to prioritize candidate FDA-approved or investigative drugs for COVID-19 treatment58. Of the top 20 candidates identified in that study as reversing SARS-CoV-2 transcriptome signatures, 8 were already under trial for the treatment of COVID-19, while the remaining 12 had antiviral properties and 6 had antiviral efficacy against coronaviruses specifically. Our analysis illustrates the ease with which iLINCS can be used to quickly provide credible drug candidates for an emerging disease.

## Discussion

iLINCS is a unique integrated platform for the analysis of omics signatures. Several canonical use cases described here only scratch the surface of the wide range of possible analyses facilitated by the interconnected analytical workflows and the large collections of omics datasets, signatures, and their connections. All presented use cases were executed using only a mouse to navigate iLINCS GUI. Each use case can be completed in less than 5 min, as illustrated in the online help and video tutorials. The published studies to date used iLINCS in many different ways, and to study a wide range of diseases (Supplemental Results 3).

In addition to facilitating standard analyses, iLINCS implements innovative workflows for biological interpretation of omics signatures via CMAP analysis. In Use case 1, we show how CMAP analysis coupled with pathway and gene set enrichment analysis can implicate mechanism of action of a chemical perturbagen when standard enrichment analysis applied to the differentially expressed genes fails to recover targeted signaling pathways. In a similar vein, iLINCS has been successfully used to identify putative therapeutic agents by connecting changes in proteomics profiles in neurons from patients with schizophrenia; first with the LINCS CGSes of the corresponding genes, and then with LINCS CP signatures59. These analyses led to the identification of PPAR agonists as promising therapeutic agents capable of reversing bioenergetic signature of schizophrenia, which were subsequently shown to modulate behavioral phenotypes in rat model of schizophrenia60.

The iLINCS platform was built with the flexibility to incorporate future extensions in mind. The combination of optimized database representation and R analysis engine provide endless opportunities to implement additional analysis workflows. At the same time, collections of omics datasets and signatures can be extended by simply adding data to backend databases. One of the important directions for improving iLINCS functionality will be the development of workflows for fully integrated analysis of multiple datasets and different omics data types. In terms of integrative analysis of matched transcriptomic and proteomic data, iLINCS facilitates the integration where results of one omics dataset informs the set of genes/proteins analyzed in the other dataset (Use case 2). However, the direct integrative analysis of both data types may result in more informative signatures61. At the same time, addition of more proteomics datasets, such as the Clinical Proteomic Tumor Analysis Consortium (CPTAC) collection62, will extend the scope of such integrative analyses.

Many complex diseases, including cancer, consist of multiple, molecularly distinct, subtypes63,64,65. Accounting for these differences is essential for constructing effective disease signatures for CMAP analysis. In Use case 2, we demonstrate how to use iLINCS in contrasting molecular subtypes when the information about the subtypes is included in sample metadata. An iLINCS extension that allows for de novo creation of molecular subtypes using cluster analysis, as is the common practice in analysis of cancer samples63, is currently under development. Another future extension under development is the workflow for constructing disease signatures using single cell datasets66. iLINCS contains a number of single cell RNA-seq (scRNA-seq) datasets, but their analysis is currently handled in the same way as the bulk RNA-seq data. A specialized workflow for extracting disease signatures from scRNA-seq data will lead to more precise signatures and more powerful CMAP analysis66.

With many signatures used in CMAP analysis, and a large number of genes perturbed by either genetic or chemical perturbations, one has to carefully scrutinize results of CMAP-based pathway analyses to avoid false positive results and identify most relevant affected pathways. Limitations of standard gene enrichment pathway analysis related to overlapping pathways are important to keep in mind in the signature-similarity-based pathway analysis, as discussed in Use case 1. In addition, the hierarchical nature of gene expression regulation may lead to similar transcriptional signatures being generated by perturbing genes at different levels of the regulatory programs (e.g., signaling proteins vs transcriptional factors). Perturbations of distinct signaling pathways leads to modulation of the proliferation rates in cancer cell lines, and it is expected that resulting transcriptional signatures share some similarities related to up- and down-regulation of proliferation drivers and markers. At the same time, signatures corresponding to perturbation of proteins regulating the same sets of biological processes are likely to exhibit a higher level of similarity. The analysis of the top 100 chemical perturbagen signatures negatively correlated with Her2E breast cancer signature in Use case 2 reveals that they all contain the “proliferation” component. However, a subset of the most highly correlated signatures is more specifically associated with mTOR inhibition, indicating that the proliferation is affected in part by modulating mTOR signaling. The association with perturbations that modulate cellular proliferation, while real, could also be considered spurious as it is relatively non-specific. The association with mTOR signaling is more specific and provides a higher-level mechanistic explanation for differences in proliferation rates. iLINCS provides mechanisms for scrutinizing expression profiles of genes in signatures identified in CMAP analysis that is required for assessing these fine points (Use case 2), and they are essential for interpreting the results of a CMAP analysis.

Several online tools have been developed for the analysis and mining LINCS L1000 signature libraries. They facilitate online queries of L1000 signatures67,68,69 and the construction of scripted pipelines for in-depth analysis of transcriptomics data and signatures70. The LINCS Transcriptomic Center at the Broad Institute developed the clue.io query tool deployed by the Broad Connectivity Map team which facilitates connectivity analysis of user-submitted signatures2. iLINCS replicates the connectivity analysis functionality, and indeed, the equivalent queries of the two systems may return qualitatively similar results (see Supplemental Results 1 for a use case comparison). However, the scope of iLINCS is much broader. It provides connectivity analysis with signatures beyond Connectivity Map datasets and provides many primary omics datasets for users to construct their own signatures. Furthermore, analytical workflows in iLINCS facilitate deep systems biology analysis and knowledge discovery of both, omics signatures and the genes and protein targets identified through connectivity analysis. Comparison to several other web resources that partially cover different aspects of iLINCS functionality are summarized in Supplemental Results 2.

iLINCS removes technical roadblocks for users without a programming background to re-use of publicly available omics datasets and signatures. The user interfaces are streamlined and strive to be self-explanatory to most scientists with conceptual understanding of omics data analysis. Recent efforts in terms of standardizing71 and indexing72 are improving findability and re-usability of public domain omics data. iLINCS is taking the next logical step in integrating public domain data and signatures with a user-friendly analysis toolbox. Furthermore, all analyses steps behind the iLINCS GUI are driven by API which can be used within computational pipelines based on scripting languages73, such as R, Python and JavaScript, and to power the functionality of other web analysis tools30,69. This makes iLINCS a natural tool for analysis and interpretation of omics signatures for scientists preferring point-and-click GUIs as well as data scientists using scripted analytical pipelines.

## Methods

### Statistics

All differential expression and signature creation analyses are performed on measurements obtained from distinct samples. All P values are calculated using two-sided hypothesis tests. The specific tests depend on the data type and are described in detail in the rest of the methods and the Supplemental Methods document. The accuracy of the iLINCS datasets, signatures and analysis procedures were ascertained as described in the Supplemental Quality Control document. The versions of all R packages utilized by iLINCS are provided in the Supplemental Data SD4.

### Perturbation signatures

All precomputed perturbation signatures in iLINCS, as well as signatures created using an iLINCS dataset, consist of two vectors: the vector of log-scale differential expressions between the perturbed samples and baseline samples d = (d1,…,dN), and the vector of associated P values p = (p1,…,pN), where N is the number of genes or proteins in the signature. Signatures submitted by the user can also consist of only log-scale differential expressions without P values, lists of up- and downregulated genes, and a single list of genes.

### Signature connectivity analysis

Depending on the exact type of the query signature, the connectivity analysis with libraries of precomputed iLINCS signatures are computed using different connectivity metrics. The choice of the similarity metric to be used in different contexts was driven by benchmarking six different methods (Supplementary Result 2).

If the query signature is selected from iLINCS libraries of precomputed signatures, the connectivity with all other iLINCS signatures is precomputed using the extreme Pearson’s correlation74,75 of signed significances of all genes. The signed significance of the ith gene is defined as

$${s}_{i}={{{{{\rm{sign}}}}}}\left({d}_{i}\right) \,*\, \left(-{{{\log }}}_{10}\left({p}_{i}\right)\right),\,{for}\,i=1,\ldots,\, N,$$
(1)

and the signed significance signature is s = (s1,…,sN). The extreme signed signature e = (e1,…,eN) is then constructing by setting the signed significances of all genes other than the top 100 and bottom 100 to zero:

$${e}_{i}=\left\{\begin{array}{ll}{s}_{i},& {if}\,{s}_{i} \ge {s}^{100}\,{or}\,{s}_{i} \le {s}^{-100}\\ 0,& {otherwise}\end{array}\right\}$$
(2)

Where s100 is the 100th most positive si and s−100 is the 100th most negative si. The extreme Pearson correlation between two signatures is then calculated as the standard Pearson’s correlation between the extreme signed significance signatures.

If the query signature is created from an iLINCS dataset, or directly uploaded by the user, the connectivity with all iLINCS signatures is calculated as the weighted correlation between the two vectors of log-differential expressions and the vector of weights equal to [-log10(P value of the query) −log10(P value of the iLINCS signature)]76. When the user-uploaded signature consists of only log-differential expression levels without P values, the weight for the correlation is based only on the P values of the iLINCS signatures [−log10(P values of the iLINCS signatures)].

If the query signature uploaded by the user consists of the lists of up- and downregulated genes connectivity is calculated by assigning −1 to downregulated and +1 to upregulated genes and calculating Pearson’s correlation between such vector and iLINCS signatures. The calculated statistical significance of the correlation in this case is equivalent to the t test for the difference between differential expression measures of iLINCS signatures between up- and downregulated genes.

If the query signature is uploaded by the user in a form of a gene list, the connectivity with iLINCS signatures is calculated as the enrichment of highly significant differential expression levels in iLINCS signature within the submitted gene list using the Random Set analysis77.

### Perturbagen connectivity analysis

The connectivity between a query signature and a “perturbagen” is established using the enrichment analysis of individual connectivity scores between the query signature and set of all L1000 signatures of the perturbagen (for all cell lines, time points, and concentrations). The analysis establishes whether the connectivity scores as a set are “unusually” high based on the Random Set analysis77.

### iLINCS signature libraries

LINCS L1000 signature libraries (Consensus gene knockdown signatures (CGS), Overexpression gene signatures and Chemical perturbation signatures): for all LINCS L1000 signature libraries, the signatures are constructed by combining the Level 4, population control signature replicates from two released GEO datasets (GSE92742 and GSE70138) into the Level 5 moderated Z scores (MODZ) by calculating weighted averages as described in the primary publication for the L1000 Connectivity Map dataset2. For CP signatures, only signatures showing evidence of being reproducible by having the 75th quantile of pairwise spearman correlations of level 4 replicates (Broad institute distil_cc_q75 quality control metric2) greater than 0.2 are included. The corresponding P values were calculated by comparing MODZ of each gene to zero using the Empirical Bayes weighted t test with the same weights used for calculating MODZs. The shRNA and CRISPR knockdown signatures targeting the same gene were further aggregated into Consensus gene signatures (CGSes)2 by the same procedure used to calculate MODZs and associated P values.

#### LINCS-targeted proteomics signatures

Signatures of chemical perturbations assayed by the quantitative targeted mass spectrometry proteomics P100 assay measuring levels 96 phosphopeptides and GCP assay against ~60 probes that monitor combinations of post-translational modifications on histones5.

#### Disease-related signatures

Transcriptional signatures constructed by comparing sample groups within the collection of curated public domain transcriptional dataset (GEO DataSets collection)34. Each signature consists of differential expressions and associated P values for all genes calculated using Empirical Bayes linear model implemented in the limma package.

#### ENCODE transcription factor-binding signatures

Genome-wide transcription factor (TF) binding signatures constructed by applying the TREG methodology to ENCODE ChiP-seq78. Each signature consists of scores and probabilities of regulation by the given TF in the specific context (cell line and treatment) for each gene in the genome.

#### Connectivity map signatures

Transcriptional signatures of perturbagen activity constructed based on the version 2 of the original Connectivity Map dataset using Affymetrix expression arrays17. Each signature consists of differential expressions and associated P values for all genes when comparing perturbagen-treated cell lines with appropriate controls.

#### DrugMatrix signatures

Toxicogenomic signatures of over 600 different compounds79 maintained by the National Toxicology Program80 consisting of genome-wide differential gene expression levels and associated P values.

#### Transcriptional signatures from EBI Expression Atlas

All mouse, rat and human differential expression signatures and associated P values from manually curated comparisons in the Expression Atlas8.

#### Cancer therapeutics response signatures

These signatures were created by combining transcriptional data with drug sensitivity data from the Cancer Therapeutics Response Portal (CTRP) project81. Signatures were created separately for each tissue/cell lineage in the dataset by comparing gene expression between the five cell lines of that lineage that were most and five that were least sensitive to a given drug area as measured by the concentration-response curve (AUC) using two-sample t test.

#### Pharmacogenomics transcriptional signatures

These signatures were created by calculating differential gene expression levels and associated P value between cell lines treated with anti-cancer drugs and the corresponding controls in two separate projects: The NCI Transcriptional Pharmacodynamics Workbench (NCI-TPW)82 and the Plate-seq project dataset4.

### Constructing signatures from iLINCS datasets

The transcriptomics or proteomics signature is constructed by comparing expression levels of two groups of samples (treatment group and baseline group) using Empirical Bayes linear model implemented in the limma package83. For the GREIN collection of GEO RNA-seq datasets84, the signatures are constructed using the negative binomial generalized linear model as implemented in the edgeR package85.

### Analytical tools, web applications, and web resources

Signatures analytics in iLINCS is facilitated via native R, Java, JavaScript, and Shiny applications, and via API connections to external web application and services. Brief listing of analysis and visualization tools is provided here. The overall structure of iLINCS is described in Supplemental Fig. 2.

Gene list enrichment analysis is facilitated by directly submitting lists of gene to any of the three prominent enrichment analysis web tools: Enrichr20, DAVID21, ToppGene22. The manipulation and selection of list of signature genes is facilitated via an interactive volcano plot JavaScript application.

Pathway analysis is facilitated through general-purpose enrichment tools (Enrichr, DAVID, ToppGene), the enrichment analysis of Reactome pathways via Reactome online tool23, and internal R routines for SPIA analysis86 of KEGG pathways and general visualization of signatures in the context of KEGG pathways using the KEGG API24.

Network analysis is facilitated by submitting lists of genes to Genemania25 and by internal iLINCS Shiny Signature Network Analysis (SigNetA) application.

Heatmap visualizations are facilitated by native iLINCS applications: Java-based FTreeView87, modified version of the JavaScript-based Morpheus88 and a Shiny-based HeatMap application, and by connection to the web application Clustergrammer29.

Dimensionality reduction analysis (PCA and t-SNE89) and visualization of high-dimensional relationship via interactive 2D and 3D scatter plots is facilitated via internal iLINCS Shiny applications.

Interactive boxplots, scatter plots, GSEA plots, bar charts, and pie charts used throughout iLINCS are implemented using R ggplot90 and plotly91.

Additional analysis is provided by connecting to X2K Web26 (to identify upstream regulatory networks from signature genes), L1000FWD27 (to connect signatures with signatures constructed using the Characteristic Dimension methodology92), STITCH28 (for visualization of drug-target networks), and piNET30 (for visualization of gene-to-pathway relationships for signature genes).

Additional information about drugs, genes, and proteins are provided by links to, LINCS Data Portal31, ScrubChem32, PubChem33, Harmonizome93, GeneCards94, and several other databases.

### Gene and protein expression dataset collections

iLINCS backend databases provide access to more than 34,000 preprocessed gene and protein expression datasets that can be used to create and analyze gene and expression protein signatures. Datasets are thematically organized into eight collections with some datasets assigned to multiple collections. User can search all datasets or browse datasets by collection.

#### LINCS collection

Datasets generated by the LINCS data and signature generation centers7.

#### TCGA collection

Gene expression (RNASeqV2), protein expression (RPPA), and copy number variation data generated by TCGA project63.

#### GDS collection

A curated collection of GEO Gene Datasets (GDS)34.

#### Cancer collection

An ad hoc collection of cancer-related genomics and proteomic datasets.

#### Toxicogenomics collection

An ad hoc collection of toxicogenomics datasets.

#### RPPA collection

An ad hoc collection of proteomic datasets generated by Reverse Phase Protein Array assay95.

#### GREIN collection

Complete collection of preprocessed human, mouse, and rat RNA-seq data in GEO provided by the GEO RNA-seq Experiments Interactive Navigator (GREIN)84.

#### Reference collection

An ad hoc collection of important gene expression datasets.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.