## Introduction

The completion of the Human Genome Project1 enabled the quantification of mRNA expression at the genome-wide scale, initially with cDNA microarray technology2, but now more commonly via RNA-sequencing (RNA-seq) (Fig. 1). RNA-seq is replacing cDNA microarrays as the dominant technology because it offers reduced cost, increased sensitivity, ability to quantify splice variants and perform mutation analyses, improved quantification at the transcript level, identification of novel transcripts, and improved reproducibility3. Genome-wide gene expression data from thousands of studies have been accumulating and made available for exploration and reuse through public repositories such as the Gene Expression Omnibus (GEO)4 and ArrayExpress5. Since the late 1990s software for the analysis of cDNA microarray data has matured toward established community accepted computational procedures, whereas analyses methods to process RNA-seq data are still actively refined and developed.

The quality of RNA-seq data depends on the sequencing depth whereby more reads per sample can reduce technical noise. Modern sequencing platforms such as Illumina HiSeq produce tens of millions of paired-end reads of up to 150 base pairs in length per sample. The raw reads are aligned to a reference genome by mapping reads to known gene sequences. The alignment step is computationally demanding, and the various alignment algorithms implemented in software packages are continually improving6,7,8,9,10,11,12. Bowtie6 is one of the first alignment methods that gained wide spread popularity. More efficient solutions were later implemented by improving memory utilization with faster execution time. One of the currently leading alignment methods, Spliced Transcripts Alignment to a Reference (STAR)8, can map more than 200 million reads per hour. As a trade-off for increased computational speed, STAR requires heavy memory consumption, particularly for large genomes such as human or mouse. For mammalian genomes, STAR requires more than 30 GB of random access memory (RAM). This requirement limits its application to high performance computing (HPC) platforms. This introduces a barrier for the typical experimental biologist who generates the data. Additionally, knowledge in programming and a series of choices in regards to the alignment software parameter settings, are commonly required to covert raw reads to a quantified expression matrix of processed RNA-seq data.

Retrospective analyses of large collections of previously published RNA-seq data can illuminate new biology and accelerate drug discovery13. However, many post hoc studies rely on large data sets that are interoperable with data analysis workflows whereby gene expression data is provided in processed form. For example, the Genotype-Tissue Expression project (GTEx)14 and The Cancer Genome Atlas (TCGA)15 RNA-seq data sets are frequently reused in post hoc projects mainly because such data are provided in a useful processed format. GTEx currently contains 9662 RNA-seq samples from 53 human tissues collected from over 250 individuals, whereas TCGA contains at least 11,077 RNA-seq samples created from a diverse collection of tumors.

Recent efforts attempted to simplify the access to gene expression data collected via RNA-seq to create more unified resources from fragmented repositories16,17,18,19,20. Currently, as of February 2018, there are 187,946 RNA-seq samples, collected from human or mouse cells and tissues, that are accessible from the Gene Expression Omnibus (GEO) and the Sequence Read Archive (SRA), making this resource the most comprehensive repository for RNA-seq data collected from mammals. This large collection of samples from diverse institutions, laboratories, studies, and projects is comprehensive, but not homogeneous compared with RNA-seq data collected for large projects such as GTEx and TCGA. The data within GEO/SRA is provided mostly in raw sequence form. While some studies provide aligned reads files, these are few and processed non-uniformly. This shortcoming makes it difficult to query and integrate this data at a global scale. To bridge the gap that currently exists between RNA-seq data generation and RNA-seq data processing, we developed the resource all RNA-seq and ChIP-seq sample and signature search (ARCHS4). The ARCHS4 pipeline (Fig. 2) processes RNA-seq data from GEO/SRA to support retrospective data analyses and reuse. ARCHS4 caters to users with different levels of computational expertise and can be employed for many post hoc analyses and projects. The goal is to provide users with direct access to the data through a web-based user interface, while implementing a scalable and cost-effective solution for the raw data processing task. The usefulness of ARCHS4 is exemplified through case studies that show how the data assembled can be applied to predict gene function and protein–protein interactions (PPI).

## Results

### The ARCHS4 website

The ARCHS4 three-dimensional viewer also supports manual lasso selection of samples through a snipping tool. The colors used to highlight selected samples can be changed by the user. The genes view of ARCHS4 provides the same manual selection feature as with the samples view. Selected gene lists can be downloaded directly from ARCHS4, or submitted for gene list enrichment analysis with Enrichr28,29. Additionally, individual genes can be queried to locate an ARCHS4-dedicated gene-landing page. These single gene landing pages contain predicted biological functions based on correlations with genes assigned to GO categories; predicted upstream transcription factors based on correlation with identified targets as determined by ChIP-seq data from the ChEA and ENCODE gene set libraries; predicted knockout mouse phenotypes based on annotated MGI mammalian phenotypes; predicted human phenotypes based on co-expression correlation with genes that have assigned human phenotypes in the Human Phenotype Ontology30; predicted upstream protein kinases based on known kinase-substrates from KEA; and membership in pathways based on co-expression with pathways from KEGG. The single gene landing pages also list the top 100 most co-expressed genes for each individual gene. Additionally, for 53 distinct tissues and 67 cell lines, expression levels are visualized for each gene. These are visualized as two hierarchical trees with tissues and cells grouped by system and organ.

In addition, ARCHS4 processed data can be accessed via the ARCHS4 Chrome extension, which is freely available from the Chrome Web Store. The Chrome extension detects GEO series landing pages and then inserts a Series Matrix File (SMF) for download for each series that has been processed by the ARCHS4 pipeline. Each SMF contains read counts for all available samples in the series. The sample expression is also visualized as a heatmap using the Clustergrammer plugin31. Clustergrammer loads JSON files containing the z-score normalized gene expression of the top 500 most variable genes across the series and embeds the interactive heatmap directly into the GEO series landing page.

The ARCHS4 pipeline (Fig. 2) speed is measured by the elapsed time from job submission until completion for 31,825 samples. The timed process includes: downloading the SRA file, extracting the FASTQ file from the SRA data format, alignment to the reference genome, mapping the transcript counts to the gene level, and writing the final result to the database. Processing time of a single FASTQ file takes on average ~11 min. This benchmark is applied using Amazon EC2 on-demand m4.large instances with 8 GB of memory and 2 vCPUs, running with 200 GB of hard-drive storage. Each instance can run 2 Dockerized alignment pipeline containers in parallel. At the time of the benchmark, the cost of the on-demand m4.large instances was $0.1/h. This results in an average compute cost for one processed SRA file to be$0.00982. For samples with at least 1,000,000 aligned reads and paired reads with 200 bp, the alignment cost was $0.973 per billion reads or$0.00486 per billion base pairs. For these specific samples, this averages to 0.025 per sample. Most samples, however, have unpaired reads and lower read counts, resulting in the lower cost of less than a cent. The alignment time correlates with the number of reads (Spearman’s correlation coefficient r = 0.881), and the processing time increases linearly with the number of spots aligned with some variance (r = 0.901, paired reads) due to performance differences between cloud computing instances (Fig. 3c, d). Paired-end read RNA-seq experiments require more time during the alignment process due to the increased number of spots that have to be processed. The ARCHS4 pipeline is, to our knowledge, the most cost-effective cloud-based RNA-seq alignment infrastructure published to date. ### STAR vs. Kallisto comparison To achieve its fast and cost-effective solution, ARCHS4 utilizes the Kallisto aligner9. However, it is not clear whether the improved speed and cost provided by Kallisto comes with a cost of drop in the output quality. To benchmark Kallisto against other aligners, a subset of 1708 human samples processed by ARCHS4 were also aligned with STAR8. While Kallisto and STAR return similar gene expression profiles, there are profound differences between the output produced by the two methods. In general, Kallisto detects more genes than STAR (Fig. 3e, f). The average Pearson correlation of the z-score transformed samples between the Kallisto and STAR outputs is 0.77. However, the number of detected genes does not directly translate to a qualitative advantage of Kallisto over STAR. To test the quality of the generated gene expression matrices and their gene correlation structure, we tested the ability of the processed data sets by STAR or Kallisto to predict GO biological processes for single genes, as described in detail in the Methods section. The quality of the predictions is almost identical for the two compared data sets with an average area under the curve (AUC) of 0.69 for predictions made by processed data generated by the two separate methods (Fig. 3f). ### Comparison to existing RNA-seq pipelines and resources Multiple efforts attempted to uniformly reprocess large collections of RNA-seq data16,17,18,19,20. Table 1 provides an overview comparison of several resources with respect to size, cost per sample, and other attributes. The total sample size and cost for data processing are visualized in Fig. 4. Even though ARCHS4 contains more than double the number of RNA-seq samples than other resources, the estimated costs compared with Recount and Toil Recompute is an order of magnitude lower with1745, $44,785, and$25,910, respectively. All approaches rely on the use of either private or public high-performance computer clusters. Toil Recompute16 was applied to re-compute the transcript level counts of 19,931 RNA-seq samples. The UCSC pipeline architecture was run on an Amazon Web Services (AWS) cluster and averaged 1.30 per sample. The data set contains 11,194 samples from TCGA, 8002 from GTEx, and 734 additional samples. The processed data is made available through a web interface called the Xena browser. Expression Atlas17 provides processed RNA-seq and microarray gene expression data for multiple species. Expression data is processed by a pipeline named iRAP18. The total number of assays in Expression Atlas is 118,209 from 3035 experiments. From these, only 565 are RNA-seq. All data is reported at the gene level and is accessible as a bulk zip download. The Recount project19 performed sequence alignment with a pipeline termed Rail-RNA. The reported cost per sample is0.72. The data in Recount contains 9662 samples from GTEx, 11,350 from TCGA, and ~50,000 human SRA samples. The data is available as bulk download or through an R package. Expression is reported at the gene and transcript levels. The RNAseqDB20 also contains all GTEx and TCGA processed samples. The alignment for RNAseqDB was performed on an internal cluster and no cost analysis is available. The data is provided as bulk download files with FPKM normalized transcripts. The data is deposited in a GitHub repository. In contrast with ARCHS4, the reported cost for all similar efforts is about two orders of magnitude more expensive for processing a sample. Cost per sample is a critical factor in processing RNA-seq data because of the rapid growth in data production. Compute cost for ARCHS4 is almost negligible compared to the cost of comparable efforts. The number of samples already available from the ARCHS4 resource is by far the largest collection of processed RNA-seq to date, and the low-cost pipeline enables a rolling update as more samples become available. In contrast with other resources, ARCHS4 provides multiple methods for data accessibility. While Recount and Expression Atlas support programmatic access through R packages, only the Expression Atlas supports enrichment analysis on signatures derived from the expression profiles. A unique feature of ARCHS4 is the real-time data and metadata querying support that allows the identification and selection of relevant subsets of samples.

The percentage of aligned reads over total reads for each FASTQ file varies significantly across labs, projects, and sequencing cores due to various reasons. Since each sample from GEO/SRA is annotated with the producers of the data, the percent of aligned reads by institution can be plotted (Fig. 5). The 34 institutions that so far produced more than 100 unique samples from more than 20 gene expression series of RNA-seq samples available on the GEO/SRA database show that the highest percentage of successful aligned reads is by the University of Minnesota with a median of 87%. The 429 samples that originated from the Jackson Laboratory come from 23 distinct gene expression series. It should be noted that observed differences in the fraction of aligned reads is not necessarily an indicator of the performance of the sequencing core within an institution but can be attributed to the quality of the samples. For example, samples from formalin-fixated tissues will suffer from RNA degradation which will result in lower percent of aligned reads. In addition, it should also be noted that in general investigators from most research institutions frequently use various external sequencing core services. On average, 63% of reads were aligned across all 84,863 of the processed human RNA-seq samples, whereas 59% of all mouse RNA-seq reads from 103,083 samples were aligned to matching genes.

### Prediction of biological functions and protein interactions

Gene function and PPI can be potentially predicted using co-expression data, whereas the data that is processed for ARCHS4 provides a rich resource for generating gene co-expression correlations. Evaluating the quality of co-expression correlation networks to predict protein interactions and biological functions can also provide an unbiased benchmark to compare the ARCHS4 resource with other major RNA-seq and microarray repositories. The hypothesis is that gene function and protein interactions can be predicted using co-expression data, and this implies that co-expressed genes tend to share their function and physically interact. It means that genes are assigned predicted biological functions when they are highly correlated with a set of genes that are already annotated to have some biological function. Similarly, a gene product is predicted to interact with another protein if the known direct protein interactors for that other protein are highly co-expressed with the gene product protein. We evaluate the human and mouse ARCHS4 data sets by comparing them to co-expression matrices created in the same way from the Cancer Cell Line Encyclopedia (CCLE) and GTEx resources. All gene expression data sets produce, on average, significant ability to predict both biological functions and protein interactions. This suggests that gene expression correlations derived from large-scale expression data sets are predictive of biological function and protein interactions. In almost all the tested categories, the ARCHS4 mouse and human data sets significantly outperformed the predictions made with co-expression data created from the CCLE and GTEx data sets (Table 2). The most accurate predictions for GO biological processes, GO molecular functions, KEGG pathways, Human Phenotype Ontology terms, predicted upstream kinases, and MGI Mammalian Phenotype terms are achieved with the ARCHS4 mouse gene co-expression data followed by the ARCHS4 human data. The co-expression data from GTEx outperforms the co-expression data created from the CCLE for GO biological processes and the phenotype libraries, whereas the predictability of using GTEx data in the same way is lower than CCLE for the upstream regulatory transcription factor predictions. P-values are calculated for the ∆ mean between methods. For the ARCHS4 mouse co-expression data, the AUC distributions for predicting gene function are significant across all categories, but most successfully in predicting GO biological processes with median AUC of 0.745, and membership in KEGG pathways with a median AUC of 0.797 (Fig. 6a).

While predicting protein function with co-expression data has been attempted successfully by many before, it is less established whether co-expression data can be used to also predict PPI. A similar strategy was employed to predict PPI using prior knowledge about PPI. PPI data is fetched from three PPI resources: hu.MAP32, BioGRID33, and BioPLEX34. These three PPI resources are unique. After filtering, the PPI from BioGRID are the composition of interactions from thousands of publications that report only few interactions. The PPI from BioPLEX are bait-prey interactions from a massive mass-spectrometry experiment. The hu.MAP PPI consist of data from three mass-spectrometry experiments integrated with sophisticated computational methods. hu.MAP also considers prey-prey interactions to boost interaction confidence. Importantly, none of these three resources utilize knowledge from mRNA co-expression data to confirm PPI. The overlap of shared interactions between the three PPI networks is relatively low, with hu.MAP and BioPLEX sharing more than 10% of their interactions (Fig. 6b). This is likely because a part of BioPLEX is contained within hu.MAP. Predicting PPI using knowledge from these three PPI resources, the ARCHS4 mouse co-expression data is the most predictive, with median AUCs of 0.85, 0.66 and 0.64 for hu.MAP, BioGRID, and BioPLEX respectively (Fig. 6c).

The fact that PPI from hu.MAP can be predicted at a much higher accuracy compared to the other two networks may suggest that PPI within hu.MAP are more correct. The 75% quantile of interaction correlation in hu.MAP is 0.174 compared to BioGRID (0.0915) and BioPLEX (0.0478); whereas the intersections between the PPI networks (I, II, III, IV) tend to have a higher 75% quantile of correlations, with 0.198, 0.217, 0.131 and 0.132 suggesting that aggregating evidence from experiments that detect PPI, is most likely to boost confidence of real interactions (Fig. 6d). This suggestion further supports that mRNA co-expression data can be used to predict PPI. The predicted PPI and predicted biological functions provide a plethora of computational hypotheses that could be further validated experimentally.

## Methods

### Post-processing to make the RNA-seq data accessible

To integrate new samples into the ARCHS4 resource, gene and transcript count files are extracted from the database and saved into HDF5 files42. The HDF5 files for human and mouse contain metadata describing each sample retrieved from GEO with GEOquery. The files are then deployed to Amazon S3 and made accessible for download. The 3D visualization of all samples on the ARCHS4 website is calculated with t-SNE43 after quantile normalization and log2 transformation of the human and mouse samples separately. The t-SNE procedure uses a perplexity of 50 for the sample centric embedding, and a perplexity of 30 for the gene-centric embedding using the Rtsne package in R44. The integration of the processed data into GEO series landing pages is achieved through the ARCHS4 Chrome extension. The extension, freely available at the Chrome Web Store, first detects whether a GEO GSE landing page is currently open in the browser. It then requests the matching GSE series matrices from ARCHS4 containing the gene expression counts and metadata information for the GSE. Additionally, the ARCHS4 Chrome extension requests JSON objects with pre-computed clustered gene expression for visualizing the samples with Clustergrammer31. Summary statistics of sample counts and tissue-specific samples are saved in a dedicated database table to be accessed by the ARCHS4 website landing page for display.

### Sample search with reduced dimensionality

To enable reliable similarity search of signatures within the ARCHS4 data matrix, the matrix is compressed into a lower dimensional representation. A projection that maintains pairwise distances and correlations between samples is computed with the JL method21. The JL-transform reduces the original gene expression matrix $$E \in N \times M$$ where N is the number of genes and M is the number samples, into a matrix $$\hat E \in S \times M$$, with $$S < N$$. A subspace of 1000 dimensions captures the original correlation structure with a correlation coefficient of 0.99 (Fig. 3a). For implementing the ARCHS4 signature search, a projection matrix $$D_{\rm JL} \in 1000 \times N$$ is used to calculate $$\hat E = D_{\rm JL} \times E$$. The human and mouse matrices are handled separately. For user queries, input signatures $$\vec s = [s_1,s_2, \ldots ,s_n]$$ are projected onto a lower dimension $$\widehat {\vec s} = D_{\rm JL} \times \vec s$$. Since $${\mathrm{cov}}(\widehat {\vec s},\hat E) \approx {\mathrm{cov}}(\vec s,E)$$, this method enables responsive real-time signature similarity search with low error (Fig. 3b).

### The ARCHS4 interactive website

The front-end of ARCHS4 is hosted on a web server derived from a tutum/lamp Docker image that is pulled from Docker Hub. It is a web service stack running on a UNIX-based operating system with an Apache HTTP server and a MySQL database. ARCHS4 is an AJAX application implemented with PHP and JavaScript. All visual data representations are implemented in JavaScript. The sample statistics overview of the landing page is implemented using D3.JS45. On the data view page, the sample and gene three-dimensional embedding is visualized using Three.js and WebGL46, which enable the responsive visualization of thousands of data points in 3D. Data-driven queries such as signature similarity searches are performed in an R environment hosted on a dedicated Dockerized Rook web server. On start-up, the Rook server automatically retrieves all necessary data files, including the JT-transformed gene expression table, as well as the transformation matrix, and loads them into memory for fast access from an S3 cloud repository. All Docker containers can be load-balanced and run on a Mesosphere computer cluster with redundant hardware. The load balancing and port mapping is controlled through a HAProxy service. The MySQL database is hosted as an RDS Amazon web service.

### Prediction of biological function and protein interactions

Gene–gene co-expression correlations across all human genes can be utilized to predict gene function and PPI by exploiting the fact that genes that co-express have the tendency to also share function and physically interact. First, expression matrices from ARCHS4 mouse, ARCHS4 human, GTEx14, and CCLE47 were organized into genes as the rows and samples as the columns. For the ARCHS4 data matrices, 10,000 samples were randomly selected to construct gene expression correlation matrices for mouse and human separately. For GTEx and CCLE, all available samples (9662 and 1037, respectively) were used to build the co-expression correlation matrix for all human genes. For ARCHS4 data, functional prediction accuracy increases with the number of samples included, while gains become marginal with more than 10,000 samples. Interestingly, even with a subset of 100 randomly selected samples, functional prediction accuracy is high. The quantile normalization function from the Bioconductor package preprocessCore48 was used to normalize gene counts across samples. From the extracted expression matrices, all pairwise gene correlations were calculated. For each gene set $$g_j \in {\rm GS}$$ and each gene g i the mean correlation of the genes in the gene set to g i was calculated. Self-correlations when $$g_i \in {\rm gs}_j$$ were excluded. Hence, the resulting gene set membership prediction matrix $${\rm GM} \in M \times N$$ for M genes and N gene sets is generated by the following procedure:

$${\rm GM}_{ij} = {\mathrm{mean}}\left( {{\mathrm{cor}}\left( {g_i,g_j} \right)} \right).$$

GM i is then sorted from high to low based on correlation level. For each row i within GM, a vector $$\vec s = [s_{1{\rm GS}1},s_{2{\rm GS}2}, \ldots ,s_{n{\rm GS}n}]$$ is then constructed where $$s_{n{\rm GS}n} \in \{ 0,1\}$$ and $$s_{j{\rm GS}j}$$ is 1 if gene g i is already known to be in the gene set GS. This vector is sorted and used to compute the AUC from the cumulative sum of $$\vec s_i$$ using trapezoidal integration. To predict PPI, the three PPI networks, hu.MAP32, BioGRID33, and BioPLEX34, are first converted to a gene set library as described by Ma’ayan et al.49. Then, to predict PPIs, the same procedure as described for functional predictions was applied.

### Data availability

The ARCHS4 website is accessible at https://amp.pharm.mssm.edu/archs4. On the site all processed data is available at the Download tab. Source code for the project is provided at https://github.com/MaayanLab/archs4. Source code is available under the Apache Licence 2.0. Provided gene expression files are made available under the Creative Commons Attribution 4.0 International License (Creative Commons License). All data is free to use for non-commercial purposes.