GREIN: An Interactive Web Platform for Re-analyzing GEO RNA-seq Data

The vast amount of RNA-seq data deposited in Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA) is still a grossly underutilized resource for biomedical research. To remove technical roadblocks for reusing these data, we have developed a web-application GREIN (GEO RNA-seq Experiments Interactive Navigator) which provides user-friendly interfaces to manipulate and analyze GEO RNA-seq data. GREIN is powered by the back-end computational pipeline for uniform processing of RNA-seq data and the large number (>6,500) of already processed datasets. The front-end user interfaces provide a wealth of user-analytics options including sub-setting and downloading processed data, interactive visualization, statistical power analyses, construction of differential gene expression signatures and their comprehensive functional characterization, and connectivity analysis with LINCS L1000 data. The combination of the massive amount of back-end data and front-end analytics options driven by user-friendly interfaces makes GREIN a unique open-source resource for re-using GEO RNA-seq data. GREIN is accessible at: https://shiny.ilincs.org/grein, the source code at: https://github.com/uc-bd2k/grein, and the Docker container at: https://hub.docker.com/r/ucbd2k/grein.


Supplementary Tables
Step-by-step guide of GREIN with an example dataset

Landing page (GEO datasets)
To illustrate the usability and efficacy of GREIN, we will walk through the available features for exploring and analyzing data sets with an example. The interpretation of the results need further bioinformatics expertise. GREIN is accessible at: https://shiny.ilincs.org/grein.

Section 1
The first section (Supplementary Figure S2) provides information regarding the sample size distribution of the processed datasets, total number of data sets already processed or waiting to be processed, and the number of processed human, mouse, or rat samples by our GREP2 pipeline.

Section 2
The first panel in section 2 provides the option to process GEO dataset of interest if it is not already processed. GEO RNA-seq datasets can be searched using a GEO series accession to see if it exists in the dataset table (Section 3). If not, then 'Start processing' button will appear right below this box and the processing can be initialized by clicking this button which opens the 'Processing console' window (See Supplementary Figure S3). Requested data set id can be seen in the 'Currently processing' or at the bottom Supplementary Figure S2. GREIN landing page. The first section provides data summaries, second section allows search options along with the list of user requested datasets and their processing status, and the processed datasets are shown in the third section.
of the 'Waiting to process' menu. This window also shows the logs of the currently processing dataset requested by a user. A single server processing pipeline is continuously running in the back-end to process datasets whenever requested. This pipeline is dedicated to process the user requested datasets only. Depending on the size of the data and queue, the requested data sets are automatically uploaded to the portal as soon as they are processed.
GREIN also provides search options for biomedical ontologies (for example, cancer, basal cell, kidney, etc.) in the second panel of this section. We use ontology terms mapped to GEO samples by MetaSRA project 17 (http://metasra.biostat.wisc.edu/). User search term associated ontologies can be found in the 'Metadata' under 'Explore dataset' tab.
The right most panel in this section shows the user requested data sets. If a dataset is requested for processing, the dataset id (GEO series accession) will show up here. Also, the status of the processing queue can be opened by pressing the 'Processing console' button.

Section 3
The list of processed data sets with additional information in the data table is shown in section 4 (See Supplementary Figure S2). Two types of search options are available in this

Description
This tab panel provides descriptive information including study link, number of GEO samples, number of SRA runs, title, and study summary of the corresponding dataset.

Metadata
GEO metadata contains a lot of information, although not all of these are useful for analysis or visualization purpose. So, we provide a filtered version of the metadata besides the full metadata.
We filter metadata based on the following criteria: 1. Columns that contain a single value.
2. Columns with incoherent information regarding analysis and visualization such as dates, time, download path and so on.

Supplementary Figure 4. Description tab panel.
The study link will take the user to GEO web page.
Supplementary Figure S5. Metadata tab panel. Both the full and filtered metadata are downloadable in CSV format.
This dataset (GSE60450) has two cell types and three developmental stages and each combination has two biological replicates. User can also download both the filtered and full metadata.

Counts table
This table shows gene wise estimated read abundance (rounded to the nearest integer) for each sample both in raw and normalized format. We use Salmon to quantify transcript abundances for each sample. These transcript level estimates are then summarized to gene level using Bioconductor package tximport which gives estimated counts scaled up to library size while taking into account for transcript length. We obtained gene annotation for Homo sapiens (GRCh38), Mus musculus (GRCm38), and Rattus norvegicus (Rnor_6.0) from Ensemble (release-91). Both gene and transcript level expression data are downloadable. Also, each gene can be visualized via interactive boxplots.

Quality control (QC) report
After running FastQC and Salmon, we generate a combined quality control report of all the samples using MultiQC. This downloadable report contains information regarding read mapping and quality scores of the FastQ files. In the general statistics table, each sample corresponds to two rows, the first one for the Salmon read mapping and the second one for FastQC (See Supplementary Figure S7).

Visualization
This section provides access to four different types of interactive exploratory plots. These plots are important in order to uncover underlying relationship of the samples and gain deeper insight of the data structure. We leverage several state-of-the-art R and Bioconductor packages for this purpose.

Correlation and density plot
Sample-wise Pearson correlation heatmap and density plot are generated using Plotly. User can hover over the plots to see expression values or zoom in to any specific area and double click to zoom out. Group wise annotation is available for correlation heatmap. Distribution of the data on the $ ( ) scale is shown in the density plot.

Heatmap
Heatmap is displayed based on the top most highly variable genes (sorted by median absolute deviation values of $ ( ) and data is centered to the mean) in this section. We use Bioconductor packages ComplexHeatmap and R package iheatmapr for static and interactive heatmaps respectively. User can select either Pearson correlation, Euclidean distance, or group by properties option for hierarchical clustering of both genes and samples. User can also pick any number of highly variable genes. Both the plots and heatmap data are downloadable.

Principal component analysis (PCA) and t-distributed stochastic neighboring embedding (t-SNE) plots
GREIN provides the options for visualizing data in reduced dimension using both linear and non-linear approaches. PCA and t-SNE plots are available in both two and three-dimensional plane. User can subset the data and mouse hover on each data points to see the labels.

Analyze dataset
The 'Analyze dataset' tab at the very top tab panel consists of 'Power analysis' and 'Create a signature' tabs.

Power analysis
This section is dedicated to assist users in power analysis which is an essential step in designing an RNAseq experiment with a goal to achieve the desired power to detect differentially expressed genes. This section is comprised of three sub-sections: metadata, power curve and detectability of genes. User will have to select two groups for power analysis.

Power curve
We use Bioconductor package RNASeqPower to calculate power using the following parameters: 1. Biological coefficient of variation calculated as the squared root of the common dispersion (We use Bioconductor package edgeR to calculate common dispersion). 2. Number of samples in each group. 3. Fold change as the effect size. The default value is 2. 4. Level of significance or alpha. The default value is 0.01. 5. Average sequencing depth in million. The default is calculated as the average column sums in million.

Detectability of genes
The plot of biological coefficient of variation (BCOV) vs. average $ ( ) gives an idea regarding the detectability of each of the genes as differentially expressed based on the selected groups. User can modify the parameters as per their interest. Also, user can search for a gene to see the power of the gene or clicking on the points will display the values in a table below the plot. Figure S11. Power curve tab panel. This plot is based on a single gene.

Create a signature
Generating differential expression signature is one of the most important segments of GREIN. This section begins with selecting a variable of interest to test for differential expression between the groups of this variable. We would like to see transcriptional changes between lactating and pregnant samples from the basal population only. So, we select 'developmental stage' as our factor of interest, select '2 day lactation' in the experimental group and '18.5 day pregnancy' in the control group. Depending on the number of available properties and levels, two different types of comparisons are available: two group without covariate and two group with covariate. Then we select 'Yes' for the 'Subset samples' which provides the option to select basal population only. User can see the selected groups in the 'Metadata' table (Supplementary Figure S13). The variable 'Selected groups' in this table is created on the fly based on the selected groups. A signature table will be generated once the 'Generate signature' button is clicked (Supplementary Figure S14). The analysis pipeline starts by filtering genes with very low counts. Genes that have counts per million (CPM) values of more than 0 in at least the minimum number of samples in any of the comparison groups are kept for the downstream analysis. We apply trimmed mean of M values (TMM) for normalizing libraries which is a built-in normalization method in edgeR. A design matrix is constructed based on the selected variable and groups. We use gene-wise negative binomial generalized linear models with quasi-likelihood tests and gene-wise exact tests from Bioconductor package edgeR to calculate differential expression between groups with and without covariates respectively. P-values are adjusted for multiple testing correction using Benjamini-Hochberg method. A gene is considered upregulated in the '2 day lactation' group if $ ( ℎ ) (logFC) is positive and a gene is downregulated if $ ( ℎ ) is negative. Figure S12. Detectability of genes tab panel. User can search for any gene symbol to see its detectability power.

Supplementary
Gene detectability plot shows the effect of BCOV or average depth on the power and identifies genes that might act as false negatives (See Supplementary Figure S17). User can download signature data based on their selection (DE, NDE&DT, or NDE&NDT).
Supplementary Figure S17. Gene detectability plot. User can select genes based on the cutoff or search for genes of interest to see their power. User can also select any specific area in the plot to see the signature table of the selected genes.
Supplementary Figure S18. Uploaded signature to iLINCS portal. User can further modify the list of genes.