Enabling cross-study analysis of RNA-Sequencing data

Driven by the recent advances of next generation sequencing (NGS) technologies and an urgent need to decode complex human diseases, a multitude of large-scale studies were conducted recently that have resulted in an unprecedented volume of whole transcriptome sequencing (RNA-seq) data. While these data offer new opportunities to identify the mechanisms underlying disease, the comparison of data from different sources poses a great challenge, due to differences in sample and data processing. Here, we present a pipeline that processes and unifies RNA-seq data from different studies, which includes uniform realignment and gene expression quantification as well as batch effect removal. We find that uniform alignment and quantification is not sufficient when combining RNA-seq data from different sources and that the removal of other batch effects is essential to facilitate data comparison. We have processed data from the Genotype Tissue Expression project (GTEx) and The Cancer Genome Atlas (TCGA) and have successfully corrected for study-specific biases, enabling comparative analysis across studies. The normalized data are available for download via GitHub (at https://github.com/mskcc/RNAseqDB).


Introduction
RNA sequencing (RNA-seq) is an important tool for understanding the genetic mechanisms underlying human diseases. A multitude of large-scale studies have recently generated an unprecedented volume of RNA-seq data. For example, The Cancer Genome Atlas (TCGA) has quantified gene expression levels in >8000 samples from >30 cancer types. On a similar scale, the Genotype Tissue Expression (GTEx) project [1], [2], has catalogued gene expression in >9,000 samples across 53 tissues from 544 healthy individuals.
These resources offer a unique opportunity to gain better insight into complex human diseases. However, the integrative analysis of these data across studies poses great challenges, due to differences in sample handling and processing, such as sequencing platform and chemistry, personnel, details in the analysis pipeline, etc. For example, the RNA-seq expression levels of the majority of genes quantified are in the range of 4-10 (log2 of normalized_count) for TCGA, and 0-4 (log2 of RPKM) for GTEx (Fig. S1A), a consequence of the use of different analysis pipelines. This makes gene expression levels from the two projects not directly comparable.  [20]. A recently published pipeline, Toil [10], attempts to unify RNA-seq data from different sources by uniformly processing raw sequencing reads. However, Toil does not remove batch effects that are introduced by sources other than the differences in read alignment and quantification. To take full advantage of the large volume of available RNA-seq data, an integrative RNA-seq resource is still urgently required.
Here, we present a pipeline for processing and unifying RNA-seq data from different studies. By unifying data from GTEx and TCGA, we provide reference expression levels across the human body for comparison with the expression levels found in human cancer. Our method removes batch effects by uniformly reprocessing RNA-seq data. Specifically, we used raw sequencing reads of the RNA-seq samples downloaded from GTEx and TCGA, realigned them, re-quantified gene expression, and then removed biases specific to each study. R e s u l t s Our analysis pipeline included realignment of raw reads, removal of degraded samples, expression quantification, and batch effect processing (Fig. 1, see Methods).
To allow proper batch bias correction, we processed only samples from tissues that were studied by both GTEx and TCGA (Table 1). Tissues with no or insufficient numbers of normal samples available in TCGA (e.g., sarcoma, ovarian cancer, melanoma) were not processed (Table S1).
We downloaded and processed raw paired-end RNA-seq data from 10,366 samples, including 2,790 from GTEx and 7,576 from the TCGA project (Table 1). 831 samples (8%) exhibited 5' degradation (as described previously [11]) and were excluded from further analysis. We also discarded samples with low alignment rates and samples not used in the final GTEx study, resulting in a total of 9109 (89%) high-quality samples for further analysis.
To correct for batch biases, we first created a sample-gene matrix for each tissue-tumor pair by merging gene expression levels of the corresponding GTEx and TCGA samples. Regardless of the actual batch that a sample belonged to in an RNA-seq experiment, we treated all GTEx samples as one batch and TCGA samples as another. Then, we ran ComBat in the R package SVAseq [12] to correct for non-biological variation accounting for unwanted differences between GTEx and TCGA samples of a particular tissue type (see Methods).
To examine how well our pipeline was able to correct study-specific batch effects, we systematically compared the effects of uniform realignment, expression quantification, and batch effect correction for three tissues: bladder, prostate and thyroid. When using expression levels reported by the TCGA and GTEx projects, even after applying upperquartile normalization to bring expression levels into comparable ranges (Fig. S1B), samples from the same study were more similar to each other than samples from the same tissue, as shown by PCA analysis ( Fig. 2A). This result indicates the necessity to uniformly reprocess RNA-seq samples.
However, uniform realignment and expression quantification using our pipeline did not fully resolve these differences; while the first principal component was now the tissue, the second principal component was still defined by the source (Fig. 2B), indicating that study-specific biases still accounted for significant variation in RNA-seq expression levels within each tissue type. This result shows that consistent realignment and expression quantification alone are not sufficient, and that further study-specific batch effects need to be removed in order to be able to compare expression data from TCGA and GTEx.
To this end, we next added a batch-effect correction step to our pipeline, using ComBat [12] (see Methods), which successfully corrected our example data and resulted in clustering by tissue type (Fig. 2C).
To determine whether uniform alignment and expression quantification was an essential step, or whether batch effect removal via ComBat by itself was sufficient, we also applied ComBat directly to the level 3 data from GTEx and TCGA (GTEx-quantified data was rescaled using quantile normalization). We found that batch effect removal by itself is not sufficient, and that the combination of uniform processing of sequencing reads followed by additional batch effect removal is required to make data from the TCGA and GTEx projects comparable (Fig. S2). We validated the expression similarities observed in the principal component analysis through hierarchical clustering (Fig. 3).
Our results demonstrate that uniform realignment and expression quantification, together with explicit correction for study-specific biases, are not only effective, but also necessary for removing batch effects and making samples from different studies comparable.
Finally, we examined the expression levels of three cancer driver genes, ERBB2, IGF2, and TP53, in our batch-effect corrected data (Fig. 4). ERBB2 expression was significantly higher in a subset of tumor samples, consistent with the frequent amplifications observed in various tumor types. IGF2 showed a similar pattern, with a subset of tumor samples expressing the gene at levels several orders of magnitude higher than those in normal samples. TP53, on the other hand, is often affected by truncating mutations in cancer, which leads to decreased levels of RNA due to nonsense-mediated decay, an effect that is visible in the normalized RNA data.
The data generated using our pipeline has been deposited into GitHub at https://github.com/mskcc/RNAseqDB.
Recent large-scale studies, such as TCGA and GTEx, have resulted in an unprecedented volume of RNA-seq data. While these data offer new opportunities to identify the mechanisms underlying disease, the comparison of data from different sources poses a great challenge, due to batch effects inherent in the data from these studies.
Here, we describe a pipeline for correcting and integrating RNA-seq data across studies. Our pipeline starts with raw sequencing reads, performs uniform alignment and read quantification, and then removes study-specific batch effects. We have applied our pipeline to two of the largest studies in the field: GTEx and TCGA, processing 2790 normal RNA-seq samples from GTEx as well as 7576 samples from TCGA. Our results show that our pipeline is able to correct the biases specific to GTEx and TCGA, and, thus, make the samples from the two projects comparable.
Further efforts will be required to process samples for which there were not sufficient normal samples in the TCGA project, as well as GTEX samples for which there is no corresponding tumor type in the TCGA project. Our approach currently relies on the presence of normal samples in both studies that need to be integrated.
The data created has been deposited to GitHub and will be made accessible through the cBioPortal for Cancer Genomics [13], [14], so that investigators can conveniently mine the data and conduct integrative analyses. The resulting resource will benefit the research of cancer and other human diseases, as the pipeline can be used for the integration of RNA-seq data from other sources.

RNA-seq data
Raw paired-end reads of the RNA-seq samples for the TCGA project were retrieved from the Cancer Genomics Hub (CGHub, https://cghub.ucsc.edu). When FASTQ files were not available, e.g. for stomach adenocarcinoma, we downloaded aligned sequence reads (in BAM format) and extracted reads from BAM files with the Java program ubu.jar (https://github.com/mozack/ubu) before processing samples using our pipeline. GTEx samples were downloaded from the Database of Genotypes and Phenotypes (dbGaP, http://www.ncbi.nlm.nih.gov/gap), which hosts >9,000 RNA-seq samples (in SRA format) for the GTEx study.

Analysis pipeline
We employed STAR aligner [15], a fast accurate alignment software used widely in the NGS community, to map reads to UCSC human reference genome hg19 and reference transcriptome GENCODE (v19), using recommended parameters, e.g. '--outFilterType BySJout' and '--outFilterMultimapNmax 20', etc., which are also standard options of the ENCODE project for long RNA-seq pipeline. Samples with alignment rates less than 40% were excluded from further analysis. The detailed parameters we used to run STAR and the codes of our pipeline are available at GitHub (https://github.com/mskcc/RNAseqDB). To verify mRIN's performance on other tissues, we manually examined coverage uniformity over gene bodies for other tissues using the tool RseQC [16] and compared it with mRIN scores. We calculated the number of reads covering each nucleotide position and the average coverage for all long genes (>4000nt). Fig. S4 shows the average coverage for TCGA prostate and bladder samples, each curve representing gene body coverage of a sample. In Fig. S4A, the 4 samples with the most uneven coverage are the ones deemed degraded in Fig. S3. We made similar observations in the other tissues examined, e.g. bladder in Fig. S4B, where the samples with the most imbalanced gene body coverage were the ones with the lowest mRIN scores. These results confirmed that mRIN is capable of measuring degradation for other tissues.
When running STAR, we specified an option '--quantMode TranscriptomeSAM' to make STAR output a file, Aligned.toTranscriptome.out.bam, which contains alignments translated into transcript coordinates. This file was then used with RSEM [18] to quantify gene expression. The program "rsem-calculate-expression" in the RSEM package requires strand specificity of the RNA-seq sample, which is estimated using RseQC [16].
We also used another transcript quantification tool FeatureCounts[19] to generate integerbased read counts. Overall, the output from FeatureCounts was highly consistent with that of RSEM (Spearman correlation > 0.95). However, for genes with multi-mapping reads (i.e., reads mapped to multiple genes), FeatureCounts differs from RSEM and tends to underestimate expression levels in comparison with RSEM (because it discards multimapping reads). For example, the transcript from the PGA3 gene, which encodes human pepsinogen A enzyme that is highly abundant in the stomach, is identical to the transcripts of two other genes, PGA4 and PGA5. Its measurement in stomach by FeatureCounts (in default settings) is generally lower than that by RSEM (see Fig. S5). In our analysis below, we primarily used results by RSEM.
To ensure that TCGA normal samples remain comparable with TCGA tumors after removing batch biases from the normals, we also included TCGA tumors in our samplegene matrix, which were processed in the same way as the normals using our pipeline from raw sequencing reads. In Table S2, we used bladder and lung as examples to show the parameters we used to run ComBat. As indicated in Table S2, we treated all TCGA samples, both tumors and normals (of the same tissue type) as one batch. To prevent ComBat from suppressing tumor-specific signals, we created a model to specify each sample to be either 'normal' or 'tumor', with which to run ComBat (see configuration file at https://github.com/mskcc/RNAseqDB/blob/master/configuration/tissue-conf.txt).
To perform principal components analysis, we firstly remove genes with invariant expression levels and then log 2 -transform sample-gene matrix. Next, we utilize an R function 'prcomp' (with the 'center' option set to TRUE) to do principal components analysis. The two-dimensional PCA plot is created using an R function 'autoplot'.

Hierarchical clustering
For hierarchical clustering of expression data, we used the R function Heatmap.3 using default parameters (e.g., distance: euclidean, hierarchical clustering method: ward, etc.) as well as the top 1000 most variable genes in the data matrix.      Figure S2. PCA plot after applying quantile normalization and ComBat to the level 3 data of the 3 tissues, bladder, prostate, and thyroid, from GTEx and TCGA Figure S3. Comparing mRIN with RNA Integrity Number (RIN) calculated by TCGA for prostate cancer samples. The horizontal line at -0.11 is our cutoff. A sample with mRIN < -0.11 is deemed degraded. TCGA prostate cancer group used 7.0 (vertical line) as cutoff. Figure S4. Gene body coverage of the TCGA prostate and bladder samples. Each curve in the figure represents average coverage of genes (from 5' to 3') in a sample. To ease visual examination, only long genes (>4000nt) were used in the calculation of the coverage and only the normal samples were plotted. Figure S5. Expression of gene PGA3 in six tissues. Gene expression in (a) and (b) were quantified using FeatureCounts and RSEM, respectively. The same set of GTEx and TCGA (both tumor and normal) samples was used to compare FeatureCounts and RSEM for each tissue type.   T  E  x  C  o  n  s  o  r  t  i  u  m  ,  "  T  h  e  G  e  n  o  t  y  p  e  -T  i  s  s  u  e  E  x  p  r  e  s  s  i  o  n  (  G  T  E  x  )  p  i  l  o  t  a  n  a  l  y  s  i  s  :  M  u  l  t  i  t  i  s  s  u  e  g  e  n  e  r  e  g  u  l  a  t  i  o  n  i  n  h  u  m  a  n  s  ,  "   S  c  i  e  n  c  e   ,  v  o  l  .  3  4  8  ,  p  p  .  6  4  8  -6  6  0  ,  2  0  1  5  .  [  2  ]  G  T  E  x  C  o  n  s  o  r  t  i  u  m  ,  "  T  h  e  G  e  n  o  t  y  p  e  -T  i  s  s  u  e  E  x  p  r  e  s  s  i  o  n  (  G  T  E