Uncertainty in RNA-seq gene expression data

RNA-sequencing data is widely used to identify disease biomarkers and therapeutic targets. Here, using data from five RNA-seq processing pipelines applied to 6,690 human tumor and normal tissues, we show that for >12% of protein-coding genes, in at least 1% of samples, current best-in-class RNA-seq processing pipelines differ in their abundance estimates by more than four-fold using the same samples and the same set of RNA-seq reads, raising clinical concern.

Several of the above data sources are only available in units of fragments per kilobase of transcript per million mapped reads (FPKM) 1 . However, FPKM counts do not take transcript lengths into account and are thus sensitive to differences in reference transcriptome annotations (Supplementary Table 1), as illustrated in Fig. 1a,b. To overcome this issue, we converted all data sets to units of transcripts per million (TPM) for a common set of genes 1 (see Online Methods).
After the above unit conversion, the different versions of TCGA and GTEx data appear highly similar at the whole-genome level (Fig. 1c,d and Supplementary Fig. 1,2), and inter-tissue differences become more pronounced than data-source effects (Fig. 1e,f).
Moreover, inter-batch differences do not appear to confound comparisons within either TCGA or GTEx data sets, irrespective of data source (Supplementary Fig. 3,4), although comparisons between TCGA and GTEx samples may still be subject to batch effects 2 ( Supplementary Fig. 5).
To quantify inter-pipeline differences not related to batch-effects, we analyzed data from the TCGA and GTEx projects separately and compared only pipelines that do not include batch-effect correction (five sources for TCGA; four sources for GTEx). We searched for genes whose expression in a given sample is >32 TPM according to one pipeline (suggesting high expression), but <8 TPM (i.e. >4-fold lower) in the same sample according to at least one other pipeline. To ensure that such large differences between abundance estimates are not due to a small number of outlier samples, we further required that the expression of a gene should be discordant (i.e. meet the above criteria) in at least 1% of all samples (48 samples for TCGA, 19 samples for GTEx). We refer to genes that meet all the preceding criteria as "discordantly quantified" genes.
Across all TCGA and GTEx data sets, of 16,738 genes analyzed, 2,068 genes (12.36%) are discordantly quantified (Supplementary Table 2a). It should be noted that we imposed very stringent criteria in our identification of discordantly quantified genes. In particular, three and two-fold expression differences between pipelines affect many more genes (Supplementary Tables 2b,c).
The relative expression of a given gene between two samples may be expected to be more comparable across pipelines 3 . However, of the above 2,068 discordantly quantified genes, 1,958 genes (~ 95%) have >2-fold inter-pipeline differences in fold-change estimates for the same sample pairs (Fig. 2b and  Our findings are consistent with those of the Sequencing Quality Control Consortium, which found that absolute RNA-seq abundance estimates were generally not trustworthy, and that reliable differential expression analysis was only feasible in less than two-thirds of the genome 3 . However, whereas SEQC compared expression data generated using multiple experimental and computational protocols, the discordant expression levels and fold differences reported here are for exactly the same RNA-seq reads, and arise entirely from differences in data processing pipelines. Of note, many discordantly quantified genes have divergent expression values in large numbers of samples (>500, Fig. 2c,d and Supplementary Table 4). Importantly, the observed discrepancies are not attributable to a particular subset of processing pipelines or a particular subset of samples. Even the two pipelines with the greatest level of agreement (MSKCC and Xena/TOIL in the figure), still include multiple genes that are discordantly quantified in large numbers of samples.
For most discordantly quantified genes, differences between pipelines arise in a variety of ways. As an example, Fig. 2e shows the expression pattern of an example discordantly quantified gene (the splicing regulator U2 small nuclear RNA auxiliary factor 1 (U2AF1), which is frequently mutated in Myelodyplastic Syndrome 4 ) in five versions of TCGA data (see Supplementary Fig. 7a,b for additional examples). We note that in some cases, estimates differ by a scaling factor. In other cases, U2AF1 is essentially not expressed according to one pipeline, but highly expressed according to another pipeline. In yet other cases, abundance estimates from two pipelines are simply poorly correlated. These large uncertainties in abundance estimates pose a significant challenge to biomedical research.
For example, more complex genome annotations can increase the numbers of unmapped and multi-mapped reads 5 . Indeed, 240 of our 2,068 discordantly quantified genes (11.61%) are known to be frequently affected by multi-mapping reads (Supplementary In addition to poor inter-pipeline correlations in mRNA abundance (Supplementary Fig.   8a,b), for approximately half of the TCGA genes with available protein abundance data, mRNA and protein levels show remarkably low levels of correlation (Supplementary Fig.   8c, Supplementary Tables 6a,b,c) raising further concerns regarding the use of RNAseq data for biomarker and target discovery.
The bioinformatics pipelines compared here represent best-in-class efforts by leading research teams, and utilize well-established, widely used methods. The differences among these pipelines arise from diverse implementation choices including statistical and algorithmic methods, software versions, and run-time parameters.
The discordant abundance and fold-change estimates revealed here do not imply any technical errors. Rather they highlight inherent uncertainty in processing noisy and complex data. Nonetheless, the end result is that for the discordantly quantified genes reported here (Supplementary Table 2a), we can have little confidence in the abundance estimates produced by any RNA-seq processing pipeline. For critical applications such as biomedical research and clinical practice, a concerted, community-wide effort will be needed to develop gold-standards for estimating the mRNA abundance of these genes. Author Contributions: All analyses were performed by SA and HB and discussed with SSP and ECH. All authors contributed to the final manuscript.

Competing Interests: Authors declare no competing interests.
Data and materials availability: all data and scripts used in this manuscript are freely available via: https://github.com/sonali-bioc/UncertaintyRNA .

Online Methods
All analyses were performed in R (https://www.r-project.org/) using Bioconductor (https://www.bioconductor.org/) packages. To maximize transparency and reproducibility, we have deposited all scripts, associated data, and a large number of additional plots in a Github repository: https://github.com/sonali-bioc/UncertaintyRNA. Data sources RNA Seq gene expression data was downloaded from five and four different sources for TCGA and GTEx respectively (Supplemental Table 1). Each source contained different numbers of genes and samples, thus we included only shared protein coding genes and samples found in every source in our analysis. The batch information for Sequencing Center, Tissue Source Site (TSS) and Plate ID were downloaded from https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables. The nucleic acid isolation batch, genotype and expression batch data for GTEx samples were downloaded from the GTEx website (https://gtexportal.org/home/datasets).

Conversion of abundance estimates to Transcripts Per Million (TPM):
All data sources except Xena/Toil provided FPKM RNA Seq gene expression data. For consistency, we converted all FPKM gene expression data to TPM data using the formula in 1 .

Principle Component Analysis (PCA):
Principle Component values were generated using the R function prcomp() using all genes and visualized with R package ggplot2.

Discordantly quantified samples:
For a gene to be discordant, its expression in at least one data set should be more than 32 TPM (i.e. log2 TPM more than 5) and the log2 fold change should be more than 2 (i.e. >4-fold difference in expression).

Discordant fold changes:
For all sample pairs within each data source, expression fold changes were calculated for discordant genes (as defined above) and compared with fold change differences across other data sources.
TCGA-GTEx batch effects comparison (Suppl. Fig. 5): Four pipelines provide both TCGA and GTEx data. To visualize potential batch effects between GTEx and TCGA data, we applied Principal Component Analysis to expression data for Stomach/STAD, Liver/LIHC and Thyroid/THCA samples from each pipeline in a manner similar to 2 . In addition to these within pipeline comparisons, we also compared the original versions of the TCGA GDC data and the GTEx (v6) data using the same approach. protein levels and gene expression using rcorr() for only those genes for which we had both protein and gene expression data.