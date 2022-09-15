TCGA RNA-seq datasets

The TCGA Research Network generated RNA-seq data from ~11,000 tumor and normal sample tissues obtained from 33 cancer types. To understand some potential sources of unwanted variation, fresh-frozen tissue samples were collected from tissue source sites (TSSs), allocated to 96-well sequencing plates (hereafter called plates) and processed at various times (Supplementary Table 1). Some TCGA RNA-seq datasets, such as uveal melanoma and kidney chromophobe, were generated using a single plate. In general, plates are completely confounded with times, making it difficult to distinguish plate effects from time effects. There are also formalin-fixed, paraffin-embedded samples among the TCGA RNA-seq samples, and these were excluded from the data discussed here. Low-quality samples and lowly expressed genes were also excluded from individual datasets before the analyses in this paper (Methods). The TCGA RNA-seq datasets are available in the form of raw gene counts, fragments per kilobase of transcript per million mapped reads (FPKM) and FPKM followed by upper-quartile normalization (FPKM.UQ).

Library size, tumor purity and plate effects are major sources of unwanted variation across TCGA RNA-seq datasets

We first considered the role of sample RNA-seq library size as a source of unwanted variation. Ideally, the gene-level counts should have no significant association with library size variation in a well-normalized dataset (Fig. 1a). Consequently, any downstream analysis, including dimensional reduction, gene co-expression and differential expression, should also not be influenced by library size variation.

Fig. 1: Unwanted variation in individual TCGA RNA-seq datasets. a, Illustrative examples showing data with and without unwanted variation. Data with unwanted variation exhibit high correlation between the first five PCs and this variation (top left). Data without unwanted variation have low correlation with unwanted variation (bottom left). The histograms show Spearman correlations and log 2 F-statistics between individual genes and different sources of unwanted variation. Data with large library size and tumor purity variation show high Spearman correlations between individual gene expression and this variation. Data with plate effects exhibit high F-statistics obtained from ANOVA between individual gene expression and plates as factor. In contrast, data without such unwanted variation show low Spearman correlations and F-statistics. b, Distribution of (log 2 ) library size colored by years for the individual TCGA cancer types. The year information was not available for the LAML RNA-seq study. The library sizes are calculated after removing lowly expressed genes for each cancer type. c, R2 obtained from linear regression between the first, first and second, and so on, cumulatively to the fifth PC and library size (first panel), tumor purity (second panel) and RLE medians (third panel) in the raw count, FPKM and FPKM.UQ normalized datasets. The fourth panel shows the vector correlation between the first five PCs cumulatively and plates in the datasets. Ideally, we should see no significant associations between PCs and sources of unwanted variation. Gray color indicates that samples were profiled across a single plate. d, Spearman correlation coefficients between individual gene expression levels and library size (first panel), tumor purity (second panel) and the RLE medians (third panel) in the datasets. The fourth panel shows log 2 F-statistics obtained from ANOVA of gene expression levels by the factor: plate variable. Plates with fewer than three samples were excluded from the analyses. ANOVA was not possible for cancer types whose samples were profiled using a single plate. Full size image

For most TCGA RNA-seq studies, library sizes vary greatly both within and between years (Fig. 1b). The first five principal components (PC) cumulatively are strongly associated with (log) library size in the raw gene counts (Fig. 1c, first panel). The FPKM and FPKM.UQ normalizations reduced the effects of library size, but they showed shortcomings—high correlation between PCs and library size—in several cancer types (Fig. 1c, first panel). For each cancer type, the association between individual gene-level counts and library size was quantified using Spearman correlation (Fig. 1d, first panel, and Supplementary Fig. 2a). The results show that a large proportion of genes have high positive correlations with library size in the raw gene count datasets. However, in these datasets, there are reasonable numbers of genes whose expression levels have no correlation or a negative correlation with library size (Fig. 1d, first panel) and, thus, present a challenge for the standard RNA-seq normalizations. Supplementary Fig. 1 shows that the association between gene-level raw counts and library size is partially explained by average gene expression level and is never constant. The FPKM and FPKM.UQ normalizations introduce or exacerbate library size effects in genes whose expression has no or negative association with this variation. This will be discussed in more detail for the rectum adenocarcinoma (READ) and colon adenocarcinoma (COAD) RNA-seq datasets.

Next, we used linear regression and Spearman correlation analyses to quantify the variation in tumor purity in the TCGA RNA-seq datasets (Fig. 1c, second panel, and Fig. 1d, second panel). The results indicate the presence of substantial variation in tumor purity, and FPKM and FPKM.UQ normalizations cannot correct for this in the datasets (Fig. 1c, second panel, and Supplementary Fig. 2b). We discuss how the tumor purity variation can compromise downstream analyses, including gene co-expression and subtype identification, as was observed in the TCGA breast invasive carcinoma (BRCA) RNA-seq data.

In most TCGA RNA-seq studies, biospecimens were profiled necessarily across plates, which can impact on gene expression levels. Vector correlation and analysis of variance (ANOVA) (Methods) reveal the presence of plate effects in the raw gene counts, FPKM and FPKM.UQ normalized datasets (Fig. 1c, third panel). We found that the major known biological populations are well-distributed across plates in TCGA READ, COAD, lung adenocarcinoma and BRCA RNA-seq data, showing the absence of large confounding effects in the data.

Finally, we examined the medians of relative log expression (RLE)23 for the raw count and TCGA normalized datasets (Methods). In the absence of unwanted variation, the RLE medians should be centered around zero, so any deviation from zero indicates the presence of unwanted variation in the data. Supplementary Fig. 3 illustrates that the RLE medians of the raw count datasets deviate greatly from zero, which further confirms the presence of unwanted variation. We then investigated the associations between the first five PCs cumulatively and the RLE medians (Fig. 1c, third panel) and also computed the Spearman correlation between individual gene expression with the RLE medians for each cancer type (Fig. 1d, third panel). Ideally, we should see no associations; however, we see many associations in the raw counts and the FPKM and FPKM.UQ normalized datasets. We will demonstrate the importance of scrutinizing the association between the RLE medians and both principal component analysis (PCA) and individual gene expression in the TCGA breast cancer RNA-seq data.

Taken together, our results show that all the TCGA RNA-seq datasets, both raw and normalized, are greatly affected by the three major sources of unwanted variation. Next, we used the READ, COAD and BRCA RNA-seq datasets to illustrate the effects of unwanted variation on certain downstream analyses and show the performance and effectiveness of RUV-III with PRPS for these datasets. The details of each study are provided separately below.

TCGA READ RNA-seq study

Study outline

The READ RNA-seq study involved 176 assays generated using 14 plates over 4 years. The RNA-seq library sizes vary greatly between samples profiled in 2010 and the other samples (Supplementary Fig. 4). The major gene-expression-based biological populations—consensus molecular subtypes (CMSs)24—were identified using the R package CMScaller25 (Methods) in the data normalized by different methods. See Supplementary Figs. 5 and 6 and the Supplementary File for further details. These subtypes will be used for both assessing the performance of normalization methods and creating PRPS for RUV-III normalization.

RUV-III removes substantial library size variation and plate effects from the data

Substantial library size variation between samples profiled in 2010 and the other samples are clearly visible in the RLE and PCA plots (Supplementary Fig. 7a and Fig. 2a, top panel) of the raw count data. Although the FPKM and FPKM.UQ normalizations reduced this variation, both methods exhibited shortcomings—for example, by not fully mixing samples with large library size differences (Fig. 2a, top row).

Fig. 2: Performance assessment of different normalizations on the TCGA READ RNA-seq data. a, Top row: scatter plots of first two PCs for raw counts, FPKM, FPKM.UQ and RUV-III normalized data colored by key time intervals (2010 versus 2011–2014). Bottom row: same as the top row colored by the CMS. The CMSs were obtained for each dataset separately. b, Top: a plot showing the R2 of linear regression between library size and up to the first five PCs (taken cumulatively). Bottom: violin plots of Spearman correlation coefficients between the gene expression levels and library size for individual data. c, Top: the frequency of P < 0.05 obtained from DE analysis between samples with low and high library size. Bottom: Scatter plot shows silhouette coefficients and ARI for mixing samples from two different key time intervals. d, Top: a plot showing the vector correlation coefficient between plates and the first five PCs within each time intervals. Bottom: box plots of log 2 F-statistics obtained from ANOVA within each key time interval for gene expression with plate as a factor. e, Top: a plot showing the vector correlation coefficient between CMS subtypes and up to the first five PCs. Bottom: a scatter plot displays silhouette coefficients and ARI for measuring the separation of CMS subtypes. Full size image

PCA plots and linear regression between the first five PCs cumulatively and library size clearly illustrate that RUV-III with PRPS improved upon the FPKM and FPKM.UQ normalizations in removing the variation in library size from the data (Fig. 2a, top row, and Fig. 2b, top plot).

Spearman correlation analyses between the individual gene expression values and library size reveal a large proportion of genes showing strong positive or negative correlations with library size in the FPKM and FPKM.UQ normalized datasets, whereas this correlation was significantly reduced in the RUV-III normalized data (Fig. 2b, bottom). Furthermore, differential expression (DE) analysis (Methods) was performed between samples with high and low library size. Ideally, we should see little evidence of differential gene expression, whereas we see a lot in the FPKM and FPKM.UQ datasets, far more than in the RUV-III normalized data (Fig. 2c, top). Finally, the silhouette coefficient and adjusted Rand index (ARI) analyses (Methods) showed that RUV-III performs better in mixing the samples with large library size differences (Fig. 2c, bottom).

To examine plate effects and separate this variation from the large library size variation in the data, we performed our evaluation within each key time interval. The results showed that RUV-III clearly improves over the FPKM and FPKM.UQ normalizations in removing plate effects from the data (Fig. 2d).

Note that, here we have not attempted to remove variation caused by tumor purity. Consequently, the tumor purity estimates obtained from the RUV-III and FPKM.UQ normalized data were highly correlated (Supplementary Fig. 7b). This illustrates the ability of RUV-III to remove only the variation that the user wants to remove and no more—that is, to retain other variation that is of biological origin.

We next explored the relationship between RLE medians and both library size and tumor purity—the two major variations in the data—for the different normalizations (Supplementary Fig. 7c). The library size variation is the largest variation in the raw counts data, and RLE medians are strongly associated with this variation. The TCGA and RUV-III normalizations reduced the variation in the library size; therefore, the tumor purity became the largest variation in these datasets. Then, the RLE medians of the TCGA and RUV-III normalized data show a strong association with tumor purity (Supplementary Fig. 7d). These results were further supported by comparisons of the Spearman correlation analyses between the individual gene expression levels and RLE medians with the same analyses between the individual gene expression levels and library size and with tumor purity (Supplementary Fig. 8). Together, these results show the value of exploring the association of the RLE medians with known sources of unwanted variation in the data. Later, we will show that the RLE medians have no correlation with gene expression in the TCGA BRCA RUV-III normalized data when variations in both library size and tumor purity are removed.

RUV-III improves the separation between consensus molecular subtypes

Colorectal cancers are classified into four transcriptomic-based subtypes—CMSs—with distinct features24. PCA plots of the RUV-III normalized data show distinct clusters of the CMSs for the READ RNA-seq samples, whereas these subtypes are not as clearly separated in the TCGA normalized datasets (Fig. 2a, bottom row). To confirm the pattern of the CMS clusters in the PCA plots of the RUV-III normalized data, we applied PCA within the key time intervals in the FPKM and FPKM.UQ normalized datasets. The results show that the CMS clustering within each time interval in the FPKM.UQ data is highly consistent with that obtained with RUV-III using the full set of data (Supplementary Fig. 9).

Furthermore, the vector correlation analysis between the first five PCs cumulatively and the CMS confirmed that the RUV-III normalization leads to a better separation of the CMS clusters than the TCGA normalized datasets (Fig. 2e, top). These results were strengthened by silhouette coefficient and ARI analyses (Fig. 2e, bottom). Additionally, gene set enrichment analyses showed that the CMSs obtained from the RUV-III normalized data are associated with known gene signatures25 (Supplementary Fig. 6b). Supplementary Fig. 10 shows the Kaplan–Meier survival plots of the CMSs identified by different normalization methods. The survival outcome difference between CMS2 and CMS4 that were obtained from the RUV-III normalized data is clearer than the TCGA normalized datasets (Supplementary Fig. 10).

RUV-III improves gene co-expression and gene-level survival analyses

Unwanted variation introduced by the large sample library size differences can compromise downstream analyses, such as gene co-expression and gene-level survival analyses, in the TCGA READ RNA-seq data. This variation can have two effects on gene co-expression analysis. It can lead to apparent correlations between genes that are most likely un-correlated. For example, the correlation between the TMF1 (TATA element modulatory factor 1) and BCLAF1 (Bcl-2-associated transcription factor 1) genes are ρ = 0.8 and ρ = 0.7 in the TCGA FPKM and FPKM.UQ normalized data, respectively. The role of the TMF1 gene has not been characterized in COAD, although the BCLAF1 gene shows a pro-tumorigenic role in this cancer type26. One might suggest that the TMF1 gene expression may have a role in tumorigenesis in colon cancer due to its high correlation with the BCLAF1 gene expression. However, we see no such correlation in the RUV-III normalized data, which is consistent with the correlation obtained from an independent platform, namely the TCGA READ microarray data (Fig. 3a). On the other hand, the unwanted variation can obscure correlations between gene–gene expression levels that are likely to be truly correlated. For example, the overall correlation between the MDH2 (malate dehydrogenase 2) and EIF4H (eukaryotic translation initiation factor 4H) genes is ρ = −0.05, whereas they exhibit a high correlation within each key time interval in the TCGA normalized data (Fig. 3a). The overall correlation of these genes was 0.7 in the RUV-III normalized data, consistent with what was seen in the TCGA READ microarray data (Fig. 3a). The MDH2 and EIF4H genes show important roles in cancer growth and metastasis; thus, they are of clinical importance for cancer treatment27,28. The high correlation between these two genes revealed by RUV-III may suggest that they are involved in a co-expression network, which has not been previously reported.

Fig. 3: Gene co-expression analyses of TCGA READ RNA-seq data using different normalizations. a, First row: scatter plots of the gene expression levels of the TMF1 and BCLAF1 genes in the TCGA READ raw counts and differently normalized datasets. The red line shows overall association, and the cyan and olive lines show associations between the gene expression within 2010 samples and within the rest of the samples, respectively. Second row: same as the first row, for MDH2 and EIF4H gene expression. b, The correlation matrix of expression levels of the genes with the 500 highest correlations with library size in the FPKM.UQ. The first plot is obtained using FPKM.UQ, and the second plot is obtained using the RUV-III normalized data. The colored bar along the top shows the correlation of individual genes with library size. The order of rows and columns is the same in both correlation matrices. c, Differences (ρ microarray – ρ RNA-seq ) of Spearman correlation coefficients for all possible gene–gene pairs calculated using the TCGA READ microarray and both the FPKM.UQ and RUV-III normalized RNA-seq data. Full size image

We extended this analysis to all possible gene–gene correlations of the genes that have the highest correlation with library size in the FPKM.UQ normalized data (Fig. 3b). Strikingly, the results show numerous strong but likely spurious correlations between gene pairs in the FPKM.UQ normalized data, whereas using RUV-III significantly reduced these correlations (Fig. 3b).

Figure 3c depicts the differences (ρ microarray – ρ RNA-seq ) between all possible gene–gene Spearman correlations ρ using the TCGA READ microarray data and the FPKM.UQ and RUV-III normalized data.

Association between gene expression and survival outcomes of patients is another downstream analysis that can be influenced by the library size variation in the TCGA READ RNA-seq data. For example, RUV-III, as opposed to the TCGA normalized data, revealed that the expression of the RAB18 (Ras-related in brain 18) and FBX14 (F-box and leucine-rich repeat protein 14) genes are highly associated with overall survival outcome of patients in the data (Fig. 4). The reason is clear from the expression patterns across time: dividing samples based on median expression mainly resulted in two groups with low and high library size, which was not biologically meaningful for the TCGA normalizations (Fig. 4). RAB18 gene expression plays pivotal roles in cell proliferation and metastasis, and high expression is associated with poor survival in different cancer types29. FBXL14 gene expression mediates the epithelial–mesenchymal transition (EMT) in cancer, which indicates that FBXL14 could function as an EMT inhibitor to suppress metastasis in human cancers30. Other examples are PTPN14 and CSGALNACT2, whose associations with survival have been previously shown in colorectal cancer (Supplementary Fig. 11)31. We found a remarkable number of genes whose expression levels were associated with survival using the RUV-III normalized data, which were not found using the FPKM and FPKM.UQ normalized data.

Fig. 4: Association between gene expression and overall survival in differently normalized TCGA READ datasets. a, Upper part: plots of the expression levels of the RAB18 gene across samples. The dashed lines represent the median expression level of the RAB18 gene. Lower part: Kaplan–Meier curves for samples with low (below median) and high (above median) expression of the RAB18 gene. b, As in a for the FBXL14 gene. Full size image

Gene-level counts are not proportional to library size

The FPKM and FPKM.UQ normalizations rely on global scale factors computed based on library size or upper quartiles of samples in the raw count data (Fig. 5a) to remove library size effects. These methods assume that gene-level counts all are proportional to the global scale factors. However, we show that, in the READ raw count data, different groups of genes exhibit different relationships to the global scale factors used in the FPKM and FPKM.UQ normalizations (Fig. 5b).

Fig. 5: Relationship between gene-level (log 2 ) counts and (log 2 ) library size in the TCGA READ RNA-seq data. a, Global scale factors obtained by sample library sizes (LS) (left) and upper quartiles (UQ) (right) of READ raw counts versus time. b, Scatter plots of log 2 fold change obtained from DE analyses of gene expression levels with the major time variation: 2010 versus 2011–2014; (log 2 ) raw READ counts on the horizontal axes of all plots and differently normalized counts vertically. c, Expression patterns of four genes (DDX23, LARP7, ALKBH7 and TMEM160) whose counts have different relationships with the global scaling factors calculated from the TCGA READ raw count data. Full size image

The first group consists of genes whose counts are proportional to the global scale factors. For these genes, the FPKM and FPKM.UQ normalizations are adequate to remove the association between library size variation and gene expression. The DDX23 (DEAD-box helicase 23) gene is an example from this group (Fig. 5c, first row). The second group includes genes whose expression levels are greater than those expected using the global scaling factors, and so those factors are insufficient for adjusting their expression levels to be independent of library size. The LARP7 (La ribonucleoprotein 7) gene represents the behavior of genes in this group (Fig. 5c, second row). The third group contains genes such as ALKBH7 (AlkB homolog 7), whose expression levels are not associated with library size in the raw count data. Then, the FPKM and FPKM.UQ normalizations introduce the library size variation to the expression levels of genes in this group (Fig. 5c, third row). Finally, there are genes such as TMEM160 (transmembrane protein 160) whose expression levels relate to library size in a manner opposite to that motivating the use of global scaling factors. Applying scaling factors to such genes exacerbates, rather than removes, variation associated with library size (Fig. 5c, fourth row).

Note that we found the same issue in the TCGA RNA-seq datasets, such as kidney chromophobe and uveal melanoma, where samples were profiled using a single plate (Fig. 1c, first panel, and Supplementary Fig. 12).

TCGA COAD RNA-seq study

The COAD RNA-seq study involved 479 assays generated across 4 years. As with the READ RNA-seq data, there are large library size differences between samples profiled in 2010 and the other samples. The FPKM and FPKM.UQ normalizations removed library size effects from the data more effectively than was the case for the READ RNA-seq data, but these also had shortcomings.

It should be noted that the first two PCs of the FPKM and FPKM.UQ data did not reveal that the library size effects have not been properly removed. This highlights the importance of gene-level assessment, such as correlation between individual gene expression and library size or DE analysis between batches, to assess the performance of normalizations. See the Supplementary File and Supplementary Figs. 13–25 for full details of this dataset and results analogous to those just presented for the READ data.

TCGA BRCA RNA-seq study

Study outline

The BRCA RNA-seq study involved 1,180 assays that were carried out on samples from 40 TSSs, distributed across 38 plates, and profiled over 5 years from 2010 to 2014 (Supplementary Fig. 26). The samples collected in 2010 and 2011 were profiled using one flow cell chemistry, and the remaining samples were profiled using a different flow cell chemistry (personal communication from TCGA). There were 94 adjacent normal breast tissue samples and seven paired primary-metastatic samples in the study (Supplementary Fig. 26). The major intrinsic biological populations, prediction analysis of microarray 50 (PAM50) of the TCGA BRCA RNA-seq samples, were identified using different approaches. See the Supplementary File and Supplementary Figs. 27 and 28 for full details.

RUV-III removes the effects of tumor purity, flow cell chemistries and library size

As with most of the other TCGA RNA-seq studies (Fig. 1), tumor purity is one of the major sources of variation in the BRCA study. For this dataset, we designed our PRPS to remove the effects of tumor purity as well as other technical variation (Methods).

Linear regression between the first five PCs cumulatively and tumor purity within the individual PAM50 subtypes showed that the RUV-III normalization substantially removed this variation from the data (Fig. 6a). These results were supported by Spearman correlation analyses between individual gene expression levels and tumor purity within each of the PAM50 subtypes and a DE analysis between samples with low and high tumor purity (Fig. 6b,c). The variation of tumor purity estimated using the RUV-III normalized data was significantly smaller than that observed in the corresponding measurements on the FPKM.UQ normalized data (Fig. 6d).

Fig. 6: RUV-III removes tumor purity and flow cell chemistry variation from the TCGA BRCA RNA-seq data. a, R2 obtained from linear regression between the first five PCs (cumulatively) and tumor purity within individual PAM50 subtypes in the differently normalized datasets. The numbers of samples for each subtype and normalization are shown in Supplementary Fig. 27a. b, Box plots of Spearman correlation coefficients between individual gene expression and tumor purity levels in the differently normalized datasets (n = 16,537 genes). c, Unadjusted P value histograms of DE analysis between samples with low and high tumor purity within the four main PAM50 subtypes in the FPKM.UQ and the RUV-III normalized datasets. P values were obtained using Wilcoxon signed-rank test. d, Distributions of tumor purity scores in the FPKM.UQ and RUV-III normalized datasets. e, Vector correlation between the first five PCs (cumulatively) and flow cell chemistry in the normalized datasets. f, Box plots of log 2 F-statistics obtained from ANOVA between individual gene expression levels and the flow cell chemistry factor in the differently normalized datasets (n = 16,537 genes). g, Bar charts of silhouette coefficients and ARIs showing the performance of different normalization methods in mixing samples from the two flow cell chemistries. h, Gene expression heat map of the 400 genes that are highly affected by the flow cell chemistries in the TCGA FPKM.UQ data (rows are clustered; columns are in chronological order of sample processing). i, Batch scores across samples in the FPKM.UQ (left) and RUV-III (right) normalized datasets. The batch scores were calculated by the singscore method using the 400 genes described in h. Samples were divided into four groups based on their batch scores. j, Spearman correlation coefficients between the batch scores and individual gene expression levels in the FPKM and RUV-III normalized datasets. In the box plots (b and f), the heavy middle line represents the median; the box shows the IQR; the upper and lower whiskers extend from the hinges no further than 1.5× IQR; and any outliers beyond the whiskers are shown as points. Full size image

As mentioned above, the TCGA BRCA RNA-seq samples were profiled over two batches of flow cell chemistries. PCA plots of the FPKM and FPKM.UQ normalized datasets showed noticeable variation due to the use of two flow cell chemistries, whereas RUV-III effectively removed this variation from the data (Supplementary Fig. 29a). This conclusion was supported by a vector correlation analysis between the first ten PCs cumulatively and the binary flow cell chemistry variable, silhouette analyses, the ARI and ANOVA between individual gene expression measurements and the flow cell chemistry factor (Fig. 6e–g and Supplementary Fig. 29b,c).

An expression heat map of the most highly affected genes by the flow cell chemistries showed that different genes are affected in different ways (Fig. 6h). Interestingly, the heat map also revealed two clusters within the samples processed by the first flow cell chemistry. This suggests that there are additional sources of unwanted variation of unknown origin within each flow cell chemistry. To explore this more fully, we took the set of most highly affected genes by the flow cell chemistries and scored samples against this gene set (hereafter called the batch score) using the R/Bioconductor package singscore32 on the FPKM.UQ normalized dataset. Batch scores clearly distinguished samples from the flow cell chemistry batches and separated the samples into clusters within each flow cell chemistry (Fig. 6i). We then used cutoffs to divide the samples into four groups based on their batch scores. These groups were not visible in the batch scores obtained from the RUV-III normalized data (Fig. 6i). Spearman correlation analyses showed that a surprising number of genes had either high positive or high negative correlations with the batch scores in the FPKM.UQ normalized data (Fig. 6j), whereas these correlations were much lower in the RUV-III normalized data.

Tumor purity and flow cell chemistries effects compromise gene co-expression and survival analysis

Just as we saw above with library size, tumor purity variation can affect downstream analyses, such as gene co-expression and the association between gene expression levels and survival outcomes of patients in the data. As with library size, this variation can introduce correlation between genes that are probably un-correlated. For example, Fig. 7a shows that the gene expression levels of ZEB2 (zinc finger E-box-binding homeobox 2) and ETS1 are both highly correlated with tumor purity. The ZEB2 gene is a one of the regulators of the EMT process that induces invasion of cancer cells33,34. ETS1 is member of a large family of transcription factors characterized by their ETS DNA‐binding domain. The gene appears to have dichotomous roles as an oncogene and a tumor suppressor gene in different cancer types35,36. The high correlation of ETS1 with ZEB2 in the TCGA BRCA RNA-seq data may confirm its oncogene role, but this is most likely a consequence of their correlations with tumor purity. The RUV-III normalized data and the breast cancer laser microdissection microarray data37 showed that the expression levels of these two genes are uncorrelated (Fig. 7b).

Fig. 7: Impact of tumor purity and flow cell chemistry variation on gene co-expression and survival analysis in the TCGA BRCA RNA-seq data. a, Relationship between tumor purity scores and the ZEB2 and ETS1 gene expression in the FPKM data. b, Scatter plots exhibit relationship between the ZEB2 and ETS1 gene expression in the FPKM data (left), the RUV-III normalized data (middle) and the LCM microarray data (right). c, Scatter plots show the Spearman correlation coefficients and partial correlation coefficients for all possible pairs of the genes that have the 1,300 highest correlations with tumor purity in the TCGA FPKM.UQ (left) and RUV-III normalized data (right). d, Kaplan–Meier survival analysis shows the association between the ZEB2 gene expression and overall survival in the FPKM.UQ (left) and the RUV-III normalized data (right). e, Relationship between the ESSRA and MAP3K2 gene expression with the batch scores in the FPKM.UQ data. f, Scatter plots show the relationship between the ESSRA and MAP3K2 gene expression in the FPKM.UQ (left), the RUV-III normalized data (middle) and the TCGA BRCA microarray data (right). g, Scatter plots display Spearman correlation coefficients of all possible pairs of genes that are highly affected by flow cell chemistries in the FPKM.UQ and the RUV-III normalized data. h, Kaplan–Meier survival analysis shows the association between the ESSRA gene expression and overall survival in the FPKM.UQ (left) and the RUV-III normalized data (right). i, Scatter plots exhibit the relationship between the E2F4 and CNOT1 gene expression in the FPKM.UQ (left), the RUV-III normalized data (middle) and the TCGA BRCA microarray data (right). Full size image

To extend this observation, we selected 1,300 genes whose gene expression levels are highly correlated with tumor purity and then calculated Spearman correlations between all possible pairs of these genes. In a matching analysis, we computed partial correlations between these pairs adjusting for tumor purity (Methods). Figure 7c shows that there are many gene pairs that have high correlations, but these are mostly likely a consequence of their correlation with tumor purity.

Variation in tumor purity can also affect the association between gene expression levels and survival outcomes. For example, the expression of the ZEB2 gene shows to be associated with cancer progression and survival outcome in different cancer types38,39. The RUV-III normalization revealed that high expression of the ZEB2 gene is associated with a poor outcome in the TCGA BRCA RNA-seq data, but this was obscured by variation in tumor purity in the FPKM.UQ normalized data.

Another example is the STAB1 (stabilin 1) gene, whose expression levels are associated with survival in several cancer types, including breast cancer40,41,42. However, this association was only evident in the present data after removing variation in tumor purity. We found many more examples of such genes using the RUV-III normalized data.

The complex unwanted variation arising from the change in flow cell chemistry and the unknown source noted above clearly compromises estimates of gene co-expression in the FPKM.UQ normalized dataset. It introduces correlations between pairs of genes that are most likely not correlated. For example, the expression levels of the ESRRA (estrogen-related receptor alpha) and MAP3K2 (mitogen-activated protein kinase kinase kinase 2) genes are positively correlated in this dataset; however, this correlation seems to be a consequence of the unwanted variation in the data (Fig. 7e), for we do not see it in either the RUV-III normalized data or the TCGA BRCA microarray data (Fig. 7f).

To extend this analysis, we first selected the genes that had the 1,000 highest correlations with the batch scores in the FPKM.UQ normalized data and calculated all gene–gene correlations between them in both the FPKN.UQ and RUV-III normalized datasets. Figure 7j shows that a large number of gene pairs have high correlations in the FPKM.UQ normalized data, something we do not see in the RUV-III normalized data.

Interestingly, the overall correlation between expression of the E2F4 (E2F transcription factor 4) and CNOT1 (CCR4-NOT transcription complex subunit 1) genes is ρ = 0.1, and the average of the correlations of these genes within each of groups 1–4 of the unknown source of unwanted variation is ρ = 0.4 (Fig. 7i) in the FPKM.UQ normalized data. Both the RUV-III normalized and the TCGA microarray data show a high positive correlation between the expression levels of the E2F4 and CNOT1 genes.

Supplementary Fig. 30 shows that the RUV-III normalization removed library size effects from this dataset more effectively than was the case with the FPKM and FPKM.UQ normalizations.

RUV-III improves the separation of the PAM50 clusters

Breast cancer intrinsic subtypes, including HER2-enriched, basal-like, luminal A, luminal B and normal-like43,44, are based on a 50-gene expression signature (PAM50)45. PCA plots, vector correlation between the first ten PCs cumulatively and the PAM50 subtypes, silhouette coefficients and ARI (Extended Data Fig. 1a–c) all show that the RUV-III normalization led to better separation of PAM50 subtypes in the BRCA RNA-seq data. Kaplan–Meier survival analysis shows that the PAM50 calls obtained using RUV-III normalized data exhibit significant associations with overall survival outcomes of TCGA BRCA patients (Supplementary Fig. 27b,c).

It should be noted the PAM50 subtypes identified using the TCGA normalized datasets are compromised by tumor purity, particularly samples from normal-like subtype that show very low tumor purity. We applied the PAM50 classifier on the breast cancer laser capture microdissection (LCM) gene expression data and found no normal-like subtype in the dataset. The results confirm previous studies that show that the normal-like subtype is due to the low tumor purity of samples46,47,48.

Additionally, Spearman correlation analysis showed that several the PAM50 genes exhibit high correlation with tumor purity in the FPKM.UQ normalized data (Extended Data Fig. 1d). For example, expression of FOXA1 (forkhead box A1) is highly associated with tumor purity in the Her2, luminal A and luminal B subtypes in the FPKM.UQ normalized data (Extended Data Fig. 1e). This observation suggests that variation in tumor purity might compromise the identification of PAM50 subtypes. In addition, this might also explain the differences between the PAM50 calls obtained from RUV-III normalized data, where the variation of tumor purity has been removed, and those obtained from the FPKM and FPKM.UQ normalized datasets (Supplementary Fig. 27a).

We explored the association between the expression levels of the PAM50 genes and survival within each of the PAM50 subtypes using both the FPKM.UQ and RUV-III normalized data. Interestingly, we found with the RUV-III normalized data that higher expression of the FOXA1 gene is associated with poorer outcome in the luminal B subtype, a conclusion that was obscured by the variation in tumor purity of the TCGA RNA-seq data (Extended Data Fig. 1f).

Normalization of multiple RNA-seq studies

We assessed the performance of RUV-III with PRPS on the normalization of multiple RNA-seq studies. In this analysis, we normalized three large breast cancer RNA-seq datasets, including TCGA and two cohorts from Brueffer et al. studies49,50. We did not have access to the raw counts data of Brueffer et al. studies, so we performed our normalization on the FPKM counts of all three studies. The lowly expressed genes were identified using the TCGA BRCA raw counts and removed from the other datasets. The PCA and RLE plots of the combined datasets show large variation between the TCGA and the other two studies (Supplementary Fig. 32a and 32b). As discussed above, we first need to identify sources of unwanted variation to create PRPS for RUV-III normalization. We used plates as batches for the TCGA BRCA RNA-seq data and the RLE medians (Methods) within each of the other two studies to identify batches. Their medians were clustered into three groups within each study. We performed PCA within each study using a set of RNA-seq housekeeping genes as negative control genes to explore the batches that were identified using the RLE medians. Supplementary Fig. 32c shows that the first and third PCs capture those batches. Then, the PAM50 subtypes were used as known major biological populations to produce five sets of PRPS (Supplementary Fig. 33a). The results demonstrated that RUV-III with PRPS leads to a satisfactory normalization by removing between-study and within-study variations and preserving the PAM50 clusters (Supplementary Fig. 33), whereas the other normalizations, quantile and upper quartile, show visible shortcomings. Furthermore, Supplementary Fig. 33d shows that several well-known gene–gene correlations51 have been preserved in the RUV-III normalized data. We also explored the correlation between the two pairs of genes, CNOT1_E2F4 and MAP3K2 _ESRRA, that were discussed in the TCGA BRCA RNA-seq data (Fig. 7). The true correlation between these two pairs of genes was preserved in the RUV-III normalized data (Supplementary Fig. 33d). The results demonstrate that RUV-III with PRPS is applicable to normalizing RNA-seq data from multiple studies. Note that we would have preferred to use RUV-III on the raw counts without any further normalization, but we were unable to do so here.

Performance of RUV-III with poorly chosen PRPS

We evaluated the performance of RUV-III with poorly chosen PRPS on the TCGA READ and BRCA RNA-seq studies. To simulate poorly chosen PRPS, we randomly shuffled 20%, 40%, 60% and 80% of the biological labels, including the CMS and PAM50 subtypes, that were originally used to create PRPS for RUV-III normalization. The shuffling steps were repeated ten times for each proportion, and the results were averaged for normalization performance assessments.

The results show that, even with poorly chosen PRPS, RUV-III outperforms the FPKM and FPKM.UQ normalization in terms of removing large library size differences and preserving the CMS clusters in the TCGA READ RNA-seq data (Supplementary Fig. 34a,b). The correlations between two pairs of genes, MDH2_EIF4H and TMF1_BCLAF1 (Fig. 3), were also preserved in the RUV-III datasets with poorly chosen PRPS (Supplementary Fig. 34c). Furthermore, the association between RAB18 gene expression and the survival outcome (Fig. 4) was identified in all the RUV-III datasets with poorly chosen PRPS. However, we found this association for the FBXL14 gene only in RUV-III with 20% shuffled labels (Supplementary Fig. 34d).

We performed a similar analysis on the TCGA BRCA RNA-seq data. Our results showed that the RUV-III normalizations with poorly chosen PRPS also show satisfactory performance compared to both FPKM and FPM.UQ in terms of removing the flow cell chemistry and tumor purity effects. However, RUV-III with 60% and 80% shuffled labels show a slightly lower performance compared to FPKM and FPKM.UQ normalization regarding the separation of the PAM50 subtypes (Supplementary Fig. 35). The gene–gene correlations and association between gene expression and survival outcomes demonstrated that the RUV-III normalizations with poorly chosen PRPS results in satisfactory normalization (Supplementary Fig. 35d–f).

Overall, our results illustrate that RUV-III shows a very satisfactory performance in a situation where PRPS is poorly chosen.

Performance of RUV-III with partially known biological labels

We assessed the performance of RUV-III with PRPS in situations where the biological labels are partially known (hereafter called the RUV-III-P). To simulate such situations, we used one of the CMS subtypes, CMS4, to create PRPS for RUV-III normalization of the TCGA READ RNA-seq data. Note that this subtype is not present across all the plates. Our results clearly show that RUV-III-P normalization led to very satisfactory normalization by removing the large library size differences and plate effects and also preserving the CMS clusters (Supplementary Fig. 36). RUV-III-P also preserved the association between RAB18 gene expression and survival outcomes in the TCGA READ RNA-seq data. However, this normalization did not show the same result for the FXBL14 gene. This might be explained by the presence of the CMS4 subtype in eight out of 14 plates in the TCGA READ RNA-seq data.

Similar analyses were performed on the TCGA BRCA RNA-seq data. We used the basal and luminal A subtypes to create PRPS. The results demonstrated that performance of RUV-III-P was largely similar to the initial RUV-III normalization, in which all the PAM50 subtypes were used for producing PRPS (Supplementary Fig. 37).

These analyses show that RUV-III with PRPS can be used for normalization of RNA-seq data in situations where the biological labels of samples are only partially known.