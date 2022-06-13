Additional details and results are described in the Supplementary Notes. Here, we summarize the key aspects of the analysis.

Total mRNA expression in scRNA-seq data

Dataset

We collected scRNA-seq data from ten patients, comprising three with colorectal adenocarcinoma, three with hepatocellular carcinoma, two with lung adenocarcinoma and two with pancreatic adenocarcinoma (Supplementary Table 1). A full description is provided in Supplementary Note 1.1. The three colorectal adenocarcinoma patient samples were obtained with informed consent and were approved by the Human Subjects Protection Office, the Clinical Research Committee as well as five separate institutional review boards at MD Anderson Cancer Center, in accordance with the Declaration of Helsinki.

Quality control, clustering, cell type annotation and normalized UMI

For each sample, we first filtered out cells based on number of genes expressed, total UMI counts and proportion of total UMI counts derived from mitochondrial genes. We also removed cells that were detected as doublets. After the quality control, 48,913 cells remained from the ten human tumor samples. Within each patient sample, highly variable genes were detected and used for principal component analysis (PCA). Cells were then clustered with the Seurat package68. Cell type was annotated using known marker genes34,35,69,70,71. Tumor cells were identified based on the inferred presence of somatic CNAs by inferCNV72. We further merged Seurat68-identified clusters that were not significantly different in gene counts, which is the total number of expressed genes (Wilcoxon rank-sum test, α = 0.001; Fig. 1b). A full description is provided in Supplementary Note 1.2.1.

To enable comparison among different scRNA-seq samples within the same study, we performed scale normalization to ensure that the total UMI count per cell was comparable across different samples from the same study. A full description is provided in Supplementary Note 1.2.2.

Trajectory and gene set enrichment analyses

We applied Monocle 2 (version 2.14.0)44,45,46 to construct single-cell trajectories and used the CytoTRACE (version 0.3.3) score to measure the differentiation state of tumor cells16. To compare CytoTRACE scores among the tumor cell clusters from patient samples within the same cancer type, we integrated tumor cells from patients 1, 2 and 3 from colorectal cancer and patients 1 and 2 from each of the lung and pancreatic cancers using ComBat (version 3.20.0)73 embedded in CytoTRACE, which corrects for batch effects. We quantified gene set enrichment for the high-UMI versus low-UMI tumor cell clusters using the GeneOverlap R package (version 1.24.0)74. A comprehensive set of signatures with 18,617 human gene sets (containing at least four genes) was compiled from the Molecular Signatures Database (version 6.2)47 and CellMarker48. A full description is provided in Supplementary Note 1.2.4.

Pseudo-bulk analysis

We pooled normalized scRNA-seq data to form pseudo-bulk samples and estimated the ratio of the mean total UMI counts of tumor cells to that of the non-tumor cells for each sample. The 95% CIs were constructed by bootstrapping the same numbers of tumor and non-tumor cells with 1,000 repetitions.

Tumor-specific total mRNA expression in bulk sequencing data

A mathematical model for tumor-specific total mRNA expression estimation

For any group of cells, we use S to denote the average global mRNA transcript level per cell per haploid genome, which follows \(S = \mathop {\sum}

olimits_{c = 1}^C {\left( {\mathop {\sum}

olimits_{g = 1}^G {u_{gc}/p_c} } \right)/C}\). Here, u gc denotes the number of mRNA transcripts of gene g in cell c; G is the total number of genes; C is the number of cells; and p c is the ploidy—that is, the number of copies of the haploid genome in cell c. However, the cell-level ploidy p c is usually not measurable. Hence, in practice, we use average ploidy Ψ of the corresponding cell group to approximate it: \(S \approx \mathop {\sum}

olimits_{c = 1}^C {\mathop {\sum}

olimits_{g = 1}^G {u_{gc}/(C\Psi )} }\). For non-tumor cells, which are commonly diploid, this assumption is assured.

In the analysis of bulk RNA sequencing data from mixed tumor samples, we are interested in comparing tumor to non-tumor cell groups. We let T denote tumor cells and N denote non-tumor cells. Therefore, we define a TmS to reflect the ratio of total mRNA transcript level per haploid genome of tumor cells to that of the surrounding non-tumor cells—that is, TmS tumor = S T / S N , simplified as TmS from here forward. It is necessary to calculate this ratio to cancel out technical effects presented in sequencing data that confound with both S T and S N . Let \(T_g = \mathop {\sum}

olimits_{c = 1}^{C_T} {u_{gc}}\) and \(N_g = \mathop {\sum}

olimits_{c = 1}^{C_N} {u_{gc}}\) denote the total number of mRNA transcripts of gene g across all cells from tumor and non-tumor cells; let \(T_ + = \mathop {\sum}

olimits_{g = 1}^G {T_g} ,N_ + = \mathop {\sum}

olimits_{g = 1}^G {N_g} ,\) C T and C N denote the total number of tumor and non-tumor cells; and let Ψ T and Ψ N represent the average ploidy of tumor and non-tumor cells, respectively. Under the assumption that the tumor cells have a similar ploidy, we can derive TmS without using single-cell-specific parameters as

$${\mathrm{TmS}} = [T_ + /(C_T\Psi _T)]/[N_ + /(C_N\Psi _N)] = [T_ + /N_ + ]/[(C_T\Psi _T)/(C_N\Psi _N)]$$ (1)

We further introduce the proportion of total bulk mRNA expression derived from tumor cells (hereafter ‘tumor-specific mRNA proportion’) \(\pi = \left( {\mathop {\sum}

olimits_{g = 1}^G {T_g} } \right)/\left( {\mathop {\sum}

olimits_{g = 1}^G {T_g} + \mathop {\sum}

olimits_{g = 1}^G {N_g} } \right)\) and the tumor cell proportion (hereafter ‘tumor purity’) ρ = C T /(C T + C N ). We, thus, have

$$\begin{array}{*{20}{l}} {\mathrm{TmS}} \hfill & = \hfill & {\left[ {\pi /(1 - \pi )} \right]/\left[ {\left( {\rho /\left( {1 - \rho } \right)} \right)\left( {\mathop {\Psi }

olimits_T /\mathop {\Psi }

olimits_N } \right)} \right]} \hfill \\ {} \hfill & = \hfill & {\left[ {\pi \left( {1 - \rho } \right)\mathop {\Psi }

olimits_N } \right]/\left[ {\rho \left( {1 - \pi } \right)\mathop {\Psi }

olimits_T } \right]} \hfill \end{array}$$ (2)

The tumor-specific mRNA proportion π derived from the tumor can be estimated using DeMixT31 as \(\hat \pi\); the tumor purity ρ and ploidy Ψ T can be estimated using ASCAT32, ABSOLUTE33 or Sequenza49 based on the matched DNA sequencing data as \(\hat \rho\) and \(\widehat {{\Psi }}_T\), respectively; and the ploidy of non-tumor cells Ψ N was assumed to be 2 (refs. 32,33). Hence, we have

$$\widehat {\mathrm{TmS}} = \frac{{\hat \pi (1 - \hat \rho )\Psi _N}}{{\hat \rho (1 - \hat \pi )\hat \Psi _T}}$$ (3)

In what follows, we use TmS to represent \({\widehat {\mathrm{TmS}}}\) for simplicity. A full description is provided in Supplementary Note 2.1.

Consensus of tumor purity and ploidy estimation

For DNA-based deconvolution methods such as ASCAT and ABSOLUTE, there could be multiple tumor purity ρ and ploidy Ψ T pairs that have similar likelihoods. Both ASCAT and ABSOLUTE can accurately estimate the product of purity and ploidy ρΨ T ; however, they sometimes lack power to identify ρ and Ψ T separately. TmS is derived from the product of tumor ploidy and the odds of tumor purity. Hence, it is potentially more robust to ambiguity in the tumor purity and ploidy estimation, ensuring the robustness of the TmS calculation. We illustrate this robustness by showing that the agreement between TmS values calculated from ASCAT and ABSOLUTE are substantially improved, as compared to the agreement between the ploidy values calculated from the two methods that was low among 20% of TCGA samples (Extended Data Fig. 3f,g). To calculate one final set of TmS values for a maximum number of samples, we take a consensus strategy. We first calculate TmS values with tumor purity and ploidy estimates derived from both ABSOLUTE and ASCAT and then fit a linear regression model on the log 2 -transformed TmS ASCAT by using the log 2 -transformed TmS ABSOLUTE as a predictor variable. We remove samples with Cook’s distance ≥4 / n (n = 5,295; Extended Data Fig. 3h) and calculate the final \({\mathrm{TmS}} = \sqrt {{{\mathrm{TmS}}_{\mathrm{ASCAT}}} \times {{\mathrm{TmS}}_{\mathrm{ABSOLUTE}}}}\).

Improved estimation of tumor-specific mRNA proportion

The identifiability of model parameters is a major issue for high-dimensional models. With the DeMixT model, there is hierarchy in model identifiability in which the cell-type-specific mRNA proportions are the most identifiable parameters, requiring only a subset of genes with identifiable expression distributions. Therefore, our goal is to select an appropriate set of genes as input to DeMixT that optimizes the estimation of the tumor-specific mRNA proportions (π). In general, genes expressed at different numerical ranges can affect estimation of π. We found that including genes that are not differentially expressed between the tumor and non-tumor components, differentially expressed across tumor subtypes in different samples or with large variance in expression within the non-tumor component can introduce large biases to the estimated π. On the other hand, the tumor component is hidden in the mixed tumor samples, hence preventing a differential expression analysis between mixed and normal samples from finding the best genes. By applying a profile-likelihood-based approach to detect the identifiability of model parameters75, we systematically selected the top-ranking identifiable genes for the estimation of π. As a general method, the profile-likelihood-based gene selection strategy can be extended to any method that uses maximum likelihood estimation. We also employed a virtual ‘normal’ spike-in strategy to balance proportion distributions, which further improved the deconvolution performance. A full description is provided in Supplementary Note 2.2.

Profile-likelihood-based gene selection

In brief, in the DeMixT model, for sample \(i \in (1,2, \ldots ,M)\) and gene \(g \in (1,2, \ldots ,G)\), we have

$$Y_{ig} = \pi _iT_{ig}^\prime + \left( {1 - \pi _i} \right)N_{ig}^\prime$$ (4)

where Y ig represents the scale-normalized expression count matrix observed from mixed tumor samples, and T′ ig and N′ ig represent the normalized relative expression of gene g within tumor and surrounding non-tumor cells, respectively. The estimated tumor-specific mRNA proportion \(\hat \pi\) is the desirable quantity for Eq. 3. We assume each hidden component follows the log 2 -normal distribution—that is, \(T_{ig}^\prime \sim LN\!\left( {\mu _{Tg},\sigma _{Tg}^2} \right)\) and \(N_{ig}^\prime \sim LN\!\left( {\mu _{Ng},\sigma _{Ng}^2} \right)\). We will use notation T and N and drop the ′ sign from now on. The identifiability of a gene k in the DeMixT model is measured by the CI \([\mu _{Tk}^ - ,\mu _{Tk}^ + ]\) around the mean expression μ Tk . The definition of the profile likelihood function of μ Tk is

$$\begin{array}{lll}l_{\mu _{Tk}}\!\left( {\mu _{Tk} = x|\pi ,\mu _T,\sigma _T} \right) \\= \mathop {{\max }}\limits_{\pi _i,\mu _{Tg},\sigma _{Tg},\sigma _{Tk}} \left\{ {\mathop {\sum}\limits_{i = 1}^M {\left[ {\mathop {\sum}\limits_{g

e k}^G {\log \left( {f\left( {\pi _i,\mu _{Tg},\sigma _{Tg}} \right)} \right) + \log \left( {f\left( {\pi _i,\mu _{Tk} = x,\sigma _{Tk}} \right)} \right)} } \right]} } \right\}\end{array}$$ (5)

where

$$\begin{array}{lll}f\!\left( {Y_{ig}|\pi _i,\mu _{Tg},\sigma _{Tg}} \right) = \frac{1}{{2\pi \sigma _{Ng}\sigma _{Tg}}}\\ \times {\int}_0^{Y_{ig}} {\frac{1}{{t(Y_{ig} - t)}}} \exp \! \left( { - \frac{{\left( {\log 2\left( t \right) - \mu _{Ng} - \log 2\left( {1 - \pi _i} \right)} \right)^2}}{{2\sigma _{Ng}^2}} - \frac{{\left( {\log 2(Y_{ig} - t) - \mu _{Tg} - \log 2(\pi _i)} \right)^2}}{{2\sigma _{Tg}^2}}} \right)dt\end{array}$$

is the likelihood function of the DeMixT model.

The CI of a profile likelihood function can be constructed through inverting a likelihood-ratio test76. However, calculating the actual profile likelihood function of all genes (~20,000) is generally infeasible due to computational limits. We adopted an asymptotic approximation to quickly evaluate the profile likelihood function75, using the observed Fisher information of the log-likelihood, denoted as \(H(\hat \pi ,\hat \mu _T,\hat \sigma _T)\). Then, the asymptotic α-level CI of μ Tk can be written as75

$$\mu _{Tk}^ \pm = \widehat {\mu _{Tk}} \pm \sqrt {2\chi _{1 - \alpha }^2(1)H\left( {\hat \pi ,\hat \mu _T,\hat \sigma _T} \right)_{k,k}^{ - 1}}$$ (6)

We hereby introduce a gene selection score to represent the length of an asymptotic profile-likelihood-based 95% CI of μ Tk for gene k,

$${\mathrm{gene}}\,{\mathrm{selection}}\,{\mathrm{score}}_k = 2\sqrt {2\chi _{0.05}^2(1)H\left( {\hat \pi ,\hat \mu _T,\hat \sigma _T} \right)_{k,k}^{ - 1}}$$ (7)

Genes with a lower score have a smaller CI, hence higher identifiability for their corresponding parameters in the DeMixT. Genes are ranked based on the gene selection scores from the smallest to the largest. A subset of genes that are ranked on top will be used for parameter estimation. In the DeMixT R package, our proposed profile-likelihood-based gene selection approach is included as function ‘DeMixT_GS’. A full description is provided in Supplementary Note 2.2.2. We performed a simulation study, mimicking the TCGA prostate adenocarcinoma dataset, to validate the proposed gene selection method. A full description is provided in Supplementary Note 2.2.3. The implementation of virtual ‘normal’ spike-ins and a simulation study is provided in Supplementary Note 2.2.4.

TmS validation using bulk RNA sequencing data from mixed cell lines

We validated TmS estimates using an experimental dataset from a previous mixed cell line study (GSE121127)31 and selected a subset of 18 mixed samples with negligible RNA content from the immune component. Lung adenocarcinoma in humans (H1092) and CAF cells were mixed at different cell count proportions (Supplementary Table 3) to generate each bulk sample, plus three additional samples of 100% H1092 or 100% CAF. The raw reads were generated from paired-end total RNA Illumina sequencing and mapped to the human reference genome build 37.2 from the National Center of Biotechnology Information through TopHat77. SAMtools78 was applied to remove improperly mapped and duplicated reads. Picard tools were used to sort the cleaned SAM files according to their reference sequence names and create an index for the reads. The gene-level expression was quantified using the R packages GenomicFeatures and GenomicRanges.

For each cell line, we measured total RNA amount (in ng µl−1) for 1 million cells in three repeats using the Qubit RNA Broad Range Assay Kit (Life Technologies). The true TmS values of H1092 or CAF were then derived as a ratio of the total RNA amount per cell between the two cell types—specifically, \({\mathrm{TmS}}_{{\mathrm{H}}1092} = \frac{{{{{\mathrm{total}}}}\,{{{\mathrm{RNA}}}}\,{{{\mathrm{amount}}}}\,{{{\mathrm{per}}}}\,{{{\mathrm{cell}}}}\,{{{\mathrm{of}}}}\,{{{\mathrm{H}}}}1092}}{{{{{\mathrm{total}}}}\,{{{\mathrm{RNA}}}}\,{{{\mathrm{amount}}}}\,{{{\mathrm{per}}}}\,{{{\mathrm{cell}}}}\,{{{\mathrm{of}}}}\,{{{\mathrm{CAF}}}}}} = 0.87\) and \({\mathrm{TmS}}_{\mathrm{CAF}} = \frac{{{{{\mathrm{total}}}}\,{{{\mathrm{RNA}}}}\,{{{\mathrm{amount}}}}\,{{{\mathrm{per}}}}\,{{{\mathrm{cell}}}}\,{{{\mathrm{of}}}}\,{{{\mathrm{CAF}}}}}}{{{{{\mathrm{total}}}}\,{{{\mathrm{RNA}}}}\,{{{\mathrm{amount}}}}\,{{{\mathrm{per}}}}\,{{{\mathrm{cell}}}}\,{{{\mathrm{of}}}}\,{{{\mathrm{H}}}}1092}} = 1.2\). We estimated the RNA proportion of H1092 and CAFs using DeMixT (DeMixT_GS function with 4,000 genes selected) under two scenarios: (1) three pure CAFs samples were used as reference; and (2) three pure H1092 samples were used as reference. To estimate TmS values, we used the known cell counts to calculate ρ values.

TmS estimation in patient cohorts

A full description of all datasets is provided in Supplementary Note 2.3.1.

TCGA datasets

Raw read counts of high-throughput mRNA sequencing data, clinical data and somatic mutations from 7,054 tumor samples across 15 TCGA cancer types (breast carcinoma, bladder urothelial carcinoma, colorectal cancer (colon adenocarcinoma + rectum adenocarcinoma), head and neck squamous cell carcinoma, kidney chromophobe, kidney renal clear cell carcinoma, kidney renal papillary cell carcinoma, liver hepatocellular carcinoma, lung adenocarcinoma, lung squamous cell carcinoma, pancreatic adenocarcinoma, prostate adenocarcinoma, stomach adenocarcinoma, thyroid carcinoma and uterine corpus endometrial carcinoma) were downloaded from the Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov/). ATAC-seq data52, tumor purity and ploidy data79,80 and annotations of driver mutation and indels81 were downloaded for these samples.

Estimation of tumor-specific mRNA proportions from RNA sequencing data

For each cancer type, we filtered out poor-quality tumor and normal samples that were likely misclassified. We then selected available adjacent normal samples as reference for the tumor deconvolution using DeMixT. Based on simulation studies (Supplementary Note 2.2.3) and observed distributions of gene selection scores in real data, we chose the top 1,500 or 2,500 genes (varies across cancer types) to estimate tumor-specific mRNA proportions (π). For each cancer type, the selected 1,500 or 2,500 genes are defined as intrinsic tumor signature genes. We added varying numbers of virtual spike-in samples depending on cancer types. We additionally removed samples with extreme estimates of π, >85% or ranked at the top 2.5 percentile of all samples within each cancer type to mitigate the remaining underestimation when π is close to 1. A full description is provided in Supplementary Note 2.3.2.1.

Consensus TmS estimation

We calculated a consensus TmS as \({\mathrm{TmS}} = \sqrt {{\mathrm{TmS}}_{\mathrm{ASCAT}} \times {\mathrm{TmS}}_{\mathrm{ABSOLUTE}}}\) and removed 264 of 5,295 TCGA samples that deviated from our consensus model, as described previously. A full description on sample exclusions is provided in Supplementary Note 2.3.2.2.

Intrinsic tumor signature genes

For each cancer type, the selected genes used for estimating π are called intrinsic tumor signature genes. We conducted gene set enrichment analyses (GSEAs) on hallmark pathways and KEGG pathways47 for these genes ranked with their gene selection scores from small to large using GSEA82 and g:Profiler83. We further evaluated the chromatin accessibility of intrinsic tumor signature genes using ATAC-seq data from TCGA samples52. For each sample, we calculated the mean of the peak scores of selected genes and compared it with the corresponding permuted null distribution for each cancer type. A full description is provided in Supplementary Note 2.3.2.3.

Association of TmS with genetic alterations and metabolism

We searched among driver mutations (including nonsense, missense and splice-site single-nucleotide variants (SNVs) and indels)81 as well as all non-synonymous mutations (including SNVs and indels) over all genes for the 15 cancer types to identify those that were significantly associated with TmS. We investigated 24 cancer–gene pairs for the driver mutation analysis and 32,894 cancer–gene pairs for the non-synonymous mutation analysis. We applied a Wilcoxon rank-sum test to each candidate gene to compare the distributions of TmS of the samples with mutations versus without mutations. We also fitted a linear regression model on TmS to adjust for TMB. The P values of each gene were adjusted for multiple testing using Benjamini–Hochberg correction across all candidate genes within the corresponding cancer type. See Supplementary Note 2.3.2.4 for further details.

TMB was calculated by counting the total number of somatic mutations based on the consensus mutation calls (MC3)84. Chromosomal instability (CIN) scores were calculated as the ploidy-adjusted percent of genome with an aberrant copy number state. ASCAT was used to calculate allele-specific copy numbers32. For samples present in both TCGA and Pan-Cancer Analysis of Whole Genomes (PCAWG), the consensus copy number was derived from published results85. Tumor samples that had undergone whole-genome duplication (WGD) were identified based on homologous copy number information33.

For each cancer type from TCGA, we conducted GSEAs82 on the metabolism of carbohydrate pathways (the Reactome database86). The genes were ranked by the Spearman correlation coefficient between their expression levels and TmS across samples; they were then put through GSEA in the ‘pre-ranked’ mode. For GSEA, we adopted permutation tests (1,000 times) to generate a normalized enrichment score (NES) for each candidate pathway. A hierarchical clustering on the expression levels of the Reactome pentose phosphate pathway (15 genes total, of which two genes were removed due to high-frequency zero counts across samples) for the tumor samples was performed using Euclidean distance and Ward linkage. The samples were then separated into two groups using the ‘cutree’ function. For each cancer type, a Wilcoxon rank-sum test was used to compare the distributions of TmS estimates between the two tumor sample groups. P values were adjusted for multiple testing using Benjamini–Hochberg correction across all cancer types.

ICGC-EOPC dataset

In this cohort, matched mRNA sequencing data and whole-genome sequencing data, as well as clinical data including biochemical recurrence, Gleason score and pathologic stage, from 121 tumor samples and nine adjacent normal samples from 96 patients (age at treatment <55 years) were downloaded from Gerhauser et al.37 We used the nine available adjacent normal samples as the normal reference. The mRNA sequencing data came from three batches: batch 1 (17 patients and 25 samples), batch 2 (42 patients and 52 samples) and batch 3 (37 patients and 44 samples). We observed consistency and robustness of DeMixT results with or without batch effect correction. See Supplementary Notes 2.3.1 and 2.3.3 for further details.

METABRIC dataset

This dataset included 1,992 pairs of expression arrays and Affymetrix SNP 6.0 arrays profiled for tumor samples from 1,992 patients, which was divided into a discovery set (997 patients) and a validation set (995 patients)38. A total of 144 expression arrays for adjacent normal tissues were provided.

We applied the DeMixT deconvolution pipeline to the expression arrays of the combined discovery and validation sets, after batch effect correction, to estimate tumor-specific proportions using the adjacent normal samples as the reference. Affymetrix CEL files were processed by PennCNV87 to obtain the LogR and B allele frequency (BAF) data, followed by both ASCAT32 and Sequenza49 to estimate tumor purity and ploidy for each sample. The consensus TmS strategy was applied to obtain robust TmS estimations. In total, 1,664 patient samples with TmS remained after the above steps. We additionally removed 118 patient samples due to missing follow-up information of biochemical recurrence intervals or the PAM50 subtypes. A final cohort of 1,546 patient samples from both the discovery and validation sets was kept for downstream analyses. See Supplementary Notes 2.3.1 and 2.3.4 for further details.

TRACERx dataset

A total of 159 tumor samples from 64 patients with matched RNA sequencing data and WES data were downloaded39,40,88 (see Supplementary Note 2.3.1 for further details). Tumor purity and ploidy were estimated from WES data by Sequenza49. We used RNA sequencing data from normal lung samples without significant pathology in the corresponding tissue types in the GTEx study as the reference for the deconvolution of tumor samples in this dataset (see Supplementary Note 2.3.5 for further details). Focusing on tumor samples with tumor purity > 0.15, we calculated TmS for 116 regions from 52 patient samples, among which 30 patients have at least two regions. We further performed association analysis of regional and sample-specific TmS with measures of chromosomal instability. We defined the subclonal CNA as a CNA presented only in a subset of regions. We further define the evolutionary relationship in two regions from the same patient as either linear or branched. For each evolutionary relationship per patient, we defined the ‘range of TmS’ as log 2 (TmS max ) − log 2 (TmS min ) across regions. We fitted linear regression models by taking log 2 (TmS max ) as the response variable and the percentage of subclonal CNA, number of regions, range of TmS, evolutionary relationship and their interactions as predictors. The best model was selected by stepwise selection based on the Bayesian information criterion (BIC)89. See Supplementary Note 3.3 for further details.

Statistical analysis

Batch effect correction

For RNA sequencing data from multiple batches, we applied batch effect correction using ComBat73 and limma90 to combine RNA sequencing data in one pool before estimating tumor-specific mRNA proportions. See Supplementary Note 3.1 for further details on the robustness of TmS estimation.

Association with clinical variables

Kruskal–Wallis tests were used to compare the distribution of TmS between subgroups defined by each clinical variable. The P values from the Kruskal–Wallis tests were adjusted using Benjamini–Hochberg correction across all available clinical variables within the corresponding cancer type.

Association with survival outcomes

Associations with TmS were assessed in terms of OS, PFI and DFS depending on cancer type and study cohort. For TCGA, we used outcome measures that are recommended by Liu et al.61. If both OS and PFI were recommended, we used the more clinically relevant outcomes for an individual cancer type. We dichotomized pathologic stages into two categories: early (I/II) and advanced (III/IV). For prostate cancers, we used the Gleason score (Gleason score = 7 versus 8+) instead of early and advanced stages. Furthermore, we followed clinical guidelines and physician recommendations to identify tumor samples that were treated without systemic therapy (surgery only) in TCGA and used the corresponding meaningful outcome measures for the selected populations. For all association analyses with clinical outcomes across datasets, we used a recursive partitioning survival tree model, rpart91, to find the optimal TmS cutoff (high versus low) separating different survival outcomes within each of the two stages defined above in each cancer type. Splits were assessed using the Gini index, and the maximum tree depth was set to 2. Log-rank tests between high- and low-TmS groups within early or advanced pathologic stages were performed. We performed sensitivity analysis on the TmS cutoff to confirm that a similar trend can be observed with other values. See Supplementary Note 3.2 for further details on the survival analysis and the identification of patients without systemic therapy.

Cox regression with model selection

We fitted multivariate Cox proportional hazard models with age, stage, TmS (high versus low) and other variables as predictors of OS, PFI or DFS for each dataset and calculated HRs and 95% CIs. We use the stepwise model selection method with BIC89, where the baseline model includes age, stage and TmS predictors, and additional variables to select include the interaction term of TmS × stage.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.