Estimation of tumor cell total mRNA expression in 15 cancer types predicts disease progression

Single-cell RNA sequencing studies have suggested that total mRNA content correlates with tumor phenotypes. Technical and analytical challenges, however, have so far impeded at-scale pan-cancer examination of total mRNA content. Here we present a method to quantify tumor-specific total mRNA expression (TmS) from bulk sequencing data, taking into account tumor transcript proportion, purity and ploidy, which are estimated through transcriptomic/genomic deconvolution. We estimate and validate TmS in 6,590 patient tumors across 15 cancer types, identifying significant inter-tumor variability. Across cancers, high TmS is associated with increased risk of disease progression and death. TmS is influenced by cancer-specific patterns of gene alteration and intra-tumor genetic heterogeneity as well as by pan-cancer trends in metabolic dysregulation. Taken together, our results indicate that measuring cell-type-specific total mRNA expression in tumor cells predicts tumor phenotypes and clinical outcomes.


A mathematical model for tumor
). Single-cell data was generated using the Chromium Single Cell 3' Library, Gel Bead & Multiplex Kit, and Chip Kit (v3, 10x Genomics). Libraries were sequenced on an Illumina NovaSeq6000. Alignment, tagging, and gene and transcript counting were conducted using the 10x Genomic Cell Ranger pipeline (version 3.0).
Liver cancer single-cell RNA sequencing data 1 Three fresh hepatocellular carcinoma samples of primary tumor were collected at the NIH Clinical Center for immune checkpoint inhibition studies (NCT01313442) (Supplementary Table 1). Two of them (patient 1 and patient 2) were from patients who were receiving immunotherapies by needle biopsy, and the other was collected from an untreated patient by surgical resection. Single-cell data was generated using the Chromium Single Cell 3' Library, Gel Bead & Multiplex Kit, and Chip Kit (v2, 10x Genomics).
Libraries were sequenced on an Illumina NextSeq500. Alignment, tagging, and gene and transcript counting were conducted using the 10x Genomic Cell Ranger pipeline (version 2.0.2).
Lung cancer single-cell RNA sequencing data 2 Two fresh lung adenocarcinoma samples of primary, non-metastatic lung tumor were collected from untreated patients by surgical resection at University Hospital Leuven (Supplementary Table 1). Singlecell data was generated using the Chromium Single Cell 3' Library, Gel Bead & Multiplex Kit, and Chip Kit (v1, 10x Genomics). Libraries were sequenced on Illumina HiSeq4000. Alignment, tagging, and gene and transcript counting were conducted using the 10x Genomic Cell Ranger pipeline (version 2.0.0).
Pancreatic cancer single-cell RNA sequencing data 3 Two untreated patients with primary pancreatic cancer were recruited at the University of Texas MD Anderson Cancer Center and informed written consents following institutional review board approval were obtained (Lab00-396 and PA15-0014). Fresh biopsies were collected from the tumors by fine needle aspiration (Supplementary Table 1). Single-cell data was generated using the Chromium Single Cell 3' Library, Gel Bead & Multiplex Kit, and Chip Kit (v1, 10x Genomics). Libraries were sequenced on an Illumina NextSeq500. Alignment, tagging, and gene and transcript counting were conducted by using the 10x Genomic Cell Ranger pipeline (version 3.1).

Single-cell RNA sequencing data processing
In this section, we first introduce the preprocessing for the single-cell RNA sequencing (scRNA-seq) datasets described above (including quality control, cell clustering, cell type annotation), followed by the introduction of a method to group cell clusters within a cell type based on gene counts, i.e., the total number of expressed genes, to simplify the characterization of heterogeneity within the cell type. We then introduce a scale normalization method to correct for sequencing or experimental biases on total UMI counts, and finally the trajectory and cell cycle analyses for the scRNA-seq data.

Quality control, clustering and cell type annotation
For each of the three colorectal adenocarcinoma scRNA-seq samples generated at MD Anderson, genes expressed in less than three cells were removed. Cells with either fewer than 500 total UMIs, below 200 expressed genes, or more than 50% total UMI counts derived from mitochondrial genes were excluded.
The total number of transcripts in each cell was normalized to 10,000, which was followed by a natural log transformation. Highly variable genes were detected and used for principal component analysis (PCA).
Cells were then clustered with the Seurat package 4 . The cell type for each cell was annotated based on known marker genes 5 (Supplementary Note Figure 1, Supplementary Note Table 1). Initial somatic copy number variation (CNV) estimates were made using inferCNV 6 , which was used to calculate CNV scores and CNV correlation scores 1 . The CNV score of a single cell was defined as the sum of the squared copy number variants across all gene positions. The CNV correlation score was calculated as the correlation between the copy number variations of a single cell and the average copy number variation of the top 2% cells ranked by CNV scores from the same sample. Tumor cells were identified as epithelial cells with an average CNV score greater than 0.0015. The three samples from patient 1, patient 2 and patient 3 had 5,444, 7,462 and 2,445 cells remaining, respectively, after data pre-processing.
The quality control of the three hepatocellular carcinoma scRNA-seq patient samples was conducted following the method described by Ma, L. et al 1 . For each sample, genes expressed in less than 0.1% of Supplementary Note Table 1. Marker genes used to annotate cell types in scRNA-seq patient samples from four cancer types.
the cells were removed. Cells with fewer than 700 total UMIs, fewer than 500 expressed genes, or more than 20% total UMI counts derived from mitochondrial genes were excluded. An additional quality control step of doublet removal was performed based on the number of cells loaded and recovered. The total number of transcripts in each cell was normalized to 10,000, followed by a natural log transformation.
Highly variable genes were detected and used for PCA. Cells were then clustered with the Seurat  Table 1). Tumor cells were identified as epithelial cells with CNV scores above the 80th percentile and CNV correlation scores above 0.4. The three samples of patient 1, patient 2 and patient 3 had 83, 761 and 796 cells remaining, respectively, after data pre-processing.
The quality control of the two lung adenocarcinoma scRNA-seq patient samples was conducted following the method described by Lambrechts, D. et al 2 . For each sample, genes expressed in less than 0.5% of the cells were removed. Any cell with either fewer than 201 total UMI counts, below 101 or over 6,000 expressed genes, or more than 10% total UMI counts derived from mitochondrial genes were filtered out from downstream analysis. The total number of transcripts in each cell was normalized to 10,000, followed by a natural log transformation. Highly variable genes were detected and used for PCA. Cells were then clustered with the Seurat package 4 . Cell type (including tumor cell) of each cell was annotated based on known marker genes 2 (Supplementary Note Figure 1, Supplementary Note Table 1). The two samples (patient 1 and patient 2) had 8,845 and 13,658 cells remaining, respectively, after data preprocessing.
For each of the two pancreatic adenocarcinoma scRNA-seq samples, genes expressed in less than three cells were removed. Cells with either fewer than 500 total UMIs, below 200 expressed genes, or more than 50% total UMI counts derived from mitochondrial genes were filtered out. The total number of transcripts in each cell was normalized to 10,000, followed by a natural log transformation. Highly variable genes were detected and used for PCA. Cells were then clustered with the Seurat package 4 . Cell type of each cell was annotated based on known marker genes 7 (Supplementary Note Figure 1, Table 1). Tumor cells were identified as epithelial cells with CNV score above 0.015 and CNV correlation above 0.4. The two samples (patient 1 and patient 2) had 2,404 and 7,037 cells remaining after QC, respectively, after data pre-processing.
Within each cell type, we further merged clusters that did not significantly differ in gene counts (two-sided

Normalized total UMI counts
We performed scale normalization on the raw count data to ensure the total UMI count per cell across all cells are comparable for different samples. Specifically, let UMIi = {UMIigc}GxCi be a matrix of raw UMI counts for the scRNA-seq data for sample i being investigated, with genes g on the rows and cells c on the columns. G denotes the total number of genes, Ci is the number of cells in sample i. Then, the normalized UMI matrix UMIi, denoted as UMI i norm , is calculated as Given a cell cluster, we let ugc denote the amount of mRNA of gene g in cell c. The average total mRNA )/C. For scRNA-seq data, we assume the UMIgc from gene g, cell c is proportional to the total mRNA u gc of gene g in that cell, with a constant kg that represents technical effects: UMI gc =k g * u gc . The constant kg is introduced because every single-cell sequencing platform presents a <100% capture efficiency for mRNA, and such efficiency varies across different platforms 9 .
Under the assumption that the technical effect kg remains constant across cells and is often evaluated as an average effect across genes within the same platform, we can evaluate total mRNA expression in the scRNA-seq data using the average total UMI counts, which is ∑ ( ∑ UMI gc )/C. Notably, we observed Supplementary Note Figure 2. An example of merging cell clusters by gene counts. Tumor cells in patient 2 of colorectal adenocarcinoma are used. The initial 4 clusters were determined by Seurat clustering (resolution=0.5). Two-sided Wilcoxon rank-sum tests comparing gene counts were performed between clusters and those that did not pass the significance level of 0.001 were merged. The resulting two tumor cell clusters had n=1,715 cells (high UMI cluster, e.g. 1, 2, and 4) and n=365 cells (low UMI cluster, e.g., 3), respectively. We repeated this process based on the initial Seurat clustering with resolution=1.0. There were still two tumor cell clusters after merging. The differences of tumor cells in the high UMI cluster and in the low UMI cluster based on the two resolutions were only n=12 cells and n=13 cells, respectively. strong correlations between gene counts and total UMI across cells in each cell cluster across all cell types and cancer types (Supplementary Note Figure 3). This observation supports our assumption of a stable technical effect kg within each study, and that the average total UMI counts serve as a reasonable surrogate to compare total mRNA expression across cells that are generated from the same experiment.
The average gene counts and average total UMI counts for both individual cell clusters and all the clusters pooled within a cell type are summarized in Supplementary Note Table 2.
The observed fold changes in total UMI counts between tumor cell clusters were significantly higher than those expected from expression dosage response from genome ploidy changes alone (at 2-3 folds 10,11 ) among tumor cells (Supplementary Note  Figure 3. Correlations between gene counts and total UMI counts. Smoothed scatter plots show the correlations between gene counts and total UMI counts in cell clusters from each patient sample. In each smoothed scatter plot, the Spearman correlation coefficient is labeled on the top (r).

Cell cycle states of tumor cells
For each patient sample, we calculated the S score and G2M score for each tumor cell using the CellCycleScoring function in the Seurat package. Cells with either S score or G2M score > 0.2 are considered as cycling cells. Among the rest, cells with either 0 < S score ≤ 0.2 or 0 < G2M score ≤ 0.

Gene set enrichment analysis in scRNA-seq data
We performed gene set enrichment analyses for the high and low UMI tumor cell clusters in scRNA-seq data. We first compiled a comprehensive set of signatures with 18,617 human gene sets (containing at least 4 genes) from the Molecular Signatures Database (MSigDB, v6.2) 47 and CellMarker 48 . Among them, 341 gene sets were annotated as stemness signatures based on the key word '_stem_' in their names.
We quantified enrichment for high and low UMI tumor cells using the GeneOverlap R package (v1.24.0) 75 .
GeneOverlap took the DE genes, a gene set and the background genome size (number of expressed genes in the scRNA-seq data expressed in ≥ 10 cells) as input, and gave a P value for the enrichment significance and Jaccard Index, which was calculated by the number of common genes in the DE gene list and the signature gene set divided by the union of them. P values were adjusted for multiple comparisons using the Benjamini-Hochberg (BH) correction. The DE genes between high UMI and low Supplementary Note Table 3. Two-sided t-tests between the total UMI counts of high UMI tumor cell cluster and 3 times of the total UMI counts of low UMI tumor cluster within each patient across four cancer types. P values are adjusted by the Benjamini-Hochberg (BH) method. tumor cells were obtained by the "FindMarkers" function from Seurat with Wilcoxon rank-sum test, based on criteria of adjusted P value < 0.1, genes expressed in ≥ 10 cells, and absolute log2(fold change) > 0.585 (1.5 fold change). haploid genome in cell c. However, the cell level ploidy pc is usually not measurable. Hence, in practice, we use average ploidy of the corresponding cell group to approximate it:

A mathematical model for tumor-specific total mRNA expression
. For non-tumor cells, which are commonly diploid, this assumption is assured.

Model
In the analysis of bulk RNAseq data from mixed tumor samples, we are interested in comparing tumor with non-tumor cell groups. We denote tumor cells by group T, and non-tumor cells by group N. Therefore, we define a tumor-specific total mRNA expression score ( Under the assumption that the tumor cells have a similar ploidy, we can derive TmS without using singlecell-specific parameters as Here we further introduce a tumor-specific mRNA proportion π = ( ∑ T g G g=1 ) and a tumor cell proportion of tumor cells within a sample (termed "tumor purity") r = CT / (CT + CN). Note that deconvolution of just the gene expression data will not provide information on the total number of tumor cells and non-tumor cells, but only the sum of total mRNA expression across all cells of each cell type.
Using these parameters, we rewrite Eq.S1 as Additionally, we can define a ploidy-unadjusted TmS by removing the ploidy terms.

Estimation
It is common practice to assume the ploidy of non-tumor cells N equals to 2 14,15 . Hence, we have In what follows, we use TmS to represent TmS * , for the sake of simplicity. Eq.S2 . Eq.S1 Eq.S3

14
Estimation of tumor-specific mRNA proportion π using high-throughput RNA sequencing has not been possible due to several technical and analytical factors including: 1) the need to account for technical artifacts introduced by varied library size, which currently involves normalization procedures across samples; 2) total mRNA transcripts per cell are confounded with technical artefacts so that normalization procedures adjust for both effects at once, consequently losing the ability to evaluate the downstream global transcriptome feature 16 ; and 3) a limited focus on estimating cell proportions by popular methods [17][18][19] .
Using deconvolution to partition tumor and non-tumor cells within the same sample under the same experimental conditions provides a mathematical means to cancel out the effect of technical artefacts while maintaining the effect of cell-type-specific total mRNA counts. We use the DeMixT model 20 to estimate π. For sample i and across any gene g, we have where Yig represents the scale normalized expression matrix from mixed tumor samples, T'ig and N'ig represent the normalized relative expression of gene g within tumor and surrounding non-tumor cells, respectively. The estimated πis the quantity desired for Eq.S3.
Computational deconvolution methods, e.g., ASCAT 14 and ABSOLUTE 15 , have been developed to perform allele-specific copy number analysis and to estimate tumor purity r and ploidy T from tumor DNA sequencing data. Such statistical methods jointly model the distribution of logR and B allele (or variant allele) frequency (BAF) across germline SNPs, with tumor purity and allele-specific copy number as parameters of interest. Then the tumor purity and ploidy (the average tumor copy number) can be estimated through minimizing the loss function or maximizing the likelihood. Below, we provide a detailed description for these methods using the ASCAT model as an example.
Sequence read counts at known SNP loci were computed from tumor DNA sequencing data. The logR i can be computed from the total read counts in the tumor versus normal for the ith SNP, which provides information on the ratio of total copy number between the tumor and the normal. Specifically, logR i can be expressed as 14 where ρ is the tumor purity, y T is the tumor ploidy, is a constant depends on which DNA sequencing technology is used. n A,i and n B,i stand for the allele-specific copy number of A allele and B allele for the ith SNP in tumor cells, respectively.
On the other hand, allelic imbalance can be inferred from the BAF i for ith SNP. The BAF i can be expressed as 14 Based on Eq.S5 and Eq.S6, the allele-specific copy number can be expressed as a function of the tumor purity and ploidy. Specifically, we have Allele-specific piecewise constant fitting (ASPCF) 21 was then applied to both logR i and BAF i simultaneously, which enforced the change points to occur at the same genomic locations. Consequently, a segmentation of the genome was obtained, each segment corresponding to a genomic region between two adjacent change points. Using the ASPCF smoothed logR i and BAF i , the final values for 3 and 4 T were obtained through the optimization, such that the allele-specific copy number estimates n 3 A,i and n 3 B,i were as close to nonnegative integers as possible for germline heterozygous SNPs.

Improved estimation using DeMixT
Many computational deconvolution methods have been developed to estimate cell type proportions through transcriptome data; however, most of them focus on the cellular proportion and not the global gene expression level of each cell type, due to lack of appropriate normalization approaches. The requiring only a subset of genes with identifiable expression distributions. Therefore, our goal is to select an appropriate set of genes as input to DeMixT that optimizes the estimation of π. In general, genes are expressed at different levels, which, due to different numerical ranges, can affect estimation of π. We found that including genes that are not differentially expressed between the tumor and non-tumor components within the bulk sample, or genes with large variance in expression within the non-tumor component, can introduce large biases into the estimated π. On the other hand, the tumor component is hidden in the mixed tumor samples, hence preventing a DE analysis between mixed and normal samples from finding the best genes. By applying a profile-likelihood based approach to detect the identifiability of model parameters 24 , we systematically evaluated the identifiability for all available genes based on the data, and selected the most identifiable genes for the estimation of π. As a result, the accuracy of the estimated proportions has been improved. As a general method, the profile-likelihood based gene selection strategy can be extended to any method that uses maximum likelihood estimation. We also employed an additional virtual spike-in strategy to balance proportion distributions which further improved model identifiability.

Likelihood model for DeMixT
In the DeMixT model 20 (Eq.S4), we assumed that the observed expression level Yig is a linear combination of two hidden components Tig (tumor, in place of T'ig from now on) and Nig (non-tumor, in place of N'ig from now on), where gene g = 1,2,…,G, sample i = 1,2,…,M , and π i is the tumor-specific total mRNA expression proportions. We assume each hidden component follows the log2-normal distribution, i.e., Fitting the deconvolution model in Eq.S1 can be formally defined as an optimization problem that seeks to identify optimal estimates for sample-level, tumor-specific mRNA proportions π i , and gene-level The full log-likelihood of the DeMixT model can be written as The DeMixT model applies an optimization method, iterated conditional modes (ICM) 25 , to maximize the full log-likelihood function and estimate all distribution parameters (μ T ,σ T ) and proportions π.

Optimized model identifiability
Based on the most stringent definition, for a parametric model However, this rigorous identifiability is difficult to validate for a general high-dimensional and non-convex model, which is the case for the DeMixT model. Thus, for a parameter θ, we use the confidence interval [θ -, θ + ] to measure its identifiability 24 .
In the DeMixT model, if we select genes with small confidence intervals of μ Tg based on profile likelihood, which indicate high identifiability, the corresponding gene g will be more stable and reliable, so will the inferred tumor-specific mRNA proportions (π). As a result, the length of confidence interval of μ Tg serves as an estimable quantity with which we can evaluate the gene g's identifiability and prioritize genes to increase the estimation quality of π i , μ Tg , σ Tg .
The profile likelihood is preferred to compute confidence intervals of parameters that often have better small-sample properties than those based on asymptotic standard errors calculated from the full likelihood 26 . Assume the kth gene's mean parameter μ Tk is the parameter of interest. The definition of the profile log-likelihood function of μ Tk is: The confidence interval of a profile likelihood function can be constructed through inverting a likelihoodratio test 27 . Assume the null hypothesis as H0: μ Tk = x, and the maximum likelihood estimator of (π i , μ Tg , The null hypothesis will not be rejected at the α level of significance if and only if distribution with degrees of freedom equal to 1. Since the maximized likelihood ?π -, μ -T , σ -T @ and model parameters π -, μ -T , σ -T can be estimated by running the DeMixT model on all available gene sets, for any given x, we are able to investigate the profile log-likelihood function μ Tk ?μ Tk = x | π -,μ -T ,σ -T @. Consequently, we can estimate the lower and upper bounds of the confidence interval [μ Tk -, μ Tk + ] as Following the same procedure, we can derive the confidence interval of μ Tk for all available genes. In real data analysis, calculating the actual profile likelihood function of μ Tk across all 20,000 genes is generally infeasible due to computational limits. An asymptotic approximation is necessary in order to quickly evaluate the profile likelihood function. If the measurement noise is small and the sample size is large enough, asymptotic confidence intervals are good approximations of the actual confidence intervals 24 . The asymptotic profile likelihood function can be derived from the observed Fisher information of the log likelihood, denoted as H(π -,μ -T ,σ -T ). Then the asymptotic α level confidence interval of μ Tk can be written as follows 24 Validation. We compared the actual profile likelihood function with the asymptotic profile likelihood function for a random set of 20 genes in real data (the TCGA prostate adenocarcinoma dataset) and observed good performance of the approximate profile likelihoods (Supplementary Note Figure 5). With 20 randomly selected genes, we calculated the root mean squared error (RMSE) between the confidence intervals from the true and asymptotic profile likelihoods is 0.05.
Gene selection score. We now introduce a metric, the gene selection score, which for gene is the width of the asymptotic 95% profile likelihood-based confidence interval of μ Tk for gene Genes with a lower score have a smaller confidence interval, hence higher identifiability in their corresponding parameters. Genes are ranked based on the gene selection score from the smallest to the largest. A subset of genes that are ranked on top will be used for parameter estimation. In the DeMixT R Eq.S8 package (freely available from Bioconductor), our proposed profile-likelihood based gene selection approach is included as function "DeMixT_GS".

A simulation study for profile-likelihood based gene selection
The DeMixT model assumes every gene g has a shared mean (μ Tg ) and variance (σ Tg ) parameters across all tumor samples. However, in real data, this assumption will be violated in selected genes, due to the fact that some genes are significantly differentially expressed in different subtypes of the cancer.
For example, the PAM50 genes are known to be differentially expressed in different molecular subtypes in breast cancer, e.g., Basal, Her2, LumA, and LumB subtypes. Therefore, our simulation aimed to assess the performance of the proposed gene selection method in finding genes that best follow the DeMixT model. The detailed simulation design is described below.
We simulated gene expression of 269 mixed samples and 100 normal references with 10,000 genes, mimicking the real data scenario presented in the TCGA prostate adenocarcinoma dataset. The true π were set as the tumor cell proportions derived from ASCAT. We generated the expression of 10,000 genes under four scenarios: 1) genes that are consistently differentially expressed between the tumor Supplementary Note Figure 5. Asymptotic profile likelihoods for 4 genes using 259 samples from the TCGA prostate cancer dataset. Comparison of asymptotic and actual profile likelihoods of μ T for 4 randomly selected genes in the TCGA prostate adenocarcinoma data. The red curve shows the true profile likelihood of the corresponding parameter. The blue curve shows an asymptotic approximation of the profile likelihood of the corresponding parameter. Asymptotic likelihood Profile Likelihood and normal components; 2) genes that are differentially expressed across subtypes of tumor samples; 3) genes that are consistently expressed similarly between the tumor and normal components; 4) genes that are expressed with large variances. Specifically, for scenario 1), we generated 6,000 genes for the pure tumor T ig and normal references N ig with distributions log 2 ?T ig @ ∼ N 5μ Tg , σ Tg 2 6 and σ Tg were sampled with replacement from the observed standard deviations from the normal samples of TCGA prostate adenocarcinoma. For scenario 2), we generated additional 2,000 subtype genes with samples spit into equal-sized subgroups M 1 , M 2 , and M 3 with corresponding μ T 1 g ~ N?5,1.5 Then we generate the expression with  successfully ranked genes from scenario 1) much higher than others, whereas a routine DE analysis using the two-sided t-test statistic between mixed tumor and normal samples failed to identify these genes (Extended Data Fig. 4d). Across simulations where we selected 100, 250, 1500, 2500, and 8000 genes, "DeMixT_GS" always outperformed "DeMixT_DE" in estimating proportions (Supplementary Note Figure 6a). The dip test 28 was used to measure the unimodality of the distribution of gene expression.
This test statistic was designed to test multimodality of a random variable based on the maximum difference between the empirical distribution and the unimodal distribution of all observed data points (Supplementary Note Figure 6b). It suggests that the proposed gene selection method successfully ranked subtype-specific genes lower than the DE method.
Optimal selection of genes. We also observed that the number of genes selected by "DeMixT_GS" influences the performance of DeMixT. The accuracies of π estimation based on 100, 250, 1,500, 2,500 and 8,000 genes selected by the proposed "DeMixT_GS" were compared. (Supplementary Note Figure   6a, c). Accurate π estimation, as measured by the RMSE, was achieved with 1,500 or 2,500 genes. In real data, we used either the top 1,500 or top 2,500 genes to estimate π.
Supplementary Note Figure 6. Profile-likelihood based gene selection (DeMixT_GS) improves tumor-specific total mRNA expression proportions estimation. a, Root Mean Square Error (RMSE) was calculated for 5 simulated scenario. The center points represent the mean and the bound of error bars represent the mean +/-one standard error of RMSE. b, Density of P values based on a dip test for selected genes by "DeMixT_GS" and "DeMixT_DE" methods. The dip test was applied to indicate the distribution of gene expression for selected genes based on the "DeMixT_GS" and "DeMixT_DE" methods, respectively. A small P value of the dip test suggests the corresponding gene is not unimodally distributed, which violates the model assumption of log2normal distribution across samples. c, Scatter plot of true versus estimated tumor-specific total mRNA expression proportions using "DeMixT_GS" method with different numbers of top-ranking genes with the smallest gene selection score.

Virtual spike-ins to improve identifiability and a simulation study
When the true proportions are skewed towards the high end (i.e., median above 0.5), which is expected to occur frequently in real data (tumor samples with a low percentage of tumor cells are already discarded), the DeMixT estimation procedure, after careful gene selection, tends to slightly underestimate the high proportions. We hypothesize that by shifting the mode of the π distribution close to 0.5, the issue of underestimation will be alleviated. To achieve this, we simulate additional "mixed tumor" samples, i.e.
spike-ins, with close to 0% of π, so that there are roughly the same number of samples with tumor proportions below and above 50%, i.e., S P + P"i P ρ i <0.5#P @ |{i | ρ i ≥0.5}|, where S P represents the number of spike-ins, ρ i represent tumor purity of sample i, and the | • | represent cardinality of a set. For the cancer type whose median tumor purity is below 0.5, we set S P at 5. The spike-ins are generated based on gene expression profiles observed from the input data of normal reference samples.
We where the x-axis represents the rank of genes sorted by gene selection score from low to high and the y-axis represents the estimated gene selection score for the corresponding genes.   Table 4).

Supplementary Note Figure 8. Hierarchical clustering of read count data from adjacent normal and mixed tumor samples for breast and prostate adenocarcinoma in TCGA as examples.
Dendrograms of the hierarchical clustering results based on the top 1,000 differentially expressed feature genes for breast carcinoma (a) and prostate adenocarcinoma (b), respectively. Circles represent adjacent normal samples and dots represent mixed tumor samples. Scale normalization at the seventy-fifth percentile based on the DSS package 50 was then applied to the post quality-control tumor and normal samples. Next, we applied two criteria to filter out spurious genes.
First, we filtered out genes with a zero count in either the mixed tumor or normal samples. Second, we filtered out genes with a large variance (σ -Ng 2 > 0.6) in the normal reference samples. Here, the standard deviation of a gene is calculated as σ -Ng 2 = sd(log 2 ?R .g @), where R .g is the normalized expression of gene g for normal reference samples.
For each cancer type, we applied the "DeMixT_DE" to the quality-controlled expression data together with simulated spike-ins as input data to generate initial tumor-specific mRNA proportions π 0 . We used ASCAT estimated tumor purities as an informed prior to calculate a reasonable number for S P . With other datasets in general, we set S P = max(50, 0.3*Sample size), as the default option of the "DeMixT_GS" function. Results from the TCGA datasets across 15 cancer types were largely consistent with small to moderate changes from the addition of spike-ins (Supplementary Note Figure 9).
We then used these π 0 K as initial values in the profile likelihood calculation on all genes to calculate gene selection scores. We ranked all genes based on their gene selection scores from smallest to largest.
Based on a simulation study and observed distributions of gene selection scores in real data, we chose the top 1,500 or 2,500 genes to ensure accuracy in proportion estimation (Supplementary Note Figure   Supplementary Note 6a). Within each cancer type, we used the spike-ins as benchmarking samples and evaluated the RMSE of the estimated proportions of the spike-ins with either the top 1,500 or top 2,500 genes (π -1500 (Sp) and π -2500 (Sp)). If RMSE(π -1500 (Sp)-0)<RMSE(π -2500 (Sp)-0), we used the results of the top 1,500 genes, i.e., the tumor proportions π=π 1500 ; otherwise, π=π 2500 . In general, the RMSEs were small (median = 0.02 across 15 cancer types), and the two sets of tumor proportions, π -1500 and π -2500 , were consistent within each cancer type. We additionally removed samples with extreme estimates of π, >85% or ranked at the

Consensus TmS estimation
We first calculated TmS values for 5,295 TCGA samples with matched tumor-specific mRNA proportions and ABSOLUTE or ASCAT derived tumor purity and ploidy estimates. We then fitted a linear regression model on log2-transformed TmS calculated by ASCAT using log2-transformed TmS calculated by ABSOLUTE as a predictor variable. We removed samples with a Cook's distance ≥ 4/n (n=5,295) (Extended Data Fig. 3f-h), and for the remaining samples, which were the majority, we calculated the final TmS as: TmS=STmS ASCAT × TmS ABSOLUTE . These TmS estimates were used throughout the paper (Supplementary Note Table 5

Intrinsic tumor signature genes
For each cancer type, we selected the top 1,500 or 2,500 genes based on a gene selection score (Eq.S8), ranked from smallest to largest, as the intrinsic tumor signature gene (Supplementary Note Figure 11a).  We further evaluated the chromatin accessibility of signature genes using ATAC-seq data TCGA samples 33 . For each sample, peak scores (-log10(p-value)) were scaled by dividing each individual peak score by the sum of all of the peak scores in the given sample divided by 1 million. These scaling values ranged from 1.4 to 67.4 across cancer types. The 75th percentile of normalized peak scores across all peaks within the promoter region were selected for each gene as representative peak scores, and genes with normalized peak scores less than 1 were excluded. A total of 7.1% to 20.4% of genes were excluded across cancer types (Supplementary Note Figure 11d). For each sample, we calculated the mean of the peak scores of all signature genes. A null distribution of mean peak scores was generated by calculating means from 1,000 random subsets of genes with the matching number of the signature genes from all genes. P values for signature genes were calculated as the percentile of the permuted means being greater than or equal to the observed mean. P values were adjusted for multiple testing by the BH method.

Association of TmS with genetic alterations
For each cancer type within TCGA, we searched among driver mutations (including nonsense, missense and splice-site SNVs and indels) 32 over all genes for the 15 cancer types to identify those that were significantly associated with TmS. For each cancer type, we considered genes which had driver mutations in at least 10 samples. For each of these genes, samples were labelled as "with driver mutation" if they carried at least one driver mutation in that gene or "without driver mutation" otherwise. We investigated 24 cancer-gene pairs for the driver mutation analysis. We applied a Wilcoxon rank-sum test to each candidate gene to compare the distributions of TmS of the samples with driver mutations versus without driver mutations. The P values of each gene were adjusted for multiple testing using BH correction across all candidate genes within the corresponding cancer type.
We also implemented an agnostic search among non-synonymous mutations (including SNVs and indels) over all genes for the 15 cancer types to identify those that were significantly associated with TmS. For each cancer type, we considered a gene as a candidate gene if there were at least 10 samples containing non-synonymous mutations in that gene. We investigated 32,894 cancer-gene pairs for non-synonymous mutation analysis. We applied two statistical tests to evaluate the difference between the "with nonsynonymous mutation" and "without non-synonymous mutation" samples. We first applied a Wilcoxon rank-sum test for each candidate gene to evaluate the difference between the distribution of TmS of the two group of samples. We then fitted a linear regression model using log2-transformed TmS as the dependent variable and mutation status as a predictor: log 2 (TmS)=b 0 +b 1 log 2 (TMB) +b 2 MUT, where TMB represents tumor mutation burden. MUT=1 if the sample has at least one non-synonymous mutation in the candidate gene, and MUT=0 otherwise. The P values were calculated by a t-test of the regression The P values of each gene based on Wilcoxon rank-sum test and t-test were adjusted by BH correction based on the number of candidate genes within the corresponding cancer type.
We find 5 overlapping pairs out of 6 statistically significant pairs produced from each interrogation (adjusted P values < 0.01). The same significant associations with PIK3CA and TP53 mutations in TmS were found in the TCGA breast cancer study (adjusted P values <0.001). The additional pair found through the agnostic search (FGFR3 in bladder carcinoma in TCGA) was not identified in the driver mutation analysis due to a limited sample size. These associations in breast, lung, thyroid, and bladder cancers show that TmS can capture changes in tumor phenotypes induced by driver mutations in a cancer type-specific manner 32 . This is consistent with previous findings that the same driver mutations may not have the same prognostic effect across cancers 56,57 , and their effects are modified by additional tumor and/or treatment-related factors. In this context, our results indicate that TmS can be used to prognostically stratify tumors beyond driver mutation status.
Tumor mutation burden (TMB) was calculated by counting the total number of all somatic mutations based on the consensus mutations calls (MC3) 58 . Chromosomal Instability (CIN) scores were calculated as the ploidy-adjusted percent of genome with an aberrant copy number state. ASCAT was used to calculate allele-specific copy numbers 21 . For samples present in both TCGA and PCAWG, the consensus copy number was derived from published results 59 . We calculated the Spearman correlation coefficients between TmS and /TMB/CIN scores for each cancer type in TCGA (Supplementary Note Table 6).

Association of TmS with expressions of pluripotency and proliferation genes, and patient characteristics
We found that TmS is mostly uncorrelated with the expression levels of canonical pluripotency genes SOX2, MYC, KLF4 and POU5F1 60 in bulk tissue samples (Supplementary Note Figure 12). The corresponding correlation coefficients are much lower than those of TmS with the proliferation marker genes (MKI67 and PCNA) (Supplementary Note Figure 12). This observation does not necessarily rule out any close relationship between these genes and tumor-cell transcriptome variations, as their expression levels from non-tumor cells, which could be major confounders, were also profiled in the bulk samples.

Supplementary Note
We also observed that TmS is unassociated with clinical characteristics of patients, including age and sex, across the 15 TCGA cancer types (Supplementary Note Figure 13a). We investigated the correlation between TmS and the Tumor-Node-Metastasis (TNM) stage which was dichotomized into early and advanced groups. Since prostate adenocarcinoma, endometrial carcinoma and renal chromophobe did not have TNM stage information, and pancreatic adenocarcinoma had very imbalanced sample distribution in early (n=91) vs. advanced (n=4) stages, they were removed from this analysis. In

ICGC-EOPC
For this dataset, DNAseq-based purity and ploidy estimates for 113 samples from 89 patients were determined by Sequenza 61 . We used the 9 available adjacent normal samples as the normal reference to run DeMixT. The RNAseq data came from three batches -batch 1 (17 patients, 25 samples), batch 2 (42 patients, 52 samples), and batch 3 (37 patients, 44 samples). We have conducted a comprehensive comparison for deconvolution with and without batch effect correction, and concluded that we will report both TmS values estimated with and without batch effect correction (see details in Supplementary Note

3.1.1).
A CONSORT diagram is provided for this dataset to demonstrate the sample filtering steps (Supplementary Note Figure 14).

METABRIC
Tumor purity and ploidy for each of the 1,992 patient samples was estimated using both ASCAT 14 and

TRACERx
This dataset does not contain RNAseq data from adjacent normal samples, which is required for running DeMixT. Instead, we used RNAseq data from normal lung samples which are available in the GTEx 64 study. To mitigate the technical artefacts, such as batch effects, scale normalization was applied before

Adjustment for focal copy number alterations
TmS may potentially be influenced by the boost in RNA content arising from focal amplifications and loss of RNA content from focal deletions. To evaluate this, we re-estimated TmS of 4,897 tumor samples across 15 cancer types in TCGA after excluding the expression data from genes with focal amplifications and homozygous deletions in the estimation of TmS. First, we identified all focal amplifications and homozygous deletion events in each tumor sample using the tumor-specific copy number profiles estimated by ASCAT (Supplementary Note Figure 21). We define a focal event as a CNA smaller than 10MB. For samples without or with a whole genome duplication event, a gene is defined as amplified if the tumor-specific copy number is greater than or equal to 5 or 9, respectively. For genes with amplification or homozygous deletion events occurring in more than 10 samples, we removed their corresponding expression data from all samples. For genes presenting focal CNA events occurring in less than 10 samples, we replaced their expression data in these samples with the median expression from all samples. After making these adjustments, we re-estimated TmS from the TCGA datasets across

Association analysis of TmS in survival outcomes
For the TCGA datasets, we used clinical data that passed at least one of the three quality control steps introduced from the TCGA pan-cancer clinical paper 66 . We used two survival outcomes, the OS and PFI.
To ensure sufficient sample size in each category, we combined the pathologic stages into two categories:  Figure 23. Forest plots of hazard ratios (center point) and 95% of CIs (error bar) of multivariate Cox proportional hazard models using re-estimated TmS in comparison with Fig. 4h. P values of two-sided Wald tests for the covariates are indicated by asterisks (* P < 0.05, ** < 0.01, *** < 0.001). For each cancer type, the number of samples is indicated on the top.
Due to the potential nonlinear relationship between TmS and survival outcomes, we used a recursive partitioning survival tree model, rpart 67 , to find an optimized TmS cutoff that best differentiates survival outcomes within each of the two stages as defined above in each cancer type. The splitting criteria were  Figure 25, for simplification, only the hazard ratios of TmS are shown). Endometrial cancer was excluded in the main analysis due to the lack of stage information, although we also performed Cox regression without Stage for this cancer type (Supplementary Table   5). Using the criteria that the number of samples within the early and the advanced-stage groups respectively should be greater than 30 and the number of events/number of samples should be greater than 10%, we further excluded renal chromophobe and renal papillary carcinoma from the main analysis as presented in Fig. 4h.
We observed the direction of hazard ratios (HRs) of the four types of binarized  on the characteristics of the patient population of interest, tertile or median cutoffs may also be used to study the prognostication effect of TmS.
Using the TCGA datasets from 12 cancer types, we fitted multivariate CoxPH models with Age, TmS, and Stage as the baseline model, and the interaction of TmS and Stage, a cell cycle score 68  We further tested the hypothesis whether the varying sample sizes indeed biased the clinical outcomes associated with TmS. Across the 15 cancer types, we set a consistent sample size around its median value, 300, bootstrapped 300 samples within each cancer type, and evaluated the prognostication effect of TmS in these 4,500 pan-cancer samples. This procedure was repeated 1,000 times. Supplementary Note Fig. 26 shows the pan-cancer results from one bootstrapped set, supporting our finding that high TmS is associated with worse prognosis across cancers. Across all iterations, we obtain the hazard ratios

Regional TmS analysis in TRACERx
We calculated the percentage of copy number alteration burden per region, the percentage of subclonal copy number alteration (CNA) per region, and the percentage of subclonal copy number alteration per patient. For each chromosomal segment i in tumor region k, we use an indicator function I ik to represent the copy number alteration (gain and loss) event 43 : and cnTotal ik is the integer total copy number of this segment 43 .
We then define the percentage of CNA burden for each region as the percentage of genome affected by copy number alterations, where nS and D i denotes the number of shared segments, i.e., segments of the genome where copy number status is available across all regions, and the length of shared segment i across regions, respectively.
Further, for each region, whether the segment i has a subclonal CNA event is defined as where K is the total number of regions for a given tumor sample.
Further, we define T i as an indicator function which denotes whether there is an CNA event (including clonal and subclonal) on shared segment i.
Therefore, the percentage of subclonal CNA for region k (percentage of subclonal CNA per region) is defined as We then introduce S i as an indicator function representing the union of subclonal CNA events on shared segment i across regions: . Correspondingly, the percentage of subclonal CNA for each patient is defined as Across regions, the Spearman correlation coefficient between log2(TmS) and percentage of subclonal CNA per region is 0.44; the Spearman correlation coefficient between log2(TmS) and copy number aberration burden per region is 0.26. The difference between these two correlation coefficients between is statistically significant (bootstrapping 1,000 times, mean difference = 0.2, 95% confidence interval: