De novo compartment deconvolution and weight estimation of tumor samples (DECODER)

Tumors are mixtures of different compartments. While global gene expression analysis profiles the average expression of all compartments in a sample, identifying the specific contribution of each compartment remains a challenge. With the increasing recognition of the importance of non-neoplastic components, the ability to breakdown the gene expression contribution of each is critical. To this end, we developed DECODER, an integrated framework which performs de novo deconvolution, and compartment weight estimation for a single sample. We use DECODER to deconvolve 33 TCGA tumor RNA-seq datasets and show that it may be applied to other data types including ATAC-seq. We demonstrate that it can be utilized to reproducibly estimate cellular compartment weights in pancreatic cancer that are clinically meaningful. Application of DECODER across cancer types advances the capability of identifying cellular compartments in an unknown sample and may have implications for identifying the tumor of origin for cancers of unknown primary.

proportions. In addition, like many other quadratic programming-based algorithms, DSA 15 56 has a minimum sample size requirement to perform the estimation, requiring the need for 57 larger datasets. 58 Pancreatic ductal adenocarcinoma (PDAC) is characterized by relatively low tumor purity 59 and large amounts of desmoplastic stroma. Therefore, identifying tumor-specific 60 alterations in PDAC is a continuing challenge. To perform virtual microdissection and 61 study compartment-specific signatures, we previously deconvolved bulk PDAC samples 62 by adapting the non-negative matrix factorization (NMF) algorithm 16 . As a result, we 63 identified two tumor-specific (Basal-like and Classical) and two stroma-specific (Activated 64 and Normal) subtypes, together with exocrine, endocrine and immune compartments. 65 Like other standard NMF applications, the number of factors (K) that the input matrix may 66 be decomposed into must be determined a priori. Although the performance of NMF at 67 different K may be evaluated by silhouette and cophenetic correlation coefficient, this 68 evaluation assumes the exclusive classification of each sample into one of the K 69 clusters 17,18 , which may not be as biologically clear cut. In our previous study, we empirically determined K by dedicated manual association of biological relevance to each 71 compartment at multiple trial runs of K, which can be time consuming and resource 72 intensive 16 . Thus, developing a streamlined framework that is able to automatically 73 determine K is very appealing and will have potential applicability to the bulk tumor sample 74 deconvolution of any cancer type.

75
Here we present de novo compartment deconvolution and weight estimation of tumor 76 samples (DECODER), an NMF-based integrated and automatic framework for the de 77 novo deconvolution of tumor mixture samples for compartments, and estimation of 78 compartment weight for samples (Fig. 1a) compartments, DECODER iterates through multiple runs of a carefully designed NMF 103 framework at increasing K. In each run, the NMF framework trains a gene weight seed 104 (W') by R iterations of the NMF algorithm followed by applying final NMF and non-negative 105 least squares (NNLS) projection to ensure a robust and reproducible output of W and H 106 at each K (Fig. 1c). With the deconvolved factors at multiple runs of increasing K, factor  Fig. S1a,b). The high-level of agreement is reassuring and 118 suggests that DECODER may be used to enable the automated identification of 119 compartments instead of involving labor-intensive manual annotation and empirical 120 determination of K at multiple separate runs.

121
Deconvolution of 33 TCGA cancer types 122 We then applied DECODER to TCGA RNA-seq datasets of 33 cancer types separately   Interestingly, even for compartments that were annotated by the same terms, the top 135 genes for different cancer types were found to be different, raising the importance of tumor 136 type specific deconvolution.

137
For the TCGA PAAD dataset, nine primary compartments were identified. We analyzed better associated with outcome than B.-C., we then determined a threshold for the ratio 185 (B./C. = 1) to reach optimal accuracy to call subtypes (Fig. 2m), and optimal significance 186 to differentiate patient outcome (Fig 2n), similar with previous subtype calls (Fig. 2o).  Fig. S3a,b). We applied this algorithm to calculate the and iClusters (Fig. 4a). We then annotated the compartments by examining the 236 associations of sample weights with cancer types, organ systems and previous cluster-237 base calls, and successfully associated 19 of them (Fig. 4b-d).

238
Ten compartments were found to show uniquely higher weights in single cancer types,  (Fig. 3b). Similarly, we found the sample 245 weights for these compartments to be dominantly highest for ten respective ATAC-seq 246 clusters and iClusters (Fig. 4c,d). This suggests that DECODER identified these 247 compartments to be specifically associated with ten cancers, similar to the conclusion in LGG, reflecting the fact that they are more anatomically similar. Comparing to ATAC-seq 256 clusters and iClusters, D29.13 is associated with A5:Brain, and C11:LGG(IDH1 mut) and compartments present in one may be absent in another. 294 We applied DECODER to deconvolve each of the 33 cancer types in the TCGA RNA-seq 295 datasets. Our results will facilitate the acquisition of cancer type specific marker genes.

296
This may lead to more accurate estimation of certain compartment fractions, since as far

385
A summary for datasets involved in this study is provided in Supplementary Table S1.

387
To handle the stochastic nature of the NMF, a stable gene weight seed was trained before  The resultant primary compartments were colored, with the dropped factors, secondary and 560 unstable compartments denoted by black circles. Each factor is labeled by the number of factors 561 (K) at current run, a dot and the algorithm assigned random factor number from 1 to K. 562 tumor purity, ESTIMATE immune score and ESTIMATE stromal score respectively. e Exocrine, 586 immune, classical and basal compartment weights in ADEX, Immunogenic, Pancreatic 587 Progenitor and Squamous subtypes in the ICGC dataset.  hierarchical clustering on samples (columns on the heat map) using compartment weights. 598 Cancer types, ATAC-seq-based cluster calls and iCluster calls for the samples are shown as 599 tracks. Compartments (rows on the heat map) were manually ordered so that the enriched of 600 weights are shown on the diagonal. The same order of the compartments is maintained in b-d. 601 b Compartment weights for each cancer type. c & d The means of DECODER compartment 602 weights relative to ATAC-seq-based clusters and iClusters. 603