Multi-omic signatures identify pan-cancer classes of tumors beyond tissue of origin

Despite recent advances in treatment, cancer continues to be one of the most lethal human maladies. One of the challenges of cancer treatment is the diversity among similar tumors that exhibit different clinical outcomes. Most of this variability comes from wide-spread molecular alterations that can be summarized by omic integration. Here, we have identified eight novel tumor groups (C1-8) via omic integration, characterized by unique cancer signatures and clinical characteristics. C3 had the best clinical outcomes, while C2 and C5 had poorest. C1, C7, and C8 were upregulated for cellular and mitochondrial translation, and relatively low proliferation. C6 and C4 were also downregulated for cellular and mitochondrial translation, and had high proliferation rates. C4 was represented by copy losses on chromosome 6, and had the highest number of metastatic samples. C8 was characterized by copy losses on chromosome 11, having also the lowest lymphocytic infiltration rate. C6 had the lowest natural killer infiltration rate and was represented by copy gains of genes in chromosome 11. C7 was represented by copy gains on chromosome 6, and had the highest upregulation in mitochondrial translation. We believe that, since molecularly alike tumors could respond similarly to treatment, our results could inform therapeutic action.

In spite of recent advances that have improved the treatment of cancer, it continues to reign as one of the most lethal human diseases. More than 1,700,000 new cancer cases and more than 60,000 deaths are estimated to occur in the year 2019, in the United States alone 1 . Cancer can be considered a highly heterogeneous set of diseases: while some tumors may have a good prognosis and are treatable, others are quite aggressive, lethal, or may not have a standard of care [2][3][4] . Cancer can also defy standard classification: a well classified tumor may not respond to standard therapy, as expected, and may behave as a different cancer type [5][6][7] . Fortunately, with the advances of sequencing technologies, data has become available for research as never before. The Cancer Genome Atlas (TCGA), for instance, offers clinical and omic (e.g. genomic, transcriptiomic, and epigenenomic data) information from more than 10,000 tumors across 33 different cancer types 8 . Much of this omic data has the potential to enable us to classify tumors and to explain the striking variation observed in clinical phenotypes [9][10][11][12] .
Omic integration has been successfully applied in previous classification efforts [13][14][15][16] . These classifications have highlighted how molecular groups of tumors highly agree with human cell types. Alternatively, we hypothesize the existence of internal subtypes hidden by cell type and tissue characteristics influencing cell behavior. These subtypes could be distinguished by molecular alterations unlocking cancerous cell-transformation events. To test this hypothesis, we have developed a statistical framework that summarizes omic patterns in main axes of variation describing the molecular variability among tumors. Key features characterizing each axis (i.e. features contributing the most to inter-tumor variability) are retained, while irrelevant ones are filtered. Retained features are then used to cluster tumors by molecular similarities and find specific molecular features representing each group.
Here we show that, after removing all tissue-specific effects, the cancer signal immediately emerges. The new molecular aggrupation, emphasizing on shared tumor biology, has the potential of providing new insights of cancer phenotypes. We expect this novel classification to contribute to the treatment of tumors without a current standard of care, by for example, borrowing therapies from molecularly similar cases. Step 1) Singular value decomposition of a concatenated list of omic blocks and identification of major axes of variation.
Step 2) Identification of omic features (expression of genes, methylation intensities, copy gains/losses) influencing the axes and mapping them onto genes and functional classes (e.g. pathways, ontologies, targets of micro RNA).
Step 3) Mapping major axes of variation via tSNE and cluster definition by DBSCAN.
Step 4) Phenotypic characterization of each cluster of subjects.
www.nature.com/scientificreports www.nature.com/scientificreports/ Re-classification of pan-cancer tumors based on similarities between omics after removing tissue specific signals. Once tissue signal was identified, it was removed from the extended omic matrix.
Next, sparsity constraints were imposed on the omic features in order to zero-out the features with irrelevant contribution to axes of variation and cluster formation. The selected features (i.e. with non-zero effects) across the three omics corresponded with the 18 th , 25 th , 33 th , and 38 th axes (sorted from more to less variance explained) and mapped onto a total of 1200 genes. The cluster identification and projection onto two dimensions revealed eight classes (Fig. 2). As a consequence of removing the effects of tissue localization, all clusters were formed by samples coming from multiple cancer types. Some clusters differed statistically from their cancer types composition (Table 2). However, all cancer types overlapped with more than one cluster ( Fig. 2; Table 2, bottom). Furthermore, this overlap was not influenced by previously reported subtypes (Fig. S2).

Clinical and demographical characterization of tumor clusters. Clusters differed statistically in
terms of patient age (with Cluster 3 and 8 containing samples from slightly younger patients) and sex (with Clusters 2 and 7 having significantly more females than Cluster 8, due to their slightly higher composition of gynecological cancers) ( Table 2). None of the clusters were significantly associated with ethnicity ( Table 2). The most notorious distinctions between clusters were their differences in prognosis and severity traits (Fig. S3). Cluster 3 (the largest cluster in Fig. 2) was distinguished by better prognosis/less severity cancers than the remaining clusters, followed by Clusters 2, 5, 6 and 7. Clusters 4 and 8 were in general the ones with worst prognosis and more aggressive tumors (Table 2). Cluster 3 was also the one with fewest metastatic samples (Fig. S4), higher survival rates, highest tumor-free fraction, lowest stage, lowest intra-tumor heterogeneity (ITH, that estimates the fraction of subclonal and clonal genomes in each sample 18 ), and lowest proliferation rates ( Table 2, Fig S3). By comparison, Clusters 4 and 8 had significantly more metastatic samples than Cluster 3. Cluster 8 had also higher ITH rates than Cluster 3. The highest ITH rates were found in Cluster 5.
Cluster 3 had also the lowest rates of non-silent mutations, aneuploidies, and homologous recombination dysfunction (HRD). The remaining clusters were very similar in terms of genome instability indicators, except for Cluster 2. This cluster had significantly higher rates of HRD than Cluster 3, but significantly lower rates than every other cluster ( Table 2). In terms of immune infiltration, Cluster 3 was characterized by the highest rates of tumor suppressive immune cells and tumor infiltrating lymphocytes ( Table 2). In addition, Cluster 6 had the lowest infiltration of activated natural killer (ANK) cells. Cluster 8 had also the lowest lymphocytic and highest Th2 CD4+ infiltrations, respectively ( Table 2).
Gene signatures characterizing tumor clusters. The clusters were also characterized by distinct sets of omic features, significantly enriched for functions involved in cell cycle (DNA replication, DNA synthesis, and targets of hsa-mir-615-b, a micro RNA involved in cell proliferation) and mitochondrial translation (initiation, elongation, and termination) ( Table 2). To study the pairwise differences across clusters, these gene sets were projected onto scores for each gene, as linear combinations between the features' values mapping onto the gene (i.e. its expression, methylation, and copy number values) and their corresponding activities (i.e. the features effects arising from the sparsity constraints) (see Materials and Methods section). In general, Cluster 3 was characterized by intermediate values of these scores, while the remaining clusters were characterized by higher (i.e. gene set with higher expression than Cluster 3) or lower (gene sets with lower expression than in Cluster 3) gene set scores. Clusters 2, 4, and 6 had significantly higher scores for cell proliferation, and significantly lower for mitochondrial translation. Clusters 1, 7 and 8, on the other hand, had significantly lower scores of proliferation and higher for mitochondrial translation.
We then interrogated all pair-wise comparisons between the scores of each one of the 441 significant genes using Tukey tests (Supplementary Table S2). We identified a subgroup of 123 significant genes that distinguished (percent of non-Hispanic Whites, Afro-descendants, and Asians), Age (at the moment of diagnosis, in years), type of sample (TS%, as percent of normal -N-and metastatic -M-samples), and survival (Surv, as expected time to 50% survival, in years). Age and Surv are represented by median values, with first and third quartiles as measurements of dispersion. Data corresponded to the alignment and intersection of all samples with information of gene expression (GE), methylation (METH), and copy number variants (CNV). *Only the three most abundant ethnicities in the data set were considered to calculate the percent. **Survival quantiles for cancer types with less than five death events were not calculated. each cluster from the rest (for example, POLH had significantly higher scores in Cluster 4 than in every other cluster). The genes characterizing each individual cluster were then used to define signatures. With this criterion, only Clusters 1, 4, 6, 7, and 8 were characterized by distinct signatures of 57, 4, 23, 24, and 15 genes each, respectively. Since the gene scores are combinations of omic features, we looked at the gene expression in each signature and the potential role of copy numbers and methylation in regulating it (Figs. 3 and 4).
Cluster 4′s signature was composed by four genes mapping onto chromosome 6: TDRD6, POLH, PAQR8 and GUCA1A. All these genes exhibited significant copy losses in Cluster 4, and all of them except GUCA1A, were also downregulated. Additionally, POLH was hypo-methylated, while PAQR8 was hyper-methylated ( Fig. 3).

Figure 2.
Pan-cancer clustering of tumor samples: tissue effects correction a selection of omic features. Tumor clusters were obtained by sequential application of tSNE and DBSCAN algorithm for 5,408 samples across 33 cancer types. The contours reflect cluster membership, and the points' colors and shapes represent similar anatomical site and cancer type, respectively. The two-dimensional tSNE projection was obtained from the four deep principal axes of the extended omic matrix projected outside the tissue specific effects, after performing sSVD and removing the first two axes. After re-classifying tumors, the few samples coming from Kidney chromophobe tumors (KICH) did not map in any of the eight clusters obtained. (2020) 10

Discussion
Most pan-cancer classifications rely on molecular alterations that clearly discriminate between tissue of origin 13,15,16,19,20 . However, as soon as tissue effects were removed, we have found that the cancer signal immediately emerged. Distinct cancer classes were formed, containing tumors from different cancer types. These classes were also characterized by very specific functional groups of omic features. A SVD of the original matrix with incidence of omics features can result on a multitude of axes of variation. Such axes have the potential of explaining different patterns of variability across subjects. In this study, we preceded our cluster analysis by selecting axes of variation (i.e. basis vectors spanning the features space of the concatenated omics) having features loadings different from zero (each axis of variation has an accompanying vector of loadings representing features activities). We have obtained the cluster display in Fig. 2 as a result of this selection criterion. Furthermore, most of the variability between clusters of tumors associates with canonical relationship between gene expression and copy number. According to this, the main source of co-variability among features seemed to be dominated by positive covariation of expression and copy number (i.e. copy losses match with lower expression levels, and vice versa, Supplementary Fig. S5). The expression of regulatory elements within the group of selected features (including www.nature.com/scientificreports www.nature.com/scientificreports/ transcription factors and the micro RNA hsa-mir-615b) was, on the other hand, not associated with the expression of their predicted targets. These observations support the role of copy numbers as a major force affecting tumor progression [21][22][23] . Experimental evidence have shown large effects of methylation at characterizing both normal and tumor tissues [24][25][26][27] . Contrarily, epigenetics has an important role during tissue differentiation, as well as in cancer. However, our analysis might suggest a minor role in leading the cancer cluster differences. We believe that this minor role could be the result of an intense correction for tissue specific effects. Other possible explanations include artifacts of data processing, such as summarizing methylation at the CpG island level. Although the map at the CGI level covered both genic and non-genic regions, and facilitated computations, this summary could have come at the cost of washing out CpG site specific effects on cancer. A third possibility is that the abnormal methylation patterns are important, but shared by two or more cancer clusters. Our features highlighted are the ones that differentiate clusters between them. Regardless, we still observed abnormal methylation patterns, that might suggest role in the expression of some genes characterizing tumor classes (e.g. expression of LRN4 and GUCA1A negatively correlated with promoter CpG islands average methylation).
The tumor clusters C1, C4, C6, C7, and C8 had exclusive signatures (i.e. different of every other cluster). Interestingly, the clusters without distinct individual signatures were the ones with more favorable outcomes (C3, C2, and C5). One possible explanation for this is the frequent correspondence between more dramatic molecular alterations and worse clinical outcomes 28,29 . To gain insights about possible biological interactions within each www.nature.com/scientificreports www.nature.com/scientificreports/ signature, we used the accompanying bibliographic results provided by the STRING database 30 (see Material and Methods section). The literature suggests a wide overlap between signatures in terms of gene functions (cell growth, division, small RNA metabolism, protein synthesis, maturation and transport, and mitochondrial dysfunction). In the case of signature C1 (most genes down-regulated), the literature suggested NOP56 (a core component of the small nucleolar ribonucleic protein) as a central element in the signature; interacting with MKKS, NAA20 and PTPRA (genes with roles on mitotic division); ESF1, SNRPB, SNRPB2, POLR3F and CRNKL1 (involved in small RNA processing), PCNA and ITPA (involved correct DNA replication and repair), UBOX 5, RRBP1, RBCK1 and NRSN2 (protein synthesis, maturation and antigen presentation), RBBPP9 (resistance to growth inhibition of TGF); SIRPA and DSTN (cell adhesion) [31][32][33][34] . In the signature C1, NOP56 could be a candidate for future therapeutic intervention. Tumor suppressors NRSN2 and RBCK1 could also be considered.
The three downregulated genes from signature C4 were involved in small RNA maturation (TDRD6, micro RNA expression and maturation), cell proliferation (PAQR8, plasma membrane progesterone receptor), and DNA repair (POLH, DNA polymerase involved in DNA repair). From these groups, PAQR8 and TDRD6 could represent potential targets of therapy. Although neither of them has been directly related to cancer, other members of the PAQR family of progesterone receptors are known tumor suppressors, while TDRD6 has been reported as frequently down-regulated in breast cancer, suggesting its potential use as biomarker 35 . In the case of signature C6 (most genes upregulated), the literature suggests CTTN as interacting with two groups of genes within the signature, either by co-expression or co-localization in amplicons. One group consisted of invasion and anti-apoptotic related genes (e.g. SHANK, PAK1, PPFIA1) and ion transport (ANO1 and TPCN2) 36,37 . The other group consisted of CCND1 (cell cycle check points), LRPS (protein synthesis), RSF1 (chromatin remodeling), and USP35 (protein turnover; through amplicon-mediated overexpression in breast and gynecological cancers) 38,39 . Patients with signature C6 could perhaps benefit by ANO1 inhibitory therapy 37 .
Signature C7 was characterized by multiple genes co-expressing with KLHDC3 (involved in homologous recombination): MEA1 (spermatogenesis), CNPY3 (protein folding, antigen presentation), PPP2R5D (direct catalytic activity), RRP36 (small RNA synthesis), CCND3 (cyclin, cell cycle checks points), and MED20 (transcription). KLHDC3 also belongs to the protein turnover and antigen presentation pathway, together with CUL7 and UBR2. The literature also suggests another group of co-expressing genes within signature C7, consisting of RPL7L (ribosome), MRPL2 and MRPS10 (mitochondrial ribosome). These genes have also been found to physically interact in cell culture 40,41 . Signature C8 genes remarkably overlapped with signature C6 genes, but exhibited opposite regulation (i.e. down-instead of up-regulated). Additionally, the literature suggests interaction between CCND1, NADSYN1 and MRPL20 in signature C8 42,43 . NADSYN1 has been proposed as target of inhibitory therapy in cancer 44 , while MRPL20 has been suggested as biomarker for gastric cancers 45,46 .
The molecular classification of tumors generated clusters with clear differences in prognosis and severity, with C3 exhibiting better outcomes than the remaining clusters. C3 also resembled a previously reported "inflammatory" type, in terms of immune infiltration and cancer type composition (enriched for prostate adenocarcinoma, thyroid, and pancreatic carcinomas and having elevated values of markers for CD4 + Th17 and Th1 cells and low genomic instability) 18 . Although the remaining clusters were clearly distinguished in terms of altered molecular processes, they were highly similar in terms of clinical and demographic characteristics. C3 also differed from the remaining clusters by lacking large CNV. In C3 we do not observe drastic genome alterations been systematically linked with worse cancer outcomes, either by causing loss of tumor-suppressing activities (e.g. loss of mitotic check points, DNA instability sensing, pro-apoptotic activity, etc.), or gain of oncogenic function (e.g. duplication of mitotic factors). In either case, large CNV have been associated with worsen clinical outcomes, in contrast with the ones characterizing C3. This observation is somewhat supported by the fact that less aggressive cancers lying on C3 (e.g. high frequency of prostate and thyroid cancers), co-located with low severity cases of more aggressive tumor types. Another example of less aggressive tumors in C3 are Her2+ breast cancer, and proximal inflammatory lung adenocarcinomas, tumors of less severe outcomes than their luminal/basal and proximal proliferative subtypes, respectively (Collisson et al. 2014) 47 . Since similar signaling deregulation can arise in different cancers (e.g. dysregulated PI3K/AKT/mTOR pathway in gynecologic cancer) 48 , further research on the link between shared molecular signatures within tumors in the same cluster could shade light on the development of novel therapies, or the repurpose and combination of existing ones. Given their small molecular weights, targeting oncogenes with common monoclonal antibodies and small molecule tyrosine kinase inhibitors could aid in the treatment of tumors with overexpressed oncogenes 49 . For instance, tumors with signature C6 could benefit of combined therapy with indirubin and Ani1, inhibitors of CCND1 and ANO1 50,51 . On the other side of the spectrum, targeting tumor suppressor on signatures of downregulated genes also presents exciting opportunities. For instance, tumors with signature C1 could benefit of target therapy for tumor suppressors NRSN2 and RBCK1. Classic approaches for targeting of tumor suppressor genes include re-activation, by either re-introducing a functional copy (e.g. gene therapy), or diminishing the repressive action of other players through small molecule inhibition 52 . Nevertheless, given the technical challenges of targeting loss of tumor-suppressing function, signatures exhibiting up-regulation could have more pharmacological potential. Similarly, signatures could also rapidly address differences in tumor heterogeneity (e.g. C8 and C5 were notoriously more heterogeneous than the rest). Differences in immune infiltration (C6 with the lowest activated natural killers' infiltration and C8 with the lowest lymphocytic one) could also imply the potential use of signatures to aid in immunotherapeutic decisions.
Given the possibility of unveiling different biological channels altered in tumors of similar clinical and molecular characteristics, we believe this novel pan-cancer classification could aid in the identification of therapies for cancers without standard of care. Extrapolation of results herein should be exerted with the following caution. Although our data included information from multiple studies, sexes, ages, and ethnicity, our results could be strongly influenced by factors such as country of origin of each study and biased on demographic characteristics. Further application of our methods to tumors from different country of origin and/or participants from different ages would be essential for an effective generalization of our results.

Material and Methods
pan-cancer data. The TCGA offers a demographically diverse sample with comprehensive and modern multi-omic data. We retrieved data from 5,408 from 33 cancer types made available by the Genome Data Commons (GDC) repository 53 , via the TCG-Assembler R package 54 . Omic data consisted of curated level-three data of genome-wide gene expression (GE), DNA methylation (METH), and copy number variants (CNV) profiles by tumor sample. GE profiles by sample corresponded with the logarithm of RNA-Seq counts by gene (Illumina HiSeq RNA V2 platform). METH profiles corresponded with CpG sites B-values from the Illumina HM450 platform, summarized at the CpG island level, using the maximum connectivity approach from the WGCNA R package 55 , and further transformed into M-values (M = β/(1-β); 56 ). CNV profiles corresponded to gene-level copy number intensity derived from Affymetrix SNP Array 6.0 platform, using human genome V19 as reference. The quality-control filtering process included the exclusion of features with all zeros, or coefficient of variation less than 1%. Samples or features with a disproportion of missing data (>20%) and/or single-sample batches were also excluded. Within the remaining samples, missing values were imputed by k-near neighbors, with k = 3. Each omic block was adjusted by batch effects using ComBat 57 . Final sample size after retaining subjects with information for all three omics was n = 5,408. Demographic information included gender, self-reported race and ethnicity, and patient's age at the moment diagnosis (Table 1). Clinical information consisted of overall survival time and vital status at the final follow up, type of sample (from primary tumor, metastases, or normal tissue), tumor free fraction. We also used previously information from "The Immune Landscape of Cancer" 18 with significant differences between clusters addressed via Kruskal-Wallis tests 58 . These covariates included: intra-tumor heterogeneity fraction (as subclonal genome fraction), and rates of non-silent mutations, aneuploidy, homologous recombination defects (all three derived as deviations from the normal genome), proliferation (normalized difference between number of dividing and non-dividing cells), and information from immune infiltrations (including scores for CD4 + cells, macrophages, lymphocytes, and natural killers) (See supplementary material in 18 for a detailed description of the scores calculation). Briefly, immune infiltration fractions in 18 were derived by CIBERSORT 59 , assigned to different cell classes, and multiplied by the leukocyte fraction derived from methylation data 18 .

Omic integration, clustering and features selection.
Our method can be conceptually described by the following four steps.

(Step 1) Identification of major axes of variation and features selection.
Integrative methods should be able to capture combined effects across omic sites that could either span across omic layers (e.g. epigenetics, gene expression, etc.) or extend genome wide (e.g. considering concomitantly contiguous CpG sites or even separated away sites). Let, where X l l:{1,…,L} is a matrix representing the l-th omic, which row i th contains information representing a sample on one subject, and column j th represents an omic feature (e.g., a feature could be the expression of a specific gene, or the methylation level for a given CpG site). Each group of features coming from a different omic block is centered, standardized, and divided by p l , where p l is the number of features from the l-th omic block. This is done so larger groups of features do not dominate the data integration step. Next, we conduct a sparse Singular Value Decomposition (sSVD) of X to generate one factor that collapses the redundancies in the omics (by creating independent columns representing the independent signals across omic features) and one that collapses redundancies across samples, grouping subjects with similar signaling. This linear factorization can be represented as X ZW = , where Z represents (linearly) independent axes of variability across subjects (i.e. a lower rank approximation), while W represents loadings representing the contribution of each omic feature to this variability. This representation is common to many unsupervised omic integration methods, but is independent of distributional assumptions on each element. In this formulation, Z and W can obtained by minimizing: To the left of the plus sign is the Frobenius norm (a matrix analogous of Euclidean distance) of the difference between X and the product of Z and W. To the right of the plus sign is a penalty on the elements of W to impose sparsity. The purpose of this penalty is to zero-out those features with minor contributions to the columns of Z.
To remove the effect of tissues, or other covariates that can influence the selection of features, we pre-multiplied X by I -Q(Q'Q) −1 Q', where I is a diagonal matrix of ones, and Q is an indicator matrix to represent the membership to a given organ or tissue.

(
Step 2) Identification of omic features (expression of genes, methylation intensities, copy gains/losses) influencing the axes. The linear decomposition achieved by SVD is an intuitive and straightforward way of integrating omics. However, the variability across omics can be governed by just a few features (i.e. highly sparse data) or by groups of interdependent features (i.e. very redundant data). To handle these limitations, we chose P W ( ) , λ α to be the Elastic Net penalty 60 2 , where α balances the regularization between LASSO and ridge regression types of regularization, and λ is associated with the degree of sparsity (i.e. how many features enter in the model?). Unlike LASSO, EN can select groups of correlated features, while zeroing out the irrelevant ones 61 . Equation 1 is solved by obtaining z 1 w 1 (where z 1 is the first column of Z and w 1 is the first row of W) with coordinate descent for given values of λ and α, following the algorithm of 62 , as implemented in 63 , but with the following thresholding operator: sign(w 1 )| |w 1 Consecutive layers are then obtained by subtracting the previous ones from X and repeating the same procedure, as many times as the number of desired axes of variation. The optimal value for λ was empirically determined, as suggested by 62 . We start by 1) calculating W over a dense grid of values for λ (lower λ yields less sparsity), 2) calculating the proportion of variance of X explained by ZW (PVX) for each λ, and 3) choosing the λ at which PVX has its minimum second derivative. Since PVX decreases monotonically with λ, this point represents a drastic drop on PVX, suggesting that the most relevant features accounting for the data variability are already incorporated 62 . The value α was fixed to 0.5 to have an equal contribution of LASSO and Ridge penalties. Once a subset of features was selected, we mapped them onto genes using annotation data of genomic position downloaded from the USCE web browser tool (GRCh38 64 ). The enrichment of functional classes (ontologies, pathways, complexes, etc.) among these genes was tested using the Enrichr package 65 . ( Step 3) Mapping major axes of variation via tSNE and cluster definition by DBSCAN. Additionally, SVD can be coupled with non-linear embedding methods to deal with highly heterogeneous data. Here, we applied t -Stochastic Neighbor Embedding (tSNE) on Z 14 . tSNE is a technique that efficiently takes on local neighborhoods present in high dimension (eventually representing clusters of data), and conserves them while projecting onto a lower dimensional display 66 . This makes tSNE a very powerful technique to reveal clusters, even in very heterogeneous and convoluted data settings 67 . The algorithm has two fundamental parameters: perplexity (which accounts for the effective number of local neighbors), and cost (related to the difference between the neighborhood's distribution in the higher and lower dimensional spaces). Since low cost is an indication of displays more likely to reveal clusters, we selected the maps corresponding with the lowest costs among perplexities of 50 and 100, using 100 thousand iterations to ensure convergence. We applied Density-Based Spatial Clustering of Applications with Noise (DBSCAN 68 ) to identify clusters. DBSCAN is one of the most powerful clustering techniques to delimit clusters of irregular shape, such as the ones tSNE produces 69 . Essentially, DBSCAN identifies groups of densely packed points, without the need of specifying the number of clusters a priori 68 . Neighborhoods of nearby points can then be tuned by evaluating different cluster partitions over a grid of possible neighborhood sizes. We tuned this parameter by maximizing the Silhouette score, as in Taskensen et al. 2016.

(Step 4) Molecular and clinical characterization of clusters.
The association between clusters and scores representing genes and functional classes selected, was studied to define the signatures representing each cluster. Scores were calculated by tacking the columns of X mapping onto a gene, or functional class, and post-multiplying it by the corresponding elements of W′. Due to the transformations of features values within each omic block (e.g. logarithm of standardized RPKM counts, Beta to M-values for CpG islands), scores can be considered to be approximately normal. Using the scores of each gene and functional class as response, and the clusters as explanatory variables, we then conducted a series of ANOVA tests to determine what genes or functional classes were significant in at least one cluster. All pairwise comparisons between significant genes and functional classes were studied via Tukey tests. Gene signatures were defined based on those genes significantly deregulated in a single cluster. For both types of tests, we used a Bonferroni multiple-test correction with P(type I -error) = 0.05 /{#selected genes and functional classes}.
To discuss the possibility of physical or functional relationships between the genes in each signature, we used the STRING data base of protein-protein interactions 30 . We considered an interaction as biologically meaningful whenever it was backed up by empirical data, such as immune precipitation, microarrays, curated databases, etc. Interactions suggested by text-mining (two genes reported in the same scientific publication) were not considered, except in the cases when a publication's results gave evidence of interaction (e.g. genes co-expressing, co-locating, etc.).
The association between clusters and phenotypes (e.g. clinical, demographic, and immunologic covariates) was evaluated via Kruskal-Wallis test 58 (non-parametric analogous of ANOVA). All significant results were further evaluated by Dunn test 70 for pairwise differences (non-parametric analogous of Tukey tests). All steps of our method were implemented in the R programming language 71 , using irlba 72 , dbscan 68 , and Rtsne 73 packages.

Data availability
Clinical and omic data used here can be retrieved from the International Cancer Genome Consortium repository (https://dcc.icgc.org/).