Malignant pleural mesothelioma (MPM) is a rare and aggressive disease associated with asbestos exposure1. The World Health Organization (WHO) histological classification distinguishes three major types with prognostic value: epithelioid (MME), biphasic (MMB) and sarcomatoid (MMS)2. In the past decade, genomic studies uncovered molecular profiles (clusters) related to MPM’s histopathological classification, each enriched for somatic alterations in known cancer genes (for example, BAP1 in MME and TP53 in MMS)3,4,5. We and others undertook unsupervised analyses of these data, revealing a molecular continuum of types that explained the prognosis of the disease more accurately than any reported discrete cluster6,7. MPM interpatient heterogeneity at the biological and clinical level is therefore expected to be sufficiently explained by the histopathological classification, with phenotypes ranging from MME to MMS8,9.

Nevertheless, the full extent of MPM phenotypes and the mechanisms by which they evolved are poorly understood. Histopathological features (such as architectural subtypes) and molecular features (such as aneuploidy and immune infiltration) were shown to be independent of histopathological type8,9, suggesting that there are additional sources of heterogeneity that remain unexplained. In addition, although malignant transformation and cancer development can depend on a wide range of genomic aberrations10,11,12, genomic events have not been fully described in MPM as previous efforts have been restricted to profiling only exomes or a reduced representation of genomes3,4,5,13. As a result, biological functions performed by tumor cells, and the role of genomic events in shaping these functions, remain largely unknown, hindering any meaningful progress in the diagnosis, classification and treatment of the disease8.

We designed the MESOMICS study to uncover the main sources of molecular variation explaining MPM intertumoral heterogeneity, and to identify the underlying biological functions. Using multiomic analyses combining genomic, transcriptomic and epigenomic data on a novel cohort of 120 MPM tumors (Supplementary Tables 13), we show that the current histopathological classification only explains a fraction of the molecular heterogeneity of the disease, while ploidy, adaptive immune response and CpG island methylation are as important. Taking advantage of a large cohort of whole-genome sequencing (WGS) data, we map the molecular landscape of 120 MPMs and elucidate the link between genotype and phenotype.


Multiomic analyses uncover four axes of molecular variation

We first found that the current histopathological classification only accounts for up to 10% of the interpatient molecular differences (2–10%, depending on the molecular layer, with an average of 6%), leaving 90% unexplained (Fig. 1a). We then undertook an unsupervised decomposition of the interpatient molecular heterogeneity using Multi-Omics Factor Analysis (MOFA)14, integrating genomic, transcriptomic and epigenomic data. We identified four independent and reproducible latent factors individually explaining more than 10% of molecular variation in at least one molecular layer, and collectively up to 61% of interpatient differences (19–61%, depending on the molecular layer, with an average of 33%; Fig. 1a, Extended Data Figs. 13, Supplementary Fig. 1 and Supplementary Tables 47). Only latent factor 2 (LF2) was associated with the histopathological classification, the recent artificial intelligence score based on digital pathology15 and the previously proposed molecular classifications3,4,5,6,7 (median q value = 6.94 × 10−11; Fig. 1b). Therefore, LF1, LF3 and LF4 capture three prominent sources of biological variation overlooked by previous histopathological and genomic studies.

Fig. 1: MOFA of whole genomes, transcriptomes and methylomes of the MESOMICS cohort (n = 120).
figure 1

a, Proportion of interpatient variance within a molecular layer explained by WHO-defined histopathological type (left) and MOFA latent factors 1–4 (right). For example, 7% of variation present in RNA expression can be explained by mesothelioma types, in contrast with 20% explained by integrative MOFA. CN, segmental copy number; DNA alt, rearrangements and mutations; MethBod, DNA methylation level at body regions; MethEnh, DNA methylation level at enhancer regions; MethPro, DNA methylation level at promoter regions; RNA, gene expression level. b, Network of the correlations between latent factors, tumor histopathology and previously published molecular scores. The arc colors, widths and transparency correspond to Pearson correlation coefficients. Features uncorrelated with any other features are highlighted in bold. AI, artificial intelligence; C/V ratio, log2 ratio of CLDN15 to VIM gene expression; S score, sarcomatoid gene expression score; E score, epithelioid gene expression score. c, Interpretation of MOFA latent factors. Plus signs indicate positive correlations and minus signs indicate negative correlations. d, Correlation between the ploidy factor (LF1) and ploidy. e, Correlation between the CIMP factor (LF4) and CIMP index. The samples are colored by histological type. f, Forest plot of the hazard ratios of MOFA latent factors for overall survival. The squares correspond to estimated hazard ratios and the segments correspond to their 95% confidence intervals. In be, P values, q values and r coefficients were determined by two-sided Pearson correlation tests. In d and e, the gray bands represent 95% confidence intervals.

Source data

LF1 (the ploidy factor) is largely explained by tumor ploidy (r = 0.87; Fig. 1c,d). LF2 (the morphology factor) separates the main histopathological types and thus summarizes the morphological and related molecular classifications (Fig. 1a–c). LF3 (the adaptive response factor) summarizes immune infiltration with adaptive response effectors (lymphocytes) (Fig. 1c). For LF2 and LF3, enhancer methylation was the major molecular layer captured (Fig. 1a), partly explained by its implication in the tumor–immune interaction phenotype captured by LF3, and its variability in MPM samples is probably driven by cell-type heterogeneity (Supplementary Fig. 2 and Supplementary Tables 5, 6 and 8). The major feature captured by LF4 (the CpG island methylator phenotype (CIMP) factor) was methylation at gene body and promoter regions, and most of its molecular variation was strongly associated with the CIMP index (r = 0.92; Fig. 1c,e). We then identified proxies to facilitate the interpretation of the latent factors and their implementation in the clinical setting: aneuploidy for LF1; the percentage of sarcomatoid component as reported by pathologists for LF2; an adaptive versus innate immune response score (Methods) for LF3; and a five-gene CIMP index proxy (Methods) for LF4. LF1, LF3, LF4 and their proxies were statistically independent of histopathological type (that is, all histological types can be either high or low ploidy, have high or low adaptive immune responses and have a high or low CIMP index), further confirming that these latent factors represent independent sources of molecular variation (Extended Data Fig. 4a–c).

In line with our previous observations6, tumor samples did not form clusters in MOFA but rather gradients between extreme molecular profiles (Fig. 1d,e). The ploidy factor ranged between a genomic near-haploidization (GNH) and a whole-genome doubling (WGD) profile, with a gradient of intermediate ploidies due to various levels of chromosome arm and focal amplifications and deletions (Fig. 1d). In contrast with the features found associated with the GNH subtype identified in the The Cancer Genome Atlas (TCGA) cohort4, the single near-haploid sample, MESO_108, had a ploidy of 1.10, almost no copy-neutral loss of heterozygosity (LOH) (<1%) and no SETDB1/TP53 mutations and did not undergo WGD. Therefore, this sample does not correspond to the GNH subtype as described by Hmeljak and colleagues4, but to another possible genomic trajectory, where genomic instability is driven by alternative pathways. Differential gene expression analyses showed that, as reported in other tumor types12, the most upregulated enriched pathway in WGD-positive (WGD+) versus WGD-negative (WGD) cases was E2F targets (q value = 0.048; Supplementary Tables 9 and 10), although we could not replicate this result in the TCGA cohort4, possibly due to the difficulty of replicating such findings in low-sample-size series (n = 11 WGD+ samples). The CIMP factor also ranged between two extreme profiles: CIMP-low and CIMP-high (Fig. 1e). A well-known effect of the CIMP-high phenotype is epigenetic silencing of tumor suppressor genes16. In line with this, we identified five Catalogue of Somatic Mutations in Cancer (COSMIC) tumor suppressor genes17, whose expression was negatively correlated with both the CIMP index and the methylation level of their CpG island(s): CBFA2T3, FBLN2, PRF1, SLC34A2 and WT1 (median q value = 2.6 × 10−3; Supplementary Table 11).

We trained latent factor-based survival models and tested their performance over previously proposed prognostic factors to evaluate to what extent each latent factor captured variability predictive of prognosis (Methods). While individually they provided a prediction value similar to each other, when combining the four latent factors there was an increase in their area under the receiver operating characteristic curve value, suggesting that they capture molecular characteristics with independent prognostic value, being informative of MPM progression in a complementary manner (Extended Data Fig. 5, Supplementary Fig. 3 and Supplementary Tables 1220). In line with evidence from multiple cancer types12, survival was lowest for the greatest ploidy (Fig. 1f). As expected, samples in the lower extreme of the morphology factor, enriched for sarcomatoid tumors, presented the worst prognosis. The adaptive response factor linked hot tumors (tumors with a high level of immune infiltration) with better survival, whereas CIMP-low tumors had better survival than CIMP-high tumors (Fig. 1f). The previously described proxies also demonstrated prognostic value in the MESOMICS cohort, and allowed for validation of the prognostic value of the latent factors in the validation cohorts (Extended Data Fig. 4d–g). Probably due to the limited power and a potential effect of histology, the prognostic value of the ploidy and CIMP factors was not statistically significant when analyzing MME samples only; however, their respective effect size remained similar to those identified in the entire cohort (Supplementary Fig. 3). We additionally validated the existence of the four dimensions as well as their prognostic values in previously published cohorts (Supplementary Tables 21 and 22).

Finally, combining molecular and drug response data for 59 MPM cell lines from Iorio et al.18, de Reyniès et al.5 and Blum et al.7, we were able to evaluate the therapeutic value of the ploidy, morphology and CIMP factors (the lack of microenvironment in cell culture models did not allow for replication of the adaptive response factor), by assessing the impact that cell line position along each latent factor had on the response to candidate drugs (Extended Data Fig. 6, Supplementary Fig. 4 and Supplementary Tables 2326). Significant drug responses associated with the different factors were entirely orthogonal (Extended Data Fig. 6a), highlighting the fact that MOFA latent factors capture independent axes of heterogeneity in both tumoral mechanisms and therapeutic responses. Therefore, both survival and cell line analyses showed that these axes of variation are clinically relevant and have the potential for translation into clinical practice.

Task specialization analyses reveal diverse tumor strategies

Samples along the interdependent morphology and adaptive response factors formed a triangular shape delimited by three extremes (Fig. 2a and Supplementary Fig. 5). The well-established Pareto optimum theory19 (ParetoTI method) predicted that this pattern results from natural selection for cancer tasks, with specialist tumors close to the vertices of the triangle and generalists in the center (triangle fit P value = 0.001; Fig. 2b). Integrative gene set enrichment analysis (IGSEA) pointed to the following cancer tasks and tumor phenotypes: cell division, tumor–immune interaction and acinar phenotype (Fig. 2c and Supplementary Tables 2730 for archetypes, IGSEA significant pathways and q values).

Fig. 2: Cancer task inference from the morphology and adaptive response factors (n = 120).
figure 2

a, Sample positions along the morphological (LF2) and adaptive response factors (LF3) are contained within a triangle formed by three phenotypic archetypes (colored vertices). The P value corresponds to a one-sided test from the Pareto fit. b, Ternary plot representing the sample’s distance from the three specialized profiles. The bar plots represent the association between archetypes and histopathological types. c, Summary table of the main phenotypes, features and overexpressed pathways (columns) identified in each profile (rows). Left, arrows indicate the focal profile of each row. Middle left, ternary plots with a color-filled background representing key features for each profile. NRC, normalized read count. Middle right, lollipop plots presenting the correlation between RNA-seq-estimated immune cell infiltration and the proportion of archetypes. Right, expression heatmaps of cancer tasks inferred from each phenotype. The rows represent enriched pathways and the columns represent the samples, ordered by increasing phenotype proportion. The heatmap color scale corresponds to the averaged z score of each gene set. The colored tiles on the right annotate the gene sets that belong to the hyper-pathways inferred from each phenotype.

Source data

Tumors specialized in the cell division task displayed upregulation of these pathways, as reported by Hausser et al. in multiple tumor types20. This phenotype was enriched for nonepithelioid tumors and presented higher levels of necrosis, higher grade and a greater percentage of infiltrating innate immune response cells (neutrophils) (median q value = 0.005). Cell division specialization was supported by high expression levels of the proliferation marker MKI67 and increased genomic instability (estimated from genomic, transcriptomic and epigenomic data; median q value = 1.97 × 10−4). Tumors specialized in the tumor–immune interaction task carried upregulated immune-related pathways, high expression of immune checkpoint genes and high immune infiltration with an enrichment for adaptive response cells: B lymphocytes, CD8+ T cells and regulatory T cells (median q value = 2.73 × 10−3). The cell division and tumor–immune interaction specialists also showed high expression of hypoxia response pathways and common enrichment for pathways in the invasion and tissue remodeling universal cancer task. Indeed, we found a higher epithelial-to-mesenchymal transition (EMT) score among tumors in this area of the Pareto triangle, driven by upregulation of mesenchymal genes and hypomethylation of their associated enhancers (median q value = 1.61 × 10−6). In line with in vitro studies showing that asbestos may induce EMT in MPM21, we found a positive correlation between the expression of mesenchymal genes and asbestos exposure score (r = 0.44 and q value = 0.01) and a negative correlation between this score and enhancer methylation of mesenchymal genes (r = −0.33 and q value = 0.02). We also observed overexpression of neoangiogenesis-related genes, corroborating the ability of these tumors to remodel their environment.

The last extreme phenotype was characterized by samples with acinar morphology, presenting a very structured tissue organization with epithelial cells tightly linked into tubular structures, and correlated with the presence of monocytes and natural killer cells (innate immune response cells) (median q value = 0.022). This phenotype presented the lowest EMT score, with overexpression of epithelial markers such as cell adhesion molecules (median q value = 1.21 × 10−3), corroborating the importance of tissue organization in this phenotype, and also low levels of MKI67 expression, indicating slow growth. This phenotype showed no particular tumoral specialization in any task based on the few IGSEA upregulated pathways. In line with the better prognosis reported for this subtype8, the acinar phenotype is characterized by the highest levels of global methylation22 (q value = 5.58 × 10−10). Altogether, these data provide a biological understanding of the molecular and phenotypic heterogeneity characteristic of MPM tumors.

WGS uncovers a diverse genomic landscape

We found 97% (111/115) of MPM tumors harboring at least one large genomic event (copy number variant (CNV), amplicon, homologous recombination deficiency (HRD), chromothripsis or aneuploidy; Fig. 3a). As captured by the ploidy factor, MPM samples ranged from haploid to tetraploid (Fig. 1d). The average CNV profile was highly consistent between cohorts (Supplementary Fig. 6), with several recurrent chromosome arm-level CNVs, as well as focal alterations encompassing known cancer genes (Fig. 3b and Supplementary Tables 3135). As previously reported23, all of the MTAP alterations co-occurred with CDKN2A/B (Fig. 3a and Supplementary Tables 36 and 37). We also found recurrent deletions of a prominent immune recognition gene, B2M (chr15q14; Fig. 3b).

Fig. 3: Genomic characterization of MPM from the MESOMICS cohort.
figure 3

a, Recurrent large genomic events. Top, clinical, epidemiological, morphological and technical features per sample. T only represents samples with WGS on the tumor sample only. Bottom left, oncoplot describing the genomic events per sample. amp, amplification; del, deletion; ND, HRD type not determined. Bottom middle, barplot of the frequency of each event within the cohort. Bottom right, comparison of the gene expression of cancer-relevant genes belonging to frequent deletions detected by GISTIC, with regards to their copy number (CN) status. Wild-type (WT) cases correspond to samples without copy number, structural variant or single-nucleotide variant events detected. The box plots represent the median and interquantile range and the whiskers the maximum and minimum values, excluding outliers. The n number above represents the number of biologically independent samples for each test. *0.01 < q value ≤ 0.05; **0.001 < q value ≤ 0.01; ***q value ≤ 0.001. NRC, normalized read count. b, Cohort-level copy number profile (top), with significantly altered regions identified by GISTIC in focal peaks (middle) and at the chromosome (chr.) arm level (bottom). cnLOH, copy-neutral LOH. c, Data from a patient with oncogene amplification due to a chromothripsis event (MESO_019). Left, chromosomes involved in the chromothripsis event (outer circle, shattered regions; intermediate circle, copy number gain and loss; inner circle, structural variants (SVs)). Middle, reconstructed ecDNA structure. Right, gene expression in MESO_019 relative to the expression in other tumors of the cohort (quantile). Oncogenes found within the ecDNA region are represented in red. The P value was determined by two-sided Wilcoxon rank-sum test. kb, kilobases.

Source data

A comprehensive analysis of mutational signatures, encompassing single-base substitutions, CNVs and structural variants24,25, allowed us to identify the processes leading to particular somatic alteration patterns (Extended Data Fig. 7). A total of ten active single-base substitution signatures were detected in MPM genomes (Extended Data Fig. 7b); all corresponded to known COSMIC signatures and none was associated with asbestos exposure, as was previously reported3,4. Six tumors were found to have extrachromosomal DNA (ecDNA) (Supplementary Fig. 7 and Supplementary Table 38), and in the one sample with transcriptomic data we found increased expression of the genes predicted to be present on the ecDNA, including the known oncogene BRIP1 (Fig. 3c). We observed that the aforementioned ecDNA sample co-occurred with, and may be fueled by, kataegis26 (Supplementary Fig. 8). Overall, kataegis was rarely seen in our cohort, contributing to only 2% of the MPM clustered mutations (Supplementary Tables 39 and 40). The identified complex mutational processes included a pattern compatible with chromothripsis. This was observed in 20% of the samples (Fig. 3a, Supplementary Fig. 9 and Supplementary Table 39) and also at the transcriptomic level, as fusion transcripts, in half of the positive samples (Supplementary Fig. 10a and Supplementary Tables 4143). A signature of clustered structural variants was detected and significantly associated with a high structural variant load and chromothripsis (Supplementary Fig. 10b,c and Supplementary Tables 41 and 42). For one sample (MESO_019), the chromothripsis region overlapped with an ecDNA region, suggesting that chromothripsis may have been the source of the circular amplification (Fig. 3c). Finally, 23% of the samples showed a HRD phenotype, identified either by copy number signatures25 or structural variant pattern-based methods27 (Supplementary Fig. 11 and Supplementary Table 40). Among these samples, five harbored pathogenic germline mutations (from the ClinVar database) in one of 26 genes known to be involved in homologous recombination28—significantly more than the two mutations reported in the 77% of samples without HRD (Fisher’s exact test, P value = 0.00587).

We detected an HRD signature in nine out of 21 MPM cell lines from Iorio et al.18, thus validating the high rate of this pattern in MPM. In addition, the sensitivity of these cell lines to the clinically approved olaparib showed a tendency toward higher sensitivity in HRD samples compared with non-HRD samples (Supplementary Fig. 12). This may be linked with the results of a clinical trial suggesting a highly complex mechanism between the response to this drug and markers for DNA repair pathway activity29. Indeed, in contrast with their original hypothesis, patients with BAP1 mutations had poorer survival when treated with olaparib than wild-type patients. In line with this observation, the olaparib response was positively associated with the prognostic CIMP index factor (r = 0.65; Extended Data Fig. 6), meaning that CIMP-low samples were more sensitive to this poly-ADP ribose polymerase inhibitor than CIMP-high samples (which are enriched for BAP1 alterations (Fig. 5a) and associated with poorer survival (Supplementary Fig. 3)).

Despite the low mutational rate (0.98 nonsynonymous small variants per megabase; Supplementary Fig. 13a and Supplementary Tables 4446), MPM tumors carry a particularly high number of structural variants relative to tumors with similarly low mutational burden (Fig. 4 and Supplementary Fig. 13b). The top genes altered by structural variants (≥5%) were RBFOX1, NF2, BAP1, MTAP and PCDH15 (Supplementary Fig. 14a). For RBFOX1, 13 out of 39 samples have two separate events, with most deleting part of the RNA-binding protein domain (Supplementary Fig. 14b). Many of these genomic rearrangements resulted in fusion transcripts detected at the transcriptomic level (Supplementary Figs. 10a and 15).

Fig. 4: MPM driver genes in the MESOMICS cohort.
figure 4

Top, tumor mutational burden (TMB), number of segments or copy number burden (CNB) and structural variant burden (SVB) of each sample. Main, oncoplot describing genomic alterations in IntOGen and structural variant MPM driver genes per sample. These genomic events can co-occur with copy number changes. Large indels and translocations refer to structural variant events detected by structural variant callers while fusion transcripts are detected at the transcriptomic level. Each gene is also annotated as belonging to one focal or arm-level GISTIC event, as well as for being regulated by DNA methylation (right bars). Right, frequency of alterations within the cohort. For each gene, the dark green dot represents the frequency of structural variants. In the legend, ERG indicates whether the sample has one or more alteration in an ERG. Key clinical, epidemiological, morphological and technical features are given for each sample. PCAWG, Pan-Cancer Analysis of Whole Genomes; SNV, single-nucleotide variant; SV, structural variant.

Source data

Combining the MESOMICS dataset with the two other large datasets from Bueno et al.3 and the TCGA4, we reached the sample size (n ≈ 300) needed to detect rare driver alterations (1%). The IntOGen pipeline30 discovered 30 MPM driver genes based on small variants (Supplementary Fig. 14c). BAP1, NF2, SETD2, TP53 and LATS2 are all known MPM driver genes. Among the other 25 genes, some were previously reported as recurrently mutated in MPM (PBRM1, KMT2D, DDX3X, PIK3CA, FBXW7, MGA, NF1, SETDB1, MYH9, PTCH1, RHOA and TRAF7)31,32,33 or altered by structural variants (PTPRD and LRP1B)34, two were found overexpressed in MPM cell lines (DNMT3B and EZH2)35 and, for another two, germline mutations have been discovered, suggesting genetic susceptibility (NCOR1 (ref. 36) and MYO5A37). The remaining seven driver genes have, to our knowledge, not been previously reported in MPM, but they are all known cancer genes, as reported in COSMIC: FAT3, NIN, ARHGAP5, HLA-A, NCOR2, SRGAP3 and WNK2. Of note, NF2 and MYH9 (IntOGen drivers) are located within the significantly deleted chr22q region, along with TTC28—a gene frequently altered by structural variants (Figs. 3a,b and 4). Beyond extending the list of putative MPM drivers, combining point mutations with structural variants allowed for refinement of the frequency of alterations in key MPM genes (Fig. 4 and Supplementary Tables 4146).

Genomic alterations tune the molecular profiles of MPM

Genomic events were associated with all MOFA latent factors and the extreme profiles that they encapsulated, as well as with the phenotypic specialists captured by the morphology and adaptive response factors (Fig. 5a and Supplementary Tables 47 and 48). Associated alterations significantly tuned tumor specialization (P value = 0.003; Methods and Extended Data Fig. 8). In addition to ploidy, NCOR2 alterations and TERT amplification were associated with the ploidy factor (q values = 4.3 × 10−18 and 3.3 × 10−4, respectively; Fig. 5a). Thirty-six samples (31%) displayed TERT amplification, resulting in a significant increase in TERT expression (P value = 1.8 × 10−5; Supplementary Fig. 16a,b). TERT amplification was accompanied by an underlying amplification of chr5p in 81% of the positive cases. While no association was previously detected between TERT promoter mutations and WGD38, here we found that both TERT amplification and its increased expression were associated with WGD events (P value = 1.6 × 10−10 and 0.009, respectively; Supplementary Fig. 16c).

Fig. 5: Impact of genomic events on MPM molecular profiles.
figure 5

a, Association between genomic events and MOFA factors. For each event, the ALT (altered) versus wild-type difference corresponds to the difference between the mean factor value of wild-type samples and that of altered samples. The q values correspond to an adjusted analysis of variance P value; the dashed horizontal line represents the q value threshold of 0.05. AMP, amplification; Decr., decrease; DEL, deletion. b, Association between CIMP index, EZH2 expression (n = 109 samples) and PRC2 target gene methylation (n = 119 samples). Left, heatmap of EZH2 gene expression (NRC) and CpG island methylation (z score) of PRC2 target genes whose methylation level was significantly positively correlated with CIMP index (q < 0.05), for samples ordered by CIMP index. Right: correlation between WT1 expression and CIMP index. The q value was determined by Pearson correlation test and the gray band corresponds to the 95% confidence interval. c, Effect vector of key alterations affecting specialization in the tumor tasks from Fig. 2. The effect vector corresponds to the difference in position on the Pareto front between the centroid of altered samples and the centroid of wild-type samples. d, Comparison of the timing of large-scale amplifications in the MESOMICS and PCAWG cohorts. The points represent estimates of the timing of genomic events. The empirical P values (red data points) were determined by one-sided outlier tests.

Source data

Genomic alterations in epigenetic regulatory genes (ERGs) have previously been shown to drive CIMP in cancer39. In line with this, we found enrichment for ERGs (P value = 3.4 × 10−3; Methods and Supplementary Fig. 17), including the mesothelioma drivers NCOR2 and EZH2, among the genes highly expressed in CIMP-high tumors, and more generally in the list of MPM drivers (q value = 2.1 × 10−5). Chr7q36.1del, encompassing EZH2, further tuned the position of the samples along the CIMP factor (q value = 5.2 × 10−3; Fig. 5a). EZH2 (enhancer of zeste homolog 2) is a histone methyltransferase that functions as part of the Polycomb repressive complex 2 (PRC2) complex to promote gene silencing of specific targets40. Indeed, genes whose CpG island methylation level was highest in CIMP-high tumors were enriched for PRC2 target genes (P value = 0.01; Fig. 5b). WT1, which is found downregulated in CIMP-high tumors, is particularly interesting and a vaccine against this PRC2 target is currently being assessed in clinical trials for mesothelioma41. Cancers frequently associated with a CIMP-high phenotype include colorectal cancer (CRC) and glioma42,43, with BRAF (CRC) and IDH1 (glioma) mutations also associated with this phenotype, as well as with microsatellite instability in CRC42. Microsatellite instability and BRAF/IDH1 mutations were rare or absent events in our series and unrelated to the CIMP phenotype (Supplementary Tables 7, 44 and 49), suggesting that the mutational processes linked with CIMP phenotype in MPM may differ from those of other cancers.

WGD and chromothripsis seemed to push tumors away from the tumor–immune interaction phenotype (q values = 0.042 and 0.012, respectively; Fig. 5c); indeed, both cell division and acinar phenotypes were characterized by low immune cell infiltration (cold tumors), which may be explained by the downregulation of the interferon response pathway and B2M expression seen in WGD + MPM tumors (q value = 7.4 × 10−17; Supplementary Fig. 18a,b,e and Supplementary Tables 9 and 10). These may represent important mechanisms for WGD+ tumors to avoid the immune response12,44. Chromothripsis has also been associated with low immune infiltration as part of the chromosomal chaos that silences immune surveillance45.

CDKN2A, MTAP and NF2 alterations also converged on cold tumors (median q value = 0.003). Within this cold phenotype, TERT amplification and alterations in TTC28, involved in the mitotic cell cycle, moved tumors towards cell division specialization (q values = 1.6 × 10−4 and 7.4 × 10−4, respectively; Fig. 5c), whereas chr3p21.1del (BAP1, DNAH1 and PBRM1) and BAP1 mutations moved tumors toward the better-prognosis acinar phenotype (q values = 0.021 and 7.1 × 10−4, respectively; Fig. 5c), as expected given the previously reported association between BAP1 alterations and better survival in MPM36. A loss of BAP1 (BRCA1-associated protein-1) expression, measured by immunohistochemistry, was also associated with this phenotype (r = −0.38 and q value = 4.61 × 10−5; Supplementary Fig. 19). Interestingly, an analysis of splicing variation found that the morphology factor and acinar phenotype were significantly associated with alternative splicing events (Supplementary Fig. 20a–f). Major contributions came from events in cell adhesion genes, and neuronal progenitor BAF, neuron-specific BAF and SWI/SNF complexes, potentially affecting the alternative splicing pattern of genes such as BCL11A and SMARCE1 (Supplementary Fig. 20g,h). The fact that these genes (just like BAP1) have important roles in chromatin remodeling suggests that disruption of chromatin remodeling pathways may molecularly define the acinar phenotype.

The specialization of tumors can be influenced by early genomic events. Estimates of the timing of WGD, TERT amplification and copy-neutral LOH in the few samples (n = 6) with such events where a subclonal deconvolution was possible showed that our samples fall well within the values observed across >2,500 tumors of the Pan-Cancer Analysis of Whole Genomes Consortium46 (empirical P values = 0.16–0.79; Fig. 5d and Supplementary Fig. 21). Thus, these genomic events may indeed have occurred more than 10 years before diagnosis. Three out of the six patients were exposed to asbestos (of the other three patients, two had no known exposure and one had unknown exposure), among whom two had well-documented periods of exposure, from 56 to 21 years before diagnosis for MESO_048 (including the estimated timing of LOH) and from 54 to 50 years before diagnosis for MESO_057, more than 50 years before the estimated timing of TERT amplification, suggesting that genomic events can occur both concomitantly with and subsequent to asbestos exposure, although conclusive evidence of the timing of these alterations will need to be investigated in hypothesis-driven studies. Using a multiregional subcohort from 13 patients, we found intratumor heterogeneity in all factors except the ploidy factor, further suggesting that genomic events are mostly early and thus do not vary much across regions (Extended Data Fig. 9, Supplementary Fig. 22 and Supplementary Tables 5052). Finally, we detected neutral tumor evolution close to the acinar phenotype (P value = 0.0024; Supplementary Fig. 23) at extreme values of the morphology and adaptive response factor, suggesting that tumors with this profile were even less influenced by recent genomic events.


The MESOMICS project represents a substantial advancement toward the comprehensive molecular characterization of MPM, made possible by inclusion of a large WGS dataset3,4,34 and by the depth of the multiomic integrative analyses undertaken. We demonstrated that ploidy, adaptive immune response and CpG island methylation constitute independent sources of molecular variation with quantitatively similar impacts on interpatient MPM heterogeneity as the histological classification. Despite some individual observations made in previous studies6,7,13, these three sources of molecular variation have been mostly unexplored or unknown because of the major focus that was put on refining the histological groups, and the lack of comprehensive analysis of a large multiomics dataset. In this sense, the unifying framework aspect of our research approach allowed us to capture the entire molecular landscape of MPM, summarized in four dimensions.

Aneuploidy is one of the morphology-independent features previously reported in MPM4 but poorly characterized. The ploidy factor identified tumors that underwent WGD, previously described in multiple cancer types as an early transformative event that dramatically destabilizes cell genetics and fuels tumor development47. WGD tends to be favored along the evolutionary course of low-mutational-burden tumors like MPM12 and is suspected to serve as a genetic spare tire in case of lethal alterations48. As a consequence, this event shapes the cellular phenotype associated with specific vulnerabilities12.

The CIMP has been reported in several cancer types, most notably CRC and glioblastoma, with inconsistent associations with survival49,50,51. Here we provide further evidence, to that of Blum et al.7, of distinct variation in CIMP index within mesothelioma tumors, and have shown that a high CIMP index is independent of morphology and predictive of poorer outcome. While a universal cause for a CIMP-high phenotype has not been established, it has been previously associated with alterations in ERGs39,52. Indeed, our data suggest that some mesothelioma tumors may acquire a CIMP-high phenotype through the activity of the ERG EZH2, to hypermethylate and silence specific target genes. Such a strategy may be warranted to promote malignant transformation in a lowly mutated tumor such as mesothelioma35.

Pareto task inference uncovered three specialized tumor profiles in the space delimited by the interdependent morphology and adaptive response factors, presumably resulting from pressures of the microenvironment, each selecting for adaptive alterations and phenotypic traits. Cell division specialists adopted a fast reproduction strategy that was expected to result from unfavorable and unpredictable environments53, with their genomic instability suggesting adaptation through evolutionary leaps54,55. Immune interaction specialists adopted an immune evasion or camouflage strategy. Both phenotypes also presented characteristics of invasion and tissue remodeling specialists20. These tumors tended to occur in intensely asbestos-exposed individuals, suggesting that chronic inflammation (promoted by asbestos exposure56) may have created the unfavorable environment responsible for selective pressure. Finally, acinar phenotype specialists adopted a structured tissue organization and slow growth strategy. This suggests an equilibrium strategy that is expected to be favorable in stable, resource-rich environments with limited predation57, in line with the lower level of asbestos exposure and limited inflammation and immune infiltration observed in these tumors. Consistent with limited environmental pressures, acinar tumors were enriched for neutral evolution and BAP1 alterations—an event that, when combined with weak asbestos exposure in mice, greatly increased mesothelioma occurrence over weak asbestos exposure alone58.

Overall, the four molecular factors are highly informative and capture specific profiles that are complementary in predicting tumor phenotype and aggressiveness. The fact that they are all independent and mostly unrelated to the morphology factor (histology) means that disregarding them might not only jeopardize the success of any treatment but also miss opportunities to stratify patients based on their molecular profile (Fig. 6). The tightly correlated proxies that we have identified could serve as biomarkers for response to specific therapies (such as immunotherapy for LF3) and could be easily tested in a hypothesis-driven study design. Subsequently, integrating these complementary factors would help to stratify patients for preselected-cohort clinical trials59, a process that has proven to be beneficial in small-cell lung cancer, another aggressive recalcitrant cancer60,61,62. The results of the MESOMICS project pave the way for the establishment of a more clinically relevant morphomolecular classification of MPM tumors.

Fig. 6: Added value of the four-factor molecular classification in understanding intertumor heterogeneity in three example patients.
figure 6

a, Patients MESO_019, MESO_079 and MESO_085 had nearly identical clinical characteristics. b, The three patients had vastly different profiles based on our four-factor morphomolecular classification: different WGD status (left), opposite positions on the Pareto front (middle) and variable levels of CpG island methylation (right).

Source data


This section briefly describes the main methods (see Supplementary Information for details on the data, processing and analyses).


All of the methods were carried out in accordance with relevant guidelines and regulations. This study is part of a larger study, the MESOMICS project, aiming to perform comprehensive molecular characterization of MPM, and was approved by the International Agency for Research on Cancer (IARC) Ethics Committee (project number 15-17). The samples used in this study belong to the virtual biorepository French MESOBANK. Written, informed consent was obtained from all participants and no participant compensation was provided.

Clinical data

Age at diagnosis (in years), sex (male or female), smoking status (nonsmoker, ex smoker or smoker), asbestos exposure (exposed or nonexposed), previous treatment with chemotherapy drugs (yes or no), treatment information (surgery, chemotherapy, radiotherapy, immunotherapy or cancer history) and survival data (calculated in months from surgery to the last day of follow-up or death) were collected for all 123 patients. The median age at diagnosis was 67.5 years and 73.3% of patients were male.


The MESOMICS cohort includes biological material from 123 patients with MPM (including three nonchemonaive patients who were excluded from all analyses unless explicitly mentioned) kindly provided by the French MESOBANK and annotated with detailed clinical, epidemiological and morphological data. Samples were collected from chemonaive surgically resected tumors, applying local regulations and rules at the collecting site, and included patient consent for molecular analyses, as well as the collection of de-identified data. Samples underwent an independent pathological review by the French MESOPATH reference panel, who determined that of the 120 MPM tumor samples, 79 belonged to the MME type, 26 were MMB and 15 were MMS. Of the 105 samples with an epithelioid component (79 MME and 26 MMB), solid, acinar, trabecular and tubulopapillary architectural patterns were the most frequent in the series (n = 37, 31, 16 and 14, respectively).

Discovery and intratumoral heterogeneity cohorts

Among the 123 patients with MPM, 13 had two tumor specimens collected for the study of intratumoral heterogeneity (ITH). The one with the highest tumor content, estimated by pathological review, was selected for this descriptive study and is reported in Supplementary Tables 13, and the other region is described in Supplementary Tables 5052. Additionally, three patients have been reported as nonchemonaive and they were excluded from the analyses except if explicitly mentioned otherwise in the Methods.

Pathological review

For all 136 samples (123 tumors plus 13 additional regions), a hematoxylin and eosin stain from a representative formalin-fixed, paraffin embedded block was collected for pathological review. Our pathologist (F.G.-S.) performed a detailed pathological review and classified all tumors according to the 2015 WHO classification63,64. The hematoxylin and eosin stain was also used to assess the quality of the frozen material selected for molecular analyses and to confirm that all frozen samples were at least 70% tumor cells.

Artificial Intelligence analysis

Whole-slide image-based artificial intelligence prognostic scores were computed using the artificial intelligence MesoNet model based on morphological features, developed by Owkin—an artificial intelligence for medical research company15.

Statistical analyses

All analyses were performed in R version 4.1.2. All tests involving multiple comparisons were adjusted using the Benjamini–Hochberg procedure, controling the false discovery rate using the p.adjust R function (stats package version 3.4.4). To limit false discoveries, we took a conservative q value threshold of 0.05. In addition, in line with the American Statistical Association statement on the misuse of P values65, which intends to ‘steer research into a “post P < 0.05 era"’, we report all P and q values, even those that may be closer to arbitrary thresholds such as the 5% threshold. To improve the reproducibility of our results, we summarize in Supplementary Tables 21 and 22 all P and q values reported in the text and main figures, along with details about the tests performed (hypothesis, model and sample size) and replication performed with additional cohorts.

Survival analysis

Survival analysis has been performed using Cox’s proportional hazard model from which the significance of the hazard ratio between the reference and the other levels has been evaluated using Wald tests. We assessed the global significance of the model using the logrank test statistic (R package survival version 2.41-3) and drew Kaplan–Meier and forest plots using the R package survminer (version 0.4.2).

DNA extraction

Included samples were extracted using the Gentra Puregene Tissue Kit (4 g) (158667; Qiagen), following the manufacturer’s instructions. All DNA samples were quantified using the fluorometric method (Quant-iT PicoGreen dsDNA Assay; Life Technologies) and assessed for purity by NanoDrop (Thermo Scientific) 260/280 and 260/230 ratio measurements. The DNA integrity of the fresh frozen samples was checked with a TapeStation system (Agilent Biotechnologies) using Genomic DNA ScreenTape (Agilent Biotechnologies).

RNA extraction

Included samples were extracted using the AllPrep DNA/RNA extraction kit (Qiagen) following the manufacturer’s instructions. All RNA samples were treated with DNAse I for 15 min at 30 °C. The RNA integrity of the frozen samples was checked with a TapeStation system (Agilent Biotechnologies) using RNA ScreenTape (Agilent Biotechnologies).

Because of unsuccessful extraction (impacting either the quality or the quantity), we obtained different numbers of MPM samples for which WGS, DNA methylation or RNA sequencing (RNA-seq) data are available (Supplementary Tables 13).

DNA sequencing


WGS was performed by the Centre National de Recherche en Génomique Humaine (Institut de Biologie François Jacob, CEA) on 130 fresh frozen MPMs, 54 of which with matched normal tissue or blood samples. We used an Illumina TruSeq DNA PCR-Free Library Preparation Kit (20015963; Illumina) according to the manufacturer’s instructions and sequenced them on a HiSeq X Five platform (Illumina) as paired-end 150-base pair reads. Samples paired with matched normal tissue or blood had a target sequencing depth of 60× and other samples had a target depth of 30×.

Data processing

WGS reads were mapped to the reference genome GRCh38 (with ALT and decoy contigs) using our in-house workflow (; release version 1.0)66. In summary, this workflow relies on the Nextflow domain-specific language67 version and consists of four steps: read mapping (software BWA68; version 0.7.15), duplicate marking (software samblaster69; version 0.1.24), read sorting (software sambamba70; version 0.6.6) and base quality score recalibration using GATK71 (version 4.0.12).

Variant calling and filtering on DNA

We performed somatic variant calling using the software Mutect2 (ref. 72) from GATK version, as implemented in our Nextflow workflow (; release version 2.2b). Multiregion samples were processed jointly using the multisample calling mode of Mutect2. We called germline variants using Strelka2 (ref. 73) version 2.9.10-0 using our Nextflow workflow (; release version 1.2a). Annotation was performed with ANNOVAR74 (16 April 2018) using the GENCODE version 33 annotation, COSMIC version 90 and REVEL databases. To call somatic variants on tumor-only samples (72/115), a similar procedure was performed (Mutect2 tumor-only mode) but including further germline-filtering steps using a random forest classifier.

CNV calling

Somatic CNVs were called using the PURPLE software75 version 2.52, as implemented in our Nextflow workflow (; version 1.0). We used a total of 57 matched WGS samples of MPM (including multiregion samples) for benchmarking the tumor-only mode of PURPLE. We ran PURPLE twice for each matched sample: first using the matched WGS normal/tumor pair as input and second using only the tumor WGS sample as input.

Structural variant calling

To identify somatic structural variants, including insertions, deletions, duplications, inversions and translocations, we built a consensus structural variants call set by integrating SvABA76 version 1.1.0, Manta77 version 1.6.0 and DELLY78 version 0.8.3 calls with SURVIVOR79 version 1.0.7. Somatic structural variants (minimum structural variant size = 50 base pairs) identified by at least two callers and single-caller predictions with a minimum read support of 15 pairs (including paired-end and split-read evidence) were included in the consensus set of each matched sample.



RNA-seq was performed on 126 fresh frozen MPM samples in the Cologne Center for Genomics, of which 109 MPM samples belonged to the discovery cohort (Supplementary Tables 13). Libraries were prepared using the Illumina TruSeq Stranded mRNA Sample Preparation Kit (20020595; Illumina) and the pool was sequenced using an Illumina NovaSeq 6000 sequencing device and a paired-end 100-nucleotide protocol.

Data processing

The 126 raw read files from the MESOMICS cohort and the 21 files from the Iorio and colleagues18 mesothelioma cohort (downloaded from the European Genome-phenome Archive (EGA) and Sequence Read Archive websites; datasets EGAS00001000828 and PRJNA523380, respectively) were processed in three steps using the RNA-seq processing workflow based on the Nextflow language and accessible at (release version 2.3)66. Then, reads were realigned locally using ABRA2 (ref. 80); (workflow; release version 3.0) and base quality scores were recalibrated using GATK (workflow; release version 1.1). Once processed, expression was quantified using StringTie software (version 2.1.2; Nextflow pipeline accessible at; release version 2.2).

The raw read counts of the 59,607 genes in the expression data matrix, from the MESOMICS, TCGA and Bueno cohorts3,4, from which we removed non-chimionaif samples, were normalized using the variance-stabilizing transform (vst function from R package DESeq2 version 1.14.1); this transformation enables comparisons between samples with different library sizes and different variances in expression across genes.

DNA methylation

EPIC 850K methylation array

Epigenome analysis was performed on 119 MPMs (Extended Data Fig. 1 and Supplementary Tables 13), two technical replicates and three adjacent normal tissues. Epigenomic studies were performed at the IARC with the Infinium EPIC DNA methylation beadchip platform (Illumina) used for the interrogation of over 850,000 CpG sites (dinucleotides that are the main target for methylation).

Data processing

The resulting IDAT raw data files were preprocessed using the R packages minfi (version 1.34.0) and ENmix (version 1.25.1). Raw data were then normalized using functional normalization (function preprocessFunnorm; minfi), to reduce technical variation within the data, and probe removal steps were performed to ensure reliability and accuracy of the final dataset. This resulted in a normalized, filtered dataset of 781,245 probes for 139 samples. Finally, beta and M values were extracted (functions getBeta and getM; minfi). Nine probes recorded M values of −∞ for at least one sample, and these values were replaced with the next lowest M value in the dataset. The three normal tissues and one remaining technical replicate were then removed from the beta and M matrices for the subsequent analyses. This resulted in 135 samples: 122 for discovery and an additional 13 for ITH analyses.

CIMP index

A CIMP index value was calculated for all samples as follows. The mean beta value across all probes located within CpG islands was calculated per sample, resulting in beta values for 24,891 and 24,924 CpG islands, MESOMICS (EPIC array), TCGA4 and Iorio and colleagues18 cell lines (HM450K array), respectively. The CIMP index was then calculated as the proportion of these 24,891 or 24,924 islands with ≥30% methylation (beta value ≥ 0.3) per sample.

Integrative unsupervised analyses

We performed four series of analyses with different subsets of samples: (1) discovery analyses with all of our discovery cohort (MESOMICS cohort; 120 samples), for which WGS, RNA-seq and/or 850K methylation array data were available; (2) and (3) replication analyses with the already published data from Bueno3 (181 samples after exclusion of nonchemonaive samples) and Hmeljak and colleagues4 (TCGA cohort; 73 samples in the curated list), respectively; (4) combined analyses integrating the MESOMICS, Bueno and TCGA cohorts3,4 with a total of 374 samples; and (5) replication combining cell lines from the Iorio study18 (for which whole-exome sequencing, expression arrays and RNA-seq, 450K methylation arrays and drug responses in the form of half-maximum inhibitory concentration scores are available (21 samples; 265 drugs)) and the de Reyniès5 and Blum et al.7 datasets (for which expression arrays and drug responses are available (38 samples; three drugs)). In addition, some single-omic analyses are also described in this section.

Preprocessing of expression data

We used normalized read count matrices (see the section ‘RNA-seq’) for subsets (1)–(4), encompassing 59,607 genes. Among these genes, those having less than one fragment per kilobase of exon per million mapped fragments (FPKM) difference across the samples were excluded from the unsupervised analyses. Also, to mitigate sex influence on the expression profiles, we removed genes from the sex chromosomes. For each analysis, the top 5,000 most variable genes were selected. Similarly, the 5,000 most variable genes from the normalized array expression of cell lines (see the section ‘Processing of publicly available expression array processing’ in Supplementary Methods) were selected. Whenever several probes were available for the same gene, the one with the highest intensity was selected.

Preprocessing of methylation data

DNA methylation was available for both the MESOMICS and TCGA cohorts. First, we extracted the M values of the CpGs from the MESOMICS, TCGA4, combined MESOMICS/TCGA and Iorio18 cell line cohorts, respectively81. We excluded sex chromosome CpGs, CpGs that did not pass quality control (see the section ‘DNA methylation’ in Supplementary Methods) and those having less than 0.1 beta value difference across the (1) 119, (3) 73, (4) 192 and (5) 59 samples. Based on this annotation, the CpG list representing the methylation data was divided according to their association with promoters, enhancers or the gene body using the EPIC 850K array manifest B5 (see the section ‘Regional methylation analysis’ in Supplementary Methods), resulting in three datasets, respectively named MethPro, MethEnh and MethBod. For each analysis and dataset, the top 5,000 most variable CpGs (calculated from M values) were selected.

Preprocessing of copy number changes

Copy number change data were available for the MESOMICS, TCGA and MPM cell line cohorts. We assessed the global (total) and minor (minor) allele copy number states at the gene level using, respectively, the total (total) and minor (minor) copy number estimate given by PURPLE (see the section ‘CNV calling’) on the hg38 genome for the MESOMICS cohort and SNP array estimates downloaded from the Genomic Data Commons portal for the TCGA–MESO cohort4 and from the Cell Model Passports portal for the MPM cell lines.

For the three analyses, the resulting value assigned to each gene is an average of the copy number estimate of the tumor by taking into account the tumor purity (purity) estimated by PURPLE. To avoid redundancy, genes with exactly the same resulting copy number value in all samples (because of their genome location proximity) were grouped as one single feature in the dataset. Only the genes or groups of genes altered in at least three samples were selected. To ensure continuity of the data, which is technically necessary for the algorithm, the copy number estimates were centered and scaled before being integrated into the MOFA algorithm. For consistency, somatic CNVs occurring on sex chromosomes were removed and the top 5,000 most variable genes or groups of genes were selected to be integrated.

Preprocessing of genomic alterations data

Somatic structural variants data were used only for integrative analyses (1) and (4), while somatic mutations were used in all analyses. Each gene, altered by somatic splicing, structural variants or exonic, damaging mutations (see the section ‘Damaging variants and driver detection’ in Supplementary Methods) was integrated in a common dataset. Of note, for missense mutations, we used the REVEL annotation included in ANNOVAR for predicting the pathogenicity of these variants and we used a 0.5 cut-off to restrict to the most likely damaging missense events. We also removed genes altered in fewer than three samples. For consistency, we selected genes in non-sex chromosomes, protein-coding or long noncoding RNA genes, and with expression greater than or equal to 0.01 fragment per kilobase of exon per million mapped fragments (FPKM) in at least one sample of the cohort, to be sure to include genes expressed in mesothelioma. We integrated the resulting datasets as a Boolean variable in the following analyses.

Multiomic integrative analyses

To provide an integrative low-dimensional summary of the molecular variation across the samples, we performed continuous latent factors identification using the software MOFA (R package MOFA2, version 1.7.0). Indeed, MOFA is able to integrate different molecular datasets (layers) by generating independent continuous variables, named latent factors, that explain most variation from the joint datasets. In total, we performed five analyses: (1) MOFA–MESOMICS (n = 120; Fig. 1 and Extended Data Fig. 1a); (2) MOFA–Bueno (n = 181; Extended Data Fig. 1c); (3) MOFA–TCGA (n = 73; Extended Data Fig. 1b); (4) MOFA–3 cohorts (n = 374; Extended Data Fig. 1d) and (5) MOFA–cell lines, as described above (n = 59; Supplementary Fig. 4). Additionally, we ran MOFA on our discovery cohort, including the ITH samples (MOFA–ITH; n = 134) to evaluate the ITH within MPM samples.

MOFA was performed independently for each analysis, setting the number of latent factors to ten (function runMOFA from the R package MOFA2). A summary of all of these runs is given in Extended Data Figs. 1 and 2, Fig. 1 and Supplementary Figs. 1 and 4 and coordinates and proportions of variance explained for models (1)–(4) are given in Supplementary Tables 48, while those for MOFA–ITH are given in Supplementary Tables 5052 and those for the cell lines (model (5)) are given in Supplementary Tables 2326. A comparison with other multiomic methods is provided in Extended Data Fig. 10 (see section 'Multiomic integrative analyses details' in Supplementary Methods).

Evolutionary tumor trade-off analyses

Pareto task identification

The Pareto front model was fitted to different sets of samples using the ParetoTI R package (; release version 0.1.13), following the above-mentioned analyses (1)–(4), and additionally on two different kinds of molecular maps: using MOFA (restricting to LF1, LF2, LF3 and LF4) and using expression principal component analysis as technical validation (see the section ‘RNA-seq’). In brief, the algorithm tries to find polyhedra by testing successively 1 to n axes, adding them one after another in decreasing order of transcriptomic variance explained. For this technical reason, the MOFA latent factors were ordered as follows by decreasing transcriptomic variance explained: morphology factor (LF2), adaptive response factor (LF3), CIMP factor (LF4) and ploidy factor (LF1). For each number n of axes used, ParetoTI identifies the position of the n + 1 = k vertices (archetypes) in the molecular map defined, and we used 200 bootstraps, each taking 75% of the data to measure the variability in archetype position and infer archetype positions robust to outliers (function fit_pch_bootstrap with the parameters bootstrap = T and bootstrap_N = 200; see our code at

Interpretation of tumor archetypes

To further characterize the phenotype of each archetype, we used the proportion of each archetype for each sample estimated by ParetoTI. These proportions were used as continuous variables to further test the association between each archetype and clinical, epidemiological and morphological variables, as well as molecular data (Supplementary Tables 2730).

More specifically, we inferred each archetype phenotype by performing IGSEA on the expression data. To do so, we used the ActivePathways R package (; release version 1.1.0), which is a tool able to integrate different sources of molecular variation to assess the enrichment of Gene Ontology terms by combining P values from different association tests between sources and gene-level data. Here we integrated these proportions as different axes of molecular variation. We restricted the Gene Ontology terms to a minimum size of 20 genes and a maximum size of 1,000 genes as the default parameters of ActivePathways. To infer the pathways specifically altered in each archetype, we integrated the Pearson’s P value correlation of each gene from the expression matrix of 59,607 genes with the proportion from each archetype and we selected the pathways for which the enrichment source only corresponded to the tested archetype. We performed two kinds of analyses: one restricted to the genes positively correlated with the proportion (to obtain the upregulated pathways) and the other restricted to the negatively correlated genes (to identify the downregulated pathways).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.