MethCORR modelling of methylomes from formalin-fixed paraffin-embedded tissue enables characterization and prognostication of colorectal cancer.

Transcriptional characterization and classification has potential to resolve the inter-tumor heterogeneity of colorectal cancer and improve patient management. Yet, robust transcriptional profiling is difficult using formalin-fixed, paraffin-embedded (FFPE) samples, which complicates testing in clinical and archival material. We present MethCORR, an approach that allows uniform molecular characterization and classification of fresh-frozen and FFPE samples. MethCORR identifies genome-wide correlations between RNA expression and DNA methylation in fresh-frozen samples. This information is used to infer gene expression information in FFPE samples from their methylation profiles. MethCORR is here applied to methylation profiles from 877 fresh-frozen/FFPE samples and comparative analysis identifies the same two subtypes in four independent cohorts. Furthermore, subtype-specific prognostic biomarkers that better predicts relapse-free survival (HR = 2.66, 95%CI [1.67–4.22], P value < 0.001 (log-rank test)) than UICC tumor, node, metastasis (TNM) staging and microsatellite instability status are identified and validated using DNA methylation-specific PCR. The MethCORR approach is general, and may be similarly successful for other cancer types.

The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated Our web collection on statistics for biologists contains articles on many of the points above.

Software and code
Policy information about availability of computer code Data collection

Data analysis Claus Lindbjerg Andersen Jesper Bertram Bramsen
Mar 18, 2020 The DNA methylation and RNA sequencing data sets from the CRC patient cohort TCGA COREAD were aquired via public databases whereas datasets from the CRC patient cohort SYSCOL were established by the authors by collection of RNA sequencing and DNA methylation profiling of tissue biopsies taken from Danish individuals with CRC. The generated SYSCOL data sets are deposited at EGA for controlled access according to Danish law, as described in the section "data availability" in the manuscript.
Processing of 450K/EPIC BeadChipMethylation raw data: !-values for each CpG site were derived using the publicly-available ChAMP Rpackage using the champ.import and champ.norm functions. Missing !-values were imputed using the R-package Impute.
Processing of RNA sequencing raw data: sequencing reads were mapped to the human genome issue HG19 (hg19) using the publicly available software Tophat2 mapper (Tophat: v2.0.10) and estimating fragments per kilobase of exon per million fragments mapped (FPKM) values for Ensembl genes using the publicly available software Cufflink (Cufflinks: v2.2.1; Gencode v15 annotation w/o Pseudogenes).
Correlations (Spearman) between RNA expression and DNA methylation were calculated using the publicly available R function "cor".
Calculations of MethCORR methylation scores (MCSs) from DNA methylation was performed using the formula provided in Figure 1b  Modelling of RNA expression from MethCORR methylation scores (MCSs): The publicly available Caret R-package was used to perform linear regression modeling by 10x10 fold cross validation and provide R2, RMSE, and MAE measures.
Correlations (Pearson´s, Spearman´s, and R2) between measured RNA expression and inferred RNA expression were calculated using the publicly available Caret R-package or R function "cor".

October 2018
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability Field-specific reporting Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design
All studies must disclose on these points even when the disclosure is negative.

Sample size
Data exclusions Establishment of the MethCORR map: clustering MethCORR genes according to their overlap in expression-correlated CpGs using the publicly available software Cytoscape V3.2.0 and publicly available application EnrichmentMap (Jaccard+Overlap filtering cutoff 0.126).
NMF: Non-negative matrix factorization (NMF) consensus clustering was performed using the publicly available R-package NMF. The similarity of independent subtype predictions was analyzed using the publicly available Genepattern SubMap module (v3).
CMS/CRIS classification: Consensus molecular subtype (CMS) classification was performed with the publicly available R-package CMSclassifier using the single sample method and nearest CMS as predicted subtype. CRC intrinsic subtype (CRIS) classification was performed using the publicly available R package CRISclassifier.
CIN scores: CIN scores were derived from copy number data extracted from the HM-450K/EPIC methylome BeadChips using the champ.CNA module of the publicly available ChAMP R-package.
Stroma/Immune scores were calculated using the publicly available R-package ESTIMATE using default parameters.
GSEA was performed using the publicly available GSEA 3.0 tool using default settings.
Gene list enrichment analysis was performed using the publicly available Enrichr software. eFORGE analysis was performed using the the publicly available eFORGE software. No sample-size calculation was performed.
In the MethCORR method development and validation phase we did not exclude any samples/data and included all samples with matching RNA sequecning and 450k DNA methylation data. During biological characterization and survival analysis we only used CRC TNM stage II-III samples with good clinical annotation and a minimum of 2-years of follow-up (as these are most relevant for identification of prognostic biomarkers). To avoid confounders in the prognostic analyses we excluded patients diagnosed with synchronous cancers, and patients who were diagnosed with another cancer within 3 years of the CRC diagnosis. Likewise we excluded patients who were diagnosed with local recurrence during follow-up.

Reporting for specific materials, systems and methods
We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response. In the MethCORR method development phase we used the publicly avialable normalized data for the TCGA COREAD cohort taken from the UCSC XENA database: the cohort was divided into a discovery part (4/5), which was used for development of the method and a validation part (1/5), which was used for validation of the method. Furthermore, we used the independent SYSCOL cohort (profiled in this study) to replicate/validate the MethCORR method. Finally, we validated our analysis by analyzing the TCGA COREAD dataset provided by the NCI GDC database .
The existence of the CRC1 and CRC2 molecular CRC subtypes was replicated in independent cohorts. The prognostic value of the DNA methylation-based biomarkers were replicated in independent cohorts using the 450K DNA methylation data and validated using an orthogonal method -Quantitative methylation sensitive PCR (QMSP). QMSP was performed on a subset of random CRC samples for which sample DNA was still available.
The division of the sample sets used for development of the MethCORR method (into discovery and validation subsets), were performed using stratified randomization to ensure a balanced distribution of the following variables: Gender, age, UICC stage, microsattelite instability and recurrence status.
The analysts generating the in house data (450K methylation array data and QMSP data) were blinded to group allocation. Data generation was performed to avoid batch effects. We cannot account for the public available data. The NMF clustering analysis was performed unsupervised i.e. blinded for group allocation. The discovery of the prognostic markers were performed unblinded, and the biomarkers were validated in independent patients cohorts. n/a n/a n/a