Minimal methylation classifier (MIMIC): A novel method for derivation and rapid diagnostic detection of disease-associated DNA methylation signatures

Rapid and reliable detection of disease-associated DNA methylation patterns has major potential to advance molecular diagnostics and underpin research investigations. We describe the development and validation of minimal methylation classifier (MIMIC), combining CpG signature design from genome-wide datasets, multiplex-PCR and detection by single-base extension and MALDI-TOF mass spectrometry, in a novel method to assess multi-locus DNA methylation profiles within routine clinically-applicable assays. We illustrate the application of MIMIC to successfully identify the methylation-dependent diagnostic molecular subgroups of medulloblastoma (the most common malignant childhood brain tumour), using scant/low-quality samples remaining from the most recently completed pan-European medulloblastoma clinical trial, refractory to analysis by conventional genome-wide DNA methylation analysis. Using this approach, we identify critical DNA methylation patterns from previously inaccessible cohorts, and reveal novel survival differences between the medulloblastoma disease subgroups with significant potential for clinical exploitation.

We focussed assay development on medulloblastoma, the most common malignant brain tumour of childhood 12 , where DNA methylation signatures have clear potential for use in routine clinical sub-classification 13,14 . Medulloblastoma comprises four primary molecular subgroups -WNT, SHH, Grp3 and Grp4 -defined by distinct methylomic, transcriptomic and genomic features [13][14][15] . These subgroups display characteristic clinical features, drug targets and outcomes, and have significantly contributed to the 2016 World Health Organisation (WHO) classification of brain tumours 16 . Following design and validation of a MIMIC assay for molecular subgrouping, we assessed its efficacy in limited archival tumour biopsies previously refractory to subgrouping using current research methods, taken from the pan-European HIT-SIOP-PNET4 medulloblastoma clinical trial (2000)(2001)(2002)(2003)(2004)(2005)(2006) 17,18 . This trial enrolled patients negative for all established clinico-molecular risk-factors (termed 'standard-risk (SR)' disease 12 ), a group for which there is an urgent unmet need to develop biomarker-driven treatment strategies.

Results
Derivation of minimal DNA methylation signatures to identify the four medulloblastoma molecular subgroups. We first identified a DNA methylation signature of 17 CpG loci, established detection methods and developed a Support Vector Machine (SVM) classification model for distinction of the four medulloblastoma molecular subgroups (WNT, SHH, Grp3 and Grp4; Fig. 1a). Non-negative matrix factorisation (NMF) consensus clustering 13,19,20 was used to identify subgroup membership of a training cohort comprising genome-scale Illumina 450k DNA methylation microarray data for 220 medulloblastomas ( Fig. 1b; Fig. 2). The 50 most discriminatory CpG loci for each subgroup (i.e. 200 in total) were considered as signature candidates. These were triaged using (i) a 10-fold cross validated classification fusion algorithm, (ii) a reiterative primer design process where amenability to primer design and multiplex bisulfite PCR was assessed in silico (Supplemental experimental methods), and (iii) in vitro PCR validation ( Fig. 1a; Fig. 3). Candidate signature CpG loci were assayed by the development of a novel application of the Agena iPlex assay 21 , whereby methylation-dependent SNPs representative of CpG methylation status were induced by initial treatment of DNA with sodium bisulfite 10 , followed by multiplexed PCR and single base extension of probe oligonucleotides. The resultant products were quantified by MALDI-TOF MS (Matrix-assisted laser desorption/ionization-time of flight mass spectrometry; Supplementary Fig. 1). The accuracy and precision of methylation estimates from multiplexed extension reactions were tested using incremental proportions of bisulfite-treated methylated:unmethylated DNA ( Supplementary  Fig. 2). Using these techniques, our optimal, multiply-redundant 17-CpG locus signature was generated. Finally, the training cohort was used to generate an optimised SVM classifier for the signature using 450k DNA methylation array data.

Discussion
We have provided a blueprint for defining minimal, multiply-redundant disease-associated DNA methylation signatures from genome-wide datasets, and have developed MS-MIMIC as a validated assay for their assessment, including open-source classification tools for data interpretation (http://medulloblastomadiagnostics.ncl.ac.uk; Supplementary Fig. 6). Unlike research methodologies (e.g. Illumina 450k and MethylationEPIC arrays) which require batched assessments (≥8 samples per run), MS-MIMIC exploits detection technologies in common clinical use (MALDI-TOF) to enable rapid (<3 days from DNA extraction to result), low-cost (<$200 per sample in 2017), routine assessment in single or multiple samples. The assay format allows a flexible, modular approach, in which multiplex PCRs can be straightforwardly added or removed, offering the ability to adapt or extend panels to evolving clinical needs. Moreover, its low DNA input requirements and applicability to archival sample collections has the potential to unlock previously inaccessible molecular information from informative cohorts, as demonstrated for HIT-SIOP-PNET4. This assessment of DNA methylation signatures in the clinical setting holds rich promise for molecular sub-classification and prognostication across diverse diseases.

Methods
Cohorts and sample collection. Three cohorts (training; n = 220, validation; n = 106 and test; n = 153) were used in this study and are described in Supplementary Table 1. The training and validation cohorts comprised archival non-trial medulloblastoma DNA samples and included tumour samples provided by the UK Children's Cancer and Leukaemia Group (CCLG) biobank as part of CCLG-approved biological study BS-2007-04. The validation cohort consisted of samples with varying DNA quality, to assess assay performance. The test cohort included samples from the HIT-SIOP-PNET4 clinical trial (2001)(2002)(2003)(2004)(2005)(2006) 17 . All tumours assayed had a confirmed histopathological diagnosis of medulloblastoma, with a high tumour cell content. Informed consent was obtained from all participants and/or their legal guardians. All experiments were performed in accordance with relevant guidelines and regulations.
Identification of a minimal methylation signature for discrimination of medulloblastoma molecular subgroups. Non-negative matrix factorisation (NMF) consensus clustering 19 was used to identify the four recognised medulloblastoma consensus molecular subgroups 13 using 220 training cohort samples run on the Illumina 450k methylation microarray platform (Fig. 2). The 50 most differentially-methylated CpG loci for each subgroup were selected as potential signature candidates using limma 22 . An iterative CpG locus selection algorithm was used to select signature gene candidates. To optimise signature loci redundancy in each level, up to 6 loci were repeatedly removed at random and classification performance evaluated. The 17 signature loci with the highest ranking in classification were identified (Fig. 3).  21 . In order to determine methylation status at each signature CpG locus, bisulfite treatment of DNA was used to induce methylation-dependent SNPs. These regions were amplified by multiplex PCR, followed by single base extension using mass-modified dideoxynucleotide terminators. MALDI-TOF mass spectrometry then identifies the proportions of the induced-SNP alleles, from which methylation status can be inferred.
Primer design and validation. PCR and extension primers were designed for multiplex assessment of methylation in 17 signature loci (Supplemental experimental methods) across three multiplexes. Plex 1 contained an additional bisulfite conversion control, targeting an invariably unmethylated locus which undergoes complete conversion to uracil. The multiplexes were validated in vitro using a triplicate mixture series of control DNAs ranging from 0-100% methylation (Supplementary Fig. 2). The correlation between the input and estimated DNA methylation was calculated, and amplicons with a poor correlation were discarded and replaced with a new CpG locus as part of the iterative redesign process (Figs 1a and 3). All signature loci had good linear correlation (average correlation coefficient R 2 = 0.86.) Assay implementation. Where possible, 100ng of DNA was bisulfite converted and purified using the Qiagen EpiTect Bisulfite kit, according to manufacturer's protocol. To ensure that template was not too fragmented for analysis, a test bisulfite PCR, targeting a 200 bp amplicon, was performed (Supplementary Table 3).  Proportion of attempted samples (153/161; 95%) successfully subgrouped (107/153; 70%) or failing the assay due to failure to 1) bisulfite convert (9/153; 6%) 2) meet QC citerion for CpG-locus specific failure (>6/17 CpGlocus fails, 24/153 samples; 16%), 3) meet probablility threshold in classification confidence (13/153; 8%). (f) Subgroup assignment and probability estimates (dot) along with 95% confidence intervals (boxplots) for 120 standard risk HIT-SIOP-PNET4 tumours after applying the confidence probability threshold (red-line). (g) All samples which were CTNNB1 mutated (CTNNB1 mut ; a well-established marker of WNT medulloblastoma) were assigned as WNT by MS-MIMIC. No non-WNT tumours were CTNNB1 mut . (h) Progression free survival (PFS) Kaplan-Meier curves for MS-MIMIC derived subgroups reveal that standard-risk Grp4 medulloblastomas show a significantly worse disease outcome compared to other subgroups (p = 0.038, log-rank test). Numbers below x-axis represent patients at risk of event.
Reaction mixtures and thermal cycling parameters used to amplify the 17 signature loci in multiplex are shown in the Supplementary Table 4. Successful amplification was confirmed by gel electrophoresis. The multiplex primer iPLEX extension assay was performed as previously described 23 . Primer sequences for multiplex signature loci PCR and iPLEX extension PCR are shown in the Supplementary Table 5 and 6. Mass spectra for the multiplexes were acquired on a MALDI-TOF mass spectrophotometer (Voyager DE; PerSeptive Biosystems).
Classifier design and validation. Two support vector machine (SVM) classifier models were created to assign subgroup and corresponding probability 24 , one trained on 450k array data using the 10,000 most variably methylated CpG loci, the second with the 17-CpG locus signature from the training cohort (Fig. 1c). Subsequently, 101/106 validation cohort samples were used to assess MS-MIMIC concordance with 450k-derived data, at the level of molecular subgroup call (Fig. 1d) and estimates of methylation β-value (Fig. 1e). It was anticipated that when applying MS-MIMIC to poor quality samples, certain loci would not be assessable. Using bootstrapped datasets (n = 10,000), a threshold of 6 was established for a maximum acceptable number of missing loci. Missing loci were imputed using expectation maximisation. Subgroup assignments using MS-MIMIC classifier were compared against corresponding subgroup calls from the 450k classifier. A threshold for probability of assignment by the MS-MIMIC classifier was empirically set to 0.69, below which samples were non-classifiable ( Supplementary Fig. 5).
Application to HIT-SIOP-PNET4 clinical trial cohort. Following successful assay development and validation, we applied MS-MIMIC to remnant, poor quality, archival, biopsies from the HIT-SIOP-PNET4 clinical trial of risk-factor negative medulloblastoma (Fig. 4a, Supplementary Table 2). Only FFPE sections (n = 42/153 available tumour samples) and cytospin nuclear preparations (approximately 30,000 nuclei isolated and centrifuged onto microscope slides 18 ; n = 111/153) remained, whose DNA derivatives all fell below quality and quantity thresholds (>200 ng double-stranded DNA (dsDNA)) required for methylation profiling using conventional research methods (Illumina 450k and MethylationEpic arrays 14 ). We assessed differential survival of the MS-MIMIC subgroup assignments using log-rank tests.
Further technical details are provided in the Supplemental experimental methods.
Data availability. The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.