DNA methylation patterns are altered in numerous diseases and often correlate with clinically relevant information such as disease subtypes, prognosis and drug response. With suitable assays and after validation in large cohorts, such associations can be exploited for clinical diagnostics and personalized treatment decisions. Here we describe the results of a community-wide benchmarking study comparing the performance of all widely used methods for DNA methylation analysis that are compatible with routine clinical use. We shipped 32 reference samples to 18 laboratories in seven different countries. Researchers in those laboratories collectively contributed 21 locus-specific assays for an average of 27 predefined genomic regions, as well as six global assays. We evaluated assay sensitivity on low-input samples and assessed the assays' ability to discriminate between cell types. Good agreement was observed across all tested methods, with amplicon bisulfite sequencing and bisulfite pyrosequencing showing the best all-round performance. Our technology comparison can inform the selection, optimization and use of DNA methylation assays in large-scale validation studies, biomarker development and clinical diagnostics.
DNA methylation is an epigenetic mark widely studied for its association with diseases such as cancer1 and autoimmune disorders2, with environmental exposures3 and with other biological phenomena4,5. Strong associations between DNA methylation patterns and clinical phenotypes can be used as biomarkers for diagnosing diseases and guiding treatment6,7. For example, DNA methylation biomarkers have been shown to support clinical decisions in various cancers8,9,10,11,12,13,14 and are also used for noninvasive prenatal testing15, for quality control of cultured cells16 and for forensic applications17,18.
DNA methylation biomarkers have several advantages that qualify them for broad use as in vitro diagnostics: (i) DNA methylation is cell-type-specific but robust toward transient perturbations, thus complementing static DNA-sequence-based biomarkers and volatile RNA-expression-based biomarkers. (ii) DNA methylation is a binary mark (i.e., for a single cell and allele, each CpG is either methylated or unmethylated), which facilitates reliable measurements on heterogeneous and degraded samples. (iii) The infrastructure for assaying DNA methylation biomarkers is already present in many clinical diagnostics laboratories, as the assays are similar to those used for DNA-sequence-based biomarkers. (iv) DNA methylation biomarkers are straightforward to integrate into routine clinical workflows because DNA is more stable than RNA and does not require any special handling. (v) DNA methylation patterns are faithfully retained during long-term storage as fresh-frozen or formalin-fixed, paraffin-embedded (FFPE) samples.
Genome-wide mapping and analysis of DNA methylation has become feasible for patient cohorts with thousands of samples19,20, and epigenome-wide association studies have been conducted for numerous biomedically relevant phenotypes21,22. To translate relevant epigenome associations into clinically useful biomarkers, it is necessary to select a manageable set of highly informative genomic regions, to target these loci with DNA methylation assays that are sufficiently fast, cheap, robust and widely available to be useful for routine clinical diagnostics23,24,25, and to confirm their predictive value in large validation cohorts.
Here we systematically compared and evaluated the most promising assays for measuring DNA methylation in large cohorts, clinical diagnostics and biomarker development. This multicenter study included research groups from seven countries across three continents, organized by the BLUEPRINT project26 in the context of the International Human Epigenome Consortium27 and as a follow-up to a previous comparison of genome-wide DNA methylation assays28,29,30. Overall, our results show that most assays provide high accuracy and robustness, although we observed some differences between assay types and laboratories. We provide detailed documentation of all contributed assays (Supplementary Data 1), such that this study can be used not only to guide assay selection but also as a resource of validated DNA methylation protocols.
Study design and assay selection
We selected assays based on comprehensive literature review, and for each promising assay we selected at least one research group that had extensive prior experience using that particular assay (Fig. 1a). In total, we invited 25 research groups, of which 19 agreed to participate. All participants received DNA aliquots for 32 reference samples, together with a list of 48 preselected genomic regions to be targeted. They designed the assays independently, analyzed the reference samples with their assays of choice and submitted the final results together with a detailed assay report for centralized benchmarking analysis by the study coordinator. Ultimately, 18 of the 19 participating research groups submitted complete analysis reports for a total of 27 assays (Table 1 and Supplementary Table 1). All contributed assays can be classified into one of three categories, which we summarize below.
First, absolute DNA methylation assays provide a quantitative measure of DNA methylation levels at single-CpG resolution. We included 16 absolute assays based on four technologies: (i) amplicon bisulfite sequencing (AmpliconBS) uses next-generation sequencing (NGS) of pooled PCR amplicons derived from bisulfite-converted DNA31,32,33. (ii) Enrichment bisulfite sequencing (EnrichmentBS) is similar to AmpliconBS in its use of bisulfite conversion and NGS, but it uses highly scalable techniques such as padlock probes or microdroplet-based amplification to enrich many genomic regions in parallel rather than relying on separate PCRs for each individual region34,35. (iii) Mass spectrometric analysis of DNA methylation (EpiTyper) combines bisulfite conversion, in vitro transcription and uracil-specific cleavage with mass-spectrometry-based quantification of fragment lengths36,37,38. (iv) Bisulfite pyrosequencing (Pyroseq) applies sequencing by synthesis39 to single PCR amplicons obtained from bisulfite-converted DNA40,41.
Second, relative DNA methylation assays measure DNA methylation by comparing samples to a suitable reference. This approach is mainly used for detecting methylated DNA fragments in an excess of unmethylated fragments, but it also provides rough estimates of absolute DNA methylation levels. We included five relative DNA methylation assays based on three alternative technologies: (v) MethyLight uses PCR amplification of bisulfite-converted DNA in combination with fluorescently labeled probes that hybridize specifically to a predefined DNA methylation pattern, typically that of fully methylated DNA42,43. (vi) Methylation-specific melting assays, including methylation-sensitive high-resolution melting (MS-HRM) and methylation-specific melting curve analysis (MS-MCA), apply melting curve analysis to amplicons obtained from bisulfite-converted DNA, which provides a semiquantitative measure of cytosines that have been converted to thymines44,45,46. (vii) Quantitative methylation-specific PCR (qMSP) uses DNA-methylation-specific primers in combination with real-time PCR to compare the prevalence of a specific DNA methylation pattern with that of a suitable reference47,48.
Third, global DNA methylation assays measure a sample's total DNA methylation content, which can be useful for measuring hypomethylation in cancer49 and the response to drugs that inhibit DNA methylation50. We included five global assays based on three alternative technologies: (viii) High-performance liquid chromatography followed by mass spectrometry (HPLC-MS) quantifies the amount of 5-methylcytosine based on its mass difference compared to unmethylated cytosine51. (ix) Immunoquantification of global DNA methylation (Immunoquant) uses a modified enzyme-linked immunosorbent assay (ELISA) with an antibody against 5-methylcytosine to quantify the total amount of methylated DNA in a given DNA sample52. (x) Bisulfite pyrosequencing of repetitive DNA elements (Pyroseq AluYb8/D4Z4/LINE/NBL2) applies pyrosequencing to amplicons obtained from bisulfite-converted DNA using primers that amplify multiple instances of the selected type of repeat53,54,55,56, which assumes that averaged local DNA methylation levels across specific repetitive regions correlate with global DNA methylation levels.
Given the study's focus on clinically applicable assays, we did not include emerging technologies that have not yet been shown to be practically useful in large-scale studies, e.g., nanopores, nanowire transistors, quantum dots, single-molecule real-time sequencing and atomic force spectroscopy57. We also did not include genome-wide assays such as whole-genome bisulfite sequencing, reduced-representation bisulfite sequencing, methylated DNA immunoprecipitation sequencing or methyl-CpG binding domain enriched sequencing, given that these assays have been benchmarked previously28,29,30, and are currently too cumbersome and expensive for routine clinical diagnostics. However, we did include the Infinium 450k assay58, which we also used to select the target regions, and we performed a limited amount of clonal bisulfite sequencing59,60, given that this assay was until recently considered the gold standard but has been largely superseded by less labor-intensive assays. Our benchmarking did not explicitly address non-CpG methylation nor DNA methylation variants (5hmC, 5fC and 5caC), but most of the included assays can be used to measure non-CpG methylation as well as CpG methylation, and they can also be adapted to distinguish between DNA methylation variants61,62,63. Finally, we note that all contributed locus-specific assays were bisulfite-based, although we had invited four research groups that had expertise in alternative technologies.
Reference samples and target regions
We prepared 32 reference samples that mimic typical applications of DNA methylation assays in biomedical research and clinical diagnostics (Supplementary Table 2). This sample set included DNA extracted from six pairs of primary colon tumor and adjacent normal colon tissue samples ('tumor/normal), DNA from two cell lines before and after treatment with a demethylation-inducing drug ('drug/control'), a titration series with partially methylated DNA spiked into unmethylated DNA ('titration 1'), another titration series with DNA from a cancer cell line spiked into whole blood DNA ('titration 2'), and DNA from two matched pairs of fresh-frozen and FFPE xenograft tumors ('frozen/FFPE').
To establish suitable targets for the locus-specific assays, we performed genome-scale DNA methylation analysis with the Infinium 450k assay and selected 48 differentially methylated CpGs that cover a broad range of technical challenges encountered in biomarker development (Supplementary Table 3). For example, we included genomic regions with high and low CpG density, GC content and repetitive DNA overlap. As an additional challenge, we included a single-nucleotide polymorphism (SNP) that replaces a potentially methylated CpG by an always unmethylated TpG dinucleotide in some of the reference samples.
To each contributing laboratory we sent aliquots of ∼1 μg DNA for each of the 32 reference samples. In addition, we provided a standardized information package comprising general instructions, documentation templates and the list of the 48 target genomic regions (Supplementary Data 2). Each region had one designated target CpG for which the DNA methylation level was to be measured, and we asked the contributing research groups to return DNA methylation measurements for each of the reference samples. We gave no further instructions on how to design the assays or how to derive the DNA methylation measurements for the target CpG from the raw data. Moreover, we asked research groups not to exchange any information among each other, and they did not have access to the Infinium 450k data used for region selection.
We designated 16 of the 48 target regions as mandatory ('core regions') and let scientists in each contributing research group for themselves decide how many of the remaining 32 regions they would cover in addition to the core regions. On average, assay design was attempted for 30 genomic regions and was successful in 95% of cases (Supplementary Table 1). The known SNP at one of the target CpGs was detected and reported by 9 of the 17 research groups who contributed locus-specific assays. We removed the SNP-containing region from further analysis to avoid bias, but we emphasize the importance of double-checking for known SNPs during assay design. In total, scientists from 18 laboratories contributed 16 absolute, 5 relative and 6 global assays (Supplementary Fig. 1), giving rise to a benchmarking data set with 16,435 locus-specific and 192 global DNA methylation measurements (Supplementary Data 3).
Performance of absolute DNA methylation assays
All absolute assays detected the expected bimodal pattern of DNA methylation, with most regions being either highly or lowly methylated (Fig. 1b). NGS-based assays (i.e., AmpliconBS and EnrichmentBS) reported extreme values of 0% and 100% more frequently than the other assays, which can be explained by their digital counting of methylated and unmethylated cytosines. The distribution plots confirmed the expected differences among the 32 reference samples (Fig. 1b), with higher DNA methylation levels for colon tumors than in matched normal tissue in the target regions, lower DNA methylation in the drug-treated leukemia cell lines, decreasing DNA methylation with decreasing concentrations of in vitro methylated DNA (titration 1) and cancer cell line DNA (titration 2), and similar DNA methylation levels for DNA extracted from fresh-frozen vs. FFPE xenografts. These plots also illustrate the broad range of different DNA methylation distributions among the selected target regions (Fig. 1b).
To assess global similarity among the absolute DNA methylation assays, we calculated Pearson correlation coefficients across all measurements for each pair of assays (Supplementary Fig. 2) and performed hierarchical clustering (Fig. 1c). 85% of between-assay comparisons resulted in correlations above 0.8, and 46% even exceeded 0.9, indicating an excellent overall agreement between many of the tested assays. Correlations were high for technical replicates in the same laboratory (Pyroseq 1 vs. Pyroseq 1 (replicate): r = 0.996), the same technology between laboratories (e.g., Pyroseq 1 vs. Pyroseq 2: r = 0.98) and between assays of different types (e.g., Pyroseq 1 vs. EpiTyper 3: r = 0.95; Pyroseq 1 vs. AmpliconBS 1: r = 0.97). However, not all assays agreed equally well (Fig. 1d and Supplementary Fig. 2). For instance, the Infinium assay reported higher DNA methylation levels for CpGs that other assays identified as lowly methylated, while reporting slightly reduced DNA methylation levels for highly methylated CpGs; and the EnrichmentBS 1 assay gave rise to a substantial number of outliers when compared to any of the other assays.
To quantify the accuracy of individual assays, a reference is needed against which to evaluate the measurements. Synthesized DNA with predefined DNA methylation patterns would be one option, but this is currently feasible only for fully methylated DNA spiked into fully unmethylated DNA, thus ignoring the challenges posed by heterogeneous DNA methylation patterns64. For this reason, we chose two alternative approaches for quantifying assay performance in the presence of epigenetic heterogeneity.
First, we combined data from several assays into high-confidence consensus estimates to establish target DNA methylation levels for the reference samples (Fig. 2a and Supplementary Fig. 3a). For each sample and genomic region, we identified the smallest interval comprising measurements by at least three of the five technologies (AmpliconBS, EnrichmentBS, EpiTyper, Infinium and Pyroseq), which minimizes the impact of outliers and technology-specific artifacts. Moreover, we extended these intervals with flanking windows of five percentage points on either side to account for small deviations (Fig. 2a). We used the resulting 'consensus corridor' as a surrogate for the true DNA methylation level (which is unknown) of each target CpG in each reference sample. All assays contributed to the consensus corridor (Supplementary Fig. 3b,c), and sensitivity analysis confirmed that the ranking of assay performance was robust to the exact definition of the consensus corridor (Supplementary Note and Supplementary Fig. 4).
Evaluating each assay against this corridor (Fig. 2b and Supplementary Table 1), we observed the lowest mean absolute deviation for Pyroseq 1 and its replicate (1.1 and 1.2, respectively), closely followed by AmpliconBS 1 (1.6), Pyroseq 2 (2.5), AmpliconBS 2 (2.8) and Pyroseq 5 (3.3). The highest mean absolute deviation was observed for EpiTyper 2 and EnrichmentBS 1 (6.8 and 11.4, respectively). We also assessed the bias of each assay, which we defined as the directional (rather than absolute) deviation from the consensus corridor. For the Infinium assay we observed an overall tendency to over-estimate DNA methylation levels, whereas all Pyroseq assays tended to underestimate DNA methylation levels. For AmpliconBS EnrichmentBS, and EpiTyper, the average direction of the deviation depended on the laboratory.
Furthermore, to understand which properties make genomic regions difficult to measure, we fitted a linear model that predicts the deviation from the consensus based on each region's estimated DNA methylation level, GC content, CpG observed vs. expected ratio and content of repetitive DNA (Supplementary Fig. 5). Four assays (AmpliconBS 4, EnrichmentBS 1, Pyroseq 4 and Pyroseq 5) showed significantly increased deviation in highly methylated regions, whereas the Infinium assay was comparably more accurate in highly methylated regions. GC content, CpG density and repetitive DNA also affected the deviation in some cases, but the best-performing assays did not show any significant biases (Supplementary Fig. 5). Finally, we compared assay performance between matched fresh-frozen and FFPE samples (Supplementary Fig. 6). We obtained highly similar results, showing that all tested assays are compatible with DNA from FFPE material.
Second, as a complementary approach to consensus corridors, we assessed the performance of each assay using two titration series with known ratios. The titration samples included heterogeneous DNA methylation patterns, which make them more challenging than titrations of fully methylated DNA used for assay calibration. For the first titration series, we created partially and heterogeneously methylated DNA by incomplete in vitro methylation (less than 20% methylated cytosines) and combined it with unmethylated DNA at ratios of 100%, 75%, 50%, 10%, 1% and 0%. The second titration series mimics the diagnostic task of detecting hypermethylated cancer DNA against a background of blood-derived DNA. To that end, we spiked DNA from a colon cancer cell line (HCT-116) at ratios of 100%, 10%, 1%, 0.1%, 0.01% and 0% into DNA extracted from whole blood. We then fitted linear models to the DNA methylation measurements (Fig. 2c and Supplementary Fig. 7) to assess consistency with the known titration ratios. The agreement was high for most assays and for three alternative metrics (Fig. 2d and Supplementary Table 1). Best results were achieved by AmpliconBS 1 with median Pearson correlation coefficients of 0.99 and 0.93 in the two titration series (Fig. 2c). By contrast, EnrichmentBS 2 is an example of an assay that showed more variability with correlation of coefficients 0.77 and 0.87 (Fig. 2c). The results for the titration series were in good agreement with the assay performance in the consensus-based validation (Fig. 2b), with AmpliconBS 1, AmpliconBS 2, Pyroseq 1 and Pyroseq 3 being among the best in both analyses.
Finally, we assessed how clonal bisulfite sequencing59,60 would fare in our benchmarking, given that it was previously considered the gold standard for locus-specific DNA methylation mapping. At a target coverage of 10−20 Sanger sequencing clones, fully unmethylated and fully methylated CpGs gave rise to consistent measurements between replicates, but regions with intermediate DNA methylation levels agreed less well (Supplementary Fig. 8a). Diverging measurements appeared to be caused by random noise resulting from sequencing few clones, and both replicates clustered similarly well with other assays (Pearson correlation above 0.9 for all but one assay; Supplementary Fig. 8b). We did not observe any directional deviation from the consensus corridor (Supplementary Fig. 8c), and Pearson correlation coefficients in comparison to other assays were in the range of 0.7 to 0.9 in comparison to other assays (Supplementary Fig. 8d). Overall, clonal bisulfite sequencing performed reasonably well in our analysis but did not reach the accuracy and reproducibility of the top-ranking assays.
Performance of relative DNA methylation assays
Relative DNA methylation assays detect DNA molecules with a predefined DNA methylation pattern, e.g., identifying fully methylated, tumor-derived DNA fragments in an excess of blood DNA. This approach is less suited for measuring quantitative DNA methylation levels at single-CpG resolution, which prompted two of the research groups contributing relative assays to report their measurements as ranges (e.g., 0.1–1%, 1–10%, or 10–100%). Indeed, the observed correlations were generally lower for relative assays than for absolute assays, ranging from 0.5 to 0.7 in most comparisons (Supplementary Fig. 9).
To benchmark the relative assays in a way that accounts for their strengths and characteristics, we assessed their ability to detect differences in DNA methylation between pairs of samples. For each assay and each pairwise comparison we discretized the measurements into three categories ('+', higher DNA methylation in first sample; '−', lower DNA methylation in first sample and '=', no detectable difference) and calculated the agreement between the different assays (Fig. 3a). Percent concordance values were high for closely related assays (68% for MS-HRM vs. MS-MCA and 75% for qMSP preamp vs. qMSP standard), but substantially lower between technologies (Fig. 3b). Furthermore, when we compared these results with the concordance values observed for absolute assays using suitable thresholds (Online Methods), the relative assays were as similar to some of the absolute assays as they were to each other, but they did not reach the high concordance (90% and above) observed among the best-performing absolute assays (Fig. 3b).
We also assessed the discriminatory power of the relative assays for DNA methylation differences identified by the consensus corridor, and for the known ratios in the two titration series (Fig. 3c). In these analyses, all relative assays accurately detected DNA methylation differences that exceeded 25%, whereas the performance for smaller differences varied between assays. MethyLight and qMSP detected many small differences with the expected direction, but reported the opposite direction for a considerable number of genomic regions. By contrast, MS-MCA and MS-HRM were more conservative and yielded very few mistakes regarding direction, but at the cost of detecting fewer differences.
Finally, we asked how well the relative DNA methylation assays captured quantitative differences in DNA methylation between samples. To that end, we took the quantitative differences reported by the relative assays for regions that were correctly classified and plotted them against the difference in consensus corridor estimates (Fig. 3d). The differences in the consensus corridor were most accurately recapitulated by the MethyLight assay. By contrast, the measurements of the other relative assays did not correlate well with the difference obtained from the consensus corridor, supporting the notion that MS-MCA, MS-HRM and qMSP should only be used for the type of qualitative comparisons that they were originally developed for.
Finally, given that sample material is scarce in some applications of DNA methylation biomarkers (e.g., needle biopsies and blood- plasma-derived DNA), we also analyzed the effect of input DNA amounts on assay performance. This analysis confirmed that DNA amounts were not limiting the assay performance in the main part of our comparison, but only the AmpliconBS and Pyroseq technologies were able to cope with severely reduced amounts and/or high fragmentation of input DNA (Supplementary Note and Supplementary Figs. 10, 11, 12).
Performance of global DNA methylation assays
Global DNA methylation assays report a single measurement value for each sample, indicative of its total DNA methylation content (Fig. 4a). For HPLC-MS, the results were generally consistent with expectations, showing global hypomethylation for the tumor samples (as opposed to locus-specific hypermethylation in the target regions of the absolute and relative assays) and for the drug-treated cell lines (Fig. 4a), similar values for fresh-frozen and FFPE samples from the same xenograft, and gradually decreasing DNA methylation from left to right in the two titration series (with relatively small differences and one strong outlier). The data for the Immunoquant assay did not agree well with the characteristics of the reference samples. Finally, the pyrosequencing assays, which focused on different types of repetitive elements (AluYb8, D4Z4, LINE1 and NBL2), detected the drug-induced demethylation and, to a lesser extent, also the tumor hypomethylation (LINE1 and NBL2) and the differences in the titration series (NBL2).
With correlations of 0.37 to 0.82 between the three technologies (Fig. 4b), there was less agreement among the global DNA methylation assays than we had observed for the locus-specific DNA methylation assays. This result prompted us to explore whether global DNA methylation levels could be inferred from locus-specific data, as a potential alternative to measuring them with global assays. We defined the 'global target' as the outlier-corrected mean of the two best-performing global assays (HPLC-MS and Pyroseq NBL2), and we tested several approaches for predicting the sample-specific global target values from the locus-specific data. Averaging across locus-specific measurements did not provide an accurate prediction (correlations of 0.37 to 0.77, Fig. 4b), likely because the target regions were enriched for regulatory elements with different DNA methylation dynamics compared to the bulk of the genome. By contrast, machine learning methods such as the generalized linear model, support vector regression and random forest regression compensated for these differences and predicted the global target values much more accurately (Fig. 4c,d). These results suggest that locus-specific assays in combination with statistical methods can be used to detect sample-specific differences in global DNA methylation (Fig. 4e).
Accuracy and robustness as epigenetic biomarkers
Epigenetic biomarker development is an important application of DNA methylation assays, requiring robust discrimination between cell types or disease states. We observed good separation between the different cell types using unsupervised methods (Supplementary Fig. 13), and we sought to quantify the assays' discriminatory power by supervised analysis focusing on the colon tumor and adjacent normal samples (Fig. 5). To that end, we trained support vector machines to distinguish between tumor and normal samples based on the data of each assay. We used four tumor-normal pairs for training, and evaluated the prediction performance on test sets consisting of the two remaining pairs, constituting a threefold cross-validation. Receiver operating characteristic (ROC) curves show excellent prediction performance for most assays (Fig. 5a and Supplementary Fig. 14a), which is not unexpected because DNA methylation patterns are known to be different between colon tumor and adjacent normal tissue, and because we selected several target regions based on their differential DNA methylation in colon cancer.
To simulate the complications of working with clinical samples of varying quality, we added noise to the data and assessed how the prediction performance was affected. Two types of noise were introduced (Online Methods): erroneous measurements were simulated by randomly replacing a fraction of DNA methylation measurements with other measurements (random error), and inaccurate measurements were simulated by adding random noise to each measurement (uniform noise) (Fig. 5a,b). Using the ROC area under curve to assess performance, we found that low noise levels were well tolerated by all assays, but at higher noise levels some of the absolute assays (AmpliconBS 1, AmpliconBS 2, EpiTyper 1, EpiTyper 2 and Pyroseq 1) outperformed the other assays.
We also assessed the effect of reducing the number of genomic regions contributing to the analysis (Supplementary Fig. 14b). When we trained and evaluated each classifier on the one, three or five most discriminatory genomic regions at a constant level of 25% uniform noise (Fig. 5c,d), the prediction accuracy remained high for most assays (in some cases it even increased because the removal of less informative regions reduced noise in the data set). However, when we repeated the analysis with randomly sampled regions (Fig. 5c,d), we observed major differences: for many of those assays that performed well at high noise levels (Fig. 5b), the drop in performance was low, and satisfactory results were achieved using just three randomly selected regions. By contrast, other assays (in particular the relative DNA methylation assays) performed well only with combined data for many different genomic regions.
Finally, we selected two assays (EpiTyper 3 and Infinium) and tested their predictive performance on a much larger cohort comprising 160 prostate tumor and 8 normal prostate samples. The concordance between the two assays was high for nine of ten tested CpGs, with Pearson correlations of 0.75 to 0.93, whereas one CpG showed little biological variation within the cohort and a low correlation coefficient (Supplementary Fig. 15a). When we trained and evaluated support vector machines for distinguishing between tumor and normal samples, we observed higher accuracy using the EpiTyper data than for the Infinium data, indicating that the locus-specific assays outperforms the Infinium assay in terms of accuracy and discriminatory power (Supplementary Fig. 15b).
We conducted a multicenter benchmarking study evaluating all DNA methylation assays that are strong candidates for clinical use. Most assays proved to be accurate and reproducible. The results also agreed well between laboratories and between technologies, which is notable because assay design (e.g., selection of primer sites and protocol parameters), execution (e.g., bisulfite conversion and sequencing) and data processing (e.g., normalization and quality control) were conducted independently by members of each contributing laboratory. Overall, our study demonstrates that locus-specific DNA methylation assays can be considered a mature technology ready for widespread use in biomarker development and clinical applications.
Despite generally consistent results, we observed characteristic strengths and weaknesses of the tested assays. The relative assays were generally less accurate and less concordant with each other than the absolute assays. This observation is not unexpected given that relative assays work best for detecting fully methylated regions, whereas many of the selected target regions were heterogeneously methylated. Despite their lower quantitative accuracy, the relative assays distinguished robustly between methylated and unmethylated regions, and they discriminated well between tumor and normal samples. Among the global assays, the HPLC-MS measurements most accurately reflected the expected differences in global DNA methylation levels, whereas the Immunoquant assay did not provide reliable results. Bisulfite pyrosequencing of repetitive DNA gave rise to highly reproducible results, but these repetitive DNA methylation levels did not correlate well with the expected differences in global DNA methylation. By contrast, good results were obtained when predicting global DNA methylation from locus-specific measurements, which may become a viable alternative to measuring global DNA methylation directly.
To capture not only the quantitative performance but also other relevant aspects of each assay, members of the contributing laboratories wrote detailed reports (Supplementary Data 1). These reports include protocol descriptions, comments on the practical strengths and limitations of each assay, and detailed time and cost calculations for running the assays in the respective laboratories. Drawing upon the cumulative experiences of our study, we arrive at the following conclusions and recommendations.
(i) Absolute DNA methylation assays are the method of choice when validating DNA methylation differences in large cohorts, and they are also an excellent technology for developing epigenetic biomarkers. (ii) Relative DNA methylation assays are not a good replacement for absolute assays. However, experiences of scientists in the contributing laboratories suggest that carefully selected, designed and validated relative assays can cost-effectively detect minimal traces of methylated DNA against an excess of unmethylated DNA. (iii) Global DNA methylation assays suffer from noisy data and divergent results between technologies. Locus-specific assays (possibly combined with prediction) provide a more robust alternative. (iv) Among the absolute DNA methylation assays, AmpliconBS and Pyroseq showed the best all-round performance, closely followed by EpiTyper. AmpliconBS is the best choice for assaying dozens of genomics regions in parallel, EpiTyper provides the highest sample throughput, and Pyroseq can work well even on minute amounts of highly fragmented DNA. (v) EnrichmentBS and Infinium can measure many more CpGs simultaneously than the other tested assays, but this comes at the cost of lower accuracy and higher cost per sample. (vi) Clonal bisulfite sequencing suffers from a high level of technical noise when sequencing 10−20 clones per sample. Given its high labor intensity and the availability of alternate assays with equal or better performance (as demonstrated in this study), clonal bisulfite sequencing is not recommended for large-scale validation and biomarker development.
We conclude that the accuracy and robustness, discriminatory power, cost structure and practical feasibility of current DNA methylation assays are sufficient for large-scale validation studies and epigenetic biomarker development. We expect that DNA methylation assays will become widely useful for clinical diagnostics and personalized therapies, as companion diagnostics of targeted drugs, in forensic testing of tissue types and in many other applications. Our study may serve as a starting point for broader standardization efforts involving academic and clinical laboratories as well as the commercial sector and regulatory agencies, to fully embrace the potential of DNA methylation biomarkers for precision medicine.
Preparation of reference DNA samples.
Six pairs of fresh-frozen colon tumor and adjacent normal colon tissue samples were obtained from the IDIBELL Tissue Biobank following approval by the corresponding ethics committee. For DNA extraction, frozen tumors were incubated overnight at 37 °C with DNA lysis buffer (10 mM Tris pH 8.0, 100 mM NaCl, 10 mM EDTA pH 8.0, 10% SDS) and proteinase K, and phenol-chloroform extraction was carried out.
KG1 (ref. 65) and KG1a66 leukemia cell lines were obtained from the German Collection of Microorganisms and Cell Cultures (DSMZ). The cells were cultured in RPMI medium (Sigma-Aldrich) supplemented with 15% FCS (PAA) and 1% penicillin/streptomycin (Sigma-Aldrich) at 37 °C and 5% CO2. DNA demethylation was induced by treatment with 5-aza-2′-deoxycytidine (Sigma-Aldrich) at the half-maximal inhibitory concentration (IC50) determined by MTT viability tests for each cell line (80 nM for KG1 and 300 nM for KG1a). Stocks were prepared fresh for each experiment in 0.1% phosphate-buffer-saline and kept at −20 °C until use. KG1 and KG1a cell lines where seeded in 75 cm2 tissue culture flasks (Sarstedt) at a concentration of 2.5 × 105 cells/ml 24 h before treatment. Cells were treated for three days at 24-h intervals. Every 24 h, medium change was performed by centrifugation of cells at 900 r.p.m. for 5 min and resuspension of cell pellet in fresh warm medium. Cells were allowed to recover for 1 h before addition of the drug. Cells where harvested by centrifugation 96 h after seeding, and the cell pellet was washed twice with 1% PBS.
For the first titration series, the REPLI-g Whole Genome Amplification kit (Qiagen) was used to produce unmethylated DNA following the manufacturer's protocol. Part of the whole genome amplified DNA was incompletely methylated in vitro using M.SssI (New England BioLabs) such that ∼20% of all cytosines became methylated. For the second titration series, DNA was extracted from HCT-116 cells grown in vitro (as described below) and from whole blood of a healthy donor using GenElute Mammalian Genomic DNA Miniprep Kit (Sigma-Aldrich) according to the manufacturer's protocol.
For the comparison of fresh-frozen and FFPE material, xenografts were prepared as follows. Two colon cancer cell lines (HCT-15, derived from a Dukes' type C colorectal adenocarcinoma, and HCT-116, derived from a colorectal carcinoma, both obtained from ATCC) were grown in DMEM with 10% FBS supplemented with antibiotic (37 °C, 5% CO2). Cells were harvested, filtered and aliquoted in PBS, and two SCID mice were subcutaneously injected in the flank with 3 × 106 tumor cells per animal and maintained over four weeks. Tumors were extracted, and each tumor was split in two. One part was formalin-fixed and embedded in paraffin, whereas the other part was frozen at −80 °C. For the FFPE samples, DNA was extracted using the QIAamp DNA FFPE Tissue kit (Sigma-Aldrich) following the manufacturer's protocol. DNA was resuspended in TE buffer pH 7.5 and treated with RNase (Sigma-Aldrich) for 45min at 37 °C. DNA from the fresh-frozen samples was extracted in the same way as for the primary colon tumors.
All reference DNA samples were quantified using Qubit 2.0 (Invitrogen) and quality-checked by gel electrophoresis. Homogeneous aliquots of equal volume corresponding to a target DNA amount of 1 μg were prepared for all reference samples and shipped on dry ice to the contributing laboratories. Scientists in each laboratory confirmed DNA quality by gel electrophoresis and DNA amounts by one of four alternative technologies (Qubit, Quant-iT, NanoDrop, TapeStation), and the results in the analysis reports.
Selection of target genomic regions.
To select informative and challenging target regions, we initially ran the Infinium 450k assay on all reference samples excluding the titration series and the FFPE samples. The resulting data were preprocessed and analyzed with RnBeads67, and we selected 1,072 genomic regions (corresponding to 16 core regions, 32 additional regions, and 1024 further regions for the EnrichmentBS assays) from the list of Infinium probes that passed quality control. Each of the regions was 122 base pairs wide, with the designated target CpG at the center and a window of 60 base pairs in both directions. 25% of these regions were selected based on differential DNA methylation for colon tumor vs. normal, 25% based on differential DNA methylation for drug treatment vs. control, 25% based on differential DNA methylated between the colon cancer cell lines and whole blood, and 25% were randomly selected. Within each block of differentially methylated CpGs we selected: (i) a subset of the most strongly differentially methylated CpGs (ranks 1, 2, 3, 5, 7, 10, 13, 17, 21 and 25); (ii) the 1% highest and lowest regions in terms of repetitive DNA content, GC content, CpG content, CpA content and CpG observed vs. expected ratio; and (iii) the 1% CpGs with DNA methylation values in the first group that were closest to 0%, 25%, 50%, 75% and 100%, respectively. We also included CpGs in genomic regions that had previously been described as epigenetic biomarkers for colon cancer68. From this list of 1,072 genomic regions, we manually selected 16 core regions and 32 additional regions such that the selected regions capture as much as possible of the technical challenges in assay design and analysis with locus-specific DNA methylation assays. One of the core regions (region 11) was chosen such that the target CpG overlapped with a common SNP, in order to provide an additional challenge during assay design. Furthermore, some of the remaining 1,024 genomic regions were covered by the two assays that easily scale to hundreds of genome regions (EnrichmentBS 1 and EnrichmentBS 2). These measurements are available from Supplementary Data 3, but for reasons of comparability across assays they were not included in the benchmarking.
Documentation of DNA methylation assays, results and analysis workflows.
The details for all contributed DNA methylation assays are available in Supplementary Data 1. These reports include a short assay summary, quality control data for the received reference DNA samples, and detailed descriptions of the design and execution of each contributed assay. They follow the standardized reporting template from the information package that was sent to all contributing laboratories (Supplementary Data 2). DNA methylation measurements for each assay, genomic region and reference sample are available in Supplementary Data 3. Illumina 450k microarray data are available at the NCBI Gene Expression Omnibus under the accession number GSE77965. Finally, the source code (written in R) underlying the bioinformatic analysis is available in a public repository (http://biomarker-benchmark.computational-epigenetics.org/), to foster transparency and reuse in the spirit of open science and reproducible research69.
Bioinformatic analysis of absolute DNA methylation assays.
To quantify assay performance without a priori knowledge of the true DNA methylation values in the reference samples, we defined target DNA methylation values by consensus. The consensus corridor was calculated as the narrowest interval containing measurements from three different technologies, extended by an additional flanking region of five percentage points in both directions. We chose this corridor (rather than, e.g., the arithmetic or geometric mean between all measurements) to minimize bias toward overrepresented assays. Based on the consensus corridor we calculated the absolute difference |da,r,s| between the DNA methylation values va,r,s (as measured by an assay a for a region r in a sample s) and the closest boundary of the corresponding consensus corridor cr,s: |da, r, s| = va, r, s − cr, s. According to this metric, assays with smaller absolute differences are in better agreement with the consensus. To assess whether certain assays tend to systematically overestimate or underestimate DNA methylation values, we also calculated the directional deviation ba as the mean of all differences:
Finally, we performed a sensitivity analysis for the consensus corridor (Supplementary Note and Supplementary Fig. 4), which confirmed that our choice of parameters (i.e., using three different technologies and flanking regions of five percentage points to constitute the consensus corridor) was appropriate for robustly ranking the assays by their performance.
We also quantified the absolute assay performance in an alternative way, which does not rely on any consensus values but makes use of the two titration series. The DNA methylation values in both titration series are expected to be proportional to the titration ratios, which are known. In contrast, the DNA methylation values at the two extreme points of the titration series are different between regions and a priori unknown. Therefore, as outlined in Supplementary Figure 7, we first calculated the difference between the median of the consensus corridors for each titration series and each region at the 0% and 100% titration ratios. We then removed all regions that did not change by at least five percentage points to focus the analysis on regions with a clear-cut change in DNA methylation over the titration series. Next, regions with a negative change between the 0% and 100% consensus values were inverted by subtracting their measured DNA methylation value from the maximum corresponding to complete DNA methylation. This procedure reversed directionality for the particular region and therefore standardized the direction across all regions. Finally, we adjusted for different offsets of DNA methylation levels by fitting a linear model to the values of each region and then subtracting the linear model offset (intersect) from these values. Using the adjusted DNA methylation values we then evaluated the Pearson correlation of the measured values to the titration ratios, which is the titration-based estimate of the correct value. To evaluate how well the assays captured the linearity of the DNA methylation values along the titration series, we also fitted a second intercept-free linear model to the adjusted DNA methylation values across all regions and samples, and we recorded the adjusted r2 and residual standard error of the fitted model. Assays with higher adjusted r2 values and lower residual standard error were considered in better agreement with the expectation that was based on the known titration ratios.
Bioinformatic analysis of relative DNA methylation assays.
We compared the relative assays among each other by calculating pairwise 3-by-3 contingency tables for the differences between each pair of samples recorded by each assay. Measurements that agreed on the direction of change in both assays appear on the diagonal of the contingency table, and the higher the percentage of measurements on the diagonal, the more concordant both assays are. We formalize the agreement between assays as a numeric value, the percent concordance:
where ci,j is the value in the ith row and jth column of the contingency table and
This approach readily generalizes to the absolute assays, where we considered samples with an absolute difference of less than five percentage points as concordant.
In a separate and complementary analysis, we evaluated the ability of the relative assays to detect the correct direction of change between any two samples by using the median of the three DNA methylation values spanning the previously defined consensus corridor as reference. For each pair of samples, we determined the target direction and magnitude of change as the difference between the two median values, and we checked for each relative assay whether the difference between the corresponding measurements had the same or opposite direction of change. If no difference was detected in the relative assays, this was also recorded. The differences in the medians were divided into four bins: marginal change (absolute difference below five percentage points), small change (5−25 percentage points) medium change (25−50 percentage points), and strong change (above 50 percentage points). Finally, we also evaluated the relative assays based on the titration series, including only those regions with a difference above five percentage points between the two extreme points according to the consensus corridor. Results were regarded as consistent with the titration series if the direction of change observed for the relative assay was the same as the direction of the change in the titration ratio, taking into account the two extreme points according to the consensus corridor.
Correlation between input DNA amounts and assay performance.
Two alternative approaches were used to assess the effect of DNA amounts on assay performance (Supplementary Note). First, owing to normal variation in the extracted DNA quality/quantity and in the initial quantification, the DNA amounts varied slightly between reference samples, e.g., ranging from 875 ng to 1,843 ng in the primary tumor/normal samples (Supplementary Fig. 10a). Each laboratory was provided with the exact same volume of homogeneous aliquots for these samples, such that these differences between samples did not result in differences between laboratories. To correlate input DNA amounts with assay performance, we fitted a linear model predicting the deviation from the consensus corridor for each sample and assay using two alternative measures of input DNA amounts: the first value based on the median of concentration measurements across all laboratories multiplied by the volume of DNA used for a given assay, and the second value based on the DNA amounts that each research group reported to have used according to their own concentration measurements. For each assay and each of the two measurements of DNA amount, P values were calculated with linear models and adjusted for multiple testing using the Benjamini-Hochberg method. We used an adjusted P-value threshold of 0.05 to call assays significantly influenced by DNA amount, but no associations were significant at this level.
Second, to assess the impact of DNA amounts in a much lower range (0.3 ng to 100 ng), we established a titration series with target DNA amounts of 100 ng, 30 ng, 10 ng, 3 ng, 1 ng and 0.3 ng from one of the tumor reference samples (CRC 2 tumor). To exclude differences owing to variation among different bisulfite conversion protocols and reagents70, these samples were bisulfite-converted centrally, using the EZ DNA Methylation-Direct Kit (Zymo Research, D5020) with the following deviations from the manufacturer's protocol: Conversion reagent was applied at 0.9× concentration, reactions incubated for 20 cycles of 1 min at 95 °C and 10 min at 60 °C, and the desulphonation time was extended to 30 min. The converted DNA was shipped on dry ice to nine laboratories that repeated their assays on these samples. We also analyzed the impact of reductions in DNA quality by fragmenting DNA from one of the tumor reference samples (CRC 1 tumor) to an average fragment length of 200 base pairs. To that end, batches of 600 ng DNA were digested with NEBNext dsDNA Fragmentase (New England BioLabs, M0348L) for exactly 60 min at 37 °C, stopping the fragmentation reactions by addition of 5μl 0.5M EDTA stop solution. The fragmented batches were combined, titrated to the same amounts as above, bisulfite-converted and shipped to the contributing laboratories.
Prediction of global DNA methylation levels.
The global DNA methylation assays give rise to one single value per sample, which made it possible to plot all data points into one diagram (Fig. 4a) and to assess the overall consistency of the results by visual inspection. In addition, we explored whether we could predict global DNA methylation values from the results of the locus-specific DNA methylation assays, either by using the mean or median of the DNA methylation levels or by more complex machine learning methods such as generalized linear models, support vector regression (linear and polynomial kernels) and random forest regression. To compensate for the fact that not all assays were run on all samples, we first imputed missing values by filling in the values of the most closely related other assay based on Pearson correlation. We trained the regression models using leave-one-out cross-validation to make optimal use of the limited data set. For each method and each analysis, we recorded the root mean square error (RMSE) between the prediction and the target value. As no single global assay gave fully consistent results, we chose as global target the mean of the two best-performing assays (HPLC-MS and Pyroseq NBL2), and we replaced the four mean values that were inconsistent with the known change in concentration in the titration series by imputed values that were calculated as the mean of the two neighboring values in the titration series. The e1071 R package was used for support vector regression, randomForest for random forest regression and DMwR for cross-validation.
Analysis of discriminatory power.
We trained linear support vector machines using patient-stratified cross-validation, such that each prediction used four tumor/patient pairs for training and left two pairs out for test-set validation. The e1071 R package was used to train the classifiers and the ROCR package71 to calculate the ROC area under curve as the main performance metric. We further examined the robustness of the classifiers in presence of two different error models: (i) random error and (ii) uniform noise.
Random error. We simulated faulty measurements by replacing a defined fraction of measurements by random numbers drawn from the pool of all measurements of that assay. In this manner, we ensured that the simulated erroneous measurements were drawn from the same distribution as the correct measurements without making assumptions about the statistical distribution of the data.
Uniform noise. We simulated inaccurate measurements by adding a random number to each measurement. At any given noise level n, this random number was sampled uniformly from the interval [−n × r; n × r], where r is the range spanned by all DNA methylation values for the same assay. To assess the prediction performance, we tested each classifier in a stratified threefold cross-validation: for each error model, noise/error level, assay, and selection of training and test set, we performed 1,000 repetitions of the analysis with randomized noise/error. To assess the robustness toward fewer measurements, we repeated the analysis with 25% uniform noise after removing the majority of regions from the training and test sets. The choice of regions retained (either 1, 3 or 5) was either entirely random or guided by the information content of each region for the prediction. We calculated the information content separately for each assay and region as the F score72. As before, we performed patient-stratified cross-validation with random repetitions. Finally, we analyzed a much larger cohort with 160 primary prostate tumor samples and 8 nonmatched normal prostate samples, comparing the EpiTyper 3 and Infinium assays with each other in terms of their correlation and discriminatory power.
Gene Expression Omnibus
We thank J. Hadler, T. Penz, C. Lo Porto, B. Schmitt, M. Bähr, M. Helf, O. Mücke, N. Mazaleyrat, C.V. Wong, T.E. Kjeldsen, A. Janosch, P. Dhami, E. Flores and H. Gohlke for technical assistance, and H. Stunnenberg as well as the scientific advisory board of BLUEPRINT for their advice and support. This work was performed in the context of the BLUEPRINT project (European Union's Seventh Framework Programme grant agreement 282510), which funded the study logistics and the integrative data analysis. The assay costs were paid by the contributing laboratories using institutional funds and the following grants: BBSRC BB/G020930/1, BBSRC BB/G020930/1, BMBF 01KU1001A, BMBF 01KU1002A, BMBF 01KU1216F, EU-FP7 282510, FWF I 1575-B19, NHMRC 1063559, NHMRC 1088144 and the DKFZ Graduate School.