DNA methylation patterns are altered in numerous diseases and often correlate with clinically relevant information such as disease subtypes, prognosis and drug response. With suitable assays and after validation in large cohorts, such associations can be exploited for clinical diagnostics and personalized treatment decisions. Here we describe the results of a community-wide benchmarking study comparing the performance of all widely used methods for DNA methylation analysis that are compatible with routine clinical use. We shipped 32 reference samples to 18 laboratories in seven different countries. Researchers in those laboratories collectively contributed 21 locus-specific assays for an average of 27 predefined genomic regions, as well as six global assays. We evaluated assay sensitivity on low-input samples and assessed the assays' ability to discriminate between cell types. Good agreement was observed across all tested methods, with amplicon bisulfite sequencing and bisulfite pyrosequencing showing the best all-round performance. Our technology comparison can inform the selection, optimization and use of DNA methylation assays in large-scale validation studies, biomarker development and clinical diagnostics.
At a glance
DNA methylation is an epigenetic mark widely studied for its association with diseases such as cancer1 and autoimmune disorders2, with environmental exposures3 and with other biological phenomena4, 5. Strong associations between DNA methylation patterns and clinical phenotypes can be used as biomarkers for diagnosing diseases and guiding treatment6, 7. For example, DNA methylation biomarkers have been shown to support clinical decisions in various cancers8, 9, 10, 11, 12, 13, 14 and are also used for noninvasive prenatal testing15, for quality control of cultured cells16 and for forensic applications17, 18.
DNA methylation biomarkers have several advantages that qualify them for broad use as in vitro diagnostics: (i) DNA methylation is cell-type-specific but robust toward transient perturbations, thus complementing static DNA-sequence-based biomarkers and volatile RNA-expression-based biomarkers. (ii) DNA methylation is a binary mark (i.e., for a single cell and allele, each CpG is either methylated or unmethylated), which facilitates reliable measurements on heterogeneous and degraded samples. (iii) The infrastructure for assaying DNA methylation biomarkers is already present in many clinical diagnostics laboratories, as the assays are similar to those used for DNA-sequence-based biomarkers. (iv) DNA methylation biomarkers are straightforward to integrate into routine clinical workflows because DNA is more stable than RNA and does not require any special handling. (v) DNA methylation patterns are faithfully retained during long-term storage as fresh-frozen or formalin-fixed, paraffin-embedded (FFPE) samples.
Genome-wide mapping and analysis of DNA methylation has become feasible for patient cohorts with thousands of samples19, 20, and epigenome-wide association studies have been conducted for numerous biomedically relevant phenotypes21, 22. To translate relevant epigenome associations into clinically useful biomarkers, it is necessary to select a manageable set of highly informative genomic regions, to target these loci with DNA methylation assays that are sufficiently fast, cheap, robust and widely available to be useful for routine clinical diagnostics23, 24, 25, and to confirm their predictive value in large validation cohorts.
Here we systematically compared and evaluated the most promising assays for measuring DNA methylation in large cohorts, clinical diagnostics and biomarker development. This multicenter study included research groups from seven countries across three continents, organized by the BLUEPRINT project26 in the context of the International Human Epigenome Consortium27 and as a follow-up to a previous comparison of genome-wide DNA methylation assays28, 29, 30. Overall, our results show that most assays provide high accuracy and robustness, although we observed some differences between assay types and laboratories. We provide detailed documentation of all contributed assays (Supplementary Data 1), such that this study can be used not only to guide assay selection but also as a resource of validated DNA methylation protocols.
Study design and assay selection
We selected assays based on comprehensive literature review, and for each promising assay we selected at least one research group that had extensive prior experience using that particular assay (Fig. 1a). In total, we invited 25 research groups, of which 19 agreed to participate. All participants received DNA aliquots for 32 reference samples, together with a list of 48 preselected genomic regions to be targeted. They designed the assays independently, analyzed the reference samples with their assays of choice and submitted the final results together with a detailed assay report for centralized benchmarking analysis by the study coordinator. Ultimately, 18 of the 19 participating research groups submitted complete analysis reports for a total of 27 assays (Table 1 and Supplementary Table 1). All contributed assays can be classified into one of three categories, which we summarize below.
First, absolute DNA methylation assays provide a quantitative measure of DNA methylation levels at single-CpG resolution. We included 16 absolute assays based on four technologies: (i) amplicon bisulfite sequencing (AmpliconBS) uses next-generation sequencing (NGS) of pooled PCR amplicons derived from bisulfite-converted DNA31, 32, 33. (ii) Enrichment bisulfite sequencing (EnrichmentBS) is similar to AmpliconBS in its use of bisulfite conversion and NGS, but it uses highly scalable techniques such as padlock probes or microdroplet-based amplification to enrich many genomic regions in parallel rather than relying on separate PCRs for each individual region34, 35. (iii) Mass spectrometric analysis of DNA methylation (EpiTyper) combines bisulfite conversion, in vitro transcription and uracil-specific cleavage with mass-spectrometry-based quantification of fragment lengths36, 37, 38. (iv) Bisulfite pyrosequencing (Pyroseq) applies sequencing by synthesis39 to single PCR amplicons obtained from bisulfite-converted DNA40, 41.
Second, relative DNA methylation assays measure DNA methylation by comparing samples to a suitable reference. This approach is mainly used for detecting methylated DNA fragments in an excess of unmethylated fragments, but it also provides rough estimates of absolute DNA methylation levels. We included five relative DNA methylation assays based on three alternative technologies: (v) MethyLight uses PCR amplification of bisulfite-converted DNA in combination with fluorescently labeled probes that hybridize specifically to a predefined DNA methylation pattern, typically that of fully methylated DNA42, 43. (vi) Methylation-specific melting assays, including methylation-sensitive high-resolution melting (MS-HRM) and methylation-specific melting curve analysis (MS-MCA), apply melting curve analysis to amplicons obtained from bisulfite-converted DNA, which provides a semiquantitative measure of cytosines that have been converted to thymines44, 45, 46. (vii) Quantitative methylation-specific PCR (qMSP) uses DNA-methylation-specific primers in combination with real-time PCR to compare the prevalence of a specific DNA methylation pattern with that of a suitable reference47, 48.
Third, global DNA methylation assays measure a sample's total DNA methylation content, which can be useful for measuring hypomethylation in cancer49 and the response to drugs that inhibit DNA methylation50. We included five global assays based on three alternative technologies: (viii) High-performance liquid chromatography followed by mass spectrometry (HPLC-MS) quantifies the amount of 5-methylcytosine based on its mass difference compared to unmethylated cytosine51. (ix) Immunoquantification of global DNA methylation (Immunoquant) uses a modified enzyme-linked immunosorbent assay (ELISA) with an antibody against 5-methylcytosine to quantify the total amount of methylated DNA in a given DNA sample52. (x) Bisulfite pyrosequencing of repetitive DNA elements (Pyroseq AluYb8/D4Z4/LINE/NBL2) applies pyrosequencing to amplicons obtained from bisulfite-converted DNA using primers that amplify multiple instances of the selected type of repeat53, 54, 55, 56, which assumes that averaged local DNA methylation levels across specific repetitive regions correlate with global DNA methylation levels.
Given the study's focus on clinically applicable assays, we did not include emerging technologies that have not yet been shown to be practically useful in large-scale studies, e.g., nanopores, nanowire transistors, quantum dots, single-molecule real-time sequencing and atomic force spectroscopy57. We also did not include genome-wide assays such as whole-genome bisulfite sequencing, reduced-representation bisulfite sequencing, methylated DNA immunoprecipitation sequencing or methyl-CpG binding domain enriched sequencing, given that these assays have been benchmarked previously28, 29, 30, and are currently too cumbersome and expensive for routine clinical diagnostics. However, we did include the Infinium 450k assay58, which we also used to select the target regions, and we performed a limited amount of clonal bisulfite sequencing59, 60, given that this assay was until recently considered the gold standard but has been largely superseded by less labor-intensive assays. Our benchmarking did not explicitly address non-CpG methylation nor DNA methylation variants (5hmC, 5fC and 5caC), but most of the included assays can be used to measure non-CpG methylation as well as CpG methylation, and they can also be adapted to distinguish between DNA methylation variants61, 62, 63. Finally, we note that all contributed locus-specific assays were bisulfite-based, although we had invited four research groups that had expertise in alternative technologies.
Reference samples and target regions
We prepared 32 reference samples that mimic typical applications of DNA methylation assays in biomedical research and clinical diagnostics (Supplementary Table 2). This sample set included DNA extracted from six pairs of primary colon tumor and adjacent normal colon tissue samples ('tumor/normal), DNA from two cell lines before and after treatment with a demethylation-inducing drug ('drug/control'), a titration series with partially methylated DNA spiked into unmethylated DNA ('titration 1'), another titration series with DNA from a cancer cell line spiked into whole blood DNA ('titration 2'), and DNA from two matched pairs of fresh-frozen and FFPE xenograft tumors ('frozen/FFPE').
To establish suitable targets for the locus-specific assays, we performed genome-scale DNA methylation analysis with the Infinium 450k assay and selected 48 differentially methylated CpGs that cover a broad range of technical challenges encountered in biomarker development (Supplementary Table 3). For example, we included genomic regions with high and low CpG density, GC content and repetitive DNA overlap. As an additional challenge, we included a single-nucleotide polymorphism (SNP) that replaces a potentially methylated CpG by an always unmethylated TpG dinucleotide in some of the reference samples.
To each contributing laboratory we sent aliquots of ~1 μg DNA for each of the 32 reference samples. In addition, we provided a standardized information package comprising general instructions, documentation templates and the list of the 48 target genomic regions (Supplementary Data 2). Each region had one designated target CpG for which the DNA methylation level was to be measured, and we asked the contributing research groups to return DNA methylation measurements for each of the reference samples. We gave no further instructions on how to design the assays or how to derive the DNA methylation measurements for the target CpG from the raw data. Moreover, we asked research groups not to exchange any information among each other, and they did not have access to the Infinium 450k data used for region selection.
We designated 16 of the 48 target regions as mandatory ('core regions') and let scientists in each contributing research group for themselves decide how many of the remaining 32 regions they would cover in addition to the core regions. On average, assay design was attempted for 30 genomic regions and was successful in 95% of cases (Supplementary Table 1). The known SNP at one of the target CpGs was detected and reported by 9 of the 17 research groups who contributed locus-specific assays. We removed the SNP-containing region from further analysis to avoid bias, but we emphasize the importance of double-checking for known SNPs during assay design. In total, scientists from 18 laboratories contributed 16 absolute, 5 relative and 6 global assays (Supplementary Fig. 1), giving rise to a benchmarking data set with 16,435 locus-specific and 192 global DNA methylation measurements (Supplementary Data 3).
Performance of absolute DNA methylation assays
All absolute assays detected the expected bimodal pattern of DNA methylation, with most regions being either highly or lowly methylated (Fig. 1b). NGS-based assays (i.e., AmpliconBS and EnrichmentBS) reported extreme values of 0% and 100% more frequently than the other assays, which can be explained by their digital counting of methylated and unmethylated cytosines. The distribution plots confirmed the expected differences among the 32 reference samples (Fig. 1b), with higher DNA methylation levels for colon tumors than in matched normal tissue in the target regions, lower DNA methylation in the drug-treated leukemia cell lines, decreasing DNA methylation with decreasing concentrations of in vitro methylated DNA (titration 1) and cancer cell line DNA (titration 2), and similar DNA methylation levels for DNA extracted from fresh-frozen vs. FFPE xenografts. These plots also illustrate the broad range of different DNA methylation distributions among the selected target regions (Fig. 1b).
To assess global similarity among the absolute DNA methylation assays, we calculated Pearson correlation coefficients across all measurements for each pair of assays (Supplementary Fig. 2) and performed hierarchical clustering (Fig. 1c). 85% of between-assay comparisons resulted in correlations above 0.8, and 46% even exceeded 0.9, indicating an excellent overall agreement between many of the tested assays. Correlations were high for technical replicates in the same laboratory (Pyroseq 1 vs. Pyroseq 1 (replicate): r = 0.996), the same technology between laboratories (e.g., Pyroseq 1 vs. Pyroseq 2: r = 0.98) and between assays of different types (e.g., Pyroseq 1 vs. EpiTyper 3: r = 0.95; Pyroseq 1 vs. AmpliconBS 1: r = 0.97). However, not all assays agreed equally well (Fig. 1d and Supplementary Fig. 2). For instance, the Infinium assay reported higher DNA methylation levels for CpGs that other assays identified as lowly methylated, while reporting slightly reduced DNA methylation levels for highly methylated CpGs; and the EnrichmentBS 1 assay gave rise to a substantial number of outliers when compared to any of the other assays.
To quantify the accuracy of individual assays, a reference is needed against which to evaluate the measurements. Synthesized DNA with predefined DNA methylation patterns would be one option, but this is currently feasible only for fully methylated DNA spiked into fully unmethylated DNA, thus ignoring the challenges posed by heterogeneous DNA methylation patterns64. For this reason, we chose two alternative approaches for quantifying assay performance in the presence of epigenetic heterogeneity.
First, we combined data from several assays into high-confidence consensus estimates to establish target DNA methylation levels for the reference samples (Fig. 2a and Supplementary Fig. 3a). For each sample and genomic region, we identified the smallest interval comprising measurements by at least three of the five technologies (AmpliconBS, EnrichmentBS, EpiTyper, Infinium and Pyroseq), which minimizes the impact of outliers and technology-specific artifacts. Moreover, we extended these intervals with flanking windows of five percentage points on either side to account for small deviations (Fig. 2a). We used the resulting 'consensus corridor' as a surrogate for the true DNA methylation level (which is unknown) of each target CpG in each reference sample. All assays contributed to the consensus corridor (Supplementary Fig. 3b,c), and sensitivity analysis confirmed that the ranking of assay performance was robust to the exact definition of the consensus corridor (Supplementary Note and Supplementary Fig. 4).
Evaluating each assay against this corridor (Fig. 2b and Supplementary Table 1), we observed the lowest mean absolute deviation for Pyroseq 1 and its replicate (1.1 and 1.2, respectively), closely followed by AmpliconBS 1 (1.6), Pyroseq 2 (2.5), AmpliconBS 2 (2.8) and Pyroseq 5 (3.3). The highest mean absolute deviation was observed for EpiTyper 2 and EnrichmentBS 1 (6.8 and 11.4, respectively). We also assessed the bias of each assay, which we defined as the directional (rather than absolute) deviation from the consensus corridor. For the Infinium assay we observed an overall tendency to over-estimate DNA methylation levels, whereas all Pyroseq assays tended to underestimate DNA methylation levels. For AmpliconBS EnrichmentBS, and EpiTyper, the average direction of the deviation depended on the laboratory.
Furthermore, to understand which properties make genomic regions difficult to measure, we fitted a linear model that predicts the deviation from the consensus based on each region's estimated DNA methylation level, GC content, CpG observed vs. expected ratio and content of repetitive DNA (Supplementary Fig. 5). Four assays (AmpliconBS 4, EnrichmentBS 1, Pyroseq 4 and Pyroseq 5) showed significantly increased deviation in highly methylated regions, whereas the Infinium assay was comparably more accurate in highly methylated regions. GC content, CpG density and repetitive DNA also affected the deviation in some cases, but the best-performing assays did not show any significant biases (Supplementary Fig. 5). Finally, we compared assay performance between matched fresh-frozen and FFPE samples (Supplementary Fig. 6). We obtained highly similar results, showing that all tested assays are compatible with DNA from FFPE material.
Second, as a complementary approach to consensus corridors, we assessed the performance of each assay using two titration series with known ratios. The titration samples included heterogeneous DNA methylation patterns, which make them more challenging than titrations of fully methylated DNA used for assay calibration. For the first titration series, we created partially and heterogeneously methylated DNA by incomplete in vitro methylation (less than 20% methylated cytosines) and combined it with unmethylated DNA at ratios of 100%, 75%, 50%, 10%, 1% and 0%. The second titration series mimics the diagnostic task of detecting hypermethylated cancer DNA against a background of blood-derived DNA. To that end, we spiked DNA from a colon cancer cell line (HCT-116) at ratios of 100%, 10%, 1%, 0.1%, 0.01% and 0% into DNA extracted from whole blood. We then fitted linear models to the DNA methylation measurements (Fig. 2c and Supplementary Fig. 7) to assess consistency with the known titration ratios. The agreement was high for most assays and for three alternative metrics (Fig. 2d and Supplementary Table 1). Best results were achieved by AmpliconBS 1 with median Pearson correlation coefficients of 0.99 and 0.93 in the two titration series (Fig. 2c). By contrast, EnrichmentBS 2 is an example of an assay that showed more variability with correlation of coefficients 0.77 and 0.87 (Fig. 2c). The results for the titration series were in good agreement with the assay performance in the consensus-based validation (Fig. 2b), with AmpliconBS 1, AmpliconBS 2, Pyroseq 1 and Pyroseq 3 being among the best in both analyses.
Finally, we assessed how clonal bisulfite sequencing59, 60 would fare in our benchmarking, given that it was previously considered the gold standard for locus-specific DNA methylation mapping. At a target coverage of 10−20 Sanger sequencing clones, fully unmethylated and fully methylated CpGs gave rise to consistent measurements between replicates, but regions with intermediate DNA methylation levels agreed less well (Supplementary Fig. 8a). Diverging measurements appeared to be caused by random noise resulting from sequencing few clones, and both replicates clustered similarly well with other assays (Pearson correlation above 0.9 for all but one assay; Supplementary Fig. 8b). We did not observe any directional deviation from the consensus corridor (Supplementary Fig. 8c), and Pearson correlation coefficients in comparison to other assays were in the range of 0.7 to 0.9 in comparison to other assays (Supplementary Fig. 8d). Overall, clonal bisulfite sequencing performed reasonably well in our analysis but did not reach the accuracy and reproducibility of the top-ranking assays.
Performance of relative DNA methylation assays
Relative DNA methylation assays detect DNA molecules with a predefined DNA methylation pattern, e.g., identifying fully methylated, tumor-derived DNA fragments in an excess of blood DNA. This approach is less suited for measuring quantitative DNA methylation levels at single-CpG resolution, which prompted two of the research groups contributing relative assays to report their measurements as ranges (e.g., 0.1–1%, 1–10%, or 10–100%). Indeed, the observed correlations were generally lower for relative assays than for absolute assays, ranging from 0.5 to 0.7 in most comparisons (Supplementary Fig. 9).
To benchmark the relative assays in a way that accounts for their strengths and characteristics, we assessed their ability to detect differences in DNA methylation between pairs of samples. For each assay and each pairwise comparison we discretized the measurements into three categories ('+', higher DNA methylation in first sample; '−', lower DNA methylation in first sample and '=', no detectable difference) and calculated the agreement between the different assays (Fig. 3a). Percent concordance values were high for closely related assays (68% for MS-HRM vs. MS-MCA and 75% for qMSP preamp vs. qMSP standard), but substantially lower between technologies (Fig. 3b). Furthermore, when we compared these results with the concordance values observed for absolute assays using suitable thresholds (Online Methods), the relative assays were as similar to some of the absolute assays as they were to each other, but they did not reach the high concordance (90% and above) observed among the best-performing absolute assays (Fig. 3b).
We also assessed the discriminatory power of the relative assays for DNA methylation differences identified by the consensus corridor, and for the known ratios in the two titration series (Fig. 3c). In these analyses, all relative assays accurately detected DNA methylation differences that exceeded 25%, whereas the performance for smaller differences varied between assays. MethyLight and qMSP detected many small differences with the expected direction, but reported the opposite direction for a considerable number of genomic regions. By contrast, MS-MCA and MS-HRM were more conservative and yielded very few mistakes regarding direction, but at the cost of detecting fewer differences.
Finally, we asked how well the relative DNA methylation assays captured quantitative differences in DNA methylation between samples. To that end, we took the quantitative differences reported by the relative assays for regions that were correctly classified and plotted them against the difference in consensus corridor estimates (Fig. 3d). The differences in the consensus corridor were most accurately recapitulated by the MethyLight assay. By contrast, the measurements of the other relative assays did not correlate well with the difference obtained from the consensus corridor, supporting the notion that MS-MCA, MS-HRM and qMSP should only be used for the type of qualitative comparisons that they were originally developed for.
Finally, given that sample material is scarce in some applications of DNA methylation biomarkers (e.g., needle biopsies and blood- plasma-derived DNA), we also analyzed the effect of input DNA amounts on assay performance. This analysis confirmed that DNA amounts were not limiting the assay performance in the main part of our comparison, but only the AmpliconBS and Pyroseq technologies were able to cope with severely reduced amounts and/or high fragmentation of input DNA (Supplementary Note and Supplementary Figs. 10, 11, 12).
Performance of global DNA methylation assays
Global DNA methylation assays report a single measurement value for each sample, indicative of its total DNA methylation content (Fig. 4a). For HPLC-MS, the results were generally consistent with expectations, showing global hypomethylation for the tumor samples (as opposed to locus-specific hypermethylation in the target regions of the absolute and relative assays) and for the drug-treated cell lines (Fig. 4a), similar values for fresh-frozen and FFPE samples from the same xenograft, and gradually decreasing DNA methylation from left to right in the two titration series (with relatively small differences and one strong outlier). The data for the Immunoquant assay did not agree well with the characteristics of the reference samples. Finally, the pyrosequencing assays, which focused on different types of repetitive elements (AluYb8, D4Z4, LINE1 and NBL2), detected the drug-induced demethylation and, to a lesser extent, also the tumor hypomethylation (LINE1 and NBL2) and the differences in the titration series (NBL2).
With correlations of 0.37 to 0.82 between the three technologies (Fig. 4b), there was less agreement among the global DNA methylation assays than we had observed for the locus-specific DNA methylation assays. This result prompted us to explore whether global DNA methylation levels could be inferred from locus-specific data, as a potential alternative to measuring them with global assays. We defined the 'global target' as the outlier-corrected mean of the two best-performing global assays (HPLC-MS and Pyroseq NBL2), and we tested several approaches for predicting the sample-specific global target values from the locus-specific data. Averaging across locus-specific measurements did not provide an accurate prediction (correlations of 0.37 to 0.77, Fig. 4b), likely because the target regions were enriched for regulatory elements with different DNA methylation dynamics compared to the bulk of the genome. By contrast, machine learning methods such as the generalized linear model, support vector regression and random forest regression compensated for these differences and predicted the global target values much more accurately (Fig. 4c,d). These results suggest that locus-specific assays in combination with statistical methods can be used to detect sample-specific differences in global DNA methylation (Fig. 4e).
Accuracy and robustness as epigenetic biomarkers
Epigenetic biomarker development is an important application of DNA methylation assays, requiring robust discrimination between cell types or disease states. We observed good separation between the different cell types using unsupervised methods (Supplementary Fig. 13), and we sought to quantify the assays' discriminatory power by supervised analysis focusing on the colon tumor and adjacent normal samples (Fig. 5). To that end, we trained support vector machines to distinguish between tumor and normal samples based on the data of each assay. We used four tumor-normal pairs for training, and evaluated the prediction performance on test sets consisting of the two remaining pairs, constituting a threefold cross-validation. Receiver operating characteristic (ROC) curves show excellent prediction performance for most assays (Fig. 5a and Supplementary Fig. 14a), which is not unexpected because DNA methylation patterns are known to be different between colon tumor and adjacent normal tissue, and because we selected several target regions based on their differential DNA methylation in colon cancer.
To simulate the complications of working with clinical samples of varying quality, we added noise to the data and assessed how the prediction performance was affected. Two types of noise were introduced (Online Methods): erroneous measurements were simulated by randomly replacing a fraction of DNA methylation measurements with other measurements (random error), and inaccurate measurements were simulated by adding random noise to each measurement (uniform noise) (Fig. 5a,b). Using the ROC area under curve to assess performance, we found that low noise levels were well tolerated by all assays, but at higher noise levels some of the absolute assays (AmpliconBS 1, AmpliconBS 2, EpiTyper 1, EpiTyper 2 and Pyroseq 1) outperformed the other assays.
We also assessed the effect of reducing the number of genomic regions contributing to the analysis (Supplementary Fig. 14b). When we trained and evaluated each classifier on the one, three or five most discriminatory genomic regions at a constant level of 25% uniform noise (Fig. 5c,d), the prediction accuracy remained high for most assays (in some cases it even increased because the removal of less informative regions reduced noise in the data set). However, when we repeated the analysis with randomly sampled regions (Fig. 5c,d), we observed major differences: for many of those assays that performed well at high noise levels (Fig. 5b), the drop in performance was low, and satisfactory results were achieved using just three randomly selected regions. By contrast, other assays (in particular the relative DNA methylation assays) performed well only with combined data for many different genomic regions.
Finally, we selected two assays (EpiTyper 3 and Infinium) and tested their predictive performance on a much larger cohort comprising 160 prostate tumor and 8 normal prostate samples. The concordance between the two assays was high for nine of ten tested CpGs, with Pearson correlations of 0.75 to 0.93, whereas one CpG showed little biological variation within the cohort and a low correlation coefficient (Supplementary Fig. 15a). When we trained and evaluated support vector machines for distinguishing between tumor and normal samples, we observed higher accuracy using the EpiTyper data than for the Infinium data, indicating that the locus-specific assays outperforms the Infinium assay in terms of accuracy and discriminatory power (Supplementary Fig. 15b).
We conducted a multicenter benchmarking study evaluating all DNA methylation assays that are strong candidates for clinical use. Most assays proved to be accurate and reproducible. The results also agreed well between laboratories and between technologies, which is notable because assay design (e.g., selection of primer sites and protocol parameters), execution (e.g., bisulfite conversion and sequencing) and data processing (e.g., normalization and quality control) were conducted independently by members of each contributing laboratory. Overall, our study demonstrates that locus-specific DNA methylation assays can be considered a mature technology ready for widespread use in biomarker development and clinical applications.
Despite generally consistent results, we observed characteristic strengths and weaknesses of the tested assays. The relative assays were generally less accurate and less concordant with each other than the absolute assays. This observation is not unexpected given that relative assays work best for detecting fully methylated regions, whereas many of the selected target regions were heterogeneously methylated. Despite their lower quantitative accuracy, the relative assays distinguished robustly between methylated and unmethylated regions, and they discriminated well between tumor and normal samples. Among the global assays, the HPLC-MS measurements most accurately reflected the expected differences in global DNA methylation levels, whereas the Immunoquant assay did not provide reliable results. Bisulfite pyrosequencing of repetitive DNA gave rise to highly reproducible results, but these repetitive DNA methylation levels did not correlate well with the expected differences in global DNA methylation. By contrast, good results were obtained when predicting global DNA methylation from locus-specific measurements, which may become a viable alternative to measuring global DNA methylation directly.
To capture not only the quantitative performance but also other relevant aspects of each assay, members of the contributing laboratories wrote detailed reports (Supplementary Data 1). These reports include protocol descriptions, comments on the practical strengths and limitations of each assay, and detailed time and cost calculations for running the assays in the respective laboratories. Drawing upon the cumulative experiences of our study, we arrive at the following conclusions and recommendations.
(i) Absolute DNA methylation assays are the method of choice when validating DNA methylation differences in large cohorts, and they are also an excellent technology for developing epigenetic biomarkers. (ii) Relative DNA methylation assays are not a good replacement for absolute assays. However, experiences of scientists in the contributing laboratories suggest that carefully selected, designed and validated relative assays can cost-effectively detect minimal traces of methylated DNA against an excess of unmethylated DNA. (iii) Global DNA methylation assays suffer from noisy data and divergent results between technologies. Locus-specific assays (possibly combined with prediction) provide a more robust alternative. (iv) Among the absolute DNA methylation assays, AmpliconBS and Pyroseq showed the best all-round performance, closely followed by EpiTyper. AmpliconBS is the best choice for assaying dozens of genomics regions in parallel, EpiTyper provides the highest sample throughput, and Pyroseq can work well even on minute amounts of highly fragmented DNA. (v) EnrichmentBS and Infinium can measure many more CpGs simultaneously than the other tested assays, but this comes at the cost of lower accuracy and higher cost per sample. (vi) Clonal bisulfite sequencing suffers from a high level of technical noise when sequencing 10−20 clones per sample. Given its high labor intensity and the availability of alternate assays with equal or better performance (as demonstrated in this study), clonal bisulfite sequencing is not recommended for large-scale validation and biomarker development.
We conclude that the accuracy and robustness, discriminatory power, cost structure and practical feasibility of current DNA methylation assays are sufficient for large-scale validation studies and epigenetic biomarker development. We expect that DNA methylation assays will become widely useful for clinical diagnostics and personalized therapies, as companion diagnostics of targeted drugs, in forensic testing of tissue types and in many other applications. Our study may serve as a starting point for broader standardization efforts involving academic and clinical laboratories as well as the commercial sector and regulatory agencies, to fully embrace the potential of DNA methylation biomarkers for precision medicine.
Preparation of reference DNA samples.
Six pairs of fresh-frozen colon tumor and adjacent normal colon tissue samples were obtained from the IDIBELL Tissue Biobank following approval by the corresponding ethics committee. For DNA extraction, frozen tumors were incubated overnight at 37 °C with DNA lysis buffer (10 mM Tris pH 8.0, 100 mM NaCl, 10 mM EDTA pH 8.0, 10% SDS) and proteinase K, and phenol-chloroform extraction was carried out.
KG1 (ref. 65) and KG1a66 leukemia cell lines were obtained from the German Collection of Microorganisms and Cell Cultures (DSMZ). The cells were cultured in RPMI medium (Sigma-Aldrich) supplemented with 15% FCS (PAA) and 1% penicillin/streptomycin (Sigma-Aldrich) at 37 °C and 5% CO2. DNA demethylation was induced by treatment with 5-aza-2′-deoxycytidine (Sigma-Aldrich) at the half-maximal inhibitory concentration (IC50) determined by MTT viability tests for each cell line (80 nM for KG1 and 300 nM for KG1a). Stocks were prepared fresh for each experiment in 0.1% phosphate-buffer-saline and kept at −20 °C until use. KG1 and KG1a cell lines where seeded in 75 cm2 tissue culture flasks (Sarstedt) at a concentration of 2.5 × 105 cells/ml 24 h before treatment. Cells were treated for three days at 24-h intervals. Every 24 h, medium change was performed by centrifugation of cells at 900 r.p.m. for 5 min and resuspension of cell pellet in fresh warm medium. Cells were allowed to recover for 1 h before addition of the drug. Cells where harvested by centrifugation 96 h after seeding, and the cell pellet was washed twice with 1% PBS.
For the first titration series, the REPLI-g Whole Genome Amplification kit (Qiagen) was used to produce unmethylated DNA following the manufacturer's protocol. Part of the whole genome amplified DNA was incompletely methylated in vitro using M.SssI (New England BioLabs) such that ~20% of all cytosines became methylated. For the second titration series, DNA was extracted from HCT-116 cells grown in vitro (as described below) and from whole blood of a healthy donor using GenElute Mammalian Genomic DNA Miniprep Kit (Sigma-Aldrich) according to the manufacturer's protocol.
For the comparison of fresh-frozen and FFPE material, xenografts were prepared as follows. Two colon cancer cell lines (HCT-15, derived from a Dukes' type C colorectal adenocarcinoma, and HCT-116, derived from a colorectal carcinoma, both obtained from ATCC) were grown in DMEM with 10% FBS supplemented with antibiotic (37 °C, 5% CO2). Cells were harvested, filtered and aliquoted in PBS, and two SCID mice were subcutaneously injected in the flank with 3 × 106 tumor cells per animal and maintained over four weeks. Tumors were extracted, and each tumor was split in two. One part was formalin-fixed and embedded in paraffin, whereas the other part was frozen at −80 °C. For the FFPE samples, DNA was extracted using the QIAamp DNA FFPE Tissue kit (Sigma-Aldrich) following the manufacturer's protocol. DNA was resuspended in TE buffer pH 7.5 and treated with RNase (Sigma-Aldrich) for 45min at 37 °C. DNA from the fresh-frozen samples was extracted in the same way as for the primary colon tumors.
All reference DNA samples were quantified using Qubit 2.0 (Invitrogen) and quality-checked by gel electrophoresis. Homogeneous aliquots of equal volume corresponding to a target DNA amount of 1 μg were prepared for all reference samples and shipped on dry ice to the contributing laboratories. Scientists in each laboratory confirmed DNA quality by gel electrophoresis and DNA amounts by one of four alternative technologies (Qubit, Quant-iT, NanoDrop, TapeStation), and the results in the analysis reports.
Selection of target genomic regions.
To select informative and challenging target regions, we initially ran the Infinium 450k assay on all reference samples excluding the titration series and the FFPE samples. The resulting data were preprocessed and analyzed with RnBeads67, and we selected 1,072 genomic regions (corresponding to 16 core regions, 32 additional regions, and 1024 further regions for the EnrichmentBS assays) from the list of Infinium probes that passed quality control. Each of the regions was 122 base pairs wide, with the designated target CpG at the center and a window of 60 base pairs in both directions. 25% of these regions were selected based on differential DNA methylation for colon tumor vs. normal, 25% based on differential DNA methylation for drug treatment vs. control, 25% based on differential DNA methylated between the colon cancer cell lines and whole blood, and 25% were randomly selected. Within each block of differentially methylated CpGs we selected: (i) a subset of the most strongly differentially methylated CpGs (ranks 1, 2, 3, 5, 7, 10, 13, 17, 21 and 25); (ii) the 1% highest and lowest regions in terms of repetitive DNA content, GC content, CpG content, CpA content and CpG observed vs. expected ratio; and (iii) the 1% CpGs with DNA methylation values in the first group that were closest to 0%, 25%, 50%, 75% and 100%, respectively. We also included CpGs in genomic regions that had previously been described as epigenetic biomarkers for colon cancer68. From this list of 1,072 genomic regions, we manually selected 16 core regions and 32 additional regions such that the selected regions capture as much as possible of the technical challenges in assay design and analysis with locus-specific DNA methylation assays. One of the core regions (region 11) was chosen such that the target CpG overlapped with a common SNP, in order to provide an additional challenge during assay design. Furthermore, some of the remaining 1,024 genomic regions were covered by the two assays that easily scale to hundreds of genome regions (EnrichmentBS 1 and EnrichmentBS 2). These measurements are available from Supplementary Data 3, but for reasons of comparability across assays they were not included in the benchmarking.
Documentation of DNA methylation assays, results and analysis workflows.
The details for all contributed DNA methylation assays are available in Supplementary Data 1. These reports include a short assay summary, quality control data for the received reference DNA samples, and detailed descriptions of the design and execution of each contributed assay. They follow the standardized reporting template from the information package that was sent to all contributing laboratories (Supplementary Data 2). DNA methylation measurements for each assay, genomic region and reference sample are available in Supplementary Data 3. Illumina 450k microarray data are available at the NCBI Gene Expression Omnibus under the accession number GSE77965. Finally, the source code (written in R) underlying the bioinformatic analysis is available in a public repository (http://biomarker-benchmark.computational-epigenetics.org/), to foster transparency and reuse in the spirit of open science and reproducible research69.
Bioinformatic analysis of absolute DNA methylation assays.
To quantify assay performance without a priori knowledge of the true DNA methylation values in the reference samples, we defined target DNA methylation values by consensus. The consensus corridor was calculated as the narrowest interval containing measurements from three different technologies, extended by an additional flanking region of five percentage points in both directions. We chose this corridor (rather than, e.g., the arithmetic or geometric mean between all measurements) to minimize bias toward overrepresented assays. Based on the consensus corridor we calculated the absolute difference |da,r,s| between the DNA methylation values va,r,s (as measured by an assay a for a region r in a sample s) and the closest boundary of the corresponding consensus corridor cr,s: |da, r, s| = va, r, s − cr, s. According to this metric, assays with smaller absolute differences are in better agreement with the consensus. To assess whether certain assays tend to systematically overestimate or underestimate DNA methylation values, we also calculated the directional deviation ba as the mean of all differences:
Finally, we performed a sensitivity analysis for the consensus corridor (Supplementary Note and Supplementary Fig. 4), which confirmed that our choice of parameters (i.e., using three different technologies and flanking regions of five percentage points to constitute the consensus corridor) was appropriate for robustly ranking the assays by their performance.
We also quantified the absolute assay performance in an alternative way, which does not rely on any consensus values but makes use of the two titration series. The DNA methylation values in both titration series are expected to be proportional to the titration ratios, which are known. In contrast, the DNA methylation values at the two extreme points of the titration series are different between regions and a priori unknown. Therefore, as outlined in Supplementary Figure 7, we first calculated the difference between the median of the consensus corridors for each titration series and each region at the 0% and 100% titration ratios. We then removed all regions that did not change by at least five percentage points to focus the analysis on regions with a clear-cut change in DNA methylation over the titration series. Next, regions with a negative change between the 0% and 100% consensus values were inverted by subtracting their measured DNA methylation value from the maximum corresponding to complete DNA methylation. This procedure reversed directionality for the particular region and therefore standardized the direction across all regions. Finally, we adjusted for different offsets of DNA methylation levels by fitting a linear model to the values of each region and then subtracting the linear model offset (intersect) from these values. Using the adjusted DNA methylation values we then evaluated the Pearson correlation of the measured values to the titration ratios, which is the titration-based estimate of the correct value. To evaluate how well the assays captured the linearity of the DNA methylation values along the titration series, we also fitted a second intercept-free linear model to the adjusted DNA methylation values across all regions and samples, and we recorded the adjusted r2 and residual standard error of the fitted model. Assays with higher adjusted r2 values and lower residual standard error were considered in better agreement with the expectation that was based on the known titration ratios.
Bioinformatic analysis of relative DNA methylation assays.
We compared the relative assays among each other by calculating pairwise 3-by-3 contingency tables for the differences between each pair of samples recorded by each assay. Measurements that agreed on the direction of change in both assays appear on the diagonal of the contingency table, and the higher the percentage of measurements on the diagonal, the more concordant both assays are. We formalize the agreement between assays as a numeric value, the percent concordance:
where ci,j is the value in the ith row and jth column of the contingency table and
This approach readily generalizes to the absolute assays, where we considered samples with an absolute difference of less than five percentage points as concordant.
In a separate and complementary analysis, we evaluated the ability of the relative assays to detect the correct direction of change between any two samples by using the median of the three DNA methylation values spanning the previously defined consensus corridor as reference. For each pair of samples, we determined the target direction and magnitude of change as the difference between the two median values, and we checked for each relative assay whether the difference between the corresponding measurements had the same or opposite direction of change. If no difference was detected in the relative assays, this was also recorded. The differences in the medians were divided into four bins: marginal change (absolute difference below five percentage points), small change (5−25 percentage points) medium change (25−50 percentage points), and strong change (above 50 percentage points). Finally, we also evaluated the relative assays based on the titration series, including only those regions with a difference above five percentage points between the two extreme points according to the consensus corridor. Results were regarded as consistent with the titration series if the direction of change observed for the relative assay was the same as the direction of the change in the titration ratio, taking into account the two extreme points according to the consensus corridor.
Correlation between input DNA amounts and assay performance.
Two alternative approaches were used to assess the effect of DNA amounts on assay performance (Supplementary Note). First, owing to normal variation in the extracted DNA quality/quantity and in the initial quantification, the DNA amounts varied slightly between reference samples, e.g., ranging from 875 ng to 1,843 ng in the primary tumor/normal samples (Supplementary Fig. 10a). Each laboratory was provided with the exact same volume of homogeneous aliquots for these samples, such that these differences between samples did not result in differences between laboratories. To correlate input DNA amounts with assay performance, we fitted a linear model predicting the deviation from the consensus corridor for each sample and assay using two alternative measures of input DNA amounts: the first value based on the median of concentration measurements across all laboratories multiplied by the volume of DNA used for a given assay, and the second value based on the DNA amounts that each research group reported to have used according to their own concentration measurements. For each assay and each of the two measurements of DNA amount, P values were calculated with linear models and adjusted for multiple testing using the Benjamini-Hochberg method. We used an adjusted P-value threshold of 0.05 to call assays significantly influenced by DNA amount, but no associations were significant at this level.
Second, to assess the impact of DNA amounts in a much lower range (0.3 ng to 100 ng), we established a titration series with target DNA amounts of 100 ng, 30 ng, 10 ng, 3 ng, 1 ng and 0.3 ng from one of the tumor reference samples (CRC 2 tumor). To exclude differences owing to variation among different bisulfite conversion protocols and reagents70, these samples were bisulfite-converted centrally, using the EZ DNA Methylation-Direct Kit (Zymo Research, D5020) with the following deviations from the manufacturer's protocol: Conversion reagent was applied at 0.9× concentration, reactions incubated for 20 cycles of 1 min at 95 °C and 10 min at 60 °C, and the desulphonation time was extended to 30 min. The converted DNA was shipped on dry ice to nine laboratories that repeated their assays on these samples. We also analyzed the impact of reductions in DNA quality by fragmenting DNA from one of the tumor reference samples (CRC 1 tumor) to an average fragment length of 200 base pairs. To that end, batches of 600 ng DNA were digested with NEBNext dsDNA Fragmentase (New England BioLabs, M0348L) for exactly 60 min at 37 °C, stopping the fragmentation reactions by addition of 5μl 0.5M EDTA stop solution. The fragmented batches were combined, titrated to the same amounts as above, bisulfite-converted and shipped to the contributing laboratories.
Prediction of global DNA methylation levels.
The global DNA methylation assays give rise to one single value per sample, which made it possible to plot all data points into one diagram (Fig. 4a) and to assess the overall consistency of the results by visual inspection. In addition, we explored whether we could predict global DNA methylation values from the results of the locus-specific DNA methylation assays, either by using the mean or median of the DNA methylation levels or by more complex machine learning methods such as generalized linear models, support vector regression (linear and polynomial kernels) and random forest regression. To compensate for the fact that not all assays were run on all samples, we first imputed missing values by filling in the values of the most closely related other assay based on Pearson correlation. We trained the regression models using leave-one-out cross-validation to make optimal use of the limited data set. For each method and each analysis, we recorded the root mean square error (RMSE) between the prediction and the target value. As no single global assay gave fully consistent results, we chose as global target the mean of the two best-performing assays (HPLC-MS and Pyroseq NBL2), and we replaced the four mean values that were inconsistent with the known change in concentration in the titration series by imputed values that were calculated as the mean of the two neighboring values in the titration series. The e1071 R package was used for support vector regression, randomForest for random forest regression and DMwR for cross-validation.
Analysis of discriminatory power.
We trained linear support vector machines using patient-stratified cross-validation, such that each prediction used four tumor/patient pairs for training and left two pairs out for test-set validation. The e1071 R package was used to train the classifiers and the ROCR package71 to calculate the ROC area under curve as the main performance metric. We further examined the robustness of the classifiers in presence of two different error models: (i) random error and (ii) uniform noise.
Random error. We simulated faulty measurements by replacing a defined fraction of measurements by random numbers drawn from the pool of all measurements of that assay. In this manner, we ensured that the simulated erroneous measurements were drawn from the same distribution as the correct measurements without making assumptions about the statistical distribution of the data.
Uniform noise. We simulated inaccurate measurements by adding a random number to each measurement. At any given noise level n, this random number was sampled uniformly from the interval [−n × r; n × r], where r is the range spanned by all DNA methylation values for the same assay. To assess the prediction performance, we tested each classifier in a stratified threefold cross-validation: for each error model, noise/error level, assay, and selection of training and test set, we performed 1,000 repetitions of the analysis with randomized noise/error. To assess the robustness toward fewer measurements, we repeated the analysis with 25% uniform noise after removing the majority of regions from the training and test sets. The choice of regions retained (either 1, 3 or 5) was either entirely random or guided by the information content of each region for the prediction. We calculated the information content separately for each assay and region as the F score72. As before, we performed patient-stratified cross-validation with random repetitions. Finally, we analyzed a much larger cohort with 160 primary prostate tumor samples and 8 nonmatched normal prostate samples, comparing the EpiTyper 3 and Infinium assays with each other in terms of their correlation and discriminatory power.
Gene Expression Omnibus
- A decade of exploring the cancer epigenome - biological and translational implications. Nat. Rev. Cancer 11, 726–734 (2011). &
- Epigenetic modifications and human disease. Nat. Biotechnol. 28, 1057–1068 (2010). &
- Epigenetics and the environment: emerging patterns and implications. Nat. Rev. Genet. 13, 97–109 (2011). &
- Epigenetic regulation of ageing: linking environmental inputs to genomic stability. Nat. Rev. Mol. Cell Biol. 16, 593–610 (2015). , &
- Stability and flexibility of epigenetic gene regulation in mammalian development. Nature 447, 425–432 (2007).
- The power and the promise of DNA methylation markers. Nat. Rev. Cancer 3, 253–266 (2003).
- Epigenetic biomarker development. Epigenomics 1, 99–110 (2009).
- Validation of ZAP-70 methylation and its relative significance in predicting outcome in chronic lymphocytic leukemia. Blood 124, 42–48 (2014). et al.
- Circulating methylated SEPT9 DNA in plasma is a biomarker for colorectal cancer. Clin. Chem. 55, 1337–1346 (2009). et al.
- TFAP2E-DKK4 and chemoresistance in colorectal cancer. N. Engl. J. Med. 366, 44–53 (2012). et al.
- A DNA methylation fingerprint of 1628 human samples. Genome Res. 22, 407–419 (2012). et al.
- MGMT gene silencing and benefit from temozolomide in glioblastoma. N. Engl. J. Med. 352, 997–1003 (2005). et al.
- PRIMe consortium. Methylated Glutathione S-transferase 1 (mGSTP1) is a potential plasma free DNA epigenetic marker of prognosis and response to chemotherapy in castrate-resistant prostate cancer. Br. J. Cancer 111, 1802–1809 (2014). et al. &
- A B-cell epigenetic signature defines three biologic subgroups of chronic lymphocytic leukemia with clinical impact. Leukemia 29, 598–605 (2015). et al.
- Fetal-specific DNA methylation ratio permits noninvasive prenatal diagnosis of trisomy 21. Nat. Med. 17, 510–513 (2011). et al.
- Reference Maps of human ES and iPS cell variation enable high-throughput characterization of pluripotent cell lines. Cell 144, 439–452 (2011). et al.
- Identification of body fluid-specific DNA methylation markers for use in forensic science. Forensic Sci. Int. Genet. 13, 147–153 (2014). et al.
- Isolation and identification of age-related DNA methylation markers for forensic age-prediction. Forensic Sci. Int. Genet. 11, 117–125 (2014). , , , &
- Analysing and interpreting DNA methylation data. Nat. Rev. Genet. 13, 705–719 (2012).
- Principles and challenges of genomewide DNA methylation analysis. Nat. Rev. Genet. 11, 191–203 (2010).
- Epigenome-wide association studies for common human diseases. Nat. Rev. Genet. 12, 529–541 (2011). , , &
- Recommendations for the design and analysis of epigenome-wide association studies. Nat. Methods 10, 949–955 (2013). et al.
- PCR-based methods for detecting single-locus DNA methylation biomarkers in cancer diagnostics, prognostics, and response to treatment. Clin. Chem. 55, 1471–1483 (2009). &
- DNA methylation biomarkers in cancer: progress towards clinical implementation. Expert Rev. Mol. Diagn. 12, 473–487 (2012). , , &
- Deciphering the epigenetic code: an overview of DNA methylation analysis methods. Antioxid. Redox Signal. 18, 1972–1986 (2013). &
- BLUEPRINT to decode the epigenetic signature written in blood. Nat. Biotechnol. 30, 224–226 (2012). et al.
- Tackling the epigenome: challenges and opportunities for collaboration. Nat. Biotechnol. 28, 1039–1044 (2010). , &
- Taking the measure of the methylome. Nat. Biotechnol. 28, 1026–1028 (2010).
- Quantitative comparison of genome-wide DNA methylation mapping technologies. Nat. Biotechnol. 28, 1106–1114 (2010). et al.
- Comparison of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications. Nat. Biotechnol. 28, 1097–1105 (2010). et al.
- Bi-PROF: bisulfite profiling of target regions using 454 GS FLX Titanium technology. Epigenetics 8, 765–771 (2013). et al.
- Focused, high accuracy 5-methylcytosine quantitation with base resolution by benchtop next-generation sequencing. Epigenetics Chromatin 6, 33 (2013). , &
- Bisulfite Patch PCR enables multiplexed sequencing of promoter methylation across cancer samples. Genome Res. 20, 1279–1287 (2010). &
- Library-free methylation sequencing with bisulfite padlock probes. Nat. Methods 9, 270–272 (2012). et al.
- Assessment of RainDrop BS-seq as a method for large-scale, targeted bisulfite sequencing. Epigenetics 9, 678–684 (2014). et al.
- Genomic profiling of CpG methylation and allelic specificity using quantitative high-throughput mass spectrometry: critical evaluation and improvements. Nucleic Acids Res. 35, e119 (2007). , , &
- Quantitative high-throughput analysis of DNA methylation patterns by base-specific cleavage and mass spectrometry. Proc. Natl. Acad. Sci. USA 102, 15785–15790 (2005). et al.
- Cytosine methylation profiling of cancer cell lines. Proc. Natl. Acad. Sci. USA 105, 4844–4849 (2008). et al.
- A sequencing method based on real-time pyrophosphate. Science 281, 363–365 (1998). , &
- Analysis and quantification of multiple methylation variable positions in CpG islands by Pyrosequencing. Biotechniques 35, 152–156 (2003). , &
- DNA methylation analysis by pyrosequencing. Nat. Protoc. 2, 2265–2275 (2007). &
- MethyLight. Methods Mol. Biol. 507, 325–337 (2009). , , &
- MethyLight: a high-throughput assay to measure DNA methylation. Nucleic Acids Res. 28, E32 (2000). et al.
- Profiling DNA methylation by melting analysis. Methods 27, 121–127 (2002). , &
- Methylation-sensitive high-resolution melting. Nat. Protoc. 3, 1903–1908 (2008). , &
- In-tube DNA methylation profiling by fluorescence melting curve analysis. Clin. Chem. 47, 1183–1189 (2001). , &
- Methylation-specific PCR: a novel PCR assay for methylation status of CpG islands. Proc. Natl. Acad. Sci. USA 93, 9821–9826 (1996). , , , &
- DNA methylation profiling defines clinically relevant biological subsets of non-small cell lung cancer. Clin. Cancer Res. 18, 2360–2373 (2012). et al.
- DNA hypomethylation in cancer cells. Epigenomics 1, 239–259 (2009).
- LINE-1 methylation in plasma DNA as a biomarker of activity of DNA methylation inhibitors in patients with solid tumors. Epigenetics 4, 176–184 (2009). et al.
- Quantification of global DNA methylation by capillary electrophoresis and mass spectrometry. Methods Mol. Biol. 507, 23–34 (2009). , &
- Quantitative measurement of genome-wide DNA methylation by a reliable and cost-efficient enzyme-linked immunosorbent assay technique. Anal. Biochem. 422, 74–78 (2012). , , &
- Changes in DNA methylation patterns in subjects exposed to low-dose benzene. Cancer Res. 67, 876–880 (2007). et al.
- Changes in DNA methylation of tandem DNA repeats are different from interspersed repeats in cancer. Int. J. Cancer 125, 723–729 (2009). et al.
- Hypomethylation of LINE-1, and not centromeric SAT-α, is associated with centromeric instability in head and neck squamous cell carcinoma. Cell Oncol. (Dordr.) 35, 259–267 (2012). et al.
- A simple method for estimating global DNA methylation using bisulfite PCR of repetitive DNA elements. Nucleic Acids Res. 32, e38 (2004). et al.
- Conventional and nanotechniques for DNA methylation profiling. J. Mol. Diagn. 15, 17–26 (2013). et al.
- Genome-wide DNA methylation profiling using Infinium® assay. Epigenomics 1, 177–200 (2009). et al.
- High sensitivity mapping of methylated cytosines. Nucleic Acids Res. 22, 2990–2997 (1994). , , &
- DNA methylation: bisulphite modification and analysis. Nat. Protoc. 1, 2353–2364 (2006). , , , &
- Oxidative bisulfite sequencing of 5-methylcytosine and 5-hydroxymethylcytosine. Nat. Protoc. 8, 1841–1851 (2013). et al.
- Chemical modification-assisted bisulfite sequencing (CAB-Seq) for 5-carboxylcytosine detection in DNA. J. Am. Chem. Soc. 135, 9315–9317 (2013). et al.
- Genome-wide profiling of 5-formylcytosine reveals its roles in epigenetic priming. Cell 153, 678–691 (2013). et al.
- The implications of heterogeneous DNA methylation for the accurate quantification of methylation. Epigenomics 2, 561–573 (2010). , &
- Acute myelogenous leukemia: a human cell line responsive to colony-stimulating activity. Science 200, 1153–1154 (1978). &
- An undifferentiated variant derived from the human acute myelogenous leukemia cell line (KG-1). Blood 56, 265–273 (1980). , , , &
- Comprehensive analysis of DNA methylation data with RnBeads. Nat. Methods 11, 1138–1140 (2014). et al.
- DNA methylation profiling in the clinic: applications and challenges. Nat. Rev. Genet. 13, 679–692 (2012). &
- Statistical analyses and reproducible research. Bioconductor project working papers. Working paper 2. http://biostats.bepress.com/bioconductor/paper2 (2004). &
- Performance evaluation of kits for bisulfite-conversion of DNA from tissues, cell lines, FFPE tissues, aspirates, lavages, effusions, plasma, serum, and urine. PLoS One 9, e93933 (2014). et al.
- ROCR: visualizing classifier performance in R. Bioinformatics 21, 3940–3941 (2005). , , &
- Feature ranking using linear SVM. JMLR Workshop Conf. Proc. 3, 53–64 (2008). &
We thank J. Hadler, T. Penz, C. Lo Porto, B. Schmitt, M. Bähr, M. Helf, O. Mücke, N. Mazaleyrat, C.V. Wong, T.E. Kjeldsen, A. Janosch, P. Dhami, E. Flores and H. Gohlke for technical assistance, and H. Stunnenberg as well as the scientific advisory board of BLUEPRINT for their advice and support. This work was performed in the context of the BLUEPRINT project (European Union's Seventh Framework Programme grant agreement 282510), which funded the study logistics and the integrative data analysis. The assay costs were paid by the contributing laboratories using institutional funds and the following grants: BBSRC BB/G020930/1, BBSRC BB/G020930/1, BMBF 01KU1001A, BMBF 01KU1002A, BMBF 01KU1216F, EU-FP7 282510, FWF I 1575-B19, NHMRC 1063559, NHMRC 1088144 and the DKFZ Graduate School.
- Supplementary Figure 1: Summary of contributed assay data (480 KB)
For each assay and reference sample, the table shows the number of genomic regions for which DNA methylation measurements were submitted. 16 regions had been designated as mandatory, and each contributing research group attempted to measure DNA methylation for these core regions. One core region was later discarded from the analysis because a deliberately included SNP was detected only by about half of the research groups and would have biased the benchmarking. As the result, the maximum number of core regions listed in this summary is 15. Light colors indicate cases where DNA methylation measurements could not be obtained for all of these 15 core regions, typically because of failed assay design or because of technical problems running the assay. For the 32 additional regions, it was at the discretion of the contributing research groups how many they were able to include in their experiments.
- Supplementary Figure 2: Pairwise comparison of measurements for the absolute assays (563 KB)
Scatterplots comparing the measurements for all pairs of absolute DNA methylation assays. Each dot corresponds to one region in one sample. The numbers (r) in the top-right triangle are Pearson correlation coefficients.
- Supplementary Figure 3: Consensus corridor for estimating true DNA methylation levels (663 KB)
(a) Distribution of DNA methylation measurements obtained with 16 absolute DNA methylation assays for genomic regions (sub-panels) and sample types (y axis). Colors indicate assay technologies as defined in Figure 1b. Gray boxes denote the corresponding consensus corridors, which are defined as the smallest corridor spanned by three technologically different assays extended by five percentage points to either side. (b) Illustration of which assays (rows) contributed to which consensus corridors (region/sample combinations, columns). (c) Number of consensus corridors that each type of assay contributed to (solid color), compared to the total number of measurements available for the respective assays (total bar height).
- Supplementary Figure 4: Sensitivity analysis for the consensus corridor (376 KB)
(a) Line plots showing the average deviation from the consensus corridor for each assay, given different choices of the corridor parameters. The order of assays corresponds to the rank order obtained with default parameters (indicated by the red line). (b) Plots showing the average deviation from the consensus corridor for Latin hypercube sampling of the parameter space. (c) Plots showing the Spearman correlation between the rank order of corridor deviations for the sampled parameter sets compared to the deviations with the default parameters used in the main analysis (indicated with a red X-mark). (d) Best-performing and worst-performing assays for a broad spectrum of parameter sets, shown as colored tiles.
- Supplementary Figure 5: Impact of genomic region characteristics on assay performance (258 KB)
Linear models were fitted to predict the absolute deviation from the consensus corridor based on the following characteristics of the target genomic regions: Estimated DNA methylation level (based on the consensus corridor), GC content, CpG observed vs. expected ratio and repetitive DNA content. The resulting P-values (y axis) were corrected for multiple testing using the Benjamini-Hochberg method and transformed such that positive values denote a direct relationship between the region characteristic and the absolute deviation, whereas negative values denote an inverse relationship. For each plot the most significantly affected assay is marked with an asterisk (*), and scatterplots (grey boxes on the right) show the numeric value of the genomic region characteristic (x axis) plotted against the corresponding absolute deviation from the consensus corridor (y axis).
- Supplementary Figure 6: Assay performance for fresh-frozen and FFPE samples (155 KB)
Boxplots summarizing the distribution of absolute deviation (top) and directional deviation (bottom) across four xenograft samples, two of which were stored as fresh-frozen and two as FFPE material. The measurements were evaluated against the consensus corridor for the corresponding fresh-frozen sample. AmpliconBS 3 and Infinium were not done on the fresh-frozen and/or FFPE samples and are therefore not included in the plot.
- Supplementary Figure 7: Correction for diverging offsets in the titration analysis (351 KB)
To eliminate the effect of different DNA methylation levels at the extreme points of the titration series, we proceeded in three steps: First, for each titration series and each genomic region we determined the difference between the median of the consensus corridors for the 0% and 100% titration ratios. Where these two extreme points differed by less than five percentage points, we discarded the corresponding region because of insufficient change in DNA methylation levels. Second, for regions with a negative change between the consensus values at the 0% and 100% titration ratio, the measurements were inverted by subtracting their measured DNA methylation value from the maximum corresponding to complete DNA methylation, which standardizes directions across all regions. Third, we adjusted for different DNA methylation levels at the 0% titration ratio by fitting a linear model to the measurements of each region and then subtracting the linear model offset (the intersect) from the measurements. These adjusted DNA methylation values were used for benchmarking the assays based on their Pearson correlation with expected DNA methylation levels and based on a second round of linear model fitting to assess linearity of the corrected DNA methylation values.
- Supplementary Figure 8: Comparison with clonal bisulfite sequencing (701 KB)
(a) Scatterplot illustrating the concordance between two replicates of clonal bisulfite sequencing for the same samples and target CpGs, done in different laboratories. Raw sequencing data shown as a BiQ Analyzer plot for one target CpG (Region 08) in one reference sample (CRC 6 Normal). (b) Heatmap and hierarchical clustering of the Pearson correlation matrix for all assays based on the DNA methylation measurements for regions 07 and 08 (for which two replicates of clonal bisulfite measurements were available) in the tumor/normal samples. Lighter colors indicate higher correlation. Comparisons with the two replicates for clonal bisulfite sequencing are highlighted by black borders. (c) Boxplots summarizing the distribution of absolute deviation (left) and directional deviation (right) across the tumor/normal samples and all core regions, comparing each assay’s DNA methylation measurements to the closest boundary of the corresponding consensus corridor. Measurements falling inside the corridor were assigned a deviation of zero. (d) Scatterplots illustrating the correlation between the ClonalBS 1 assay and the other absolute assays across the tumor/normal samples and covered regions. The blue lines indicate fitted linear models, and the reported numbers (r) are Pearson correlation coefficients.
- Supplementary Figure 9: Pairwise comparison of measurements for the relative assays (714 KB)
(a) Scatterplots comparing the measurements for all pairs of relative DNA methylation assays (top triangle) and all pairs of one relative and one absolute assay (bottom rectangle). Each dot corresponds to one region in one sample. (b) Heatmap representation of the Pearson correlation matrix for all assays across all DNA methylation measurements. Lighter colors indicate higher correlation.
- Supplementary Figure 10: Variation among DNA amounts and its effect on assay performance (277 KB)
(a) Distribution of DNA concentration measurements for the tumor/normal sample aliquots sent to the participants. Each dot represents one sample measured in one laboratory, and the colors correspond to the technology used to obtain the measurements. (b) Lack of association between varying DNA amounts and assay performance. Linear models were fitted to predict the absolute deviation from the consensus corridor, based on the amount of input DNA according to the measurement in the corresponding laboratory (bottom) and the estimated DNA amount based on the median of all concentration measurements for the specific sample (top). The resulting P-values (y axis) were corrected for multiple testing using the Benjamini-Hochberg method and transformed such that positive values denote a direct relationship between the measured DNA amount and the absolute deviation, whereas negative values denote an inverse relationship. No significant associations were found for an adjusted P-value threshold of 0.05. (c) Example scatterplots demonstrating the lack of correlation between input DNA amounts (x axis) according to own measurements (top) and consensus concentration (bottom) on the one hand, and the assay performance on the other hand (y axis).
- Supplementary Figure 11: Assay performance for low-input titration series (317 KB)
(a) Bar plots showing the ratio of successful measurements (i.e., values that passed quality control) for each input DNA amount, over a titration series of high molecular weight DNA with amounts ranging from 100ng to 0.3ng. (b) Dot plots summarizing the deviation of DNA methylation measurements from the respective consensus corridor. (c) Dot plots summarizing the deviation of DNA methylation measurements from the target value provided by the same assay’s measurement on the corresponding reference sample with ~1µg input DNA. Colored areas range from the minimum to the maximum of observed deviations at each DNA quantity level.
- Supplementary Figure 12: Assay performance for fragmented DNA titration series (305 KB)
(a) Bar plots showing the ratio of successful measurements (i.e., values that passed quality control) for each input DNA amount, over a titration series of highly fragmented DNA with amounts ranging from 100ng to 0.3ng. (b) Dot plots summarizing the deviation of DNA methylation measurements from the respective consensus corridor. (c) Dot plots summarizing the deviation of DNA methylation measurements from the target value provided by the same assay’s measurement on the corresponding reference sample with ~1µg input DNA. Colored areas range from the minimum to the maximum of observed deviations at each DNA quantity level.
- Supplementary Figure 13: Unsupervised analysis of similarity between samples (378 KB)
For each of the locus-specific DNA methylation assays, multidimensional scaling diagrams visualize the relative similarity among the reference samples in two dimensions. The analysis was based on Euclidean distances calculated across all genomic regions for any given sample pair and DNA methylation assay. Point color indicates sample type. Note that the 100% titration sample in the second titration series is based on DNA from a colon cancer cell line, which explains why it often clusters with either the colon tumor/normal samples or with the fresh-frozen vs. FFPE xenografts (which were also derived from colon cancer cell lines).
- Supplementary Figure 14: ROC curves for classifying tumor vs. normal samples (431 KB)
(a) ROC curves visualizing the classifier performance for varying levels of random error (orange) or uniform noise (purple). Each plot compares the true positive rate (y axis; sensitivity) with the false positive rate (x axis; 1 - specificity) at different classification thresholds for a support vector machine trained to discriminate between colon tumor vs. normal samples. All ROC curves are based on cross-validation and show test set performance. (b) ROC curves as in panel a but based on a restricted set of regions, namely the top 1, 3 or 5 most informative regions (green) or on 1, 3 or 5 randomly selected regions (pink).
- Supplementary Figure 15: Assay validation in a prostate cohort (132 KB)
(a) Pairwise comparison of DNA methylation measurements obtained by the EpiTyper 3 and Infinium assays for 160 prostate tumor and 8 normal prostate samples across 10 target genomic regions. (b) ROC curve comparing the true positive rate (y axis; sensitivity) with the false positive rate (x axis; 1 - specificity) at different classification thresholds for a support vector machine trained to discriminate between prostate cancer and normal prostate samples based on measurements obtained with either EpiTyper 3 (green) or Infinium (purple).
- Supplementary Text and Figures (7,985 KB)
Supplementary Figures 1–15 and Supplementary Note 1
- Supplementary Table 1 (23 KB)
Summary of the evaluated DNA methylation assays
- Supplementary Table 2 (14 KB)
Summary of the reference samples used for benchmarking
- Supplementary Table 3 (324 KB)
Summary of the target genomic regions used for benchmarking