Simultaneous sequencing of genetic and epigenetic bases in DNA

Füllgrabe, Jens; Gosal, Walraj S.; Creed, Páidí; Liu, Sidong; Lumby, Casper K.; Morley, David J.; Ost, Tobias W. B.; Vilella, Albert J.; Yu, Shirong; Bignell, Helen; Burns, Philippa; Charlesworth, Tom; Fu, Beiyuan; Fordham, Howerd; Harding, Nicolas J.; Gandelman, Olga; Golder, Paula; Hodson, Christopher; Li, Mengjie; Lila, Marjana; Liu, Yang; Mason, Joanne; Mellad, Jason; Monahan, Jack M.; Nentwich, Oliver; Palmer, Alexandra; Steward, Michael; Taipale, Minna; Vandomme, Audrey; San-Bento, Rita Santo; Singhal, Ankita; Vivian, Julia; Wójtowicz, Natalia; Williams, Nathan; Walker, Nicolas J.; Wong, Nicola C. H.; Yalloway, Gary N.; Holbrook, Joanna D.; Balasubramanian, Shankar

doi:10.1038/s41587-022-01652-0

Download PDF

Article
Open access
Published: 06 February 2023

Simultaneous sequencing of genetic and epigenetic bases in DNA

Jens Füllgrabe¹^na1,
Walraj S. Gosal¹^na1,
Páidí Creed¹,
Sidong Liu¹,
Casper K. Lumby ORCID: orcid.org/0000-0001-8329-9228¹,
David J. Morley¹,
Tobias W. B. Ost¹,
Albert J. Vilella ORCID: orcid.org/0000-0002-2005-2516¹,
Shirong Yu¹,
Helen Bignell¹,
Philippa Burns¹,
Tom Charlesworth¹,
Beiyuan Fu¹,
Howerd Fordham¹,
Nicolas J. Harding¹,
Olga Gandelman¹,
Paula Golder¹,
Christopher Hodson¹,
Mengjie Li¹,
Marjana Lila¹,
Yang Liu¹,
Joanne Mason¹,
Jason Mellad¹,
Jack M. Monahan ORCID: orcid.org/0000-0002-0635-0015¹,
Oliver Nentwich¹,
Alexandra Palmer¹,
Michael Steward¹,
Minna Taipale¹,
Audrey Vandomme¹,
Rita Santo San-Bento¹,
Ankita Singhal¹,
Julia Vivian ORCID: orcid.org/0000-0001-6688-7714¹,
Natalia Wójtowicz¹,
Nathan Williams¹,
Nicolas J. Walker ORCID: orcid.org/0000-0002-0498-4356¹,
Nicola C. H. Wong¹,
Gary N. Yalloway¹,
Joanna D. Holbrook ORCID: orcid.org/0000-0003-1791-6894¹ &
…
Shankar Balasubramanian ORCID: orcid.org/0000-0002-0281-5815^2,3

Nature Biotechnology volume 41, pages 1457–1464 (2023)Cite this article

60k Accesses
21 Citations
730 Altmetric
Metrics details

Subjects

Abstract

DNA comprises molecular information stored in genetic and epigenetic bases, both of which are vital to our understanding of biology. Most DNA sequencing approaches address either genetics or epigenetics and thus capture incomplete information. Methods widely used to detect epigenetic DNA bases fail to capture common C-to-T mutations or distinguish 5-methylcytosine from 5-hydroxymethylcytosine. We present a single base-resolution sequencing methodology that sequences complete genetics and the two most common cytosine modifications in a single workflow. DNA is copied and bases are enzymatically converted. Coupled decoding of bases across the original and copy strand provides a phased digital readout. Methods are demonstrated on human genomic DNA and cell-free DNA from a blood sample of a patient with cancer. The approach is accurate, requires low DNA input and has a simple workflow and analysis pipeline. Simultaneous, phased reading of genetic and epigenetic bases provides a more complete picture of the information stored in genomes and has applications throughout biomedicine.

Single duplex DNA sequencing with CODEC detects mutations with high sensitivity

Article Open access 27 April 2023

Jin H. Bae, Ruolin Liu, … Viktor A. Adalsteinsson

Direct enzymatic sequencing of 5-methylcytosine at single-base resolution

Article 15 June 2023

Tong Wang, Johanna M. Fowler, … Rahul M. Kohli

Digital RNA sequencing using unique molecular identifiers enables ultrasensitive RNA mutation analysis

Article Open access 01 March 2024

Manuel Luna Santamaría, Daniel Andersson, … Anders Ståhlberg

Main

Information encoded in nucleic acids is fundamental to the biology of living systems. There are multiple dimensions of information stored within DNA. Genetic sequencing of the DNA bases G, C, T and A has been transformed by high-throughput sequencing approaches in the past two decades. Epigenetic information in DNA provides insights into dynamic changes in biology that are closely associated with transcriptional programs¹ and cell fate². The combination of genetic and epigenetic information provides a more comprehensive view of biology. For instance, the analysis of somatic genetic mutations, together with DNA methylation marks, from blood DNA gave a substantially more accurate prediction of mortality than either genetics or DNA methylation alone³. Both DNA methylation and genotype are required to determine the pluripotency of induced stem cells⁴ and their maturation capacity⁵. Germline genetic alterations cause changes in DNA methylation that ultimately dictate predisposition for disease^6,7,8. Combining information on DNA methylation with genetic sequence in cell-free DNA (cfDNA) from blood has been shown to substantially increase sensitivity to detect tumor DNA⁹. DNA methylation information can also inform on the tissue of origin of the circulating tumor DNA¹⁰. Noninvasive prenatal diagnostic analysis has also demonstrated that DNA methylation signal can determine fetal origin of DNA fragments¹¹. Epigenetic information in DNA has been retrieved principally via sequencing 5-methylcytosine (5mC). More recently, 5-hydroxymethylcytosine (5hmC) has emerged as an important base modification that can provide information that goes beyond 5mC and genetics^12,13. Hitherto, researchers have accessed either genetic or epigenetic information, without resolving 5mC from 5hmC.

Commonly used sequencing approaches do not capture full information from both genetics and epigenetics. Next-generation sequencing directly captures the canonical bases G, C, T and A in its readout¹⁴. A number of base-conversion chemistries have been developed to help differentiate unmodified C from its epigenetic variants, 5mC or 5hmC. These include bisulfite-based approaches such as whole-genome bisulfite sequencing (WGBS)¹⁵ and bisulfite-free approaches such as enzymatic-methyl sequencing (EM-seq)¹⁶ and TET-assisted pyridine borane sequencing¹⁷. An important shortfall of all such methods is that conversion of either the C base, or one of its epigenetic derivatives, to a U (read as T) compromises the direct detection of genetic C-to-T changes, which is the most common mutation in the mammalian genome¹⁸ and in cancer¹⁹. Furthermore, the ambiguity caused by C-to-T conversions in the sequenced reads being mapped against either C or T in the reference genome increases false-positive matches in the search space, consequently making computational alignment and mapping of converted reads slower, more expensive and less accurate²⁰. Also, these existing methods cannot distinguish 5mC from 5hmC in a single workflow. Methods to distinguish 5hmC from 5mC by exclusively converting only one base have been developed for example oxidative bisulfite sequencing²¹, TET-assisted pyridine borane sequencing-beta²², Tet-assisted bisulfite sequencing²³ and APOBEC-coupled epigenetic sequencing²⁴ or by selectively copying 5mC across strands of DNA^25,26. However, some of these can involve separate, parallel workflows and sequencing to yield full information, which may increase sample requirement, cost and time taken and/or yield data that lack phased information. Combining separate datasets is fraught with difficulties that lead to additive measurement error and coverage gaps across workflows (Supplementary Fig. 1).

In this study, we present a whole-genome sequencing methodology capable of sequencing the four genetic letters in addition to 5mC and 5hmC to provide an accurate six-letter digital readout in a single workflow. The processing of the DNA sample is entirely enzymatic and avoids the DNA degradation and genome coverage biases of bisulfite treatment^27,28. The method uses all four genetic letters for genomic alignment and encompasses an intrinsic error suppression capability, which leads to high accuracy for both genetic and epigenetic base calling. The approach is versatile and we demonstrate its application to different sample formats, including human genomic DNA and a cfDNA sample from a human patient with cancer.

Results

We have developed a method to sequence beyond the four canonical DNA bases and include 5mC and 5hmC (Fig. 1a). The approach is compatible with any sequencer platform. Watson–Crick base pairing provides an explicit digital molecular mechanism for reading up to four letters (or states) (Fig. 1b). To read an epigenetic letter, one can carry out a base-conversion transformation, such as a C-to-T conversion (for example bisulfite seq/EM-seq), where modified Cs are not converted (Fig. 1b). Here, the genetic information is compromised and importantly masks the commonly occurring C-to-T single-base variant. More generally, when sequencing DNA using a single-base coding system with a four-state readout (that is G, C, T, A), one can unambiguously report on a maximum of four states of information in a given run, be those genetic or epigenetic. A system that uses a two-base coding approach, whereby combinations of two bases relay the information for a state, enables up to 16 states to be decoded unambiguously (Fig. 1c). This makes it possible to read all four genetic states and multiple epigenetic states in a single run. We have reduced this to practice for simultaneous five- and six-letter sequencing (seq).

We first describe the five-letter seq workflow that unambiguously resolves the four genetic bases and the epigenetic modifications, 5mC or 5hmC, termed hitherto as modC (unmodified cytosine is referred to as unmodC). In this workflow (Fig. 1d), the sample DNA is first fragmented via sonication and then ligated to short, synthetic DNA hairpin adapters at both ends. The construct is then split to separate the strands. For each original sample strand, a complementary copy strand is synthesized by DNA polymerase extension of the 3′-end to generate a hairpin construct with the original sample DNA strand connected to its complementary strand, lacking epigenetic modifications, via a synthetic hairpin. Sequencing adapters are then ligated to the end. ModCs are enzymatically protected via enzymatic oxidation of 5mC to 5hmC or 5fC or 5caC, by a ten-eleven translocation (TET) methylcytosine dioxygenase 2 mutant²⁹ and subsequent enzymatic glycosylation of all 5hmCs, by beta-glycosyltransferase³⁰. The unprotected Cs are then deaminated to uracil, which is subsequently read as thymine, via the action of APOBEC3A cytosine deaminase (A3A)³¹ (Fig. 1d). UvrD helicase³² is added to the deamination reaction to generate a single-stranded DNA substrate for A3A. The deaminated constructs are no longer fully complementary and have substantially reduced or no duplex stability, and as such can be readily amplified by polymerase chain reaction (PCR). The constructs can be sequenced in paired-end format whereby read 1 (P7 primed) is the original strand and read 2 (P5 primed) is the copy strand. The read data are pairwise aligned so read 1 is aligned to its complementary read 2. Cognate residues from both reads are computationally resolved to produce a single genetic or epigenetic letter (Fig. 1e). Pairings of cognate bases that differ from the permissible five (Fig. 1f) are the result of incomplete fidelity at some stage(s) comprising sample preparation, amplification or erroneous base calling during sequencing. Because these errors occur independently to cognate bases on each strand, most (18 from 24 possible) substitutions result in a nonpermissible pair. Exceptions are changes between A and G on either strand, which could result in genetic (four possible substitutions) or epigenetic (two possible substitutions) miscalls (Supplementary Fig. 2). Nonpermissible pairs are masked (marked as N) within the resolved read and the read itself is retained, leading to minimal information loss and high accuracy at read level. The resolved read is aligned to the reference genome. Genetic variants and methylation counts are produced by read-counting at base level.

The illustrative examples we provide below have used the sequence adapters for the Illumina Novaseq NGS platform; however, other adapters can be readily substituted, and the method is compatible with any sequence reader capable of discriminating at least the four genetic bases.

We performed the five-letter seq workflow on a mixed sample comprising 80 ng of sonicated human genomic DNA from a B lymphoblast cell line (NA12878), obtained from the Genomes in a Bottle project³³, 0.5 ng bacteriophage λ DNA enzymatically methylated at all cytosines in CpG context by a CpG methyltransferase (M.SssI) and 0.5 ng pUC19 isolated from a methylation-negative strain of Escherichia coli (Dam–, Dcm–). DNA was prepared, in duplicate, using the workflow outlined in Fig. 1d and sequenced on an Illumina Novaseq 6000 to produce approximately 550 million paired-end reads. These reads were computationally resolved as outlined in Fig. 1e,f. On average, 98.4% of reads obtained were resolved and of these 89.8% were aligned to the genome with mapq > 0. The average PCR and cluster duplication rate in five-letter seq data was 8.5% with 90.2% of the genome covered by at least one read, and a 15× average coverage of the genome.

Resolved reads contain the four-state genetic information (with the epigenetic information stored as sequence alignment/map format (SAM) tags). Full genetic information enables genomic alignment using standard tools and reduced execution times compared to the three-state alignment necessitated by techniques such as WGBS and EM-seq. The execution time, measured on a single central processing unit (CPU), for genomic alignment of one million resolved single-end 16-state reads using Burrows-Wheeler Aligner-minimal exact match (BWA-MEM) was 7.5 min and the genomic alignment time for one million three-state reads (in 500 K paired-end pairs) using Burrows-Wheeler Aligner-methylation (bwa-meth) was 16.5 min.

We next compared the data quality of both the epigenetic and genetic components of five-letter seq with best-practice methods used to sequence either epigenetics only, or genetics only. To compare the epigenetic component of our method, the same sample mix (80 ng NA12878, 0.5 ng λ and pUC19), was interrogated in duplicate by WGBS (Epitect, Qiagen) and by EM-seq (NEB). For EM-seq and WGBS we used 275 million paired-end reads to generate data at a comparable coverage to the 275 million resolved five-letter seq reads (each paired-end read in five-letter seq yields one single-ended read of equivalent length). Reads were aligned using bwa-meth and methylation was called using MethylDackel, yielding 89.5% and 92% aligned reads (with mapq > 0), and 15× and 17× deduplicated average coverage, for WGBS and EM-seq, respectively. All three technologies had similar coverage and uniformity across the genome. Half-mean coverage was achieved for 87.82% of the bases in the NA12878 genome in five-letter seq, which compares to 85.91% for WGBS and 87.48% EM-seq (Supplementary Fig. 3 and Supplementary Table 1). We observed a small drop in coverage of CpGs near transcription start sites relative to the rest of the genome. Dips and peaks in coverage near transcription start sites have been observed for other technologies³⁴. We observed a negligible difference (0.01×) between mean coverage of CpGs in regions with <20% methylation and CpGs in regions with >80% (Supplementary Fig. 4).

To compare the accuracy of the genetic sequencing component of our method, the same sample mix was interrogated by Illumina sequencing with standard library preparation using KAPA HyperPrep kit (Illumina) and processed in a similar pipeline. We calculated sensitivity to detect modC (expressed as a percentage) by considering all CpG-context Cs in the lambda genome and evaluating the ratio of modCs to the total number of observed cytosines (modC or unmodC). Similarly, specificity was calculated as the ratio of unmodCs to the total number of cytosines for CpG contexts in the pUC19 reference. Sensitivity of five-letter seq was 98.55% and specificity was 99.95%. This compared well with EM-seq, which had lower sensitivity (97.89%) and lower specificity (99.5%), and to WGBS, which was also less sensitive (95.69%) and less specific (99.92%) (Fig. 2a, bottom). Across the NA12878 human genome, average levels of modC observed at CHG and CHH sites were 0.07% as measured by five-letter seq, 0.14% as measured by WGBS and 0.33% as measured by EM-seq. In contrast, average modC levels at CpG sites was highest as measured by five-letter seq (54.05%), 51.10% as measured by EM-seq and 49.38% as measured by WGBS (Fig. 2a, top). Quantification of modC at all CpGs across reads and at genome level was highly concordant at single-base level between five-letter seq and WGBS (Pearson’s P < 0.01, r = 0.94) (Fig. 2b). There was a high level of agreement between modC estimates generated from the two methods (Fig. 2c). For the comparisons in Fig. 2b,c, we pooled counts of modified and unmodified C calls at CpGs across both strands in two technical replicates and only considered CpGs, which were covered by at least three reads (94.24% of all CpGs).

**Fig. 2: Performance of five-letter seq on genomic DNA.**

To evaluate the genetic component, called bases were compared with the reference at nonvariant sites according to published variant call files for NA12878. Genetic accuracy was calculated as the ratio of correct base calls to total base calls (disregarding N calls). For five-letter seq and Illumina sequencing, the genetic accuracy was consistent across all base types (99.95–99.98%). For WGBS and EM-seq, however, accuracy was high (99.69–99.90%) for C/G bases, but low (around 72.5%) for A/T bases (Fig. 2d). This is caused by unmodC to T deamination in WGBS and EM-seq technologies that results in T bases being called instead of true C bases for the first read in each paired-end read, and in A bases being called instead of true G bases for the second read. This phenomenon has the consequence of masking actual C>T or A>G transitions without additional processing and requires relatively high genomic coverage³⁵. A feature of the technology is suppression of PCR and sequencing errors, these suppressed errors constituted 0.45% of bases in the resolved reads (Supplementary Table 1). The accuracy of variant calling was investigated by applying GATK HaplotypeCaller to mapped read files of varying depths (211 million–842 million reads, resulting in mean coverages of 7×–29×). Overall performance (single nucleotide polymorphisms (SNPs) and indels) was determined through comparison with published variant call files (small variants) for NA12878. Sensitivity varied from 84.86% (211 million reads) to 95.62% (842 million reads) with precision consistently above 97.79% (Fig. 2e).

It is important to note that five-letter seq simultaneously determines genetics and epigenetics on the same read. Therefore, combinations of genetic and epigenetic marks in cis on the same DNA molecule are detected. This property can be used to deconvolute genetic and epigenetic properties of heterologous samples. For example, we detected allele-specific methylation in the NA12878 sample. Allele-specific methylation is a wide-spread phenomenon in the human genome whereby DNA methylation is differential between alleles. It can identify regulatory sequence variants that underlie genome-wide association studies signals for common diseases^36,37. Reads covering a polymorphism were interrogated for methylation levels that differed between alleles. Loci with the strongest effects included human PLAG1, a well-known imprinted gene³⁸ (Fig. 2f,g).

The analysis of cfDNA is a burgeoning aspect of diagnostics and application areas include enhanced noninvasive prenatal diagnosis, early cancer detection and disease monitoring³⁹. A practical challenge is to work with the limiting amount of cfDNA that can be extracted from a standard blood draw, which is typically around 10 ng/ml (ref. ⁴⁰). An accurate sequencing method that can simultaneously detect genetic and epigenetic information from the same sample could transform cfDNA analysis. We deployed the five-letter seq workflow (Fig. 1d,e) for the analysis of cfDNA from an individual with stage III colon cancer. Input DNA quantity was 2 or 10 ng of cfDNA or 80 ng of genomic DNA (gDNA), in each case mixing the unsonicated sample with λ and pUC19 DNA spike-in controls, as previously described. PCR and cluster duplication rates ranged from 38.2% to 14.5% as input DNA increased from 2 to 80 ng (Fig. 3a), with 90.85%, 90.89% and 90.35% of the genome covered with at least one read, at mean read depth of 21.8×, 27.6× and 30.4×, for 2, 10 and 80 ng input, respectively (Fig. 3b and Supplementary Table 2). The accuracy of methylation detection, determined for the λ and pUC19 DNA controls, remained high achieving above 98% sensitivity and specificity 99.89%, 99.94% and 99.98% (Fig. 3c), even at 0.05 ng of control DNA in the 2 ng mixed sample. The fragment length distribution typical of cfDNA is retained in the five-letter seq library suggestive of profiling of the mono and dinucleosomal fractions (Supplementary Fig. 5).

**Fig. 3: Application of five-letter seq to cfDNA.**

The enzymatic oxidation of 5mC generates 5hmC⁴¹, which has been shown to have value as a marker of biological states and disease, including early cancer detection from cfDNA^42,43. We have adapted our platform to enable six-letter seq of DNA that comprises G, C, T, A, mC and hmC (Fig. 4a). A critical requirement is to disambiguate 5mC from 5hmC without compromising genetic base calling within the same sample fragment. The first three steps of the workflow are identical to five-letter seq shown in Fig. 1d to generate the adapter ligated sample fragment with the synthetic copy strand. To enable six-letter seq, critically, methylation at 5mC is enzymatically copied across the CpG unit to the C on the copy strand using DNA methyltransferase 5 (DNMT5), whereas 5hmC is glycosylated via beta-glycosyltransferase to prevent such a copy. DNMT5 has been selected to perform the copy step because of its specificity for copy methylation over de novo methylation⁴⁴. The unmodCs are then deaminated to uracil, which is subsequently read as thymine via the joint action of TET2, A3A and UvrD helicase. Glycosylation still protects 5hmC. The DNA is subjected to PCR amplification and sequencing as described earlier (Fig. 1e). The reads are pairwise aligned and resolved using the two-base code shown in Fig. 4b. Each of unmodC, 5mC and 5hmC can be resolved as the three CpG units have distinct sequencing readouts of the two-base code (Fig. 4b). In this workflow modC can also be detected in non-CpG contexts, although in such contexts 5mC cannot be distinguished from 5hmC. The evidence suggests this is not such a substantial limitation as conversion of 5mC to 5hmC by TET, has been shown to be minimal outside of CpG context⁴⁵ and so it would be a reasonable assumption that the majority of modC calls at CHH sites are 5mC.

We estimated the accuracy of base-modification status calls produced by six-letter seq using ground-truth control sequences with known modifications. We used the same fully unmethylated pUC19 and fully methylated lambda as were used in the evaluation of the five-letter seq protocol and an additional short synthetic oligonucleotide with 5hmC present at specific CpGs. Six-letter seq correctly identified 95.24% 5mC and 97.93% 5hmC status calls, across all reads aligning to those CpGs (Fig. 4c) and retained a very high (99.93%) accuracy of detecting unmodC across all reads. Extrapolation of this accuracy to genome-scale is detailed in Supplementary Fig. 6.

Discussion

We have described a sequencing platform that will deliver the full complement of genetics in addition to the epigenetic bases 5mC and 5hmC at base resolution in a single workflow and data pipeline. The platform operates via a two-base coding mechanism at the molecular level coupled with decoding software. It also improves the accuracy of genetic sequencing and variant calling alongside high-accuracy epigenetic base calling. The platform comprises an all-enzyme workflow with extremely high conversion efficiencies, thus enabling accurate data from valuable, low-input biological samples or from clinical cfDNA. The readout of genetic or epigenetic bases is digital and context independent. The alignment uses a four-base system that is faster and more accurate than three-base alignments used for bisulfite sequencing or EM-seq. The acquisition of accurate, phased genetic and epigenetic information, in a single experimental and data workflow is likely to offer numerous advantages in research and in diagnostics by providing more comprehensive biological information, with ease of use and at reduced cost.

Third-generation sequencing systems^46,47 also measure epigenetic and genetic modifications in the same workflow by detecting signal differences directly caused by epigenetic modifications. As such, they are incompatible with PCR amplification of the starting material, as DNA modifications such as methyl and hydroxymethylcytosine are lost during PCR. Therefore, they are severely limited by the read depth achievable at low DNA inputs. For instance, Katsman et al.⁴⁸ described nanopore sequencing of 15–60 ng of isolated cfDNA, achieving an average sequence depth of 0.2×. This low sequence depth precludes any sampling of the circulating tumor DNA (which is a small fraction of cfDNA) and quantitation of epigenetic modifications at specific loci beyond 0% or 100% methylated. It also requires reliance on read-level genetic accuracy without consensus across multiple reads, this is estimated on both the promethION and minION as <99%⁴⁹.

The method presented in this study, due to its use of base conversion, is compatible with PCR. In Supplementary Table 2 we show 22× sequencing depth achieved from 2 ng of cfDNA with only 38% duplication rate, further depth could be achieved before the library is saturated. The 22× enables sampling of cfDNA (including genetic states and epigenetic states) present as <5% of total library and quantification of epigenetic state (resolving differences of <5%). Read-level genetic accuracy in this technology is high (99.97%) due to the inherent error suppression technology, and the ability to produce high depth at low input, allows consensus variant calling at least equal to Illumina accuracy (Fig. 2e).

Conclusions

The most prevalent epigenetic modifications to DNA in mammalian biology are 5mC and 5hmC. There are other, less frequent, modifications such as 5-formylcytosine, 5-carboxycytosine and N⁶-methyladenine, in various organisms. The measurement of additional modifications by exploiting additional information states inherent in the technology will be the subject of future work, beyond the scope of this manuscript. As the platform fundamentally uses Watson–Crick base pairing to decode information, it can be made readily compatible with any sequencer platform and we see opportunities for its future application to long-read sequencing and single-cell analysis.

Methods

Samples

Genomic DNA was obtained from the Coriell Institute (https://www.coriell.org/0/Sections/Search/Sample_Detail.aspx?Ref=NA12878).

Double spun plasma was obtained from Geneticist Inc. from a single woman patient with a gastric cancer diagnosis. cfDNA was extracted using the NextPrep-Ma kit on the Chemagic Prime platform (Perkin Elmer chemagen Technologie GmbH).

Ground truth controls

An 80-base pair (bp) oligonucleotide with balanced CG content and no homology to human or lambda genome or pUC19, and symmetrical 5hmC at one CpG and asymmetrical 5hmC at another, was designed and purchased from ATDBio Ltd. (Supplementary Fig. 7). The oligonucleotide pair was diluted in 100 mM potassium acetate, 30 mM HEPES, pH 7.5 (IDT) and annealed by heating to 94 °C for two minutes and cooling gradually to 4 °C.

For unmethylated pUC19 control, the plasmid (NEB) was transformed into dcm-/dam- E. coli chemically competent cells (NEB). Plasmid was isolated from culture using a QIAprep Spin Miniprep Kit (Qiagen). Methylated lambda DNA was prepared using unmethylated lambda DNA (Promega). Briefly, 1 μg DNA was incubated with eight units of CpG methyltransferase MSssI (NEB), NEB buffer 2, 320 μM S-adenosyl-methionine (NEB) at 37 °C. After four hours, a further two units of MSssI was added and the reaction incubated for a further four hours at 37 °C. The DNA was purified using a Zymo clean and concentrator column. Complete methylation was checked using methylation sensitive HpaII (NEB) and methylation insensitive MspI (NEB). A negative control using unmethylated lambda DNA was also prepared and used to check digestion with MspI and HpaII. Both methylated lambda DNA and unmethylated pUC19 were diluted in 10 mM Tris-HCl pH 8.0, 0.1 mM EDTA and fragmented to ~250 bp using a Covaris M220.

Laboratory processing

For five- and six-letter seq, variable amounts of input material were used according to the manufacturer’s protocol (Cambridge Epigenetix). Genomic DNA was sonicated using the Covaris M220 Sonicator set to a target size of 250 bp. Other target sizes are compatible with the method and achieve similar yields (Supplementary Table 3). The fragment length profile of cfDNA is conserved in the sequencing libraries (Supplementary Fig. 5).

For five-letter seq, cfDNA or sheared gDNA was mixed with 1 µl of spike-in control (at 0.5 ng/μl for main DNA inputs of ≥80 ng and above and 0.05 ng/μl for inputs ≤10 ng), 3.5 µl End Prep reaction buffer and 1.5 µl End Prep Enzyme Mix for end repair and A-tailing (catalog no. E7647, NEB). The reaction was incubated at 20 °C for 30 min followed by 65 °C for 30 min. In the same mix, Adapter 1 (ATGACGATGCGTTCGAGCATCGUCAUT, all Cs are methylated, Biomers.net GmbH) was ligated using 3.75 µl of adapter 1, 0.5 µl of ligation enhancer and 15 µl of ligation master mix (both catalog no. E7647, NEB) and incubated at 20 °C for 15 min. Afterwards SPRIselect magnetic beads (catalog no. B23319, Beckman Coulter) were added to the solution and a clean-up performed according to the manufacturer’s protocol. The library was eluted in 23.75 µl nuclease-free water (catalog no. W4502, Sigma) and 3 µl of 10× rCutsmart buffer (catalog no. B6004S, NEB) and 3.25 µl USER (catalog no. M5505, NEB) were added. This was incubated for 30 min at 37 °C. Next the DNA was purified using SPRI magnetic beads (catalog no. B23319, Beckman Coulter) using the manufacturer’s protocol. DNA was eluted in 12 µl Library Elution buffer (10 mM Tris-HCl pH 8.0) and used subsequently in strand synthesis. Here, 2 µl of each of 10× NEB buffer 4 (catalog no. B7004S, NEB), dNTPs (10 mM each, catalog no. R0192, Thermo Fisher Scientific), Klenow (exo-) (catalog no. P7010-LC, Thermo Fisher Scientific) and T4 PNK (catalog no. EK0032, Thermo Fisher Scientific) were added and incubated for 30 min at 37 °C. The reaction was then exposed to denaturing at 95 °C for two minutes and gradually cooled (at −0.1 ºC/second) to enable reannealing. Immediately afterwards, 2.5 µl adapter 2 was ligated (forward: ACACTCTTTCCCTACACGACGCTCTTCCGATC*T, *indicates phosphorothioate, reverse: GATCGGAAGAGCACACGTCTGAACTCCAGTCA, all Cs are mC, Biomers.net GmbH), 0.5 µl of ligation enhancer, 15 µl of ligation master mix (both catalog no. E7647, NEB) and 12 µl nuclease-free water (catalog no. W4502, Sigma). This was incubated for 15 min at 20 °C. Next, the DNA was purified by another SPRIselect bead purification. Elution from beads was performed in 30 µl and then 10 µl of reconstituted TET2 5× supplement buffer (10 mM a-Ketoglutarate, 0.25 M Tris-HCl pH 8.0, 10 mM ATP), 1 µl of UDP glucose, 1 µl of T4 beta-glucosyltransferase (both catalog no. EO0831, Thermo Fisher Scientific), 1 µl of 100 mM DTT and 2 µl of TET2 (Cambridge Epigenetix) were added. After mixing 500 mM Fe(II) sulfate hexahydrate (Sigma) 1:1250 with nuclease-free water (catalog no. W4502, Sigma), 5 µl of this dilution was added to the DNA and incubated for 60 min at 37 °C. Next, the converted DNA was purified by a SPRIselect bead clean-up according to the manufacturer’s protocol, eluted in 31 µl of nuclease-free water and exposed to another enzymatic conversation reaction. For this, 12.75 µl nuclease-free water, 17.5 µl 4X APOBEC buffer (200 mM BisTris pH6.1, 0.4% Tween), 1.75 µl 100 mM ATP (catalog no. A6559, Sigma), 3.5 µl 100 mM MgCl2 (catalog no. M1028, Sigma), 2 µl APOBEC-A3A (Cambridge Epigenetix) and 2.5 µl of UvrD Helicase (Cambridge Epigenetix) were added. The reaction was incubated for 1.5 h at 37 °C and cleaned up afterwards with another SPRIselect bead-mediated purification according to the manufacturer’s protocol. The DNA was eluted in 20 µl nuclease-free water and amplified by PCR using 5 µl of Abclonal Unique Dual Index Primers for Illumina (Abclonal no. RK21624_SetA) and 25 µl of 2X KAPA HiFi U⁺ Polymerase (no. KK2802). For 80 ng of gDNA, six cycles of PCR were used, whereas 10 ng of cfDNA used seven cycles of PCR and 2 ng cfDNA used eight cycles (PCR program: 30 seconds at 98 °C for initial denaturation, 10 seconds at 98 °C for denaturation, 30 seconds at 62 °C for annealing, 60 seconds at 65 °C for extension and five minutes at 65 °C for final extension. After PCR the final libraries were purified by SPRIselect bead, eluted in 15 µl Library Elution buffer (10 mM Tris-HCl pH 8.0) and quantified using TapeStation D5000 reagents (Agilent).

For six-letter seq, the five-letter seq protocol has been used with the following modifications. For the PNK/Klenow step, two additional components are added to the enzyme mix, 2.5 µl UDP glucose and 1 µl T4 beta-glucosyltransferase (both catalog no. EO0831, Thermo Fisher Scientific). After adapter 2 ligation, an additional protocol step is performed. In this additional step, 15 µl of DNA is combined with 3.95 µl of nuclease-free water (catalog no. W4502 Sigma), 0.75 µl of 100 mM ATP (catalog no. A6559, Sigma), 1.5 µl of 100 mM MgCl₂ (catalog no. M1028, Sigma), 0.4 µl of S-adenosylmethionin (catalog no. B9003S, NEB), 6 µl of 5X DNMT buffer (250 mM Tris-HCl pH 8.0, 10 mM DTT) and 2.4 µl of DNMT5 (Cambridge Epigenetix). This was incubated for 3 h at 37 °C.

To generate standard genomic libraries the KAPA Hyper Prep kit (catalog no. KK8500, Roche) was used according to the manufacturer’s protocol with the following modifications. Using a Covaris M220 sonicator, 100 ng of genomic DNA (Coriell, catalog no. NA12878) was sonicated to ~250 bp in 50 µl sonication tubes (Covaris) and used in the experiment. For adapter ligation, standard TRUSeq adapters were used (15 µM, Illumina). Ampure beads (Beckman Coulter) were used instead of KAPA clean-up beads and the elution volume was reduced to 21 µl using library dilution buffer (10 mM Tris-HCl, pH 8.0). For library amplification the KAPA Hifi Hot start PCR kit was used (catalog no. KK2500, Roche) according to the manufacturer’s protocol using four PCR cycles and 1 µM final primer concentration. Final libraries were quantified using D1000 HS screen tapes (Agilent).

EM-seq libraries were produced using the New England Biolabs EM-Seq kit (catalog no. E7120, NEB) according to the manufacturer’s protocol with the following modifications. Instead of using the EM-seq control DNA, the ground-truth spike-in DNAs described in above were used to allow a direct comparison. Denaturation was performed using Formamide (catalog no. F9037, Sigma). For the final PCR, seven cycles of PCR were used (80 ng gDNA input). Libraries were quantified using D1000 HS screen tapes (Agilent) on a Tapestation (catalog no. 4200, Agilent).

Before WGBS, the libraries were prepared using the EM-seq kit (catalog no. E7120, NEB) and the EpiTect Plus DNA Bisulfite kit (catalog no. 59124, Qiagen) with the following modifications. Instead of using the EM-seq control DNA, the ground-truth spike-in DNAs described above were used to allow a direct comparison. The EM-seq protocol was followed until Clean-Up of Adapter Ligated DNA and the eluate was then used as input for the EpiTect Plus DNA Bisulfite kit. The high-concentration sample setup was used for bisulfite conversion. To improve bisulfite conversion, the 20 µl elution was used for another round of conversion using the EpiTect Plus DNA Bisulfite kit (catalog no. 59124) again. After the second round of conversion, the DNA was PCR amplified using the EM-seq kit according to the manufacturer’s protocol starting from PCR amplification using eight cycles.

Sequencing was performed using the S4 Standard workflow on the Illumina NovaSeq. Libraries were quantified using the Tapestation D5000 (Agilent) and subsequently diluted to 1.2–1.8 nM for loading on the sequencer. To balance out the low cytosine content in deaminated libraries (WGBS, EM-seq, five-letter seq and six-letter seq), 8% PhiX (catalog no. FC-110-3001, Illumina) was added according to the manufacturer’s protocol. All libraries were run in a paired-end set up using either 200 or 300 cycle kits in a 111/8/8/111 or 151/8/8/151 base-reads setup.

Downsampling

To obtain data that were comparable across the different technologies, five-letter seq data was downsampled to 550 million paired-end reads and EM-seq and WGBS data was downsampled to 275 million paired-end reads. Downsampling was achieved using seqtk (https://github.com/lh3/seqtk).

Additionally, five-letter seq data were trimmed using flexbar (https://github.com/seqan/flexbar) from PE151 to PE111 to match the read length of the EM-seq and WGBS data.

Data processing of five- and six-letter seq

FASTQ files were trimmed and quality-filtered using fastp⁵⁰ before being processed through a resolution algorithm designed by Cambridge Epigenetix. The resolution algorithm corrects for any misalignment of the original and copy strands using a modified Needleman–Wunsch pairwise alignment. Errors identified by unexpected pairings of bases between the original and copy strand are suppressed in the resolved FASTQ file by being converted to N. Phred scores for resolved bases are determined using empirically calculated tables of quality scores that are both instrument and read-length specific. Read-pairs that failed to resolve, defined as having more than 5% unexpected base pairing, indicating they did not derive from the expected hairpin-connected original and copy strand constructs, were filtered out. The resolution approaches for five- and six-letter seq libraries differ merely by the resolution rules within this algorithm; resolution rules are described in Figs. 2 and 4.

Resolved FASTQ files were aligned using BWA-MEM⁵¹ to a standard four-letter reference genome comprising of both GRCh38 and spiked-in control sequences. Epigenetic information encoded in tags in the resolved FASTQ files was passed on into the aligned BAM files and stored using the MM tag. The aligned BAM files were then split into reads aligning to the genome and reads aligning to the controls; unmapped reads were filtered out. Reads aligning to the genome, to the methylated bacteriophage lambda control and to the unmethylated pUC19 control were deduplicated using Picard MarkDuplicates⁵². Reads aligned to the controls were downsampled to a mean coverage of 200× on each control genome before being deduplicated. A range of standard metrics were calculated on the genome-aligned reads using samtools⁵³, Qualimap⁵⁴, deepTools⁵⁵ and Picard⁵⁶. Accuracy of the genetic base calling was calculated relative to the known genotype of high-confidence regions of chromosome 20 of the NA12878 sample. Quantification of epigenetic modifications was calculated at each CpG, CHG and CHH site that was present in the reference genome and covered in the sequencing. This was performed using software developed by Cambridge Epigenetix. Likewise, quantification of epigenetic modifications was calculated at each CpG site in the methylated bacteriophage lambda control and the unmethylated pUC19 control. Sensitivity of modification calling was calculated from the methylated bacteriophage lambda control and specificity of modification calling was calculated from the unmethylated pUC19 control.

All of the processing was performed using a software pipeline developed by Cambridge Epigenetix, written in the Nextflow orchestration language and processed on the Google Cloud Platform (GCP).

Data processing of EM-seq and WGBS samples

Processing of EM-seq and WGBS samples was performed using a software pipeline written in the Nextflow orchestration language and processed on the GCP. Trimming of FASTQ files was performed using Trim Galore!⁵⁶. Alignment to a deaminated reference genome comprising of both GRCh38 and spiked-in control sequences was performed using bwa-meth⁵⁷ and deduplication was performed using Picard MarkDuplicates. Modification calling at CpG, CHG and CHH sites was performed using MethylDackel⁵⁸.

Data processing of Illumina sequencing samples

Processing of Illumina samples was performed using a software pipeline written in the Nextflow orchestration language and processed on the GCP. Trimming was performed using fastp. Alignment to a GRCh38 reference genome was performed using BWA-MEM and deduplication was performed using Picard MarkDuplicates.

Alignment runtimes were compared by first subsampling five-letter seq samples to one million reads and subsampling EM-seq, WGBS and Illumina sequencing reads to 500,000 reads. Alignment runtimes were then calculated by timing an alignment running on a single central processing unit (CPU).

Genetic accuracy metrics

Genetic accuracy metrics were computed using a table of empirical Phred scores. An empirical Phred score is a measure of genetic accuracy, expressed as a Q-score, evaluated by comparing observed read data to a truth set. In the table of Phred scores, empirically computed Phred scores and numbers of raw counts of correct and incorrect observations are stratified by base and nominal, that is instrument reported, Phred score. Empirical Phred scores were computed for five-letter seq, WGBS, EM-seq and Illumina sequencing, by considering NA12878 sequence data from chr20 for each of the technologies. The truth set was derived by masking the hg38 reference genome fasta file by the gold-standard Genome in a Bottle variant call data⁵⁸ using the Bedtools maskfasta command (v.2.25.0). Additionally, within chr20, only high-confidence regions, as defined by Genome in a Bottle, were considered. To compute empirical Phred scores, a Python script was used to query the read data in a pileup-oriented fashion using pysam (v.0.19.1). For each pileup, a reference base was defined (unless masked) and correct (matching) and incorrect (nonmatching) observations were tallied. N bases, both in the reference and in the observed read data, were ignored. For completeness, the observations were stratified by observed base and nominal Phred. Finally, Phred scores were computed using the equation −10 × log10(1−(no. correct/(no. correct + no. incorrect))) and rounded down to the nearest integer. In the case where no incorrect observations were made, a maximal Phred score of 60 was assigned.

Genetic accuracies were computed for each of the four technologies by considering the base and nominal Phred stratified correct and incorrect base-call counts from the empirical Phred score table. To this end, genetic accuracies were computed for each base type (A, C, G, T) by considering only count data from bases with a nominal Phred score greater than or equal to 25. This threshold allowed for evaluating genetic accuracies while avoiding data that either of the technologies classified as poor. Genetic accuracy was defined by the equation no. correct/(no.correct + no. incorrect). Overall genetic accuracies were computed by tallying counts across base types.

To examine variant calling performance, the five-letter seq reads from two 80 ng gDNA (NA12878) samples were pooled using samtools merge (v.1.15.1). The combined.bam file had a total read count of 842 million reads, equivalent to a mean coverage of 28.6×. The combined.bam file was then downsampled into four separate.bam files representing fractions of 0.25, 0.50, 0.75 and 1.0 times the original depth. This was achieved using samtools view–subsample {fraction}–subsample-seed 1 and resulted in.bam files of 211 million reads (7.1×), 421 million reads (14.3×), 632 million reads (21.5×) and 842 million reads (28.6×) respectively. Variant calling was then performed for each of these samples by GATK HaplotypeCaller (v.4.2.5.0) with default settings. For efficiency, HaplotypeCaller was run in parallel for each chromosome and the resulting VCF files were merged using GATK MergeVCFs. Finally, RTG Tools’ vcfeval function (v.3.12.1) was used to compare the five-letter seq variant call data to a ground-truth set defined by the Genome in a Bottle VCF file for NA12878 (https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/GRCh38/). Evaluation was constrained to the high-confidence regions defined by the associated NA12878.bed file (same location). Overall variant calling performance, representing performance across both SNPs and indels, was extracted from the summary.txt output file. Reported sensitivity and precision metrics were associated with receiver operator curve score thresholds resulting in maximal F-measure.

Call-rate matrix for six-letter seq

A call-rate matrix is used to measure the accuracy of modC calls. Each cell in the matrix M represents the rate at which the method calls a particular modification X when the true modification is Y, for example, the cell M(unmodC, mC) represents the rate at which a method calls unmodC when the true modification status is modC. Each column of the matrix, corresponding to the rate at which we call each modification status for a particular true modification status, is estimated using a different spike-in control: a fully unmethylated pUC19 for the column corresponding to a true state of unmodC, a fully methylated lambda for the column corresponding to a true state of 5mC and a synthetic oligonucleotide for the column corresponding to a true state of 5hmC. For a given column, the rate is calculated as the proportion of bases with each modification status in the set of bases for which the genetic base call is C and which are aligned to CpGs in the given control sequence.

Allele-specific methylation calling

SNP calling was done using GATK (v.4.2.6) HaplotypeCaller. The resulting VCF file and the BAM file generated from the five-letter seq pipeline described above were parsed using pysam (v.0.19.1). For each heterozygous SNP site, a script counted the number of times a modC or unmodC call was associated with a CpG in a read containing each allele. Reads that overlapped with the variant site but did not contain a base aligned with the variant site or for which the base call did not match either of the two alleles were not included in these counts. Only sites that had at least six reads containing each allele were considered for ASM calling. The significance of the association was determined using Fisher’s exact test based on a contingency table of counts (variant alleles on one axis and unmodC/modC on the other). After calculating P values for all heterozygous variant sites that had sufficient reads for each allele, Benjamini–Hochberg’s correction was applied to minimize the false discovery rate; this was calculated using statsmodels (v.0.13.2).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Data described in this study are available from National Center for Biotechnology Information National Library of Medicine, Gene Expression Omnibus with no restrictions or conditions on access GSE208549 (ref. ⁵⁹).

Code availability

Source code for procedures described in this study is available in a public repository on github with no restrictions or conditions on access, https://github.com/cegx-ds/couplet.

References

He, L. et al. DNA methylation-free Arabidopsis reveals crucial roles of DNA methylation in regulating gene expression and development. Nat. Commun. 13, 1335 (2022).
CAS PubMed Central PubMed Google Scholar
Mazid, M. A. et al. Rolling back human pluripotent stem cells to an eight-cell embryo-like stage. Nature 605, 315–324 (2022).
CAS PubMed Google Scholar
Nachun, D. et al. Clonal hematopoiesis associated with epigenetic aging and clinical outcomes. Aging Cell 20, e13366 (2021).
CAS PubMed Central PubMed Google Scholar
Yokobayashi, S. et al. Inherent genomic properties underlie the epigenomic heterogeneity of human induced pluripotent stem cells. Cell Rep. 37, 109909 (2021).
CAS PubMed Google Scholar
Nishizawa, M. et al. Epigenetic variation between human induced pluripotent stem cell lines is an indicator of differentiation capacity. Cell Stem Cell 19, 341–354 (2016).
CAS PubMed Google Scholar
Gaunt, T. R. et al. Systematic identification of genetic influences on methylation across the human life course. Genome Biol. 17, 61 (2016).
PubMed Central PubMed Google Scholar
Ahmed, M. et al. CRISPRi screens reveal a DNA methylation-mediated 3D genome dependent causal mechanism in prostate cancer. Nat. Commun. 12, 1781 (2021).
CAS PubMed Central PubMed Google Scholar
Carreras-Gallo, N. et al. The early-life exposome modulates the effect of polymorphic inversions on DNA methylation. Commun. Biol. 5, 1–13 (2022).
Google Scholar
Kim, S.-T. et al. Abstract 916: combined genomic and epigenomic assessment of cell-free circulating tumor DNA (ctDNA) improves assay sensitivity in early-stage colorectal cancer (CRC). Cancer Res. 79, 916 (2019).
Google Scholar
Klein, E. A. et al. Clinical validation of a targeted methylation-based multi-cancer early detection test using an independent validation set. Ann. Oncol. 32, 1167–1177 (2021).
CAS PubMed Google Scholar
Gai, W. et al. Applications of genetic-epigenetic tissue mapping for plasma DNA in prenatal testing, transplantation and oncology. eLife 10, e64356 (2021).
CAS PubMed Central PubMed Google Scholar
Spruijt, C. G. et al. Dynamic readers for 5-(hydroxy)methylcytosine and its oxidized derivatives. Cell 152, 1146–1159 (2013).
CAS PubMed Google Scholar
Mellén, M., Ayata, P. & Heintz, N. 5-hydroxymethylcytosine accumulation in postmitotic neurons results in functional demethylation of expressed genes. PNAS 114, E7812–E7821 (2017).
PubMed Central PubMed Google Scholar
Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).
CAS PubMed Central PubMed Google Scholar
Frommer, M. et al. A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. PNAS 89, 1827–1831 (1992).
CAS PubMed Central PubMed Google Scholar
Vaisvila, R. et al. Enzymatic methyl sequencing detects DNA methylation at single-base resolution from picograms of DNA. Genome Res. https://www.genome.org/cgi/doi/10.1101/gr.266551.120 (2021).
Liu, Y. et al. Bisulfite-free direct detection of 5-methylcytosine and 5-hydroxymethylcytosine at base resolution. Nat. Biotechnol. 37, 424–429 (2019).
CAS PubMed Google Scholar
Cagan, A. et al. Somatic mutation rates scale with lifespan across mammals. Nature 604, 517–524 (2022).
CAS PubMed Central PubMed Google Scholar
Alexandrov, L. B. et al. The repertoire of mutational signatures in human cancer. Nature 578, 94–101 (2020).
CAS PubMed Central PubMed Google Scholar
Xi, Y. & Li, W. BSMAP: whole genome bisulfite sequence MAPping program. BMC Bioinf. 10, 232 (2009).
Google Scholar
Booth, M. J. et al. Oxidative bisulfite sequencing of 5-methylcytosine and 5-hydroxymethylcytosine. Nat. Protoc. 8, 1841–1851 (2013).
CAS PubMed Central PubMed Google Scholar
Liu, Y. et al. Subtraction-free and bisulfite-free specific sequencing of 5-methylcytosine and its oxidized derivatives at base resolution. Nat. Commun. 12, 618 (2021).
PubMed Central PubMed Google Scholar
Yu, M. et al. Tet-assisted bisulfite sequencing of 5-hydroxymethylcytosine. Nat. Protoc. 7, 2159–2170 (2012).
CAS PubMed Central PubMed Google Scholar
Schutsky, E. K. et al. Nondestructive, base-resolution sequencing of 5-hydroxymethylcytosine using a DNA deaminase. Nat. Biotechnol. 36, 1083–1090 (2018).
CAS Google Scholar
Dahl, J. A. et al. Methods and kits for detection of methylation status. WO2013090588, USA (2013).
Kawasaki, Y. et al. A novel method for the simultaneous identification of methylcytosine and hydroxymethylcytosine at a single base resolution. Nucleic Acids Res. 45, e24 (2017).
PubMed Google Scholar
Tanaka, K. & Okamoto, A. Degradation of DNA by bisulfite treatment. Bioorg. Med. Chem. Lett. 17, 1912–1915 (2007).
CAS PubMed Google Scholar
Olova, N. et al. Comparison of whole-genome bisulfite sequencing library preparation strategies identifies sources of biases affecting DNA methylation data. Genome Biol. 19, 33 (2018).
PubMed Central PubMed Google Scholar
Liu, M. Y. et al. Mutations along a TET2 active site scaffold stall oxidation at 5-hydroxymethylcytosine. Nat. Chem. Biol. 13, 181–187 (2017).
CAS PubMed Google Scholar
Moréra et al. T4 phage beta-glucosyltransferase: substrate binding and proposed catalytic mechanism. J. Mol. Biol. 292, 717–730 (1999).
PubMed Google Scholar
Schutsky et al. APOBEC3A efficiently deaminates methylated, but not TET-oxidized, cytosine bases in DNA. Nucleic Acids Res. 45, 7655–7665 (2017).
CAS PubMed Central PubMed Google Scholar
Manelyte, L. et al. The unstructured C-terminal extension of UvrD interacts with UvrB, but is dispensable for nucleotide excision repair. DNA Repair (Amst) 8, 1300–1310 (2009).
CAS PubMed Google Scholar
Zook, J. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).
CAS PubMed Google Scholar
Foox, J. et al. The SEQC2 epigenomics quality control (EpiQC) study. Genome Biol. 22, 332 (2021).
CAS PubMed Central PubMed Google Scholar
Gao, S. et al. BS-SNPer: SNP calling in bisulfite-seq data. Bioinformatics 31, 4006–4008 (2015).
CAS PubMed Central PubMed Google Scholar
Do, C. et al. Genetic-epigenetic interactions in cis: a major focus in the post-GWAS era. Genome Biol. 18, 120 (2017).
PubMed Central PubMed Google Scholar
Onuchic, V. et al. Allele-specific epigenome maps reveal sequence-dependent stochastic switching at regulatory loci. Science 361, eaar3146 (2018).
PubMed Central PubMed Google Scholar
Pilvar, D. et al. Parent-of-origin-specific allelic expression in the human placenta is limited to established imprinted loci and it is stably maintained across pregnancy. Clin. Epigenetics 11, 94 (2019).
PubMed Central PubMed Google Scholar
Lianidou, E. Detection and relevance of epigenetic markers on ctDNA: recent advances and future outlook. Mol. Oncol. 15, 1683–1700 (2021).
CAS PubMed Central PubMed Google Scholar
Janku, F. et al. A novel method for liquid-phase extraction of cell-free DNA for detection of circulating tumor DNA. Sci. Rep. 11, 19653 (2021).
CAS PubMed Central PubMed Google Scholar
Tahiliani, M. et al. Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in mammalian DNA by MLL partner TET1. Science 15, 930–935 (2009).
Google Scholar
Guler, G. D. et al. Detection of early stage pancreatic cancer using 5-hydroxymethylcytosine signatures in circulating cell free DNA. Nat. Commun. 11, 5270 (2020).
CAS PubMed Central PubMed Google Scholar
Walker et al. Hydroxymethylation profile of cell-free DNA is a biomarker for early colorectal cancer. Sci. Rep. 12, 16566 (2022).
CAS PubMed Central PubMed Google Scholar
Wang, J. et al. Structural insights into DNMT5-mediated ATP-dependent high-fidelity epigenome maintenance. Mol. Cell. 82, 1186–1198 (2022).
CAS PubMed Central PubMed Google Scholar
DeNizio, J. E. et al. TET-TDG active DNA demethylation at CpG and non-CpG sites. J. Mol. Biol. 433, 166877 (2021).
CAS PubMed Central PubMed Google Scholar
Tse, O. Y. O. et al. Genome-wide detection of cytosine methylation by single molecule real-time sequencing. PNAS 118, e2019768118 (2021).
PubMed Central PubMed Google Scholar
Simpson, J. T. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat. Methods 14, 407–410 (2017).
CAS PubMed Google Scholar
Katsman, E. et al. Detecting cell-of-origin and cancer-specific methylation features of cell-free DNA from Nanopore sequencing. Genome Biol. 23, 158 (2022).
CAS PubMed Central PubMed Google Scholar
Foox, J. et al. Performance assessment of DNA sequencing platforms in the ABRF Next-Generation Sequencing Study. Nat. Biotechnol. 39, 1129–1140 (2021).
CAS PubMed Central PubMed Google Scholar
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
PubMed Central PubMed Google Scholar
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. https://doi.org/10.48550/arXiv.1303.3997 (2013).
Picard v.2.23.8. & v.2.25.4. (The Broad Institute Picard, 2020 & 2021). http://broadinstitute.github.io/picard/
Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).
PubMed Central PubMed Google Scholar
Okonechnikov, K., Conesa, A. & García-Alcalde, F. Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics 32, 292–294 (2016).
CAS PubMed Google Scholar
Ramírez, F. et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 44, W160–W165 (2016).
PubMed Central PubMed Google Scholar
Kruger, F. TrimGalore, v.0.0.6. (The Babraham Institute, 2020). https://github.com/FelixKrueger/TrimGalore
Pedersen, B. S., Eyring, K., De, S., Yang, I. V. & Schwartz, D. A. Fast and accurate alignment of long bisulfite-seq reads. https://doi.org/10.48550/arXiv.1401.1129 (2014).
Ryan D. MethylDackyl, v.0.5.2. (Ryan, 2021). https://github.com/dpryan79/MethylDackel
Creed, P., Lumby C. K., Morley, D. M., Holbrook, J. D. Accurate simultaneous sequencing of genetic and epigenetic bases in DNA GSE208549 Gene Expression Omnibus. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE208549 (2022).

Download references

Acknowledgements

We would like to acknowledge the practical assistance and helpful comments of current and former employees of Cambridge Epigenetix.

Author information

These authors contributed equally: Jens Füllgrabe, Walraj S Gosal.

Authors and Affiliations

Cambridge Epigenetix Ltd, The Trinity Building, Chesterford Research Park, Cambridge, UK
Jens Füllgrabe, Walraj S. Gosal, Páidí Creed, Sidong Liu, Casper K. Lumby, David J. Morley, Tobias W. B. Ost, Albert J. Vilella, Shirong Yu, Helen Bignell, Philippa Burns, Tom Charlesworth, Beiyuan Fu, Howerd Fordham, Nicolas J. Harding, Olga Gandelman, Paula Golder, Christopher Hodson, Mengjie Li, Marjana Lila, Yang Liu, Joanne Mason, Jason Mellad, Jack M. Monahan, Oliver Nentwich, Alexandra Palmer, Michael Steward, Minna Taipale, Audrey Vandomme, Rita Santo San-Bento, Ankita Singhal, Julia Vivian, Natalia Wójtowicz, Nathan Williams, Nicolas J. Walker, Nicola C. H. Wong, Gary N. Yalloway & Joanna D. Holbrook
Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK
Shankar Balasubramanian
Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, UK
Shankar Balasubramanian

Authors

Jens Füllgrabe
View author publications
You can also search for this author in PubMed Google Scholar
Walraj S. Gosal
View author publications
You can also search for this author in PubMed Google Scholar
Páidí Creed
View author publications
You can also search for this author in PubMed Google Scholar
Sidong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Casper K. Lumby
View author publications
You can also search for this author in PubMed Google Scholar
David J. Morley
View author publications
You can also search for this author in PubMed Google Scholar
Tobias W. B. Ost
View author publications
You can also search for this author in PubMed Google Scholar
Albert J. Vilella
View author publications
You can also search for this author in PubMed Google Scholar
Shirong Yu
View author publications
You can also search for this author in PubMed Google Scholar
Helen Bignell
View author publications
You can also search for this author in PubMed Google Scholar
Philippa Burns
View author publications
You can also search for this author in PubMed Google Scholar
Tom Charlesworth
View author publications
You can also search for this author in PubMed Google Scholar
Beiyuan Fu
View author publications
You can also search for this author in PubMed Google Scholar
Howerd Fordham
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas J. Harding
View author publications
You can also search for this author in PubMed Google Scholar
Olga Gandelman
View author publications
You can also search for this author in PubMed Google Scholar
Paula Golder
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Hodson
View author publications
You can also search for this author in PubMed Google Scholar
Mengjie Li
View author publications
You can also search for this author in PubMed Google Scholar
Marjana Lila
View author publications
You can also search for this author in PubMed Google Scholar
Yang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Joanne Mason
View author publications
You can also search for this author in PubMed Google Scholar
Jason Mellad
View author publications
You can also search for this author in PubMed Google Scholar
Jack M. Monahan
View author publications
You can also search for this author in PubMed Google Scholar
Oliver Nentwich
View author publications
You can also search for this author in PubMed Google Scholar
Alexandra Palmer
View author publications
You can also search for this author in PubMed Google Scholar
Michael Steward
View author publications
You can also search for this author in PubMed Google Scholar
Minna Taipale
View author publications
You can also search for this author in PubMed Google Scholar
Audrey Vandomme
View author publications
You can also search for this author in PubMed Google Scholar
Rita Santo San-Bento
View author publications
You can also search for this author in PubMed Google Scholar
Ankita Singhal
View author publications
You can also search for this author in PubMed Google Scholar
Julia Vivian
View author publications
You can also search for this author in PubMed Google Scholar
Natalia Wójtowicz
View author publications
You can also search for this author in PubMed Google Scholar
Nathan Williams
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas J. Walker
View author publications
You can also search for this author in PubMed Google Scholar
Nicola C. H. Wong
View author publications
You can also search for this author in PubMed Google Scholar
Gary N. Yalloway
View author publications
You can also search for this author in PubMed Google Scholar
Joanna D. Holbrook
View author publications
You can also search for this author in PubMed Google Scholar
Shankar Balasubramanian
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.F., W.S.G., S.L., D.J.M., O.N., T.W.B.O., M.S., A.V., N.J.W., S.Y., H.B., R.S.S., J.D.H. and S.B. conceptualized the presented ideas. J.F., W.S.G., S.L., T.W.B.O., S.Y., H.B., P.B., H.F., O.G., J. Mason, O.N., A.P., P.G., M.T., A.V., M. Liu, M. Lila and J. Mellad conceptualized, planned and carried out the experiments. P.C., C.K.L., D.J.M., A.V., N.J.H., J. Monahan, N. Wójtowicz, N. Williams, N.J.W., N.C.H.W. and J.D.H. planned and carried out the analyses. W.S.G., T.W.B.O., T.C., P.B., B.F., C.H., Y.L., M.S., A.S., J.V., G.N.Y., J. Mason and J. Mellad conceptualized, planned and carried out reagent preparation. All authors contributed to the interpretation of the results. J.D.H. and S.B. took the lead in writing the manuscript. P.C. took the lead in creating figures. All authors provided critical feedback and helped shape the research, analysis and manuscript.

Corresponding authors

Correspondence to Joanna D. Holbrook or Shankar Balasubramanian.

Ethics declarations

Competing interests

S.B. is a founder, adviser and shareholder of Cambridge Epigenetix and of Inflex. All the other authors are current or former employees and hold share options. Patents covering this work and the methodologies described in this manuscript have been filed by Cambridge Epigenetix (patent applicant), inventors are S.B., J.F., W.S.G., J.D.H., S.L., D.M., O.N., T.O., M.S., A.V., N.J.W., S.Y, H.R.B. and R.S.S.-B. The application numbers are WO2022023753A1 (published), US20220298551A1 (pending), US20220290215A1 (issued), EP4034676A1(pending) and EP4083231A1 (pending).

Peer review

Peer review information

Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–7 and Tables 1–3.

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Füllgrabe, J., Gosal, W.S., Creed, P. et al. Simultaneous sequencing of genetic and epigenetic bases in DNA. Nat Biotechnol 41, 1457–1464 (2023). https://doi.org/10.1038/s41587-022-01652-0

Download citation

Received: 12 August 2022
Accepted: 16 December 2022
Published: 06 February 2023
Issue Date: October 2023
DOI: https://doi.org/10.1038/s41587-022-01652-0

This article is cited by

Simultaneous single-cell analysis of 5mC and 5hmC with SIMPLE-seq
- Dongsheng Bai
- Xiaoting Zhang
- Chengqi Yi
Nature Biotechnology (2024)
Nanopore DNA sequencing technologies and their applications towards single-molecule proteomics
- Adam Dorey
- Stefan Howorka
Nature Chemistry (2024)
Direct methylation sequencing
- Abdulkadir Abakir
- Wolf Reik
Nature Chemical Biology (2023)
Speed reading the epigenome and genome
- James M. George
- Arul M. Chinnaiyan
Nature Biotechnology (2023)
Mapping epigenetic modifications by sequencing technologies
- Xiufei Chen
- Haiqi Xu
- Chun-Xiao Song
Cell Death & Differentiation (2023)