Samples

Genomic DNA was obtained from the Coriell Institute (https://www.coriell.org/0/Sections/Search/Sample_Detail.aspx?Ref=NA12878).

Double spun plasma was obtained from Geneticist Inc. from a single woman patient with a gastric cancer diagnosis. cfDNA was extracted using the NextPrep-Ma kit on the Chemagic Prime platform (Perkin Elmer chemagen Technologie GmbH).

Ground truth controls

An 80-base pair (bp) oligonucleotide with balanced CG content and no homology to human or lambda genome or pUC19, and symmetrical 5hmC at one CpG and asymmetrical 5hmC at another, was designed and purchased from ATDBio Ltd. (Supplementary Fig. 7). The oligonucleotide pair was diluted in 100 mM potassium acetate, 30 mM HEPES, pH 7.5 (IDT) and annealed by heating to 94 °C for two minutes and cooling gradually to 4 °C.

For unmethylated pUC19 control, the plasmid (NEB) was transformed into dcm-/dam- E. coli chemically competent cells (NEB). Plasmid was isolated from culture using a QIAprep Spin Miniprep Kit (Qiagen). Methylated lambda DNA was prepared using unmethylated lambda DNA (Promega). Briefly, 1 μg DNA was incubated with eight units of CpG methyltransferase MSssI (NEB), NEB buffer 2, 320 μM S-adenosyl-methionine (NEB) at 37 °C. After four hours, a further two units of MSssI was added and the reaction incubated for a further four hours at 37 °C. The DNA was purified using a Zymo clean and concentrator column. Complete methylation was checked using methylation sensitive HpaII (NEB) and methylation insensitive MspI (NEB). A negative control using unmethylated lambda DNA was also prepared and used to check digestion with MspI and HpaII. Both methylated lambda DNA and unmethylated pUC19 were diluted in 10 mM Tris-HCl pH 8.0, 0.1 mM EDTA and fragmented to ~250 bp using a Covaris M220.

Laboratory processing

For five- and six-letter seq, variable amounts of input material were used according to the manufacturer’s protocol (Cambridge Epigenetix). Genomic DNA was sonicated using the Covaris M220 Sonicator set to a target size of 250 bp. Other target sizes are compatible with the method and achieve similar yields (Supplementary Table 3). The fragment length profile of cfDNA is conserved in the sequencing libraries (Supplementary Fig. 5).

For five-letter seq, cfDNA or sheared gDNA was mixed with 1 µl of spike-in control (at 0.5 ng/μl for main DNA inputs of ≥80 ng and above and 0.05 ng/μl for inputs ≤10 ng), 3.5 µl End Prep reaction buffer and 1.5 µl End Prep Enzyme Mix for end repair and A-tailing (catalog no. E7647, NEB). The reaction was incubated at 20 °C for 30 min followed by 65 °C for 30 min. In the same mix, Adapter 1 (ATGACGATGCGTTCGAGCATCGUCAUT, all Cs are methylated, Biomers.net GmbH) was ligated using 3.75 µl of adapter 1, 0.5 µl of ligation enhancer and 15 µl of ligation master mix (both catalog no. E7647, NEB) and incubated at 20 °C for 15 min. Afterwards SPRIselect magnetic beads (catalog no. B23319, Beckman Coulter) were added to the solution and a clean-up performed according to the manufacturer’s protocol. The library was eluted in 23.75 µl nuclease-free water (catalog no. W4502, Sigma) and 3 µl of 10× rCutsmart buffer (catalog no. B6004S, NEB) and 3.25 µl USER (catalog no. M5505, NEB) were added. This was incubated for 30 min at 37 °C. Next the DNA was purified using SPRI magnetic beads (catalog no. B23319, Beckman Coulter) using the manufacturer’s protocol. DNA was eluted in 12 µl Library Elution buffer (10 mM Tris-HCl pH 8.0) and used subsequently in strand synthesis. Here, 2 µl of each of 10× NEB buffer 4 (catalog no. B7004S, NEB), dNTPs (10 mM each, catalog no. R0192, Thermo Fisher Scientific), Klenow (exo-) (catalog no. P7010-LC, Thermo Fisher Scientific) and T4 PNK (catalog no. EK0032, Thermo Fisher Scientific) were added and incubated for 30 min at 37 °C. The reaction was then exposed to denaturing at 95 °C for two minutes and gradually cooled (at −0.1 ºC/second) to enable reannealing. Immediately afterwards, 2.5 µl adapter 2 was ligated (forward: ACACTCTTTCCCTACACGACGCTCTTCCGATC*T, *indicates phosphorothioate, reverse: GATCGGAAGAGCACACGTCTGAACTCCAGTCA, all Cs are mC, Biomers.net GmbH), 0.5 µl of ligation enhancer, 15 µl of ligation master mix (both catalog no. E7647, NEB) and 12 µl nuclease-free water (catalog no. W4502, Sigma). This was incubated for 15 min at 20 °C. Next, the DNA was purified by another SPRIselect bead purification. Elution from beads was performed in 30 µl and then 10 µl of reconstituted TET2 5× supplement buffer (10 mM a-Ketoglutarate, 0.25 M Tris-HCl pH 8.0, 10 mM ATP), 1 µl of UDP glucose, 1 µl of T4 beta-glucosyltransferase (both catalog no. EO0831, Thermo Fisher Scientific), 1 µl of 100 mM DTT and 2 µl of TET2 (Cambridge Epigenetix) were added. After mixing 500 mM Fe(II) sulfate hexahydrate (Sigma) 1:1250 with nuclease-free water (catalog no. W4502, Sigma), 5 µl of this dilution was added to the DNA and incubated for 60 min at 37 °C. Next, the converted DNA was purified by a SPRIselect bead clean-up according to the manufacturer’s protocol, eluted in 31 µl of nuclease-free water and exposed to another enzymatic conversation reaction. For this, 12.75 µl nuclease-free water, 17.5 µl 4X APOBEC buffer (200 mM BisTris pH6.1, 0.4% Tween), 1.75 µl 100 mM ATP (catalog no. A6559, Sigma), 3.5 µl 100 mM MgCl2 (catalog no. M1028, Sigma), 2 µl APOBEC-A3A (Cambridge Epigenetix) and 2.5 µl of UvrD Helicase (Cambridge Epigenetix) were added. The reaction was incubated for 1.5 h at 37 °C and cleaned up afterwards with another SPRIselect bead-mediated purification according to the manufacturer’s protocol. The DNA was eluted in 20 µl nuclease-free water and amplified by PCR using 5 µl of Abclonal Unique Dual Index Primers for Illumina (Abclonal no. RK21624_SetA) and 25 µl of 2X KAPA HiFi U+ Polymerase (no. KK2802). For 80 ng of gDNA, six cycles of PCR were used, whereas 10 ng of cfDNA used seven cycles of PCR and 2 ng cfDNA used eight cycles (PCR program: 30 seconds at 98 °C for initial denaturation, 10 seconds at 98 °C for denaturation, 30 seconds at 62 °C for annealing, 60 seconds at 65 °C for extension and five minutes at 65 °C for final extension. After PCR the final libraries were purified by SPRIselect bead, eluted in 15 µl Library Elution buffer (10 mM Tris-HCl pH 8.0) and quantified using TapeStation D5000 reagents (Agilent).

For six-letter seq, the five-letter seq protocol has been used with the following modifications. For the PNK/Klenow step, two additional components are added to the enzyme mix, 2.5 µl UDP glucose and 1 µl T4 beta-glucosyltransferase (both catalog no. EO0831, Thermo Fisher Scientific). After adapter 2 ligation, an additional protocol step is performed. In this additional step, 15 µl of DNA is combined with 3.95 µl of nuclease-free water (catalog no. W4502 Sigma), 0.75 µl of 100 mM ATP (catalog no. A6559, Sigma), 1.5 µl of 100 mM MgCl 2 (catalog no. M1028, Sigma), 0.4 µl of S-adenosylmethionin (catalog no. B9003S, NEB), 6 µl of 5X DNMT buffer (250 mM Tris-HCl pH 8.0, 10 mM DTT) and 2.4 µl of DNMT5 (Cambridge Epigenetix). This was incubated for 3 h at 37 °C.

To generate standard genomic libraries the KAPA Hyper Prep kit (catalog no. KK8500, Roche) was used according to the manufacturer’s protocol with the following modifications. Using a Covaris M220 sonicator, 100 ng of genomic DNA (Coriell, catalog no. NA12878) was sonicated to ~250 bp in 50 µl sonication tubes (Covaris) and used in the experiment. For adapter ligation, standard TRUSeq adapters were used (15 µM, Illumina). Ampure beads (Beckman Coulter) were used instead of KAPA clean-up beads and the elution volume was reduced to 21 µl using library dilution buffer (10 mM Tris-HCl, pH 8.0). For library amplification the KAPA Hifi Hot start PCR kit was used (catalog no. KK2500, Roche) according to the manufacturer’s protocol using four PCR cycles and 1 µM final primer concentration. Final libraries were quantified using D1000 HS screen tapes (Agilent).

EM-seq libraries were produced using the New England Biolabs EM-Seq kit (catalog no. E7120, NEB) according to the manufacturer’s protocol with the following modifications. Instead of using the EM-seq control DNA, the ground-truth spike-in DNAs described in above were used to allow a direct comparison. Denaturation was performed using Formamide (catalog no. F9037, Sigma). For the final PCR, seven cycles of PCR were used (80 ng gDNA input). Libraries were quantified using D1000 HS screen tapes (Agilent) on a Tapestation (catalog no. 4200, Agilent).

Before WGBS, the libraries were prepared using the EM-seq kit (catalog no. E7120, NEB) and the EpiTect Plus DNA Bisulfite kit (catalog no. 59124, Qiagen) with the following modifications. Instead of using the EM-seq control DNA, the ground-truth spike-in DNAs described above were used to allow a direct comparison. The EM-seq protocol was followed until Clean-Up of Adapter Ligated DNA and the eluate was then used as input for the EpiTect Plus DNA Bisulfite kit. The high-concentration sample setup was used for bisulfite conversion. To improve bisulfite conversion, the 20 µl elution was used for another round of conversion using the EpiTect Plus DNA Bisulfite kit (catalog no. 59124) again. After the second round of conversion, the DNA was PCR amplified using the EM-seq kit according to the manufacturer’s protocol starting from PCR amplification using eight cycles.

Sequencing was performed using the S4 Standard workflow on the Illumina NovaSeq. Libraries were quantified using the Tapestation D5000 (Agilent) and subsequently diluted to 1.2–1.8 nM for loading on the sequencer. To balance out the low cytosine content in deaminated libraries (WGBS, EM-seq, five-letter seq and six-letter seq), 8% PhiX (catalog no. FC-110-3001, Illumina) was added according to the manufacturer’s protocol. All libraries were run in a paired-end set up using either 200 or 300 cycle kits in a 111/8/8/111 or 151/8/8/151 base-reads setup.

Downsampling

To obtain data that were comparable across the different technologies, five-letter seq data was downsampled to 550 million paired-end reads and EM-seq and WGBS data was downsampled to 275 million paired-end reads. Downsampling was achieved using seqtk (https://github.com/lh3/seqtk).

Additionally, five-letter seq data were trimmed using flexbar (https://github.com/seqan/flexbar) from PE151 to PE111 to match the read length of the EM-seq and WGBS data.

Data processing of five- and six-letter seq

FASTQ files were trimmed and quality-filtered using fastp50 before being processed through a resolution algorithm designed by Cambridge Epigenetix. The resolution algorithm corrects for any misalignment of the original and copy strands using a modified Needleman–Wunsch pairwise alignment. Errors identified by unexpected pairings of bases between the original and copy strand are suppressed in the resolved FASTQ file by being converted to N. Phred scores for resolved bases are determined using empirically calculated tables of quality scores that are both instrument and read-length specific. Read-pairs that failed to resolve, defined as having more than 5% unexpected base pairing, indicating they did not derive from the expected hairpin-connected original and copy strand constructs, were filtered out. The resolution approaches for five- and six-letter seq libraries differ merely by the resolution rules within this algorithm; resolution rules are described in Figs. 2 and 4.

Resolved FASTQ files were aligned using BWA-MEM51 to a standard four-letter reference genome comprising of both GRCh38 and spiked-in control sequences. Epigenetic information encoded in tags in the resolved FASTQ files was passed on into the aligned BAM files and stored using the MM tag. The aligned BAM files were then split into reads aligning to the genome and reads aligning to the controls; unmapped reads were filtered out. Reads aligning to the genome, to the methylated bacteriophage lambda control and to the unmethylated pUC19 control were deduplicated using Picard MarkDuplicates52. Reads aligned to the controls were downsampled to a mean coverage of 200× on each control genome before being deduplicated. A range of standard metrics were calculated on the genome-aligned reads using samtools53, Qualimap54, deepTools55 and Picard56. Accuracy of the genetic base calling was calculated relative to the known genotype of high-confidence regions of chromosome 20 of the NA12878 sample. Quantification of epigenetic modifications was calculated at each CpG, CHG and CHH site that was present in the reference genome and covered in the sequencing. This was performed using software developed by Cambridge Epigenetix. Likewise, quantification of epigenetic modifications was calculated at each CpG site in the methylated bacteriophage lambda control and the unmethylated pUC19 control. Sensitivity of modification calling was calculated from the methylated bacteriophage lambda control and specificity of modification calling was calculated from the unmethylated pUC19 control.

All of the processing was performed using a software pipeline developed by Cambridge Epigenetix, written in the Nextflow orchestration language and processed on the Google Cloud Platform (GCP).

Data processing of EM-seq and WGBS samples

Processing of EM-seq and WGBS samples was performed using a software pipeline written in the Nextflow orchestration language and processed on the GCP. Trimming of FASTQ files was performed using Trim Galore!56. Alignment to a deaminated reference genome comprising of both GRCh38 and spiked-in control sequences was performed using bwa-meth57 and deduplication was performed using Picard MarkDuplicates. Modification calling at CpG, CHG and CHH sites was performed using MethylDackel58.

Data processing of Illumina sequencing samples

Processing of Illumina samples was performed using a software pipeline written in the Nextflow orchestration language and processed on the GCP. Trimming was performed using fastp. Alignment to a GRCh38 reference genome was performed using BWA-MEM and deduplication was performed using Picard MarkDuplicates.

Alignment runtimes were compared by first subsampling five-letter seq samples to one million reads and subsampling EM-seq, WGBS and Illumina sequencing reads to 500,000 reads. Alignment runtimes were then calculated by timing an alignment running on a single central processing unit (CPU).

Genetic accuracy metrics

Genetic accuracy metrics were computed using a table of empirical Phred scores. An empirical Phred score is a measure of genetic accuracy, expressed as a Q-score, evaluated by comparing observed read data to a truth set. In the table of Phred scores, empirically computed Phred scores and numbers of raw counts of correct and incorrect observations are stratified by base and nominal, that is instrument reported, Phred score. Empirical Phred scores were computed for five-letter seq, WGBS, EM-seq and Illumina sequencing, by considering NA12878 sequence data from chr20 for each of the technologies. The truth set was derived by masking the hg38 reference genome fasta file by the gold-standard Genome in a Bottle variant call data58 using the Bedtools maskfasta command (v.2.25.0). Additionally, within chr20, only high-confidence regions, as defined by Genome in a Bottle, were considered. To compute empirical Phred scores, a Python script was used to query the read data in a pileup-oriented fashion using pysam (v.0.19.1). For each pileup, a reference base was defined (unless masked) and correct (matching) and incorrect (nonmatching) observations were tallied. N bases, both in the reference and in the observed read data, were ignored. For completeness, the observations were stratified by observed base and nominal Phred. Finally, Phred scores were computed using the equation −10 × log10(1−(no. correct/(no. correct + no. incorrect))) and rounded down to the nearest integer. In the case where no incorrect observations were made, a maximal Phred score of 60 was assigned.

Genetic accuracies were computed for each of the four technologies by considering the base and nominal Phred stratified correct and incorrect base-call counts from the empirical Phred score table. To this end, genetic accuracies were computed for each base type (A, C, G, T) by considering only count data from bases with a nominal Phred score greater than or equal to 25. This threshold allowed for evaluating genetic accuracies while avoiding data that either of the technologies classified as poor. Genetic accuracy was defined by the equation no. correct/(no.correct + no. incorrect). Overall genetic accuracies were computed by tallying counts across base types.

To examine variant calling performance, the five-letter seq reads from two 80 ng gDNA (NA12878) samples were pooled using samtools merge (v.1.15.1). The combined.bam file had a total read count of 842 million reads, equivalent to a mean coverage of 28.6×. The combined.bam file was then downsampled into four separate.bam files representing fractions of 0.25, 0.50, 0.75 and 1.0 times the original depth. This was achieved using samtools view–subsample {fraction}–subsample-seed 1 and resulted in.bam files of 211 million reads (7.1×), 421 million reads (14.3×), 632 million reads (21.5×) and 842 million reads (28.6×) respectively. Variant calling was then performed for each of these samples by GATK HaplotypeCaller (v.4.2.5.0) with default settings. For efficiency, HaplotypeCaller was run in parallel for each chromosome and the resulting VCF files were merged using GATK MergeVCFs. Finally, RTG Tools’ vcfeval function (v.3.12.1) was used to compare the five-letter seq variant call data to a ground-truth set defined by the Genome in a Bottle VCF file for NA12878 (https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/GRCh38/). Evaluation was constrained to the high-confidence regions defined by the associated NA12878.bed file (same location). Overall variant calling performance, representing performance across both SNPs and indels, was extracted from the summary.txt output file. Reported sensitivity and precision metrics were associated with receiver operator curve score thresholds resulting in maximal F-measure.

Call-rate matrix for six-letter seq

A call-rate matrix is used to measure the accuracy of modC calls. Each cell in the matrix M represents the rate at which the method calls a particular modification X when the true modification is Y, for example, the cell M(unmodC, mC) represents the rate at which a method calls unmodC when the true modification status is modC. Each column of the matrix, corresponding to the rate at which we call each modification status for a particular true modification status, is estimated using a different spike-in control: a fully unmethylated pUC19 for the column corresponding to a true state of unmodC, a fully methylated lambda for the column corresponding to a true state of 5mC and a synthetic oligonucleotide for the column corresponding to a true state of 5hmC. For a given column, the rate is calculated as the proportion of bases with each modification status in the set of bases for which the genetic base call is C and which are aligned to CpGs in the given control sequence.

Allele-specific methylation calling

SNP calling was done using GATK (v.4.2.6) HaplotypeCaller. The resulting VCF file and the BAM file generated from the five-letter seq pipeline described above were parsed using pysam (v.0.19.1). For each heterozygous SNP site, a script counted the number of times a modC or unmodC call was associated with a CpG in a read containing each allele. Reads that overlapped with the variant site but did not contain a base aligned with the variant site or for which the base call did not match either of the two alleles were not included in these counts. Only sites that had at least six reads containing each allele were considered for ASM calling. The significance of the association was determined using Fisher’s exact test based on a contingency table of counts (variant alleles on one axis and unmodC/modC on the other). After calculating P values for all heterozygous variant sites that had sufficient reads for each allele, Benjamini–Hochberg’s correction was applied to minimize the false discovery rate; this was calculated using statsmodels (v.0.13.2).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.