DNA 5-methylcytosine detection and methylation phasing using PacBio circular consensus sequencing

Long single-molecular sequencing technologies, such as PacBio circular consensus sequencing (CCS) and nanopore sequencing, are advantageous in detecting DNA 5-methylcytosine in CpGs (5mCpGs), especially in repetitive genomic regions. However, existing methods for detecting 5mCpGs using PacBio CCS are less accurate and robust. Here, we present ccsmeth, a deep-learning method to detect DNA 5mCpGs using CCS reads. We sequence polymerase-chain-reaction treated and M.SssI-methyltransferase treated DNA of one human sample using PacBio CCS for training ccsmeth. Using long (≥10 Kb) CCS reads, ccsmeth achieves 0.90 accuracy and 0.97 Area Under the Curve on 5mCpG detection at single-molecule resolution. At the genome-wide site level, ccsmeth achieves >0.90 correlations with bisulfite sequencing and nanopore sequencing using only 10× reads. Furthermore, we develop a Nextflow pipeline, ccsmethphase, to detect haplotype-aware methylation using CCS reads, and then sequence a Chinese family trio to validate it. ccsmeth and ccsmethphase can be robust and accurate tools for detecting DNA 5-methylcytosines.


Supplementary Figures
Supplementary Fig. 1 Illustration of inferring methylation frequency of CpGs at site level using count mode and model mode.  Fig. 7 Comparison of the count mode and model mode of ccsmeth using 71.0× HG002 CCS reads. a Distribution of the number of CpGs in terms of measuring whether the methylation frequencies of model mode or count mode is closer to that of BS-seq. Rb, Rc, Rm represent the methylation frequencies of a CpG calculated by BS-seq, count mode of ccsmeth, and model mode of ccsmeth, respectively. Gc contains CpGs whose | Rb -Rc | -| Rb -Rm | < -0.1, while Gm contains CpGs whose | Rb -Rc | -| Rb -Rm | > 0.1. b Distribution of the "True" methylation frequencies (calculated by BS-seq) of the CpGs in the whole genome, Gc, and Gm, respectively. c Comparison of genome-wide per-site methylation frequency between the count mode and model mode of ccsmeth. Fig. 12 Methylation phasing of ccsmethphase on SD0651_P1 CCS data. a Distribution of methylation differences of known imprinted intervals calculated using CCS data between two haplotypes of SD0651_P1. 93 out of 102 "well-characterized" intervals, and 91 out of 102 "other" intervals which have at least 5 CpGs covered by CCS reads in each haplotype are analyzed. The boxes inside the violin plots indicate 50th percentile (middle line), 25th and 75th percentile (box), the smallest value within 1.5 times interquatile range below 25th percentile and largest value within 1.5 times interquatile range above 75th percentile (whiskers). Source data are provided as a Source Data file. b Screenshot of Integrative Genomics Viewer (chr20:60,671,001-60,673,750) on a DMR of SD0651_P1 near the maternally imprinted gene GNAS. Red and blue dots represent CpGs with high and low methylation probabilities, respectively.

Supplementary
Supplementary Fig. 13 Distribution of the number of CCS-generated DMRs in terms of distance to the closest BS-seq-generated (a) and ONT-generated DMR (b) in HG002. ONT: nanopore sequencing. Source data are provided as a Source Data file. Supplementary Fig. 14 Distribution of the number of known imprinted intervals in terms of distance to the closest CCS-generated DMR in HG002 and SD0651_P1. Source data are provided as a Source Data file. Supplementary Fig. 15 5mCpG detection and methylation phasing of the HN0641 family trio using ccsmethphase. a Distribution of methylation frequencies of CpGs in HN0641_FA (father), HN0641_MO (mother), and HN0641_S1 (son). b Distribution of methylation differences of known imprinted intervals between two haplotypes of HN0641_S1. 93 out of 102 "well-characterized" intervals, and 93 out of 102 "other" intervals which have at least 5 CpGs covered by CCS reads in each haplotype are analyzed. The boxes inside the violin plots indicate 50th percentile (middle line), 25th and 75th percentile (box), the smallest value within 1.5 times interquatile range below 25th percentile and largest value within 1.5 times interquatile range above 75th percentile (whiskers). c Distribution of the number of known imprinted intervals in terms of distance to the closest CCS-generated DMR in HN0641_S1. Source data underlying b and c are provided as a Source Data file. Fig. 16 Screenshot of Integrative Genomics Viewer 1 (chr20:60,664,900-60,673,999) on a DMR of HN0641_S1 near the maternally imprinted gene GNAS, showing the variants information of the HN0641 family trio, and the phased methylation information of HN0641_S1. Red and blue dots in the "Methylation" area represent CpGs with high and low methylation probabilities, respectively. Fig. 17 Screenshot of Integrative Genomics Viewer (chr7:95,890,500-95,894,600) on a DMR of HN0641_S1 near the maternally imprinted gene PEG10, showing the variants information of the HN0641 family trio, and the phased methylation information of HN0641_S1. Red and blue dots in the "Methylation" area represent CpGs with high and low methylation probabilities, respectively. Fig. 18 Screenshot of Integrative Genomics Viewer (chr14:95,056,500-95,066,00) on a DMR of HN0641_S1 near the paternally imprinted gene MEG3, showing the variants information of the HN0641 family trio, and the phased methylation information of HN0641_S1. Red and blue dots in the "Methylation" area represent CpGs with high and low methylation probabilities, respectively. Fig. 19 Comparison of the number of total (a) and phased (b) CpGs detected by the HG002 BS-seq (117.5×), ONT (65.8×), and CCS (71.0×) reads in non-RepeatMasker regions.

Supplementary Fig. 20
Comparing ccsmeth and primrose with BS-seq for 5mCpG detection of a Zebrafish sample. Fig. 21 ccsmeth for strand-specific methylation detection. a The model framework of ccsmeth for strand-specific methylation detection. b Comparison of the strand-specific-methylation model and the symmetric-methylation model of ccsmeth at read-level 5mCpG detection using long CCS reads. c Accuracy of the strand-specific-methylation model under different subread depths. Fig. 22 Main steps of HK model, ccsmeth, and primrose for methylation calling. Fig. 23 Runtime of 8 main processes in ccsmethphase. 10 SMRT cells of CCS reads (2 SMRT cells for each of the 5 "samples": HG002 (15Kb), HG002 (20Kb), HG002 (24Kb), CHM13 (20Kb), and SD0651_P1 (15Kb)) were used in this test. Source data are provided as a Source Data file.

Supplementary Tables
Supplementary Table 1

Supplementary Note 1 Evaluation of ccsmeth for 5mCpG detection at read level in different genomic contexts and regions
Different genomic regions may vary in sequence contexts and methylation levels 8 . To explore whether the performance of ccsmeth is correlated with any genomic features, we further examine ccsmeth for read-level 5mCpG detection in different genomic contexts and regions using the datasets of HG002 and SD0651_P1. We consider the following genomic contexts and regions: (1)  We use the high-confidence methylated and unmethylated CpGs of HG002 and SD0651_P1 for the read-level evaluation ( Supplementary Fig. 3a). As shown in Supplementary Fig. 3b, compared to the genome-wide performance, ccsmeth has much higher accuracies in non-singletons and CpG islands but has lower accuracies in singletons, indicating that ccsmeth tends to have higher performances in regions with high CpG densities. ccsmeth has relative lower accuracies in intergenic regions, CpG shores, and CpG shelves. In simple repeats and "Others" repetitive regions, ccsmeth has lower sensitivities and specificities, respectively. On all four datasets, the results of primrose show consistent patterns with ccsmeth across all tested regions. The results indicate that biologically relevant genomic contexts and regions do impact the performance of 5mCpG detection. Further studies are needed to focus on improving the performance of 5mCpG detection in specific genomic regions.

Supplementary Note 2 Comparison of the count mode and model mode of ccsmeth
We use the HG002 CCS datasets (71.0×) to compare the methylation frequencies calculated by the count mode and model mode of ccsmeth: Suppose Rb, Rc, Rm are the methylation frequencies of a CpG calculated by BS-seq, count mode of ccsmeth, and model mode of ccsmeth, respectively. We use | Rb -Rc | -| Rb -Rm | to measure whether Rc or Rm is closer to Rb. If | Rb -Rc | -| Rb -Rm | > 0.1, meaning the model mode has a more accurate prediction than count mode, we classify the CpG into the group Gm. If | Rb -Rc | -| Rb -Rm | < -0.1, we classify the CpG into the group Gc. We find that among the total tested 29,174,320 CpGs, 3,975,014 CpGs are classified to Gm, while 644,370 CpGs are classified to Gc (Supplementary Fig. 7a). The methylation frequencies of CpGs in the two groups show significant differences: CpGs in Gm tend to have either very low (<0.2) or high (>0.8) methylation frequencies, while CpGs in Gc tend to have intermediate methylation frequencies (Supplementary Fig. 7b). The comparison of genome-wide per-site methylation frequency between the count mode and model mode of ccsmeth is shown in Supplementary Fig. 7c.

Supplementary Note 3 Pipeline for haplotype-aware methylation calling using Illumina whole-genome sequencing (WGS) trio data and BS-seq data
To evaluate the methylation phasing pipeline on CCS data, we performed haplotype-aware methylation calling using WGS and BS-seq reads of HG002 as the benchmark (Supplementary Fig. 8). In this pipeline, we used SNPsplit (version 0.5.0) 9 to assign BS-seq reads to the haplotypes of HG002. SNPsplit requires information of heterozygous SNVs of HG002 and the origin of these SNVs for accurate reads alignment and splitting. Thus, we downloaded the Illumina WGS reads of AshkenazimTrio: HG003 is the father, HG004 is the mother, HG002 is the son. We used BWA-MEM (version 0.7.17-r1194-dirty) 10 to align the WGS reads, and then used DeepTrio 11 (version 1.3.0) to call SNVs for HG002, HG003, and HG004. Then, we used the following rules as in NanoMethPhase 12  (1) where S represents an SNV. After phasing, we generated a chromosome-level SNV phasing result (i.e., all heterozygous SNVs of HG002 inherited from HG004 (mother) were assigned to Haplotype 1, and all heterozygous SNVs inherited from HG003 (father) were assigned to Haplotype 2).
We used Bismark 13 to align BS-seq reads to genome reference. Then, we used SNPsplit to assign the aligned BS-seq reads to haplotypes. We got the methylation profile of each haplotype using Bismark. At last, we got differentially methylated regions (DMRs) of the two haplotypes using DSS (version 2.44.0) 14 ( Supplementary  Fig. 8).

Supplementary Note 5 The model architecture of ccsmeth
(1) bidirectional GRU A bidirectional GRU 19 layer includes a forward GRU and a backward GRU to catch both the forward and reverse flow of features. Suppose x1, x2,…, xt are a sequence of features, each time step contains four features: the nucleotide base, the mean IPD value, the mean PW value, and the number of subreads. A GRU cell will recursively calculate the hidden layer h as follows: where W and b are weight matrices and biases. xt is the input feature; rt is a reset gate; zt is an update gate; ht is the hidden state; and ℎ � represents information that needs to be updated in the current cell. The outputs of forward and backward GRU are combined as: ℎ , = ℎ , ⨁ℎ , (6) (2) Bahdanau attention Bahdanau attention 20 receives all the hidden states of RNN cells and outputs context vector as follows: score(ℎ , ℎ ) = tanh( 1 ℎ + 2 ℎ ) (7) = softmax(score(ℎ , ℎ )) (8) = ℎ (9) where ht represents the hidden state in the output vector of BiGRU; hs contains the final hidden state for an element in the sequence from GRU; W1 and W2 are weight matrices.
(3) Softmax activation function to output methylated/unmethylated probabilities A softmax activation layer is used in ccsmeth to predict the methylated and unmethylated probabilities of one sample as follows: where x0 and x1 are two outputs from the former fully connected layer, for calculating unmethylated and methylated probabilities, respectively. (4) The cross-entropy loss function used for training the read-level model is as follows: where z is the true label vector and y is the predicted methylated probability vector from the softmax function. (5) The mean squared error (MSE) loss function for training the site-level model is as follows: where xi is the predicted methylation frequency, yi is the true methylation frequency, n is the number of samples.

Supplementary Note 6 Testing ccsmethphase using CCS data of the HN0641 family trio
We sequenced three human samples of a Chinese family trio using CCS, and got 2 SMRT cells of CCS reads for each of the three samples: HN0641_FA (father), HN0641_MO (mother), HN0641_S1 (son). We tested ccsmethphase using the CCS reads of the family trio. As shown in Supplementary Fig. 15a, the results indicate that the three samples have similar methylation levels. 11.9%, 12.0%, 10.4% CpGs of HN0641_FA, HN0641_MO, HN0641_S1, respectively, have low (≤0.3) methylation frequencies, while 75.4%, 73.4%, 80.6% CpGs of the three samples, respectively, have high (≥0.7) methylation frequencies.
We then examined the haplotype-aware methylation status of known imprinted regions in HN0641_S1. As shown in Supplementary Fig. 15b, the well-characterized imprinted intervals show large methylation differences between the two haplotypes (median=0.51). 14.0% of other known imprinted intervals also show large (>0.5) methylation differences. The result of HN0641_S1 is consistent with the results in HG002 and SD0651_P1.
Using ccsmethphase, we generated 2,813 DMRs from the CCS data of HN0641_S1. The DMRs cover 108 (52.9%) of the known imprinted intervals (i.e., 108 known imprinted intervals are overlapped with the CCSgenerated DMRs) (Supplementary Fig. 15c). Moreover, the haplotype phasing results of the family trio show that ccsmethphase not only can detect imprinted intervals, but also reveals the pattern of parental imprinting correctly. In the results of ccsmethphase, the maternally imprinted intervals (e.g., GNAS_Ex1A and PEG10) in HN0641_S1 show high methylation levels in the haplotype inherited from mother, while the paternally imprinted intervals (e.g., MEG3) show high methylation levels in the haplotype inherited from father ( Supplementary Figs. 16-18).

Supplementary Note 7 Computational efficiency of ccsmeth and the ccsmethphase pipeline
We compared the runtime (wall clock time) and peak memory of ccsmeth with HK model and primrose. The comparison was performed at an HPC cluster containing two kinds of servers: (1) Server-CPU with 48 CPU cores (Intel(R) Xeon(R) Gold 6248R CPU @ 3.0GHz) and 192 GB RAM; (2) Server-GPU with 40 CPU cores (Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz), 384 GB RAM, and 2 Nvidia Tesla V100 GPU cards. As shown in Supplementary Fig. 22, ccsmeth and primrose (+pb_CpG_tools) contain the same steps for methylation calling, while HK model has a different pipeline. The differences are mainly in the following aspects: (1) HK model extracts features from subreads, while ccsmeth and primrose take CCS reads as input.
(2) HK model needs aligned reads for feature extraction, while the "Call per-read methylation" step of ccsmeth and primrose can be performed before or after the "Align" step. (3) There is no module or script in HK model for calculating site-level methylation frequency.
We first subsampled 100K ZMW reads from each of the three HG002 CCS datasets (15Kb, 20Kb, 24Kb) to compare all three methods. As shown in Supplementary Table 18, HK model is very time-consuming, especially in the "Extract features" step. This is mainly because HK model directly extracts features from subreads, and the script of HK model for "Extract features" is not optimized for parallel processing. We further used 1 SMRT cell CCS reads from each of the three HG002 datasets to compare primrose and ccsmeth (Supplementary Table  19). As shown in Supplementary Tables 18-19, primrose is extremely fast in calling per-read methylation. primrose takes ~6-10 minutes to call per-read methylation from 1 SMRT cell of CCS data, while ccsmeth needs ~3-6 hours. However, when CCS reads have been called from the raw subreads, the whole pipeline of ccsmeth takes at most 8 hours to call methylation from 1 SMRT cell CCS data. Compared to primrose which takes at most ~2.4 hours, ccsmeth can also be used in practice. In the future, we will continue to optimize ccsmeth in terms of computational efficiency.
We also evaluated the runtime of 8 main processes in the ccsmethphase pipeline: SAMTOOLS_index_bam for indexing the CCS bam files, CCSMETH_call_mods for calling methylation in CCS reads, PBMM2 for aligning CCS reads to the reference genome, SAMTOOLS_merge_bam for merging alignment bam files of the same "sample", CLAIR3 for calling SNVs, WHATSHAP_phase_haplotag for phasing SNVs and reads, CCSMETH_call_freq for calling methylation frequencies of CpGs, and DSS for calling DMRs. The data used for evaluation include 10 SMRT cells of CCS reads used for testing in this study, in which there are 2 SMRT cells of CCS reads for each of the 5 "samples": HG002 (15Kb), HG002 (20Kb), HG002 (24Kb), CHM13 (20Kb), and SD0651_P1 (15Kb) (Supplementary Table 2). Details of the applied computing resources for the processes are shown in Supplementary Table 20. The runtime of the processes is shown in Supplementary Fig. 23. Note that for the first 3 processes, the runtime for each SMRT cell is shown. For the last 5 processes, the runtime for each "sample" (2 SMRT cells) is shown. The evaluation indicates that for a human sample with 2 SMRT cells of CCS reads, methylation phasing and ASM detection can be performed in less than 14 hours using ccsmethphase even on a single server (Supplementary Fig. 23).