Advances in methylation analysis of liquid biopsy in early cancer detection of colorectal and lung cancer

Methylation patterns in cell-free DNA (cfDNA) have emerged as a promising genomic feature for detecting the presence of cancer and determining its origin. The purpose of this study was to evaluate the diagnostic performance of methylation-sensitive restriction enzyme digestion followed by sequencing (MRE-Seq) using cfDNA, and to investigate the cancer signal origin (CSO) of the cancer using a deep neural network (DNN) analyses for liquid biopsy of colorectal and lung cancer. We developed a selective MRE-Seq method with DNN learning-based prediction model using demethylated-sequence-depth patterns from 63,266 CpG sites using SacII enzyme digestion. A total of 191 patients with stage I–IV cancers (95 lung cancers and 96 colorectal cancers) and 126 noncancer participants were enrolled in this study. Our study showed an area under the receiver operating characteristic curve (AUC) of 0.978 with a sensitivity of 78.1% for colorectal cancer, and an AUC of 0.956 with a sensitivity of 66.3% for lung cancer, both at a specificity of 99.2%. For colorectal cancer, sensitivities for stages I–IV ranged from 76.2 to 83.3% while for lung cancer, sensitivities for stages I–IV ranged from 44.4 to 78.9%, both again at a specificity of 99.2%. The CSO model's true-positive rates were 94.4% and 89.9% for colorectal and lung cancers, respectively. The MRE-Seq was found to be a useful method for detecting global hypomethylation patterns in liquid biopsy samples and accurately diagnosing colorectal and lung cancers, as well as determining CSO of the cancer using DNN analysis. Trial registration: This trial was registered at ClinicalTrials.gov (registration number: NCT 04253509) for lung cancer on 5 February 2020, https://clinicaltrials.gov/ct2/show/NCT04253509. Colorectal cancer samples were retrospectively registered at CRIS (Clinical Research Information Service, registration number: KCT0008037) on 23 December 2022, https://cris.nih.go.kr, https://who.init/ictrp. Healthy control samples were retrospectively registered.

by selectively cutting and sequencing unmethylated restriction sites in cancer DNA 29,30 .This process enriches ctDNA molecule and enhances sensitivity, which can improve the accuracy of early cancer detection (Fig. 1).Second, the MRE-Seq has a lesser impact on DNA degradation during sample preparation compared to bisulfite conversion, allowing for robust analysis using a relatively small amount of cfDNA 31,32 .
Next generation sequencing (NGS) was performed using MRE-Seq, which involves using the methylationsensitive restriction enzyme SacII to capture and sequence unmethylated restriction sites in cfDNA.As cancer develops, global DNA hypomethylation occurs, leading to accelerated demethylation in both regulatory and intragenic region of the genes.In the cancer genome, this demethylation occurs in distinct regions, leading to a specific methylation pattern that can be detected using MRE-Seq.By analyzing this pattern, it is possible to diagnose and detect the development of cancer using cell-free DNA by liquid biopsy analysis.
We investigate the utility of the proposed MRE-Seq method in diagnosing various cancers by analyzing liquid biopsy samples from patients with lung and colorectal cancers.Lung and colorectal cancer are the two leading causes of global cancer-related deaths 33 .Although the 5-year survival rates among patients with late-stage lung cancer and colorectal cancer remain below 20% and 14%, the survival rate can be increased to 70% and 90% if it is detected in early-stage of the cancer, respectively [34][35][36][37] .Although existing cancer screening analysis such as low-dose computed tomography (LDCT) is a commonly used for the early detection of lung cancer, it has high rate of false-positive results and risk of radiation exposure for patients 38 .While fecal occult blood tests and colonoscopies are currently recommended for the early detection of colorectal cancer, both methods have limitations.Fecal occult blood test is relatively simple procedure but less accurate than colonoscopies, and colonoscopies can be inconvenient and invasive for patients 39 .In order to overcome the limitation and disadvantage of the current cancer screening methods, liquid biopsy of cfDNA analyses has been explored as a safer, more convenient, and potentially more effective alternative for early cancer detection.
In this prospective study, the diagnostic performance of the new MRE-Seq method was evaluated for the detection of the cancers, and classification of the cancer signal origin (CSO) using a deep neural network (DNN) analysis.The aim of the study was to determine the accuracy of MRE-Seq for detecting cancer-specific DNA methylation patterns in cfDNA from liquid biopsy samples, and to investigate the potential of DNN method for detecting presence of cancer and identifying the type of cancer tissue.By exploring the diagnostic potential of the MRE-Seq and DNN, this study may contribute to the development of more accurate and effective methods for early cancer screening and diagnosis.

Methods
Study subjects.Treatment-naïve and histologically confirmed patients with lung cancer and colorectal cancer were enrolled in this study at Samsung Medical Center and Bucheon St. Mary's Hospital, respectively.Patients with a history (within five years) of other malignancy were not included.For healthy controls, participants with no history of cancer diagnosis were enrolled at Gangnam Major Hospital.
cfDNA library construction for MRE-seq.Eight-mL tubes of whole blood were collected (NICE® cfDNA tube) (EDGC, South Korea), which were centrifuged at 1900×g for 10 min and 13,000×g for 5 min for plasma separation.Samples with a hemoglobin level of ≤ 100 mg/dL were used in further analysis.The separated plasma was stored at − 70 °C until use.
CfDNA was extracted using 3.5-4 mL of plasma with the chemagic cfNA 5k Kit special H24 (Perkin Elmer) using chemagic™ 360 instrument according to the product manual.Extracted cfDNA is purified using HiAc-cuBead (Accugene) with 2X.CfDNA concentration was measured with a Qubit 2.0 fluorometer (Thermo Fisher Scientific).The extracted cfDNA was stored − 20 °C until use.cfDNA (10-25 ng) was used for end-repair and A-tailing.Then, a p7 adapter with 10-bp unique molecular index (UMI) was ligated to the cfDNA with T4 DNA ligase (NEB, USA) at 3 µM at 25 °C for 2 h.After that, the p7-ligated cfDNA was treated with SacII and ligated to a p5 adapter which have a cohesive end of SacII digestion.PCR amplification was performed with 11 cycles using the p7 universal primer (5′-CAA GCA GAA GAC GGC ATA CGA-3′) and p5 universal primers (5′-AAT GAT ACG GCG ACC ACC GA-3′) with Taq DNA polymerase (Supplementary Table 1).Finally, size selection of PCR-amplified libraries between 200 and 550 bp was performed using PippinHT (Sage Science, USA).High-throughput NGS was performed using an Illumina Novaseq6000 sequencer with 100 PE (Supplementary Table 2).
Data processing.NGS data were obtained in binary base call (BCL) sequence file format and converted to Fastq format using bcl2fastq v2.20.The sequenced read quality was examined with FastQC 40 after removing reads shorter than 20 bp, single-end reads.The UMI sequence located at the beginning of R2 reads was used for deduplication.Reads containing even a single N or Q0 base in the UMI sequence were dropped during the quality-trimming step.BWA-MEM 0.7.15 41 was used to align the processed Fastq sequences to the hg19 human reference genome and convert into binary alignment map (BAM) file format.In-house software removed PCR duplications and corrected sequencing errors using UMI sequence tags.
There are 67,472 SacII site in the hg19 human genome and 63,266 SacII sites of autosomal chromosomes, excluding sex chromosomes, were used as markers for analysis.For downstream analysis, the deduplicated read depth of each SacII site was normalized by a trimmed mean, which was obtained by calculating the average depth of the total SacII sites excluding 10% outliers (Fig. S1).

Deep learning modeling.
We implemented the multi-layer feed-forward neural network consisting of two hidden layers between the input and output layers.The normalized depth values corresponding to the 63,266 SacII sites entered the input layer and went through two hidden layers consisting of 64 nodes with a Rectified www.nature.com/scientificreports/decreased value (Fig. S2).To build an accurate and robust prediction model, the dataset was split into training, testing, and validation sets.The training set encompassed the data sample used to fit the model, whereas the validation set was used to fine-tune the hyperparameters, i.e., the number of layers and nodes and batch and epoch sizes.The model was trained with the best parameters, and the test dataset was evaluated (Supplementary Methods, Fig. S3).
Ethics approval and consent to participate.Approvals were obtained from the institutional review boards (IRBs) at the Samsung Medical Center (IRB: SMC 2019-11-080), Bucheon St. Mary's Hospital (IRB: HC17TOSI0032), and Gangnam Major Hospital (IRB: DR_CPLX_001).Written informed consent was obtained from each study participant before enrollment.This study was conducted in accordance with the Declaration of Helsinki.

Results
Study participants.Whole blood samples were collected from 327 participants consisting of 102 with colorectal cancer, 99 with lung cancer, and 126 healthy controls.After excluding 6 patients who withdrew consent to participate and two patients with QC-failed samples, the final analysis included 96 patients with colorectal cancer, 95 with lung cancer, and 126 healthy controls for model training and performance evaluation.Colorectal cancer cohort was composed of 74 colon cancer samples and 22 rectal cancer samples and lung cancer cohort was composed of 86 non-small-cell lung cancer (NSCLC) samples and 9 small-cell lung cancer (SCLC) samples (Table 1).
MRE-seq of cfDNA.SacII, a methylation-sensitive restriction enzyme, was used for MRE-seq-based liquid biopsy in this study.Approximately 90% of reads produced by MRE-seq were mapped to hg19 reference genome.After deduplication based on the UMI, the remaining read ratio was 42-52% compared with the original mapped reads.The mapping coverage at a depth of at least one SacII site ranged between 96 and 99% of the 63,266 target sites.Among the deduplicated reads, those with the 5′ end "GCGG" sequence matching the SacII cut site were defined as on-target reads, and the ratio of on-target reads to deduplicated reads is defined as the on-target ratio.The on-target ratio of samples ranged 50-57% which was no significant difference between colorectal cancer, lung cancer, and healthy controls (Fig. S4).
The heatmap plot of the top 1,000 markers from each cancer types showed distinguishable patterns with high statistical power (student t-test P < 1 × 10 −7 ) for differentiating cancer from healthy controls.(Fig. S5, Supplementary Table 3A,B).
Among the 63,266 target sites, most of the SacII sites were uniformly distributed in intron regions (31.0%; 19,649), the promoter (25.7%; 16,285) and intergenic regions (24.8%; 15,699), followed by exons, 5' UTR in 5,871 (9.3%) and 2,240 (3.5%) cases, respectively, which is suitable for global hypomethylation analysis.SHapley Additive exPlanations (SHAP) 42 assigns each feature an importance value after the model training (Supplementary Table 4A,B).The top 1,000 markers with high feature importance were obtained with SHAP from our DNN model, and these markers were also evenly distributed in regulatory and intergenic regions (Fig. S6, Supplementary Table 5).
Evaluation of prediction performance of DNN model.We defined the probability value of output layer of DNN model as a cancer score.We performed 20 independent repetitions of nested fivefold cross-validation which makes 100 different cancer scores per each sample and used the average of cancer scores to assess the performance of our DNN model.In each cross-validation cycle, a classification model was trained and test samples which were excluded from the training set were evaluated (Fig. S3).The interquartile range (IQR) was also calculated to measure how stable the scores of test sample are in various models.
The average IQR values of cancer score was 0.09 for cancer samples and 0.06 for the healthy control samples in the colorectal cancer classification model.In the lung cancer classification model, the average IQR of cancer samples and the healthy control samples were 0.13 and 0.10, respectively (Fig. S7).Therefore, the cancer scores appeared consistent for each cross-validation cycle.Additionally, to check whether the number of samples is sufficient to evaluate model performance, area under the receiver operating characteristic curve (AUC) and the average IQR were measured by randomly selecting samples with different sample size ratios.In the colorectal cancer model, reducing the number of samples by 50% only decreased the AUC by 0.02 and increased the average IQR by 0.015.(Fig. S8a,b).In the lung cancer model, AUC is almost saturated from the sample size ratio of 60%, and the average IQR showed only 0.03 difference in the sample size ratio of 50%.(Fig. S8c, d).

Lung cancer classification.
Table 1.Clinical characteristics and demographics of patients with cancer and healthy controls.Data are presented as numbers (%) or the median (interquartile range).BMI, body mass index; MSI, microsatellite instability; MSS, microsatellite stable; MSI-H, high microsatellite instability; N/A, not applicable; NSCLC, non-small-cell lung cancer; RLL, right lower lobe; RML, right middle lobe; RUL, right upper lobe; LLL, left lower lobe; LUL, left upper lobe; SCLC, small-cell lung cancer. 1 NSCLC (N = 86) and SCLC (N = 9) were staged according to the 8 th edition of the American Joint Committee on Cancer. 2 One patient in colorectal cancer group had neuroendocrine carcinoma.Other NSCLCs (N = 5) included large-cell neuroendocrine carcinoma (N = 3), adenosquamous cell carcinoma (N = 1), pleomorphic carcinoma (N = 1), and NSCLC not otherwise specified (N = 1).mines whether cancer is present, and the Cancer Type Classifier, which classifies the type of cancer.Prediction performance was measured through fivefold cross-validation which samples 80% of the data for training and 20% for testing.In each fold, the Cancer Classifier was trained using two cancer type samples as a case group and the healthy controls as a control group.Afterwards, the true positives were tested in the Cancer Type Classifier which was built using two cancer types with different labels.A cancer type with the highest probability value was defined as a true positive.The accuracy of these two classifiers was displayed in the confusion matrix (Fig. 4).In the Cancer Classifier, 179 out of 191 cancer samples were positively predicted with a sensitivity of 93.7% and they were classified into the two cancer types using the Cancer Type Classifier with high accuracy (94.4% in colorectal and 89.9% in lung cancer).

Confounder analysis.
Even after normalizing the data, principal component analysis (PCA) is commonly used to determine whether sequencing batch effects exist 43,44 .It has been confirmed with PCA that there was no bias between the 22 batches or between the sample groups.(Fig. S10).Seven samples were identified as outliers if Principal component 1 (PC1) exceeded 250, and they were over stage 3 cancer samples with high cancer scores.
Considering that methylation changes are affected by age 45 , it is possible that age becomes a confounding factor if the age distribution differs between the sample groups.In this study, there was a statistically significant difference in age distribution (student t-test P = 0.003 for colorectal cancer vs. healthy controls and P = 0.022  for lung cancer vs. healthy controls).However, the age was not correlated with the cancer score.The Pearson's correlation coefficient (PCC) between cancer score and age was 0.005 for colorectal cancer samples and 0.096 for lung cancer samples.For the normal group, the PCC values in the colorectal cancer classification model and lung cancer classification model were 0.071 and 0.061, respectively.The distribution of cancer scores was not significantly different among the age groups (Fig. S11a).
Because the dataset contained males and females, cancer-related markers on the sex chromosomes may lead to incorrect analytical results.To avoid this, all analyses were carried out using only markers on autosomal chromosomes.Still, if there is a large difference in the cancer scores between males and females, sex may act as a confounding factor.In the colorectal cancer model, both genders showed similar distributions, but in the lung cancer model, male patients showed a significantly higher cancer score (Fig. S11b).To address this sex difference, we compared the characteristics of patients with lung cancer by sex.As shown in Supplementary Table 7, 87.1% (27/31) of the female patients were never-smokers, whereas 91.2% (59/64) of the male patients were current or former smokers.Moreover, the female patients in the lung cancer group were significantly younger and had a higher prevalence of LUAD and early stages compared with the male patients.Because these factors (age, smoking, histology, and cancer stage) might have confounded the results, we performed multivariable analysis (Supplementary Table 8) and found that smoking was an independent factor associated with lung cancer score (Fig. S12).All tests were two-sided, and significance was set at P < 0.05.We used Stata software (v.14.0; Stata Corporation, College Station, TX, USA) for statistical analysis.

Discussion
This study presents a novel liquid biopsy method for cancer detection using the proposed MRE-Seq method and a DNN artificial intelligence (AI) analysis.The method was found to be highly sensitive and accurate in detecting cancer-specific DNA methylation patterns in cfDNA and has the potential to be a valuable tool for early cancer diagnosis and detection.
In recent years, analysis of the methylation pattern in cfDNA has emerged as a promising cancer screening and monitoring protocol for the development of multicancer liquid biopsy methods [46][47][48] .The bisulfite sequencing was the most extensively studied method for analyzing DNA methylation in cancer.It has been used in a recent study on 27 different types of cancer and found that the methylation analysis was highly accurate and showed outstanding results in 16 cancer types with specificity of 99.4% and a sensitivity of 60% to 94% with the 92% correct classification of the CSO, an important index for early diagnosis in clinical practice 18,19 .However, the bisulfite sequencing analysis is very challenging and difficult to adopt in routine clinical setting due to a requirement of high amount of input blood because 84-96% of the DNA is subject to degraded during the bisulfite conversion step 31,32,[49][50][51] .
Methylated DNA immunoprecipitation coupled with high-throughput sequencing (MeDIP-seq) was employed as an affinity-purification-based method, with AUCs of 0.978, 0.918, and 0.971 for acute myeloid leukemia, pancreatic cancer, and lung cancer, respectively 52 .However, the overall CSO prediction accuracy was less than70%, which is insufficient to become a practical tool for early multicancer screening.
Our proposed MRE-seq performance is comparable to the previous bisulfite sequencing method in accuracy of the cancer detection and classification of CSO with use of relatively smaller amount of blood from a single tube collection, and require lower sequencing depth coverage compared to the whole genome bisulfite sequencing analysis.These features enhance its practicability for routine clinical adoption by lowering requirement of patient blood sample and reducing the cost of the testing 28,31,51 .
The overall accuracy of cancer detection by MRE-seq was high because MRE-seq measures global hypomethylation, a characteristic feature of most cancer genome.In particular, the sensitivity of stage 1 samples of colorectal and lung cancers was 76.5% and 50.0%, respectively.The high detection rate for the early-stage cancer may be due to a prevention of DNA damage by avoiding bisulfite treatment and enhancing cancer signal by enriching cancer-specific demethylated reads (hypomethylation).Therefore, this method is more suitable for diagnosing early cancer in liquid biopsy using a small amount of cfDNA in a regular clinical testing.
The overall accuracy of the liquid biopsy method was found to be lower for lung cancer than for colorectal cancer.This is likely due to the greater diversity of histological subtypes and larger differences in DNA methylation patterns in lung cancer compared to colorectal cancer.However, the accuracy of lung cancer detection is expected to improve with the use of a sufficient number of LUAD samples in the training set.
Although only two cancer types were used in the testing the feasibility of CSO analysis using the MRE-Seq and deep learning analysis was, the results showed that the method had high accuracy in predicting the tissue of origin for most of the samples.However, there were more falsely predicted samples in certain subtypes of cancers, such as left colon in colorectal cancer and LUAD in lung cancer.By analyzing a sufficient number of samples with a similar number of each subtype, the accuracy of the CSO analysis may be improved in future work (Fig. S10).
In the cfDNA of cancer patients, both cancer-specific methylation patterns of ctDNA and tissue-specific patterns can coexist.As a result, some methylation signals detected in liquid biopsy samples may be tissue-specific rather than cancer-specific.To distinguish between cancer-specific and tissue-specific methylation patterns, it is necessary to perform comparative analyses with samples from patients with benign diseases related to the cancer type.By analyzing the methylation patterns of both cancer and benign disease samples, it may be possible to develop more accurate and specific liquid biopsy methods for cancer diagnosis 53 .In this study, samples with benign disease were not excluded, which may reflect real-world situations more accurately.By including samples with both cancer and benign disease, the study may provide a more realistic assessment of the accuracy and reliability of the liquid biopsy testing.
The study has several limitations that should be considered when interpreting the results.First, each case study was conducted at a single center, which may introduce selection bias and limit the generalizability of the www.nature.com/scientificreports/findings.Additional multicenter studies are needed to validate the results with the independent test samples and confirm the utility of the method in different populations.Second, no follow-up was conducted for the healthy controls, which may have led to misclassification bias if certain individuals developed cancer after the study.Further research is needed to track the health outcomes of the controls and to assess the long-term predictive power of the method.Third, the study only included two types of cancer samples, which may limit the accuracy of CSO classification, especially for cancers with similar methylation patterns.To improve the performance of CSO prediction, additional studies using larger and more diverse sets of cancer types are required.Despite these limitations, the study provides valuable insights into the potential of the MRE-seq and DNN analysis of liquid biopsy methods for early cancer detection and diagnosis and may facilitate more effective treatment of the disease.

Conclusions
The study aimed to develop a screening method for early detection of multiple cancers using liquid biopsy-based testing.By combining the proposed MRE-Seq and machine learning algorithm, the researchers were able to detect and classify colorectal and lung cancers with high accuracy.The MRE-Seq allows for the analysis of global hypomethylation in cancer genomes with high sensitivity and low cost with small blood sample requirement, which makes it a promising approach for early screening of multiple cancer types in routine clinical setting.However, additional research is needed to adapt and validate the method for other cancer types, and to determine its clinical feasibility for multicancer early detection.The study highlights the potential of liquid biopsy methods for improving cancer diagnosis and detection, suggesting that further development and validation of these methods could have important implications for improving cancer survival and quality of life.

Figure 1 .
Figure 1.Methylation-sensitive restriction enzyme digestion followed by sequencing (MRE-seq) with a SacII diagram.A library was constructed based on MRE-seq using a methylation-sensitive restriction enzyme, SacII.As cancer grows, global DNA hypomethylation accelerates demethylation in both the regularity region and intragenic regions.In the cancer genome, demethylation occurs in differ regions, making a distinct pattern.

Figure 2 .
Figure 2. Test performance of Colorectal and Lung cancer classification.(a,b) The overall AUC values were 0.978 for colorectal cancer and 0.956 for lung cancer.(c,d) Sensitivity at 99.2% specificity with 95% confidence interval (CI) according to cancer stage.

Table 2 .
Sensitivity of the DNN model for predicting colorectal cancer and lung cancer at a specificity of 99.2%.*Sensitivity for each cancer type was calculated based on the specificity of 99.2%, which allowed 1 false positive sample out of 126 healthy control samples.