Epigenetic study of early breast cancer (EBC) based on DNA methylation and gene integration analysis

Breast cancer (BC) is one of the leading causes of cancer-related deaths in women. The purpose of this study is to identify key molecular markers related to the diagnosis and prognosis of early breast cancer (EBC). The data of mRNA, lncRNA and DNA methylation were downloaded from The Cancer Genome Atlas (TCGA) dataset for identification of differentially expressed mRNAs (DEmRNAs), differentially expressed lncRNAs (DElncRNAs) and DNA methylation analysis. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyzes were used to identify the biological functions of DEmRNAs. The correlation analysis between DNA methylation and DEmRNAs was carried out. Then, diagnostic analysis and prognostic analysis of identified DEmRNAs and DElncRNAs were also performed in the TCGA database. Subsequently, methylation state verification for identified DEmRNAs was performed in the GSE32393 dataset. In addition, real-time polymerase chain reaction (RT-PCR) in vitro verification of genes was performed. Finally, AC093110.1 was overexpressed in human BC cell line MCF-7 to verify cell proliferation and migration. In this study, a total of 1633 DEmRNAs, 750 DElncRNAs and 8042 differentially methylated sites were obtained, respectively. In the Venn analysis, 11 keys DEmRNAs (ALDH1L1, SPTBN1, MRGPRF, CAV2, HSPB6, PITX1, WDR86, PENK, CACNA1H, ALDH1A2 and MME) were we found. ALDH1A2, ALDH1L1, HSPB6, MME, MRGPRF, PENK, PITX1, SPTBN1, WDR86 and CAV2 may be considered as potential diagnostic gene biomarkers in EBC. Strikingly, CAV2, MME, AC093110.1 and AC120498.6 were significantly actively correlated with survival. Methylation state of identified DEmRNAs in GSE32393 dataset was consistent with the result in TCGA. AC093110.1 can affect the proliferation and migration of MCF-7. ALDH1A2, ALDH1L1, HSPB6, MME, MRGPRF, PENK, PITX1, SPTBN1, WDR86 and CAV2 may be potential diagnostic gene biomarkers of EBC. Strikingly, CAV2, MME, AC093110.1 and AC120498.6 were significantly actively correlated with survival. The identification of these genes can help in the early diagnosis and treatment of EBC. In addition, AC093110.1 can regulate SPTBN1 expression and play an important role in cell proliferation and migration, which provides clues to clarify the regulatory mechanism of EBC.


Enrichment analysis of DEmRNAs.
In order to explore the biological function of DEmRNAs, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) functional enrichment were performed using clusterProfiler (version 3.10.1). FDR < 0.05 was considered statistically significant. In terms of biological process (BP), DEmRNAs were involved in extracellular matrix organization, mitotic nuclear division and chromosome segregation (Fig. 1A). In terms of molecular function (MF), DEmRNAs were involved in channel activity, DNA-binding transcription activator activity, RNA polymerase II-specific and transcription factor activity, RNA polymerase II proximal promoter sequence-specific DNA binding (Fig. 1B). In terms of cell composition (CC), DEmRNAs were involved in extracellular matrix, receptor complex and transmembrane transporter complex (Fig. 1C). According to KEGG enrichment analysis, neuroactive ligand-receptor interaction, cAMP signaling pathway and cell cycle were significantly enriched signaling pathways (Fig. 1D). From the functional enrichment, we found that most DEmRNA were involved in cell activity, cell cycle, transcription and so on. These biological processes may be related to the occurrence and development of BC.

DNA methylation analysis.
A total of 485,577 methylation sites were detected in this study. In order to ensure the reliability of the difference results, these 485,577 methylation sites were preprocessed to exclude unqualified methylation sites. After preprocess, 385,717 methylation sites were obtained and principal component analysis (PCA) was performed ( Fig. 2A). Then, β > 0.7 or β < 0.3 were selected for methylation sites. (Fig. 2B). According to screening criteria of Δβ > 0.2 and FDR < 0.05, 8042 differentially methylation sites (DMSs) (7458 hypermethylation sites and 584 hypomethylation sites) were obtained. The heat map of top 200 DMSs was shown in Fig. 2C. The Manhattan figure of these DMSs was shown in Fig. 2D. The distribution of DMSs on CpG island and gene group was shown in Fig. 3 (Table S1). Among them, 109 positive correlation pairs and 349 negative correlation pairs. Subsequently, 311 relationship pairs were selected according to hypermethylation site-DEmRNAs with low expression or hypomethylation site-DEmRNAs with high expression. Among which, 127 DEmRNAs (marked as cor-relation) were included. In addition, the relationship between DEmRNAs and differential methylated genes on CpG islands was integrated. www.nature.com/scientificreports/ A total of 42 DMRs-DEmRNAs were identified (Table S2), which includes 42 pairs of hypermethylation with low expression DEmRNAs (marked as island) and 0 pairs of hypomethylation with highly expression DEmRNAs.
In vitro verification. Clinical information of 6 EBC patients was shown in Table 4. The EBC tissue samples (disease group) and adjacent normal tissues samples (control group) were obtained for RT-PCR. Some top-ranked or reported genes, such as CACNA1H, CAV2, MME, FHL1, CAV1, LINC01537, TRHDE-AS1, LINC01614, FOXD3-AS1 and AC120498.6 (ENSG00000261294) were selected for RT-PCR verification. Primers were shown in Table 5. Compared with adjacent normal tissues, CAV2, MME, FHL1, CAV1, LINC01537, TRHDE-AS1 were down-regulated and LINC01614, FOXD3-AS1 were up-regulated in EBC tissues ( Fig. 9), which was consistent with bioinformatics analysis. Moreover, except CAV1 and TRHDE-AS1, the expression of www.nature.com/scientificreports/ them the rest genes were significant (P < 0.05). However, CACNA1H and ENSG00000261294 were opposite with bioinformatics analysis. The inconsistency may be caused by the small sample size, which needs further study.

AC093110.1 overexpression promoted proliferation and migration of MCF-7.
Bioinformatics analysis showed that AC093110.1 cis regulation of SPTBN1. Moreover, SPTBN1 and AC093110.1 have potential diagnostic and prognostic value, respectively. So we investigated the potential biological function of AC093110.1 at the cell level. AC093110.1 was overexpressed in human BC cell line MCF-7 (Fig. 10A). MTT detection showed that the proliferation rate in AC093110.1 overexpressed cells was significantly lower than that of normal MCF-7 cells (Fig. 10B). Conspicuously, overexpression of AC093110.1 inhibited cell migration (Fig. 11). Moreover, AC093110.1 cis regulated SPTBN1 in the bioinformatics analysis. Western blotting assay was used to detect the expression of SPTBN1 after AC093110.1 overexpression. The results showed that the expression level of SPTBN1 was up-regulated after AC093110.1 overexpression (Fig. 12). These results suggest that AC093110.1 may play an important role in BC cell proliferation and migration by regulating SPTBN1.

Discussion
Previous studies based on TCGA data have revealed the diagnostic and prognostic value of lncRNAs, miRNAs and mRNAs for EBC 16 . In addition, the identification of abnormal methylation of genes helps the detection of BC 17 . However, we have not found integrated studies on DNA methylation with transcriptome data. Key genes screened out through the integration of DNA methylation with DEmRNA and DElncRNA have more research significance for the diagnosis and treatment of EBC. In this study, ALDH1A2, ALDH1L1, HSPB6, MME, MRG-PRF, PENK, PITX1, SPTBN1, WDR86 and CAV2 were considered as potential diagnostic gene biomarkers in EBC. Furthermore, CAV2, MME, AC093110.1 and AC120498.6 were considered as potential prognosis gene www.nature.com/scientificreports/ biomarkers in EBC. In addition, AC093110.1 could regulate SPTBN1 expression and play an important role in cell proliferation and migration, which provides clues to clarify the regulatory mechanism of EBC. Caveolin 2 (CAV2) is a member of the caveolin protein family, the down-regulation of CAV2 promote cell proliferation of HeLa epithelial cervical cancer and A549 lung adenocarcinoma cells 18 . Compared with normal tissue, CAV2 showed a downward trend in tumor tissue 19 . In the early detection of BC, CAV is considered as a new methylation marker 20 . Membrane metalloendopeptidase (MME) is a transmembrane glycoprotein that  www.nature.com/scientificreports/  Overexpression of MME can inhibit substance P to promote the growth of cholangiocarcinoma 22 . In addition, MME hypermethylation has been found in BC, which may be the cause of reduced gene expression 23 . In this study, down-regulation of CAV2 and MME in EBC tumor tissues was associated with poor prognosis. This showed that CAV2 and MME may have high prognostic value in EBC. Spectrin beta, non-erythrocytic 1 (SPTBN1) is also known as ELF. Inhibition of SPTBN1 can up-regulate the activity of transcriptional activator 3, thereby promoting the development of in hepatocellular carcinoma (HCC) 24 . Meanwhile, it is also a potential and reliable biomarker for predicting the prognosis of HCC patients 25 . Previous studies have found that reduced SPTBN1 expression is associated with worse prognosis of pancreatic cancer 26 . The knockdown of SPTBN1 enhances the migration and invasion potential of BC cells 27 . In this study, the differentially expressed genes, SPTBN1 was modified by DNA methylation. Moreover, AUC value was greater than 0.9 in the ROC curve. Therefore, we speculated that SPTBN1 might be potential diagnostic genes in EBC. In addition, bioinformatics analysis showed that AC093110.1 cis regulation of SPTBN1. AC093110.1 also has potential prognostic value. So we investigated the potential biological function of AC093110.1 at the cell level. The results clarify that AC093110.1 may play an important role in cell proliferation and migration by regulating the expression of SPTBN1, which provide clues for elucidating the regulation mechanism of EBC.
Aldehyde dehydrogenase 1 family member L1 (ALDH1L1) is a folate metabolizing enzyme with tumor suppressive properties 28 . In colon cancer, ALDH1L1 expression is decreased and mRNA levels are negatively correlated with promoter hypermethylation 29 . In HCC, compared with low expression of ALDH1L1, patients with high expression of ALDH1L1 have a significantly lower risk of recurrence and death 30 . In addition, ALDH1L1 is considered as a potential biomarker for poor prognosis of gastric cancer 31 . ALDH1L1 expression is inhibited in BC patients, and the mean hypermethylation level in the promoter region was positively correlated with the down-regulation of ALDH1L1 32 . In addition, high expression of ALDH1L1 is found to be associated with better overall survival in BC patients 33 . In this study, the differentially expressed genes, ALDH1L1 was modified by DNA methylation. Moreover, AUC value was greater than 0.9 in the ROC curve. Therefore, we speculated that ALDH1L1 might be potential diagnostic genes in EBC.
Aldehyde dehydrogenase 1 family member A2 (ALDH1A2) has been reported to be down-regulated in the early stages of human prostate cancer, and can be used as a candidate tumor suppressor gene for prostate cancer 34,35 . However, previous studies have found that high expression of ALDH1A2 mRNA is significantly www.nature.com/scientificreports/ associated with poorer survival in patients with non-small cell lung cancer (NSCLC) 36 . High expression of ALDH1A2 is found to be related to better overall survival in BC patients 33 . In this study, the expression of ALDH1A2 decreased in patients with EBC, suggesting that ALDH1A2 may play different regulatory roles in different cancers. The hypermethylation and down-regulation of proenkephalin (PENK) in BC can lead to cell migration and adhesion defects 37 . The PENK may be a molecular marker of tumorigenesis of hormone-receptor positive (HR( +))/human epidermal growth factor receptor 2 negative (HER2(-)) in adolescents and young adults 38 . In this study, KEGG analysis showed that PENK participated in Neuroactive ligand-receptor interaction signal pathway. In addition, AUC value was 0.907 in ROC curve. This further showed that PENK could play an important role in EBC.
Heat shock protein family B (small) member 6 (HSPB6), also known as Hsp20, and its expression level is negatively correlated with the degree of malignancy in ovarian cancer 39 . Decreased expression of HSP20 is associated with TNM stage, lymph node metastasis, and tumor recurrence, and may be valuable as a prognostic tumor marker 40 . Compared with normal lung tissue, the expression of paired like homeodomain 1 (PITX1) is up-regulated in NSCLC and is associated with a poorer prognosis 41 . PITX1 has been reported to be significantly increased in different histological classifications of BC, and is positively correlated with metastatic relapse-free survival and distant metastasis-free survival 43,44 . In this study, HSPB6 and PITX1 were down-regulated and upregulated in EBC, respectively. This is consistent with the expression in other cancers. Therefore, it is speculated that HSPB6 and PITX1 could play an important regulatory role in the physiological and pathological process of EBC.
There are few studies on WD repeat domain 86 (WDR86) and MAS related GPR family member F (MRGPRF) in cancer. However, they are found to be significantly related to diagnosis of EBC in this study, which implies that they play an important regulatory role in the development of EBC.
In this study, analysis of EBC in TCGA revealed methylated modification status of ALDH1A2, ALDH1L1, HSPB6, MME, MRGPRF, PENK, PITX1, SPTBN1, WDR86 and CAV2, and clarified that they may be potential diagnostic gene biomarkers. In order to further prove abnormal methylation modification in EBC, we verified the methylation modification status of ALDH1A2, ALDH1L1, HSPB6, MME, MRGPRF, PENK, PITX1, SPTBN1, Figure 7. Methylation modification state verification of ALDH1A2, ALDH1L1, HSPB6, MME, MRGPRF and PENK in the GSE32393 dataset. * represents P < 0.05, ** represents P < 0.01, *** represents P < 0.001, P < 0.05 was considered statistically significant. www.nature.com/scientificreports/ WDR86 and CAV2 in the GSE32393 dataset. The GSE32393 dataset contains methylation modification data from EBC tissue samples, which is consistent with TCGA 45 . The results showed that the methylation modification state was consistent with the result in TCGA. Genes methylation patterns are related to gene expression regulation 46,47 . These results suggest that abnormal methylation of genes plays an important role in the progression of EBC. In addition to the above 10 genes, we also found that many genes had important regulatory roles in EBC, such as four and a half LIM domains 1 (FHL1), caveolin 1 (CAV1), long intergenic non-protein coding RNA Figure 8. Methylation modification state verification of PITX1, SPTBN1, WDR86 and CAV2 in the GSE32393 dataset. * represents P < 0.05, ** represents P < 0.01, *** represents P < 0.001, P < 0.05 was considered statistically significant.  48 . CAV1 is down-regulated in BC cells and tissues, and it is revealed that CAV1 plays an essential regulatory role in BC by regulating lysosomal function and autophagy 49 . LINC01614 is highly expressed in BC and has strong prognostic value, which can be used as a potential biomarker to predict the prognosis of BC 50 . Compared with normal tissues, FOXD3-AS1 has a significantly higher expression in BC tissues. Moreover, patients with low FOXD3-AS1 expression have a higher survival rate, smaller tumor size and fewer distant metastases 51 . In this study, we found that FHL1, CAV1 were down-regulated and LINC01614, FOXD3-AS1 were up-regulated in EBC. The expression trend was consistent with previous studies, which further suggests that they may play an important regulatory role in EBC.

GAPDH-F(internal reference) 5-CTG GGC TAC ACT GAG CAC C-3
GAPDH-R(internal reference) 5-AAG TGG TCG TTG AGG GCA ATG-3   ACTB-F(internal reference)  5-GAT CAA GAT CAT TGC TCC TCCT-3   ACTB-R(internal reference)  5-TAC TCC TGC TTG CTG ATC CA-3   CACNA1H-F  5-ATG CTG GTA ATC ATG CTC AACTG-3   CACNA1H-R  5-AAA AGG CGA AAA TGA AGG CGT-  www.nature.com/scientificreports/ Certain limitations exist in this study. First of all, the RT-PCR verification sample size is too small, which leads to errors in the verification results and bioinformatics analysis. The sample needs to be expanded for further verification. Secondly, the molecular mechanisms of the identified key genes were still unclear and require further study in EBC. Thirdly, the role of the identified genes in each subtype of BC needs to be further studied.

Conclusion
In this study, these results indicate that identified genes may be used as potential clinical biomarkers of EBC. Among them, ALDH1A2, ALDH1L1, HSPB6, MME, MRGPRF, PENK, PITX1, SPTBN1, WDR86 and CAV2 may be considered as the potential diagnostic gene biomarkers for EBC. Strikingly, CAV2, MME, AC093110.1 and AC120498.6 were significantly actively correlated with survival. In addition, AC093110.1 could regulate SPTBN1 expression and play an important role in cell proliferation and migration, which provides clues to clarify the regulatory mechanism of EBC. In short, the identification of these genes can help in the early diagnosis and treatment of EBC.

Materials and methods
Datasets. The data of mRNA (read counts data), lncRNA (read counts data) and DNA methylation (DNA methylation chip data) were downloaded from TCGA (https:// tcga-data. nci. nih. gov/ tcga/) on June 30, 2020. A total of 1098 patients with BC were included in the dataset, including 1098 patients with clinical data, 1092 patients with RNA-seq data and 1095 patients with methylation array data. According to the clinical information stage, 804 patients with EBC (stage I-II) were selected. Finally, selected patients all had mRNA data, lncRNA data and DNA methylation data. Detailed data was shown in Table 6. We used the R software (version 3.5.1; Bell Laboratories, formerly AT&T, now Lucent Technologies, Murray Hill, NJ, USA) to analyze mRNA, lncRNA and DNA methylation data.  www.nature.com/scientificreports/ Differential analysis of mRNAs and lncRNAs. The DEmRNAs and DElncRNAs were evaluated in the R-bioconductor package DESeq 52 . Firstly, the RNAs read counts = 0 distributed greater than 20% in the case sample or the RNAs read counts = 0 distributed greater than 20% in the normal sample were deleted. Then, false discovery rate (FDR) < 0.05 and absolute value of log2FoldChange > 2 were used to identify DEmRNAs and DElncRNAs. Log2FoldChange > 2 and log2FoldChange < -2 represented up-regulation and down-regulation, respectively.
Biological function enrichment analysis of DEmRNAs. In order to explore the biological function of DEmRNAs, GO (including biological process, molecular function and cellular component) and KEGG functional enrichment analysis of DEmRNAs were performed by using clusterProfiler package in R 53 . The detailed technical parameters of enrichment were organism = has, pvalueCutoff = 0.05, qvalueCutoff = 0.1. FDR < 0.05 was considered statistically significant.
DNA methylation analysis. The COHCAP package in R was used to screen DMSs 54 . The Δβ was the difference of methylation site expression values between case and normal. In order to ensure the reliability of difference analysis results, 485,577 methylation sites detected were preprocessed. Filter out the methylated sites whose expression value β is not available (NA) and the distribution is greater than 20% in the sample. The COHCAP. annotate function was used to annotate the methylation sites and remove the methylation sites on the sex chromosomes. After preprocess, we perform PCA of the remaining methylation sites. Subsequently, β > 0.7 or β < 0.3 were selected for methylation sites. Finally, Δβ > 0.2 and FDR < 0.05 were used to screen DMSs and differentially methylation regions (DMRs).

Correlation analysis of DNA methylation with DEmRNAs. Chromosomal location information of
DEmRNAs and DElncRNAs was downloaded from ensemble database. Then, mRNA-lncRNA distance information was extracted with bedtools intersect tool 55 . DEmRNAs were searched in the upstream and downstream of DElncRNAs within the range of 100 kilobase (kb). In order to obtain DEmRNAs that were jointly regulated by lncRNA and related with DNA methylation, we took the intersection of lncRNA cis regulated mRNA and mRNAs corresponding to the DMSs. P < 0.05 and absolute value of cor > 0.2 were used to perform Pearson correlation analysis on DMSs and DEmRNAs. In addition, the relationship between DEmRNAs and differential methylated on CpG island was integrated. Finally, Venn analysis (http:// www. bioin forma tics. com. cn/) was used to show the DEmRNAs obtained from the above three steps.

Methylation verification, diagnostic and prognostic analysis of identified DEmRNAs and DEl-ncRNAs.
In order to evaluate the potential diagnostic utility and prognostic value of the identified genes in EBC, diagnostic analysis and prognostic analysis were performed in the TCGA database. The pROC package in R was used for diagnostic analysis. The sensitivity and specificity at the cut-offs were determined referring to previous report 56 . The diagnostic ability was evaluated by the AUC values in the ROC curve. Survival and Survminer software packages in R were used to for survival analysis and survival curve drawing. The prognostic ability was evaluated by survival curve. Subsequently, the GSE32393 dataset was downloaded from the Gene Expression Omnibus (GEO, https:// www. ncbi. nlm. nih. gov/ geo/) database. Methylation modification state verification of the identified genes was performed in the GSE32393 dataset (Normal:Case = 23:100).
RT-PCR validation of identified genes. 6 patients with clinical stage I-II EBC were enrolled. The EBC tissue samples (disease group) and adjacent normal tissues samples (control group) were obtained for RT-PCR. The TRIzol kit was used for extracted total RNA. Then, FastKing cDNA first chain synthesis kit (KR116, TIAN-GEN) was used for mRNA reverse transcription. RT-PCR was performed using SuperReal PreMix Plus (SYBR Green) (FP205, TIANGEN). Glyceraldehyde-3-phosphate dehydrogenase (GAPDH) and actin beta (ACTB) were used as internal reference. The relative gene expression levels were calculated using the 2 -ΔΔCt method 57 . This study was approved by the ethics committee the Shijiazhuang people's Hospital (202060). The written consent was obtained from the all patients.
Validation at the cell level. Bioinformatics analysis showed that AC093110.1 cis regulation of SPTBN1.
Moreover, SPTBN1 and AC093110.1 have potential diagnostic and prognostic value, respectively. So we investigated the potential biological function of AC093110.1 at the cell level. AC093110.1 was overexpressed in human BC cell line MCF-7. The MCF-7 cells were cleaned with phosphate buffered saline (PBS) before transfection, and then 3 × 10 5 cells were inoculated into each well of the 6-well plate with 1.5 ml low serum minimum essential medium (MEM). Subsequently, the cells were incubated in 37 ˚C incubator. AC093110.1 was added to 250 μl MEM, and then mixed with 250ul MEM containing 5ul lipofectamine 2000 transfection reagent. The mixture was incubated at room temperature for 20 min. Add the mixture to the cultured 6-well plate. The cell transfection efficiency was detected after 48 h. The expression level of AC093110.1 was detected by RT-PCR. The MTT assay was used to analyze the affect of AC093110.1 on cell proliferation. Cell wound scratch assay was used to analyze the affect of AC093110.1 on cell migration. In addition, after overexpression of AC093110.1, western blot was performed to detect the expression of SPTBN1 adjacent to AC093110.1. Usually the blots were cut prior to hybridisation with antibodies.
Statistical analysis. GraphPad Prism was used to perform all the data statistics. For the RT-qPCR experiments, one-way ANOVA and Duncan's multiple range test was used to assess differences between case and nor- www.nature.com/scientificreports/ mal groups. Results were presented as the mean ± SD. P < 0.05 was considered as statistical significance. Experiments were repeated independently at least 3 times.
Ethics approval and consent to participate. This study was approved by the ethics committee the Shijiazhuang people's Hospital (202060). The written informed consent was obtained from the all patients. All participants were informed as to the purpose of this study, and that this study complied with the Declaration of Helsinki.

Data availability
All data generated or analyzed during this study are included in this published article.