In silico identification of single nucleotide variations at CpG sites regulating CpG island existence and size

Genetic and epigenetic modifications of genes involved in the key regulatory pathways play a significant role in the pathophysiology and progression of multifactorial diseases. The present study is an attempt to identify single nucleotide variations (SNVs) at CpG sites of promoters of ACAT1, APOB, APOE, CYBA, FAS, FLT1, KSR2, LDLR, MMP9, PCSK9, PHOX2A, REST, SH2B3, SORT1 and TIMP1 genes influencing CpG island (CGI) existence and size associated with the pathophysiology of Diabetes mellitus, Coronary artery disease and Cancers. Promoter sequences located between −2000 to + 2000 bp were retrieved from the EPDnew database and predicted the CpG island using MethPrimer. Further, SNVs at CpG sites were accessed from NCBI, Ensembl while transcription factor (TF) binding sites were accessed using AliBaba2.1. CGI existence and size were determined for each SNV at CpG site with respect to wild type and variant allele by MethPrimer. A total of 200 SNVs at CpG sites were analyzed from the promoters of ACAT1, APOB, APOE, CYBA, FAS, FLT1, KSR2, LDLR, MMP9, PCSK9, PHOX2A, REST, SH2B3, SORT1 and TIMP1 genes. Of these, only 17 (8.5%) SNVs were found to influence the loss of CGI while 70 (35%) SNVs were found to reduce the size of CGI. It has also been found that 59% (10) of CGI abolishing SNVs are showing differences in binding of TFs. The findings of the study suggest that the candidate SNVs at CpG sites regulating CGI existence and size might influence the DNA methylation status and expression of genes involved in molecular pathways associated with several diseases. The insights of the present study may pave the way for new experimental studies to undertake challenges in DNA methylation, gene expression and protein assays.

www.nature.com/scientificreports/ promoter 17 . Studies suggested that the methylation of cytosines in a promoter DNA suppresses the rate of transcription, reduces the mRNA copy number and ultimately affects the protein synthesis [18][19][20] .
There are few reports published to show the tangible impact of SNVs at CpG sites on CpG island existence or size in genes influencing the pathophysiology of various diseases [64][65][66] . A genome-wide CpG SNP identification study revealed that CpG SNPs are significantly associated with the Cancers 64 . Furthermore, GWAS datasets on DM and CAD have identified novel functional SNPs at CpG sites which affect the expression and function of genes via epigenetic regulations 65 . Experimental studies on O6-methylguanine-DNA methyltransferase (MGMT) gene rs16906252 and RAD50 gene DNase I hypersensitive sites (RHS) 7 region rs2240032 polymorphisms suggested that SNPs at CpG sites can influence the DNA methylation at promoter regions, transcription factors binding at enhancer or silencer region and miRNA binding at 3'UTR region [67][68][69][70] . The SNVs at CpG sites might modulate the existence and size of CpG islands at the promoter region; altering the methylation patterns and binding of transcription factors which ultimately affect the gene activation or silencing or expression 64,65 . Therefore, studies are warranted to identify SNVs at CpG sites regulating CpG island existence & size and their consequent effects on DNA methylation and gene expression.

Materials and methods
Study design. The detailed study design is presented in Fig. 1.
Literature search and databases. We  Prediction of CpG Islands. CpG islands (CGIs) in promoter sequence of genes under the study were predicted using MethPrimer v1.1 beta. CGI existence and size were determined for each single nucleotide variation at CpG site with respect to wild type and variant allele. MethPrimer predicts potential CGIs in the input promoter DNA sequence and designs sequence specific primers for Methylation-Specific PCR and Bisulfite-Sequencing PCR. The output results are presented in graphical view for predicted CpG island and in text format for PCR primers 72 . The criteria used for gain and loss of CGI prediction is Island size > 100bp, GC percent > 50.0, ratio of Obs/Exp no of CpG dinucleotides > 0.60 73 . Selection of SNVs at CpG sites. CpG sites were identified from the results of MethPrimer and the SNVs at CpG sites were accessed from National Center for Biotechnology Information (NCBI) and Ensembl. NCBI and Ensembl are widely used genome browsers in global scientific community. The browsers were developed with the data of genomic regions, genes, gene sequence, genetic variations, phenotypes, etc. The tools visualize DNA sequence and their respective annotated genetic variations to identify the SNVs at CpG sites in CpG islands 74,75 . Transcription factor binding site prediction. AliBaba2.1 tool was used for the prediction of transcription factor binding sites in wild type and variant alleles of SNVs at CpG sites. It is an online tool to identify transcription factors and their respective binding sites for the input DNA sequence by constructing matrices on

Results
Promoter sequence of ACAT1, APOB, APOE, CYBA, FAS, FLT1, KSR2, LDLR, MMP9, PCSK9, PHOX2A, REST, SH2B3, SORT1 and TIMP1 genes were analysed for the prediction of CpG islands and have observed CpG islands for all the genes ( Fig. 2A, B). Further, the existence and sizes of CGI for wild type and variant alleles of all the CpG SNVs were analyzed. In addition, transcription factors binding to both the wild type and variant alleles of CpG SNVs abolishing CGI were predicted.
To the 4 SNVs of LDLR gene that abolished CGI, TFs binding site prediction has shown that rs1026272027 wild type allele has a binding site for C/EBPapl and variant allele has a binding site for C/EBPbet. For rs887608252,C/T, rs1006494933,G/A and rs1024897634,C/T SNVs, there were no TF binding sites for their wild type alleles, but their variant alleles have binding sites for C/EBPapl, GATA-1 & Oct-1 and Oct-1 TFs respectively.
Likewise, 2 SNVs abolishing CGIs in MMP9 gene have shown the difference in binding of TFs, rs370018925 wild type allele has no binding site for any TF whereas variant allele is bound by Sp1 transcription factor. Though the rs1014494202 has Sp1 binding site for wild type allele, variant allele has an additional binding site for BRF-1 transcription factor.
For rs922413124 in SH2B3 gene, there was a binding site for Sp1 in wild type allele, but it is abolished in variant allele. Similarly, APOE rs769448 has binding site for Sp1 transcription factor but its variant allele is lacking a site for binding of any transcription factor.
Furthermore, 2 SNVs that abolished CGIs in TIMP1 gene has shown that the wild type alleles of rs779329701 and rs376386551 has binding sites for Egr-1 and Sp1 transcription factors while variant alleles have binding sites for NF-1 and N-Myc transcription factors respectively. Gene ontology enrichment analysis. The gene ontology enrichment analysis of the genes set is shown in Fig. 6. The top 10 GO terms of biological process (BP), cellular component (CC), molecular function (MF) and disease class analyses in genes were sorted by p-value or gene count. According to the BP analysis, the GO term pathways were mainly associated with the cholesterol biosynthesis, metabolism and homeostasis, regulation of apoptosis, receptor mediated endocytosis, etc (Fig. 6A). For the CC analysis, the GO terms of these genes were mainly located and enriched in the plasma membrane, extracellular exosomes and space, golgi apparatus, etc (Fig. 6B). In the MF analysis, 15 genes were mainly enriched and associated with binding activity and transporter activity particularly protein binding, metal ion binding, identical protein binding, low-density lipoprotein particle receptor binding, cholesterol transporter activity, etc (Fig. 6C).

Co-expression analysis.
The GO terms disease class analysis of these genes revealed that the genes are associated with metabolic diseases, neurological diseases, cardiovascular diseases, cancers, etc (Fig. 6D). Later, functional annotation clustering of these genes was performed and functional chart of cluster with highest gene enrichment score (3.17) is shown in Fig. 6E. Out of the 15 genes APOB, APOE, LDLR, PCSK9, SORT1 genes are associated with golgi complex, early endosome, cholesterol metabolism, etc (Supplementary data 1).

Discussion
The multifactorial diseases like diabetes mellitus, coronary artery disease and cancers are leading cause for morbidity and mortality worldwide. Genetic and epigenetic modifications are also recognized as significant risk factors for the pathophysiology of these diseases. Studies reported that epigenetic modifications play a crucial role in cell differentiation at embryonic development 79 . Besides, environmental factors and age affect the DNA methylation and demethylation patterns in mammalians 80   Another study on effect of RAD50 gene DNase I hypersensitive site7 (RHS7) region rs2240032 polymorphism on DNA methylation has shown that, it is significantly affecting the 5q31 locus IL13 gene promoter DNA methylation  64 . In addition, Wang, Z. et al., identified novel functional CpG-SNPs by conditional false discovery rate (cFDR) analysis from statistical data of two large GWAS of type 2 DM and CAD. Among them, 13 CpG-SNPs of DM, 15 CpG-SNPs of CAD have a significant methylation quantitative trait locus effect and increased susceptibility to disease 65 .
In view of the above, the present study has been designed to analyze the impact of single nucleotide variations at CpG sites in promoter CpG islands of ACAT1, APOB, APOE, CYBA, FAS, FLT1, KSR2, LDLR, MMP9, PCSK9, PHOX2A, REST, SH2B3, SORT1 & TIMP1 genes on their respective existence and size.
It has been shown that, APOE is involved in lipid metabolism, tissue repair, inflammation and plays a significant role in age related diseases. APOE modulates its effect on angiogenesis, tumor cell growth and metastasis  www.nature.com/scientificreports/   90 . A study reported that methylation of APOE is significantly lower in men with coronary heart disease than healthy control men and is inversely proportional to APOE plasma levels. Thus, it is considered that the DNA methylation is a potential factor for regulation of APOE gene expression 19 . In the present study, we have observed that APOE rs769448 has abolishing the CGI existence that might influence the methylation pattern and further may regulate the gene expression. The GO enrichment analysis has shown that the APOE gene is a key regulator in the cholesterol metabolism and transportation contributing to the initiation and progression of multiple diseases. Similarly, Low density lipoprotein receptor (LDLR) gene encodes a cell surface LDL receptor protein mediating endocytosis of LDL particles regulate cholesterol levels. Evidences suggest that elevated circulating cholesterol levels are involved in the coronary artery disease, cancer growth promotion and progression [91][92][93] . Ghose, S. et al. reported that LDLR gene undergoes hypomethylation and induces an increased expression which subsequently decreases the LDL levels and reduces the risk of CAD 94 . In the present study, we have observed that 31% of CpG SNVs abolished the CGI existence and ~ 44% decreased the size of CGI. The abolishment and reduced CGI size, decreases the possibility of methylation and inversely increases the gene expression. The increased gene expression associates with decreased LDL-cholesterol levels and lead to reduced risk of diseases.
Furthermore, Src homology 2-B adaptor protein 3 (SH2B3) plays a critical role in haematopoiesis and acts as a negative regulator of several tyrosine kinases and cytokine signaling. SH2B3 was associated with diseases like atherosclerosis and thrombosis, cancers, diabetes, etc [95][96][97] . A recent study on Celiac disease (CeD) revealed that the expression of SH2B3 is influenced by the methylation and it is reported that hypomethylation is associated with higher expression of the genes in CeD patients than controls. The methylated DNA sequence is showing differences in binding of regulatory elements to control the expression of gene at mRNA level 61 . The present study investigations have shown SH2B3 gene promoter has 7% CGI abolishing SNVs besides 17% size reducing SNVs. The differences in CGI existence, binding of transcription factors and CGI size influences the methylation patterns to regulate the expression. According to gene ontology disease class term SH2B3 is playing a significant role in metabolic, cardio vascular and immune diseases.
In recent years, there is a growing interest on matrix metalloproteinase (MMP) family to understand their significant association with various disease pathophysiologies such as cancers, CAD and DM 87 . MMP9 and Tissue inhibitors of metalloproteinases 1 (TIMP1) were known to be associated with the risk of cardiovascular disease and several cancers [98][99][100][101] . A study on MMP9 promoter methylation suggested that serum circulating levels were inversely associated with methylation level in Diabetic nephropathy patients. MMP9 demethylation increases its serum circulating levels that might be accompanying with the incidence and prognosis of diabetic nephropathy 102 . Tissue inhibitors of metalloproteinases (TIMPs) are inhibitors of the MMPs involved in extracellular matrix degradation. In chronic periodontitis, TIMP1 promoter methylation positively correlated with severity of the disease 63 . In another study, DNA methylation in TIMP3 gene contributed to its lower expression and eventually lead to metastasis of oral cancer 103 . In the present analysis, ~ 18% of MMP9 and ~ 67% of TIMP1 CpG SNVs have shown for the loss of CGIs, further 57% of MMP9 and 33% of TIMP1 CpG SNVs reduced the size of CGI. GO enrichment analysis of MMP9, TIMP1 revealed that these two genes are playing a significant role in metabolic, neurological, cardiovascular diseases and cancers. Altogether, abolishment and reduction of CGI size, differential binding of TFs could influence their gene expression in ECM remodelling and degradation which can further mediate the pathological conditions of various diseases.
Further, 50% of ACAT1, ~ 67% of APOB, 57% of CYBA, ~ 92% of FAS, 50% of FLT1, ~ 13% of KSR2, ~ 44% of LDLR, ~ 36% of MMP9, 50% of PCSK9, 36% of PHOX2A, 40% of REST, ~ 14% of SH2B3, ~ 13% of SORT1 and 33% of TIMP1 SNVs are altering the size of CGIs. Among all the 200 SNVs in the genes under study, we have observed that approximately 9% of SNVs at CpG site are abolishing the existence of CpG island; whereas 35% are decreasing the size of CGIs. Consequently, loss of CGI & decreased CGI size leads to the intermittent and asymmetrical DNA methylation pattern of gene which can regulate the expression of genes by affecting binding of transcription factors to the promoter.
The findings of the study suggest that the SNVs at CpG sites in the promoter region regulating CGI existence and size might influence the DNA methylation status and expression of genes that take part in molecular pathways associated with multifactorial diseases like diabetes mellitus, cardiovascular diseases, cancers, etc. The insights of the present study may pave the way for new experimental studies to undertake challenges in DNA methylation, gene expression and protein assays.

Limitations
A primary limitation of the study is that this is an in silico study, designed to know the impact of single nucleotide variations at CpG sites on CpG island existence, size and their respective DNA methylation pattern and gene expression. Another limitation of the study is that the genes are randomly selected from the various pathways to test the hypothesis. Therefore, the predicted results should be essentially validated using experimental analyses such as genotyping, DNA methylation and their subsequent gene expression assays for further correlation with disease phenotypes.

Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.