Introduction

Cardiovascular disease remains the leading cause of death worldwide1. In 2019, a prospective urban and rural epidemiological study published by the Lancet showed that 40% of all deaths were caused by cardiovascular diseases2. Although the etiology of cardiovascular disease is different, heart failure is commonly the final stage. Noticeably, pathological cardiac hypertrophy could easily develop into heart failure and therefore becomes an increasingly important cause of cardiovascular diseases3,4. Hypertrophic cardiomyopathy (HCM), which is characterized by myocardial hypertrophy asymmetry, often occurs in ventricular septum, resulting in outflow tract obstruction, left ventricular filling limitation and reduced compliance5. In some cases, HCM will lead to heart failure, myocardial ischemia and sudden death. Thus, early detection of HCM becomes highly important. HCM is a genetic heterogeneous disorder associated with mutations in certain genes6. Development of gene sequencing technology has raised the significance of using genetic detection in the diagnosis of HCM patients with a family history and in asymptomatic patients to avoid sudden death. The European and American Guidelines also encourage the use of genetic testing for potential HCM patients7,8.

Machine learning refers to the process of learning and training from data and accurately predicting the system of future events. It is a multidisciplinary specialty and a type of artificial intelligence applied in data processing of bioinformatics9,10. Machine learning can identify potential rules through massive amounts of data, outperforming most traditional statistical methods11,12. In the medical field, machine learning can be used as a predictive model to guide precision medicine13. In protein function research, machine learning can promote prediction accuracy and enable more comprehensive analysis14. In metabolic engineering, the integration of machine learning enriches data analysis techniques and enhances the precision of metabolic outcome prediction15. Recently, machine learning method has been applied to the development of prognostic models for various malignant tumors16,17,18. In addition, multiple machine learning algorithms are combined to mine genes for predicting the prevalence of bronchopulmonary dysplasia patients, and results have confirmed that genes discovered by bioinformatics can serve as potential targets for identifying the disease and contribute to its treatment10. However, studies about machine learning in HCM is largely limited.

This study screened DEGs from the GSE36961 database for HCM. Applying machine learning method, key genes were mined and the area under the curve (AUC) showed that these genes had a strong predictive performance. Moreover, a zebrafish model was constructed to verify the effectiveness of the models. In this paper, through machine learning, valuable biomarkers for HCM were screened and their clinical diagnostic value and treatment direction were analyzed.

Materials and methods

Data acquisition and screening of DEGs

The microarray dataset GSE36961 of Bos et al.19, which included mRNA expression data from 106 HCM tissues and 39 normal samples, was downloaded from the Gene Expression Omnibus (GEO) database for analysis. The raw data of GSE36961 was preprocessed using the "limma" package in R software (version 3.6.2). Missing values were filled utilizing the k-nearest neighbor algorithm20. Raw data were normalized by running robust multiarray average algorithm21. Elimination of batch effect was performed in the "sva" package in R employing the COMBAT method. Identifying the anchor points through principal component analysis (PCA), and assessing the top two principal components using the t-distributed stochastic neighbor embedding (t-SNE) technique to unveil notable groupings. The screening threshold value of DEGs was |Log2FC|> 1.0, adjust. p value < 0.05. Subsequently, the heatmaps and volcano maps were plotted by the "pheatmap" and "EnhanceVolcano" packages of the R software, respectively. In addition, we validated the comparison of the expression levels of the screened key genes in HCM and control samples based on the GSE141910 dataset.

Enrichment analysis of the genes

Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses were performed using the "clusterProfiler" package, a commonly analytical tool in bioinformatics to identify statistically significantly enriched biological terms22. We used the list of DEGs as the input for GO analysis using the “enrichGO” function and the “enrichKEGG” function for KEGG pathway analysis. Subsequently, the Molecular Signatures Database (MSigDB, https://www.gsea-msigdb.org/gsea/ index. jsp) was used to obtain Hallmark and C7 immunosignature gene sets. According to the expression level of DEGs as the phenotype annotation, data from HCM patients were divided into the HCM group and normal group under NOM p < 0.05 and FDR q < 0.25.

Weighted correlation network analysis (WGCNA)

A scale-free weighted gene co-expression network was developed using the "WGCNA" package in R to identify co-expressed genes and modules associated with clinical features23. The data were clustered, outliers were detected, and appropriate soft thresholds were set to obtain a network as scale-free as possible. Topological overlap matrix-based hierarchical clustering was used to section gene modules. To determine the correlation of each module with clinical features, Pearson correlation coefficients were calculated to select the module with the greatest clinical correlation and the mRNAs in the module were obtained. The glmnet package was used to construct a LASSO regression model using differential genes, and e1071, caret and kernlab packages were loaded to develop SVM-RFE model. Intersection genes between the two models were defined as the signature genes and Venn diagram was drawn for visualization.

Protein–protein interaction (PPI) network construction

PPI analysis of turquoise module genes screened by WGCNA was performed using the STRING database (https://string-db.org/). The PPI network was visualized using Cytoscape software24. In the network, nodes represent proteins and edges represent interactions between proteins. The core genes in the network were identified by calculating the degree of connectivity (degree) of each node.

Support vector machine based recursive feature elimination (SVM-RFE)

Support vector machine (SVM), a robust machine learning algorithm, has been extensively used to predict the functions of biological molecules25. Our research employed SVM modeling with the "e1071" R package, with the radial basis function as the kernel function of choice. The SVM learning model was constructed using all original features to calculate the absolute coefficient |w| for each input attribute. Subsequently, the features were sorted based on the square of |w|, and those at the lower end of the ranking were removed, while the remaining attributes were subjected to repeated iterative process of SVM model construction and ranking while mirroring the previously taken steps. This cycle continued until all features were eliminated26. Features removed at the end of the process were considered more significant as compared to those eliminated earlier. To determine the optimal number of mRNAs for developing a signature for HCM, the dataset was subjected to fivefold cross-validation. By varying the selected sets, SVM models were trained using different number of top mRNAs to calculate the overall prediction error. Finally, receiver operating characteristic (ROC) curves was used to calculate the area under the curve (AUC) value for each chosen mRNA characteristic employing the "pROC" tool in R27, so as to assess the effectiveness of these features in the diagnosis of HCM.

Zebrafish husbandry

Zebrafishes were housed at the Zebrafish Research Center of Nantong University. The animal study protocol was approved by the Nantong University Institutional Animal Care and Use Committee (Item number is IACUC20221008-1001). Zebrafish embryos with Tg (cmlc2: GFP) were obtained by natural mating and maintained at 28.5 c. The embryos 24 h (h) post-fertilization (hpf) was treated with 0.2 mM 1-phenyl-2-thio-urea (PTU).

RNA extraction, reverse transcription, and qRT-PCR

In brief, tissues were homogenized in 1 mL Trizol (Life Technologies). RNA (1 μg) was reverse-transcribed into cDNA using Reverse Transcription Kit (Vazyme, China).

The qRT-PCR with ABI StepOne instrument was carried out in a total volume of 20μL. The reference gene was GAPDH. The MYH6 and RASD1 primers for qRT-PCR were: MYH6 F: 5′-AGAATAAGGATGGAGGGA-3′; R: 5′-CTTTAGATTGAACAGCACC-3′; RASD1 F: 5′- CCTCGGGTCCACCAAAGT-3′; R:5′- GTTCCCTGAAGTATCCAAAA-3′. The qRT-PCR reactions were run in three technical replicates, and the data were collected from at least three independent experiments.

Microinjection

The sgRNA transcription template was prepared by PCR by adding the T7 promoter sequence. The reverse primer was 25 bp sgRNA-R (Table1). The MAXIscript T7 Kit (Invitrogen, USA) for in vitro transcription was used to obtain sgRNAs. mMessage mMachine T7 Kit (Invitrogen, USA) was used to prepare Capped dCas9 mRNA for in vitro transcription. Dcas9 mRNA and sgRNA (Table1) injection concentrations were adjusted to 300:200 (unit: ng/μL).

Table 1 The sgRNAs in this study.

Imaging and statistics

Heart development in Tg (cmlc2: GFP) zebrafishes was observed by anesthetizing embryos with egg water (1% PTU and 0.6% low agarose). Images were taken using a fluorescence microscope (IX71, Olympus, Japan). The measurement data analysis was performed using ImageJ. Significance of differences were shown using GraphPad Prism 8 and analyzed using Student's t-test.

Ethical approval and consent to participate

All methods were carried out in accordance with relevant guidelines and regulations. The animal protocols used in this investigation were approved by the Nantong University Institutional Animal Care and Use Committee (Item number is IACUC20221008-1001), complying with the ARRIVE guidelines. All authors have participated in the work and have reviewed and agree with the contents of the article for publication.

Results

Identification of DEGs between HCM and control groups

We well distinguished between patients with HCM and normal samples based on PCA, which indicates that the GSE36961 dataset used in this study is of good quality (Supplementary Fig. S1). Subsequently, to screen abnormal expressed genes in HCM, a total of 157 DEGs incorporating 47 up-regulated genes and 110 down-regulated genes were identified between HCM and normal groups (Fig. 1A,B).

Figure 1
figure 1

Based on the GSE36961 dataset to screen for differentially expressed genes (DEGs) in HCM and normal controls. Heat maps (A) and volcano maps (B) showing DEGs, respectively.

Enrichment analysis

Enrichment analysis on the positive regulation of inflammatory response and external stimuli was performed using GO analysis (Fig. 2A). According to the KEGG analysis, the genes were enriched in the complement and coagulation cascades (Fig. 2B). Combining the GO and KEGG analysis, we found the DEGs were associated with the immune pathways. The immunologic gene sets and genes in the HCM group were enriched in the C7 collection (Fig. 2C). As compared to the normal group, the enriched genes were shown in Fig. 2D. These results confirmed that DEGs were correlated with immune-related signals.

Figure 2
figure 2

Enrichment analysis. (A) GO analysis (B) KEGG analysis (C) Enriched immunologic gene sets in C7 collection by the HCM group. (D) Enriched immunologic gene sets in C7 collection by the normal group.

Identification of hub genes by WGCNA and machine learning methods

To screen hub genes, we performed WGCNA co-expression network on 157 DEGs. As shown in Fig. 3A,B, the scale-free network was generated under the soft threshold of 10 (R2 = 0.88), as supported by the adjacency matrix and topological overlap matrix. Then, the modules were sectioned according to mean hierarchical clustering and dynamic tree cutting (Fig. 4A). The turquoise module was significantly correlated with HCM patients, which was therefore selected for further analysis (Fig. 4B). As shown in Supplementary Fig. S2, we performed PPI on the genes of the module to observe the interactions of these genes. This indicates that these genes may be involved in the occurrence and development of HCM through multiple interactions.

Figure 3
figure 3

The soft-threshold power in the WGCNA. (A) Calculating the mean connectivity and fit of the scale-free topology mode to determine the optimal soft-thresholding power β. (B) Checking the scale free topology.

Figure 4
figure 4

Identification of modules associated with HCM patients. (A) Heatmap of the correlation between the module and HCM patients. (B) The MEturquoise module associated with the HCM patients.

In addition, we further identified signature genes with diagnostic significance for HCM by LASSO regression analysis. As shown in Fig. 5A,B, the coefficients of the model gradually approached 0 as the penalty parameter λ increased and finally reached 14 to characterize genes for subsequent studies. Meanwhile, we used the SVM-RFE method to recursively remove features and obtained 28 key marker genes (Fig. 5C). Integrating the LASSO regression model genes, SVM model genes and WGCNA module genes (Fig. 5D), FCN3, MYH6 and RASD1 can be used as hub genes in HCM group.

Figure 5
figure 5

Screening the hub genes by machine learning. (A,B). LASSO algorithm to screen genes. (C) SVM-RFE algorithm to screen genes. (D) Venn diagram showing key genes identified based on WGCNA, LASSO regression analysis, and SVM-RFE have been screened for 3 signatures used to diagnose HCM.

Validation of the hub genes

We found that the expression of these key signature genes (FCN3, MYH6, and RASD1) was significantly downregulated in HCM patients relative to normal patients (p value < 0.05) (Fig. 6A–C). Similarly, we also performed another validation based on the GSE141910 dataset and showed that the expression of these three genes was significantly down-regulated in HCM samples relative to normal control samples (Supplementary Fig. S3). In addition, the ROC curves of FCN3 (AUC = 0.968, CI = 0.917–0.998), MYH6 (AUC = 0.954, CI = 0.899–0.995), and RASD1 (AUC = 0.978, CI = 0.952–0.997) also showed relatively higher sensitivity for the diagnosis of HCM patients (Fig. 6D–F).

Figure 6
figure 6

Based on the GSE36961 dataset to validate the expression and diagnostic value of the 3 key signatures. (A,B) Differential expression analysis of FCN3 (A), MYH6 (B), and RASD1 (C) in normal control and HCM groups. (D,E) ROC curves to validate the diagnostic value of FCN3 (D), MYH6 (E), and RASD1 (F) in patients with HCM.

Importantly, our in vivo zebrafish-based model also validated the effects of MYH6 and RASD1 on the animal hearts. Here, we used CRISPRi technology and Tg (cmlc2: GFP) as the research background to construct the zebrafish model with MYH6 and RASD1 knockdown. It was found that knockdown of MYH6 and RASD1 resulted in malformation to the zebrafish hearts (Figs. 7A, 8A). The bar graphs showed a significant downregulation in the expression of the MYH6 and RASD1 knockdown groups (Figs. 7B, 8B). In addition, after knockdown of MYH6 and RASD1, we also observed a reduced heart rate frequency, prolonged bulbus arteriosus- sinus venosus (BA-SV) distance, and a decrease in fractional shortening percentage (FS%), which suggested the significant effects of these two genes on regulating cardiac rhythms and cardiac pumping capacity (Figs. 7C–E, 8CE). Significant differences in ventricular diastole (VD) and systole (VS) phases were also observed after knocking down these two genes. By analyzing the ventricular volume of the zebrafishes with MYH6 (48hpf) and RASD1 (72hpf) knockdown. The results showed that atrial and ventricular embryos could form after MYH6 and RASD1 knockdown, but the ventricles were smaller than those in wild-type group (Figs. 7F–H and 8F–H). Knockout of the MYH6 and RSD1 genes slowed heartbeat and reduced ventricular function of the zebrafishes, which was similar symptoms to patients with HCM. These results revealed that MYH6 and RSD1 were the potential driver genes in HCM development.

Figure 7
figure 7

Phenotype of MYH6 cardiac malformation in a knocked down zebrafish. (A) the lateral body and enlarged heart of MYH6 zebrafish, control, n = 7, MYH6 KN, n = 7, white line, venous sinus-arterial distance. (B) Down-regulation efficiency of MYH6 gene. (C) heart rate (times/10 s), control, n = 10, MYH6 KN, n = 7. (D) BV-SA distance, control, n = 7, MYH6 KN, n = 7. (E) ventricular shortening fraction analysis, control, n = 7, MYH6 KN, n = 7. (FH) the ventricle volume of zebrafish in systole and diastole, control, n = 7, MYH6 KN, n = 7, a atrium, v ventricle, s systole, d diastole; white line, the major and minor axes of the ventricles; bar = 200um.

Figure 8
figure 8

Phenotype of RASD1 cardiac malformation in a knocked down zebrafish. (A) The lateral body and enlarged heart of RASD1 zebrafish, control, n = 7, RASD1 KN, n = 7, white line, venous sinus-arterial distance. (B) Down-regulation efficiency of RASD1 gene. (C) Heart rate (times/10 s), control, n = 7, RASD1 KN, n = 7. (D) BV-SA distance, control, n = 7, RASD1 KN, n = 7. (E) Ventricular shortening fraction analysis, control, n = 7, RASD1 KN, n = 7. (FH) the ventricle volume of zebrafish in systole and diastole, control, n = 7, RASD1 KN, n = 7, a atrium, v ventricle, s systole, d diastole; white line, the major and minor axes of the ventricles; bar = 200 um.

Discussion

Cardiovascular myocardial hypertrophy is an important cause of a variety of major cardiovascular diseases. The process of cardiomyocyte enlargement is accompanied by increased myocardial interstitial, which will lead to myocardial hypoxia and cardiac remodeling due to excessive capillary supply. If without timely intervention, the heart will further develop into an irreversible decompensation period, accompanied by cardiac dysfunction, ultimately leading to heart failure and death. HCM is caused by myocardial hypertrophy, manifesting as fatigue dyspnea, chest pain, sudden death. It has been demonstrated that HCM is genetically heterogeneous28, therefore, this provides a basis for exploring the potential biological information of HCM gene expression and helps to efficiently mine HCM-related target genes through a large amount of data to be able to help reduce adverse cardiovascular events. In this study, we obtained 106 HCM patients and 39 normal controls from the GEO database, from which a total of 157 DEGs were screened. Subsequent enrichment analysis showed that these DEGs were mainly enriched in immune and inflammatory response pathways. Past findings have found the presence of mild systemic and local inflammation in individuals with HCM. Patients with HCM have mild chronic inflammatory cell infiltration in their myocardium29,30,31. Specifically, levels of circulating inflammatory markers were high in HCM, including tumor necrosis factor (TNF)-α and interleukin (IL)-632,33,34. In addition, the immune system maintains the normal physiological function of the heart in HCM, and damage to its immune system leads to the development of abnormal inflammatory responses and myocardial remodeling35,36. These findings could support the crucial role of immune and inflammatory pathway mechanisms in HCM.

The use of high-throughput sequencing technologies and bioinformatics offers the possibility to further identify and detect HCM predictors. Yu et al. combined multiple machine algorithms to screen three key signature genes related to hypoxia and immunity from by HCM-related datasets37. Li et al. used SVM-RFE and random forest algorithms to select m6A regulators associated with HCM and used LASSO to establish a gene signature that can distinguish HCM patients from normal controls38. Specifically, in our study, three key hub genes (MYH6, RASD1 and FCN3) were screened by LASSO, SVM-RFE and WGCNA. Study confirmed that MYH6 mutations present phenotypes such as cardiac sudden death and heart failure associated with HCM in Japanese families39. Previous studies have found that MYH6-/- mutation causes dysfunction of the sinus node to slow down the heartbeat40,41. We also found the same phenotype by knocking down MYH6 in zebrafishes. Sinus node dysfunction is a clinical disease presented with bradycardia symptoms. In our study, we observed that zebrafishes with MYH6 knocked down also had bradycardia, further confirming the relationship between MYH6 and sinus node dysfunction. In addition, we also observed that after MYH6 knockdown, the FS, systolic and diastolic ventricular volume of zebrafishes were significantly reduced. Low FS and ventricle volume represent heart malformation and cardiac dysfunction. Low FS in zebrafishes with GTP3 mutation has also been previously defined42. Studies found low ventricle volume and malformation in zebrafishes with mutations of the genes, which was consistent with our findings43,44. Ras-related protein 1 (RASD1) is a dexamethasone induced monomeric Ras-like G protein that oscillates in the suprachiasmatic nucleus (SCN). As a novel signaling protein involved in a variety of cellular processes, RASD1 plays an important role in regulating uterine remodeling dynamics during the estrous cycle in utero45. In the cardiovascular system, knockdown of RASD1 significantly increases the secretion of atrial Natriuretic Factor (ANF) and negatively affects cardiovascular homeostasis46. In this study, we knocked down RASD1 in zebrafishes, and found that the decrease of RASD1 affected the heartbeat and ventricle volume, causing bradycardia and small ventricle volume, which may be explained by the interaction between RASD1 and renin as RASD1 is involved in renin transcriptional regulation47. It has been demonstrated that the renin-angiotensin system is an important candidate for susceptibility for left ventricle hypertrophic (LVH). Gene polymorphisms of the renin-angiotensin system plays a key role to HCM48,49. The renin-angiotensin system regulates human blood pressure and the expression of cardiac hypertrophy, affecting myocardial remodeling. The knockdown of RASD1 influences the renin-angiotensin system and ANF regulation, leading to the failure of myocardial remodeling and heart malformation. FCN3, also known as thermo- labile β-2 macroglycoprotein, is a protein of lectin-pathway that mainly express in the lung and liver50. In cardiovascular system, studies found that the FCN3 is involved in heart failure51,52 and hypertension53. Moreover, bioinformatics analysis showed that FCN3 may be the potential key dysfunctional gene of HCM54, which was similar to our study. However, we could not use zebrafish model with FCN3 knockdown as we did not have the expression of homologous gene in zebrafishes. Thus, more subsequent experiments are needed to improve and verify the relationship between FCN3 and HCM.

However, due to the complex etiology and pathogenesis of myocardial hypertrophy, this study still had some limitations, for instance, the current sample size was relatively small and we did not perform specific experiments for verification. In the future, we will expand the sample size to verify the present findings. In addition, as we did not find a homologous gene for FCN3 expression in zebrafishes, we were unable to construct a knockdown model for FCN3. Therefore, we will utilize other organism models to investigate the role of FCN3 in future studies.

Conclusion

In conclusion, we found that the expression of FCN3, MYH6 and RASD1 was downregulated in patients with HCM. The effects of MYH6 and RASD1 on cardiac function and cardiac mass index have been proved using a zebrafish model. A combined use of these genes may be useful for HCM diagnosis.