Introduction

IPF is a chronic progressive interstitial lung disease characterized by developing lung scarring and deterioration of lung function1. The abnormal extracellular matrix accumulates with the lung tissue and vascular system are repeatedly micro-damaged, and the alveolar structure is destroyed, resulting in a decrease in the lung tissue's ability to exchange gas with the outside2,3,4. The clinical manifestations are cough and dyspnea, severe cases can lead to respiratory failure. IPF mainly occurs in people at 50 years of age and older, and the incidence increases with age5,6,7. IPF is widely believed to result from the interaction of genetics, environmental risk, and ageing8. Most IPF patients also have multiple complications, such as heart failure, gastroesophageal reflux disease, obstructive apnea, etc.9,10

The challenge to clinicians is distinguishing IPF from other idiopathic interstitial pneumonias, high-resolution CT of the chest and lung biopsy are advised now, but the risk of surgical lung biopsy is greater for elder pattens11. With the development of high-throughput sequencing technology, the molecular alteration can be observed and the changes at RNA level can be more accurately determined in disease12,13,14.

In addition, with the development of artificial intelligence technology, machine learning and deep learning models have recently made significant contributions to the development of predictive medicine and modern pathological practice15,16,17. These models enable people to better interpret high-throughput data, reduce the dependence of disease diagnosis on subjective consciousness, and provide more precise criteria for disease diagnosis18,19,20. In this study, Gene Expression Ontology (GEO) databases were used to screen the key genes and construct an artificial neural network model for diagnosis of idiopathic pulmonary fibrosis.

Materials and methods

Statement

Our study is based on open-source databases(GEO), there are no ethical issues and other conflicts of interest. which belongs to public databases. The patients involved in the database have been obtained ethical approval. Users can download relevant data for free for research and publish relevant articles.

Data download and processing

We used keywords “idiopathic pulmonary fibrosis”, “pulmonary fibrosis”, and “lung fibrosis” to search for relevant datasets in the GEO database. Specifically, we included datasets in our analysis if they met the following criteria: (1) Adequate sample size. (2) Included both normal and disease groups. (3) RNA was extracted from lung tissue. (4) Adequate evidence for a diagnosis of IPF, such as HRCT. (5) Clinical information was collected in a standardized manner. In GSE47460, these are 582 total subjects, 254 have interstitial lung disease, 220 have COPD, and 108 are controls21. GSE110147 lung samples were obtained from the recipients’ organs of 22 patients with IPF, 10 with NSIP (non-specific interstitial pneumonia) and 5 with mixed IPF-NSIP undergoing lung transplantation22. In GSE53845 RNA was extracted directly from lung tissue samples from 40 IPF patients or 8 healthy controls23. In GSE70866, BAL cells were harvested from a discovery cohort of 62 patients from Freiburg, Germany, and two independent validation cohorts, Siena, Italy (50 patients) and Leuven, Belgium (64 patients)24. The GEOquery package was used to obtain the expression profiles and clinical phenotype data of the microarray datasets GSE47460, GSE53845, GSE110147, GSE32537, and GSE70866. Only IPF samples and normal samples for subsequent analysis were retained, excluding other lung samples. The annotation information of the chip probes of the corresponding platforms was obtained from the GEO database, respectively. During the conversion of the microarray probe ID and gene symbols, multiple probes corresponding to one gene symbol were found. Considering the reliability of the data, 'many-to-one' probe expression levels were used for gene expression levels. GSE47460 database had the largest sample size among the available databases and therefore provided the most possibility to identify differentially expressed genes.

Differential gene expression and enrichment analysis

Principal Component Analysis (PCA) was performed to identify spatial sample separation in the patient cohort using an R package factoextra25. The R software package limma26 was used to achieved differential analysis on 91 control and 122 IPF samples of GSE47460, genes differentially expressed with P values < 0.05 and fold changes > 1.5 or < 2/3 were visualized using an R package heatmap. The metascape tool27 was used to carry out enrichment analysis with DEGs on multiple databases including GO database, KEGG database, Reactome database, Wikipathway database28,29.

Random forest screens the top signatures

126 DEGs from GSE47460 (93 up-regulated and 33 down-regulated) were used to construct the random forest model (Table 1). The construction of random forest model and the chosen of top signatures were used the methods of Tian1. The R package pheatmap was used to perform k-means unsupervised clustering of the GSE47460 dataset and visualize.

Table 1. 126 DEGs from GSE47460.

Establishment of IPF classification model with artificial neural network

We used the top six signatures expression in another dataset of GSE32537 to construct an artificial neural network model using the R software package neuralnet. Taking the four hidden layers as the model parameters, the IPF disease classification model is constructed through the obtained gene weight information. Five-fold cross-validation were performed by the Caret package, pROC package was used to estimate the value of AUC30.

Additional data verification

On three independent datasets (GSE47460, GSE53845 and GSE110147), the validity of the constructed classification scoring model of IPF disease and normal samples was verified. The clinical data of GSE70866 were used to evaluate this potential of this model to indicate the patient prognosis and survival.

Clinical parameters

Clinical straits of IPF patients and healthy control such as the age, gender, pulmonary function tests (PFT) et al. were obtained in GES47460 dataset. The commonly PFT including forced vital capacity (FVC) (% pred.), FVC (post.), forced expiratory volume in 1 s (FEV1) (%pred.), FEV1(post.), and diffusing capacity of the lung for carbon monoxide (DLCO) (%pred.)31 were combined into a single "meta" lung function indicator by R package factoextra. FVC(post) and FEV1(post) refer to the post-bronchodilator forced vital capacity (FVC) and forced expiratory volume in one second (FEV1). The design of this study including the main four-step process was represented by a flow chat (Fig. 1).

Figure 1
figure 1

Flow chat.

Statistical analysis

The R package limma is used for differential expression analysis. To fit the expression data to a linear model and perform empirical Bayes moderation to shrink the standard errors and increase the accuracy of the results. Visualize the results using limma’s built-in plotting functions or by exporting the data and using ggplot2 for visualization. To remove batch effects using the Combat package, the combat() function to adjust for batch effects by specifying the variables that contain the batch information and the variables to be adjusted. OS time and cause of death were obtained and matched to respective patients from the supplemental clinical data available from the GSE70866. Survival time was measured in days starting at diagnosis and ending on the patient’s death or end of the follow-up period. Kaplan–Meier method was used to estimate overall patient survival by genes expression. The high—and low-risk groups were differentiated according to the expression value, with each group containing at least one third of the total sample. The log–rank and Wilcoxon tests were used to compare survival distributions. Correlations were calculated using Spearman’s rank correlation (presented as Spearman rho). The resulting coefficient will range from − 1 to 1, where  − 1 indicates a perfect inverse relationship, 0 indicates no relationship, and 1 indicates a perfect direct relationship.

Results

Significantly alteration of the genes and enriched signal pathway in IPF

The design of this study including the main four-step process was represented by a flow chat (Fig. 1). Totally 213 subjects in the GSE47460 dataset, including 91 healthy control and 122 IPF patients, the Bayesian test in the limma package were used to identify DEGs between normal and IPF samples. One hundred and twenty-six significantly differentially expressed proteins (DEPs) (adj.P.Val < 0.05 and a differential expression ratio [IPF/N] > 2 or < 0.5), including 93 up-regulated genes and 33 down-regulated genes were identified and heatmap in Fig. 2A, B. Then the metascape tool was used for pathway enrichment analysis of 126 important DEGs, the thresholds set at a -Log10 (P value) of > 2.5. Of the DEGs upregulated in patients with IPF were enriched in pathways associated with collagen degradation, NABA CORE MATRISOME and lung fibrosis (Fig. 2C), whereas DEGs those involved in cellular response to lipid, regulation of cytokine, positive regulation of cell death (Fig. 2D). These pathways were combined with GO database, KEGG database, Reactome database and Wikipathway database.

Figure 2
figure 2

Differential gene expression analysis in IPF. (A) Volcano plot of differential expression analysis results. The abscissa is log2Fold Change and the ordinate is –log10 (adj.P value). The upper right part has a adj.P value less than 0.05 and a fold change greater than 2, indicating significant DEGs with higher expression levels. The upper left part has a adj.P value less than 0.05 and a fold change less than − 2, indicating significant DEGs with reduced expression. The gray dots represent the remaining stable genes. (B) Heatmap of DEGs. The colors in the graph from red to blue indicate high to low expression. On the upper part of the heatmap, the red band indicates the disease samples and the blue band indicates the normal samples. C-D. Matescape toll function enrichment results bar graph. The x-axis represents −log10(adj P) values and the y-axis represents enriched pathways. Pathways with Log10(P value) of > 2.5 are marked and shown in the figure. (C) shows a bar graph of the enriched pathways that were significantly up-regulated in IPF patients compared to healthy controls. (D) shows a bar graph of the enrichment pathway results that were significantly downregulated in IPF patients compared to healthy controls.

Random forest classifies the DEGs between IPF and healthy control

The 126 DEGs were further classified by random forest classifier. In order to further obtain a model with stable error, appropriate parameters are selected by changing the number of decision trees, and finally 500 trees are set as the optimal parameters of the model (Fig. 3A). Nine was determined as the parameter of variable number, the importance of features by calculating the purity of nodes through Gini coefficient method were computed, the top 20 potential indicators were showed in Fig. 3B. Next, DEGs with importance greater than 4 or equal to 4 were screened for further analysis. which are CDH3, ADAMTS14, IL13RA2, HS6ST2, DIO2 and IGFL2 sequentially (Fig. 3B). The association between the top six genes’ expression with the age, gender, smoking history and disease stage, and status were heatmap in Fig. 3C by k-means unsupervised cluster, which indicated that the genes panel can be used to distinguish IPF patients from the control samples. Although older age and male increased susceptibility to IPF32, there was no significant correlation between the expression of the six genes and age or sex (Fig. S1). This suggests that the six genes are not affected by these factors in distinguishing between normal and IPF.

Figure 3
figure 3

Random Forest screening for DEGs. (A) The effect of the number of decision trees on the error rate. The x-axis represents the number of decision trees, and the y-axis represents the error rate. When the number of decision trees is about 500, the error rate is relatively stable. (B) Results of the Gini coefficient method in the random forest classifier. The x-axis represents the importance index, and the y-axis represents the genetic variables. Rank and display the top 20 genes of importance coefficient. (C) The unsupervised clustering heatmap shows the hierarchical clustering results generated from six significant genes generated by a random forest in GSE47460. On the upper part of the heatmap, the red band in the status module represents normal samples, and the blue band represents disease samples; the color in the age module gradually changes from white to green, representing the increasing age of the sample; the light green band in the gender module represents male samples, the purple strip represents female samples; the green strip in the gold stage module means AT Risk, the green strip means Moderate COPD; the purple strip means Severe COPD; the rose-red strip means unknown; the yellow strip in the smoking history module means the current still Smoking; green strips have ever smoked; blue strips have never smoked; orange strips are unknown.

Construction of the artificial neural network model

The convolutional neural network model was constructed using another dataset GSE32537 by the neuralnet package. Before training the neural network model, we need to set important parameters, especially the number of hidden layers and the number of neurons. There were no fixed rules for the setting of these two values, which relied more on experience and constant attempt. After many tests on the number of hidden layers, we found that when the number of hidden layers was 5, the training effect of the model was the best. Six neurons were finally set as model parameters based on the size of the input layer, commonly two-thirds of the input size was recommended. In order to further strengthen the stability of the neural network model, the GSE32537 dataset was randomly divided into the training set and the validation set, the fivefold cross-validation method was used for 5 iterations of optimization. The more important DEGs and their corresponding weight coefficients were learned from the training set. The classification effect of the scoring model was proved by the validation on other datasets, and the classification accuracy of the neural network model on the verification set was recorded each time. The receiver operating characteristic (ROC) curve is used to evaluate the classification performance of the model. The fivefold cross-validation results show that the AUC value of the area under the ROC curve is close to 1 (average AUC ≈ 0.99) (Fig. 4A), which indicating that the classification accuracy of this convolutional neural network model was high.

Figure 4
figure 4

Construction of the artificial neural network model. (A) Verification of the ROC curve results by the five-time cross-validation model in GSE32537. The points marked on the ROC curve are the optimal threshold points, and the values in parentheses represent sensitivity and specificity. The AUC value is the area under the ROC curve. (B) Results of neural network visualization.

The training of the whole neural network model was performed in 28,730 steps, In the connection weights between neurons of the network, the positive weights were connected by black lines, the negative weights were connected by gray lines, and the thickness of the lines reflected the value of the weight. The termination condition of neural network training was the absolute partial derivative of the error function was less than 0.01(almost reached threshold), and the output result of the entire training process shown in Fig. 4B.

Model accuracy verification

The trained neural network model was put into three independent datasets of GSE47460, GSE110147 and GSE53845 for verification. The data in the three datasets were standardized before verification. In GSE47460 dataset, the sensitivity was 90%, the specificity was 85%, and the AUC was 0.856 (Fig. 5A). In GSE110147 dataset, the sensitivity was 100%, the specificity was 100%, and the AUC was 1 (Fig. 5C). In GSE53845, the sensitivity was 75%, the specificity was 90%, and the AUC was 0.880 (Fig. 5E). The confusion matrix results of GSE47460, GSE110147 and GSE53845 are shown in Fig. 5B,D and F respectively. These data demonstrated that the accuracy of this model is reliable.

Figure 5
figure 5

Model accuracy verification. (A) Verification of the ROC curve results in GSE47460. The points marked on the ROC curve are the optimal threshold points, and the values in parentheses represent sensitivity and specificity. The AUC value is the area under the ROC curve. (B) GSE47460 confusion matrix result. The x-axis represents the predicted results, and the y-axis represents the actual results. (C) Verification of the ROC curve results in GSE110147. The points marked on the ROC curve are the optimal threshold points, and the values in parentheses represent sensitivity and specificity. The AUC value is the area under the ROC curve. (D) GSE110147 confusion matrix result. The x-axis represents the predicted results, and the y-axis represents the actual results. (E) Verification of the ROC curve results in GSE53845. The points marked on the ROC curve are the optimal threshold points, and the values in parentheses represent sensitivity and specificity. The AUC value is the area under the ROC curve. (F) GSE53845 confusion matrix result. The x-axis represents the predicted results, and the y-axis represents the actual results.

Survival analysis

To further estimate the prognostic effect of the identified candidate genes in IPF, the complete dataset (GSE70866) of RNA-seq samples (bronchoalveolar lavage fluid) with follow-up comprised 194 specimens from IPF patients (n = 176) and normal controls (n = 18). Cox proportional hazards regression model and Kaplan–Meier method (product-limit method) were used to calculate the correlation between gene expression and survival status. Univariate COX results showed that CDH3 was a potential prognostic marker (HR = 1.3, pvalue = 0.0013, Fig. S2). Of the six signatures, CDH3, ADAMTS14 and DIO2 showed a different significant association with overall survival in IPF (Fig. 6A–F). Patients with high expression of CDH3 and ADAMTS14 had a poor prognosis (Fig. 6A, B), while those with high expression of DIO2 had a good prognosis (Fig. 6E).

Figure 6
figure 6

Survival predictive analysis. (A) CDH3 as a prognostic factor to evaluate the prognosis and survival status of IPF patients. (B) ADAMTS14 as a prognostic factor to evaluate the prognosis and survival status of IPF patients. (C) IL13RA2 as a prognostic factor to evaluate the prognosis and survival status of IPF patients. (D) HS6ST2 as a prognostic factor to evaluate the prognosis and survival status of IPF patients. (E) DIO2 as a prognostic factor to evaluate the prognosis and survival status of IPF patients. (F) IGFL2 as a prognostic factor to evaluate the prognosis and survival status of IPF patients. The x-axis represents time and the y-axis represents survival probability. The yellow line represents the high gene expression group, and the blue line represents the gene low expression group. Each point on the curve represents the patient's survival rate at that time point.

Six signatures correlation analysis with clinical features

The dataset (GSE47460) including the RNA-seq counts and clinical data from IPF patients was used to yielded the global correlation network heatmap shown in Fig. 7A. Quantification of multiple combinations of clinical lung function parameters into a single "meta" lung function measure by principal component analysis. Next, we performed linear multivariate regression analysis the mRNA expression levels associated with the meta lung function variable. CDH3, ADAMTS14, IL13RA2, HS6ST2, DIO2 and IGFL2 are positive correlated to lung function with R value at about 0.6 (Fig. 7B).

Figure 7
figure 7

Six signature genes were significantly associated with clinical features. (A) The heatmap illustrates the computationally derived meta lung function variable combing multiple lung function parameters. In the upper part of the heatmap, the color in the meta lung function module gradually changes from white to green, representing an increase in the sample meta lung function; the blue bars in the gender module represent male samples, and the red bars represent female samples; the age module The color gradually changes from white to purple, representing the increasing age of the sample. On the right side of the heat map, there are clinical indicators DLCO, FVC (pred), FVC (post), FEV1(pred), and FEV1(post). Pred, predict; Post, post-bronchodilator. (B) The scatter plots show the positive correlation of the indicated genes with meta lung function. The x-axis represents gene expression, and the y-axis represents meta lung function.

Discussion

In this study, we calculated differentially expressed genes (DEGs) related to idiopathic pulmonary fibrosis (IPF), and obtained six important candidate DEGs using a random forest classifier. we used a neural network model to determine the predicted weights of related genes and construct a classification model score for IPF. We then evaluated the classification efficiency of the model score in three independent sample datasets17. We found that the AUC efficiency of our model, called neura IPF, was excellent. However, the lack of gene data for IPF in the GEO database meant that the genetic characteristics of IPF were not included in the construction of the diagnostic model, potentially compromising its diagnostic effectiveness for IPF.

Of these six genes, a major function of DIO2 is to convert T4 to active T3 so that TH (Thyroid hormone) can be activated. IPF, diabetic nephropathy, and myocardial infarction have all been associated with a poor prognosis with hypothyroidism21,33,34,35,36. The expression and activity of DIO2 are increased in the lungs of patients with IPF and are correlated with disease severity. DIO2 mainly localizes to AECs, which are thought to play a central role in the cycle of injury and repair that is characteristic of IPF21. DIO2 is significantly upregulated in the fibrotic state, but this upregulation is thought to be protective. This was further confirmed in our prognostic analysis21.

Insulin-like growth factors (IGFs) and their binding proteins (IGFBPs) play a critical role in pulmonary fibrosis development and progression37. It has previously been shown that IGFBP2 and IGF-like family member 2 (IGFL2) are upregulated in SSc-PF and IPF38. IGFL2 is secreted form in the ECM, its expression is also increased in IPF. IGFL2 expression levels were significantly reduced in human skin fibroblasts aged with mitochondrial function, suggesting that mitochondrial physiological processes are associated with IGFL238. IGFL2 play critical roles in cellular energy metabolism and in growth and development, especially prenatal growth. However, there has been no relevant research exploring its role in the pathogenesis of IPF.

Classical cadherins are the principle adhesive proteins at cohesive intercellular junctions and are essential proteins for morphogenesis and tissue homeostasis39. P-cadherin is a calcium dependent cell–cell adhesion glycoprotein, which has a crucial role in the conservation of the structural integrity of epithelial tissues. Like other members of the cadherin family, P-cadherin (CDH3) regulates several cellular homeostatic processes that participate in embryonic development and maintain adult tissue architecture, being important for cell differentiation, cell shape, cell polarity, growth, and migration40. It is worth noting that DIO2 and IGFL2 also play an important role in growth and development. These three genes play a major role in development, regeneration, morphogenesis and so on. This highlights the prominent role of tissue formation and development in the pathogenesis of fibrosis.

Interleukin (IL)-13 has been shown to play a role in several inflammatory and fibrotic diseases41. IL-13 modulates its effector functions via a complex receptor system that includes the IL-4 receptor (R) α, IL-13Rα1, and the IL-13Rα2. IL-13Rα1 binds IL-13 with low affinity, yet, when it forms a complex with IL-4α, it binds with much higher affinity, inducing the effector functions of IL-13. IL-13Rα2 binds IL-13 with high affinity but has a short cytoplasmic tail and has been shown to act as a nonsignaling decoy receptor. Transfection of fibroblasts and epithelial cells with IL-13Rα2 inhibited the IL-13 induction of soluble collagen, TGF-β, and CCL17. Adenoviral overexpression of IL-13Rα2 in the lung reduced bleomycin-induced fibrosis41.

Heparan sulfate (HS) proteoglycan is a glycosaminoglycan widely distributed on the surface of animal cells and extracellular matrix, and regulates cell growth, differentiation, adhesion, and migration by interacting with various ligands complement. Compared with normal lung, IPF lung showed significantly increased HS6-O-sulfuration and HS6-O-sulftransferase 1 and 2 (HS6ST1/2) mRNA overexpression. Immunohistochemistry showed that HS6ST2 was specifically expressed in bronchial epithelial cells, including IPF lung honeycomb cyst lining cells42. Both IL13RA2 and HS6ST2 highlight the importance of ligand receptor interactions, highlighting that cell–cell interactions in the microenvironment may be a major cause of fibrosis progression.

ADAMTS14 gene encodes a member of the ADAMTS (a disintegrin and metalloproteinase with thrombospondin motif) protein family. As reported by previously studies, the ADAMTS14 gene was discovered to play critical roles in the progress of inflammation and the immune system, through a crosstalk of the TGF-β pathways and mesenchymal cells43. ADAMTS14 gene polymorphism was associated with knee osteoarthritis44 or the osteoarthritis of the temporomandibular joint in Chinese Han women45. But so far, no studies have been able to clarify ADAMTS14's role in pulmonary fibrosis. Like HS6ST2, ADAMTS14 is also mainly present in extracellular matrix, which may suggest that the composition or structure of extracellular matrix is also an important pathological factor that should not be ignored in pulmonary fibrosis.

This model has made significant progress compared to other models in previous studies37. This progress is primarily reflected in the use of fewer feature variables, the validation of the model using a large-scale dataset, and its strong predictive performance. There still are some limitations in this study, the sample sizes of the cohorts are still relatively big enoughwhich may not be sufficient to represent the overall population precisely and could affect the generalizability of diagnostic model. Additionally, this diagnostic model is based on preliminary findings and short of the sound experimental verification to support its reliability. As such, given these limitations, this diagnostic model requires further investigation to determine whether it can be used in clinical decision-making.

In conclusion, we constructed an artificial neural network model that demonstrated robust performance across multiple cohorts. We assessed the relationship between each gene of the model and demographic variables. The majority of the genes showed no association with age or gender, but all presented close correlation with clinical features. CDH3, ADAMTS14 and DIO2 were found to be related to prognosis. These results are useful to prioritize targeting these indicators for diagnosis and drug development in future.