Pulmonary arterial hypertension (PAH) is a chronic cardiopulmonary syndrome characterized by salient pulmonary vascular remodeling and accelerating increase in pulmonary vascular load, bringing about right ventricular (RV) enlargement and pulmonary blood vessels remodeling1 and ultimately leading to heart failure or even death if untreated2. Plentiful widespread cardiopulmonary diseases complicated with PAH, which badly increases morbidity and mortality1. PAH is a rare disease with reported prevalence of 15–50 cases in a million people and incidence of 5–10 cases in a million people every year3. Debate on the best strategy for PHA management continues despite great progress in treating PAH in recent decades. Extensive researched have shown current therapeutic approaches in PAH included prostacyclin receptor agonists, phosphodiesterase-5 (PDE-5) inhibitors, endothelin receptor antagonists (ERAs) and a soluble guanylate cyclase activator targeting several crucial signaling pathways that predominantly regulate pulmonary vasculature4. However, PAH may also can be caused by many unknown causes, such as pulmonary vascular remodeling and inflammation, which cannot be well solved by current drug treatment and PAH is still a complicated incurable cardiopulmonary disease5. Thus, it is necessary for us to utilize bioinformatics technology to explore the pathogenesis or potential treatments of PAH.

At present, bioinformatics methods have been proverbially performed to analyze gene sequencing data of various diseases to ascertain differentially expressed genes (DEGs) and implement various analyses. And more and more robust databases and powerful online tools have been established to help our repositioning of known intricate mechanism of diseases6. Increasing researches applied microarray technology for the purpose of searching DEGs and their molecular functions (MFs), biological processes (BPs), or cellular components (CCs) as well as relatedregulatory pathways in specific disease state7. For example, Habib Rahman and his colleagues used bioinformatics and machine learning methods to determine new factors that improve the identification and characterization of glioblastoma tumors and their progression8. Based on a neighborhood-based benchmarking and multilayer network topology techniques, they also identified novel putative biomarkers which manifest how type 2 diabetes (T2D) interact9. In our study, as the flow chart shown in Fig. 1, the expression datasets GSE22356, GSE131793 and GSE168905 were analyzed to identify key genes employing comprehensive bioinformatics analysis technologies, consisting of GO and KEGG analysis, protein–protein interaction (PPI) network construction, immune infiltration analysis and hub gene identification. Relying on the GEO database and R language, the distinguishing infiltration of 22 immune cells in peripheral blood of PAH patients were compared with healthy cases. The study probably revealed the pathogenic mechanism and potential therapeutic target of PAH.

Figure 1
figure 1

Flow chart of the analysis program in this study.


Identification of DEGs in PAH

There are 580 DEGs screened from GSE22356 between PAH group with healthy group which consists of 162 up-regulated and 418 down-regulated genes (Fig. 2A,D). Simultaneously, heatmaps and volcano plots vividly showed that 156 up-regulated and 13 down-regulated DEGs were identified from GSE131793, 102 up-regulated and 456 down-regulated DEGs were identified from GSE168905 (Fig. 2B,C,E,F) (p value < 0.05 and |log2FC|≥ 0.5). Then, we merged up- regulated and down-regulated DEGs in the three datasets, which revealed a total of 403 up-regulated DEGs and 874 down-regulated DEGs in the microarrays.

Figure 2
figure 2

The identification of EDGs. (AC) The heat-maps of differential expression genes of GSE22356, GSE131793 and GSE168905 (P < 0.05, |log FC|> 1). Up-regulated genes were in red and down-regulated genes were in blue. (DF) Volcano plot of genes identified in PAH of GSE22356, GSE131793 and GSE168905 (P < 0.05, |log FC|> 0.5). Red dots represent up-regulated and green dots represent down-regulated.

Functional enrichment analysis of DEGs

For demonstrating PAH‐related functional annotation and pathway enrichment, these DEGs with an absolute value of log2 fold change greater than 1 were selected for GO and KEGG analysis and detailed results are presented in Tables 1 and 2. In GO analysis, the richest BP participated hydrogen peroxide catabolic process, positive regulation of leukocyte migration and homeostasis of number of cells (Fig. 3A). In the CC category, platelet alpha granule membrane, haptoglobin-hemoglobin complex and platelet alpha granule were mostly interrelated with PAH (Fig. 3B). Organic acid binding, chemokine binding and haptoglobin binding were most obviously enriched for DEGs in the MF category (Fig. 3C). As we all know, KEGG mapping is a predictive method via reestablishing molecular network systems from molecular constructing blocks grounded on functional orthologs10,11,12. In KEGG analysis, the result indicated that genes were mainly associated with hematopoietic cell lineage, TGF-beta signaling pathway and ECM-receptor interaction (Fig. 3D).

Table 1 The top 10 BPs, CCs, MFs of GO functional annotation analyses of DEGs.
Table 2 The top 12 KEGG enrichment pathway analysis of DEGs.
Figure 3
figure 3

GO and KEGG analysis of DEGs between PAH and healthy groups. (AC) The top 10 GO annotation results of DEGs. (D) KEGG pathway analyses of DEGs (top 10 according to enrichment score).

PPI network construction and hub genes identification

Uploading 47 DEGs into the STRING database, a PPI network with the desired interaction score > 0.48 was constructed and the network containing 33 nodes and 57 edges was visualized via Cytoscape software (Fig. 4A). Then the most remarkable module was recognized by cytoHubba plug-in according to the maximal clique centrality topology analysis methods (Fig. 4B). The top 10 significant nodes with the highest connectivity among the network were appraised as hub genes: SLC4A1, AHSP, ALAS2, CA1, HBD, SNCA, HBM, SELENBP1, SERPINE1, ITGA2B (Table 3). Additionally, the most significantly enriched BPs showed that hub genes were related to oxygen transport, erythrocyte development, hydrogen peroxide catabolic process bicarbonate transport. The changes of CCs containing hemoglobin complex, blood microparticle, haptoglobin-hemoglobin complex and platelet alpha granule membrane. MFs were mainly enriched in hemoglobin binding, haptoglobin binding, oxygen transporter activity and organic acid binding (Fig. 4C).

Figure 4
figure 4

(A) The PPI network of DEGs was established using string and Cytoscape, containing 33 nodes and 57 edges. The upregulated genes are red and downregulated are green. (B) The top 10 hub genes were detected from the PPI network. (C) GO terms of the top 10 key genes.

Table 3 Summary of the function of 10 key genes.

Analysis of TF-target regulating networks

To gain further insight into the regulatory TFs and selected DEGs, we used the iRegulon plugin to predict the TFs of DEGs. There were 48 nodes were predicted consisting of 42 DEGs and 6 TFs with an NES > 4 in this TF-DEGs network (TEAD4, TGIF2LY, GATA5, GATA1, GATA2, FOS) (Fig. 5). Specifically, the results show that 36 DEGs were forecasted as targets of TGIF2LY, 15 DEGs were forecasted as targets of TEAD4, GATA2 is predicted to have 17 DEG targets, GATA5 has 25 DEG targets, GATA1 has19 DEGs, and FOS has 23 DEGs13. Three predicted TFs belong to the family of GATA zinc finger TFs that bind to various motifs in the promoters of DEGs. It can be speculated that the TFs regulates the development and progression of PAH by activating or repressing the transcription of DEGs (details of transcription factors are shown in Table 4).

Figure 5
figure 5

The TF-DEGs network including 42 DEGs and 6 TFs with an NES > 4.

Table 4 Summary of the function of 6 TFs.

Immune cells infiltration analysis

Using CIBERSORT, the relative proportions of 22 infiltration immunocytes in PAH and control samples were determined (Fig. 6A) and details were presented in Table 3. Furthermore, the heatmap showed that the 22 subpopulations of immunocytes were significantly different between PAH and controls and it was found that monocytes occupied the largest proportion of immune cells (Fig. 6B). Correlation analysis discovered that macrophages M had the most obvious positive relationship with both T cells follicular helper and T cells regulatory (Tregs) (r = 0.9, r = 0.9), dendritic cells resting and mast cells activated were also highly positively correlated with r = 0.83, while the strongest negative correlation was between dendritic cell activation and NK cell quiescence with r = − 0.6 (Fig. 6C). And the trend analysis results of infiltration immune cells illustrated that PAH patients had higher infiltration of NK cell activation, monocyte, T cell CD4 memory activation, and mast cell than healthy controls and lower infiltration of T cell CD4 naive (Fig. 6D; p < 0.05).

Figure 6
figure 6

Summary of immune cell subpopulations between normal controls and PAH. (A) Composition of 22 kinds of immune cells subsets from GSE22356, GSE131793 and GSE168905 datasets. (B) Heatmap shows differences between the 22 subpopulations cell types and samples. (C) Correlation matrix of the 22 immunocyte proportions in healthy control and PAH samples. Red colors represent positive and blue colors represent negative correlations. (D) The difference of immune cells infiltration between 9 normal controls and 12 PAH (blue color represent normal controls group and red colors represent PAH group. P values < 0.05).


PAH is a rare chronic refractory syndrome accompanied with pathological changes of distal pulmonary arteries such as vasculature remodeling, lumen stenosis, laminar intimal proliferation and fibrosis, plexiform lesions with excessive poorly formed capillaries14. The current clinical definition of PAH is that the average pulmonary arterial pressure (m PAP) ≥ 25 mm Hg measured during right heart catheterization (1 mmHg = 0.133 kPa)15. PAH is divided into many subgroups due to different etiologies worldwide. According to the latest Chinese PAH registry, coronary heart disease-related PAH was reported as the most prevalent subtype, accounting for about 43% of all cases3. In addition to anticoagulant, calcium channel blockers16, diuretics, exercise rehabilitation17, oxygen therapy18 and other basic measures and conventional therapies to reduce right ventricular preload and improve left ventricular filling, the treatment of PAH also includes drugs targeting the prostacyclin pathway, endothelin pathway, nitric oxide-cGMP pathway like epoprostenol, iloprost, treprostinil, bosentan, ambrisentan, cediline, nafil, sildenafil, riociguat etc.5,19. Previous study has reported that tyrosine kinase inhibition (TKI) has an inhibitory effect on cell proliferation against pulmonary vascular remodeling20. Despite substantial efforts were made in PAH field in recent years, the pathogenesis of PAH remains to be further elucidated. Nearly decades, rapid advances in gene sequencing technologies and bioinformatics related science have made it possible to utilize abundant sequencing data for further analyzing. In this research, we identified 403 up-regulated and 874 down-regulated DEGs from peripheral blood samples belong to 29 controls and 32 PAH patients whose expression profiles downloaded via GEO database. Besides, PPI network construction and functional enrichment analyses were performed to filter key genes and important pathways. GO annotation results manifested that DEGs were enriched in extracellular exosome, integral component of plasma membrane, platelet degranulation, blood microparticle, extracellular space, extracellular region, immune response, cell surface and protein binding. KEGG pathway results declared that DEGs were mainly mapped to hematopoietic cell lineage, cytokine-cytokine receptor interaction, transcription growth factor beta (TGF-β) signaling pathway, extracellular mechanism (ECM) receptor interaction and focal adhesion.

Establishing PPI network is friendly for researchers to investigate the underlying molecular mechanism of PAH for the reason that the DEGs would be grouped and ordered in the network judging by their interactions21. Of which, 10 hub genes (SLC4A1, AHSP, ALAS2, CA1, HBD, SNCA, HBM, SELENBP1, SERPINE1, ITGA2B) related to PAH were detected owing to Cytoscape. It is speculated that these significant DEGs theoretically lead to the occurrence and progression of PAH. SLC4A1, solute carrier family 4 member 1, a member of the anion exchanger (AE) family, expressed mainly in the erythrocyte plasma membrane, functions as a chloride/bicarbonate exchanger in the erythrocyte plasma membrane associated with the transport of carbon dioxide from tissues to the lungs22. Recently a genome-wide association study revealed that candidate genes including SLC4A1 known to regulate structure of erythrocyte, metabolic process and ion channels23. Kaneda et al. used a lung cancer (LC) model mice to examine the proteome of cancer cell-secreted extracellular vesicles (cEVs) and found that SLC4A1 was co-expressed in CHL1-expressing EVs24. AHSP, Alpha-hemoglobin-stabilizing protein, is reported as a molecular chaperone, which peculiarly attaches to free α-globin and participates in hemoglobin assembly. This protein links to monomeric alpha-globin while it is transferred to beta-globin to compose a heterodimer, which then conversely unites to another heterodimer to shape a sturdy tetrameric hemoglobin25. During normal erythroid development, this protein functions as a chaperone to guard against deleterious aggregation of alpha-hemoglobin, particularly, to defend free alpha-hemoglobin from precipitation. AHSP regulates pathological conditions with excess alpha-hemoglobin like beta-thalassemia26. Protein 5′-Aminolevulinate Synthase 2 of ALAS2 is an erythrocyte-specific mitochondrial localization enzyme and that step one of the heme biosynthetic pathway is catalyzed by this product27. Blemishes on ALAS2 will lead to the development of X-linked pyridoxine-responsive sideroblastic anemia28.

The evidence presented thus far supports the idea that carbonic anhydrases (CAs) are zinc metalloenzymes mainly involved in catalyzing the reversible hydration of carbon dioxide29. CAs are involved in diverse biological processes, consisting of respiration, calcification, acid–base balance, bone resorption and so on. Studies have confirmed that CA1 gene is tightly related to the CA2 and CA3 genes on chromosome 8 and encrypts the highest levels of cytosolic proteins found in red blood cells30. Previous studies said the HBD gene was normally expressed in adults and encodes a delta chain that two delta chains add two alpha chains make up HbA-2, then together with HbF makes up adult hemoglobin. The research to date has tended to display that mutation in the HBD gene give rise to hemoglobin leprosy-beta-thalassemia syndrome and fetal hemoglobin quantitative trait loci31. The HBM gene has an open reading frame (ORF) that enciphers a 141 amino acid polypeptide resemble the delta globin unearthed in reptiles and birds. HBD-related pathways include factors involved in megakaryocyte development and thrombopoiesis and responses to elevated platelet cytoplasmic Ca2+. Alpha-synuclein, encoded by SNCA, belongs to the synuclein family including beta- and gamma-synuclein as well. Synuclein is plentifully expressed in the cerebrum, α- and β-synuclein selectively inhibits phospholipase D2. Protein SNCA can be used to analyze and integrate presynaptic semaphore and membrane transportation. It has been reported that SNCA peptides are the governing elements of amyloid patches in the brains of victims suffering from Alzheimer32. SERPINE1 gene a serine protease that is classified to the serine protease inhibitor (serpin) superfamily and a major inhibitor for tissue plasminogen activator (tPA) and urokinase (uPA). In addition, it is capable of congenial antiviral immunity. Studies have shown that SERPINE1 deficiency is responsible for plasminogen activator inhibitor 1 deficiency (PAI-1), and the high consistence proteins linked to thrombophilia33. Selenium Binding Protein 1, encoded by SELENBP1, pertain to the selenium-binding protein family. As we all known selenium is an indispensable trace element that possesses formidable anticarcinogenic characters. Researchers have proved that lack of selenium probably results in some certain neurological diseases. The role of selenium in the precaution of carcinoma and neurological diseases perchance is regulated through selenium-united proteins, and alleviated expression of SELENBP1 presumably is relevant to a few kinds of cancer34. This protein might selenium-dependently function in degrading ubiquitination/deubiquitination-regulated protein. SELENBP1 relational disorders include Extraoral Halitosis due to methanethiol oxidase deficiency and methionine adenosyltransferase deficiency35. Integrin Subunit Alpha 2b (ITGA2B) is geared to integrin alpha chain proteins family. The preproprotein undergoes proteolytic processing to produce light and heavy chains that binding via disulfide bonds to constitute subunits of the alpha-IIb/beta-3 integrin cell adhesion receptor. This receptor plays a vital character in coagulation system via attracting platelet gathering. Platelet-type bleeding disorders such as Glanzmann thrombasthenia are caused by mutations in ITGA2B manifesting the inability of platelets to aggregate36.

Furthermore, our current analysis also identified some TFs (TEAD4, TGIF2LY, GATA5, GATA1, GATA2, FOS) related to PAH, insinuating that these DEGs serve an important function in PAH. Next, based on current relevant studies we take a closer look at the association between PAHs and identified TFs. TEAD4, TEA domain transcription factor 4, is important component in Hippo signaling pathway which is involved in controlling organ growth and inhibiting tumor through limiting cell proliferation and facilitating apoptosis. TEAD4 mediates cell proliferation, migration, and epithelial-mesenchymal transition (EMT) by mediating the gene expression of YAP1 and WWTR1/TAZ. Previous study showed that Luteolin ameliorates reconstitution of pulmonary blood vessels and RV hypertrophy on rats PAH model and inhibited smooth muscle cells proliferation and migration at a dose-dependent manner through inhibition mediated by HIPPO-YAP/PI3K/AKT pathway37. TGIF2LY, belongs to the TALE/TGIF homeobox family of transcription factors, is located in the specific region of chromosome Y, in a large stretch of sequence which is considered as a X-to-Y transposition. It is possible to exert transcriptional regulation in testis and become a competitor or regulator of gene TGIF2LX. There are no studies to date that TGIF2LY is associated with PAH yet. The GATA family of transcription factors members GATA1, GATA2and GATA5 are associated with the extent of amino acid sequence identity in DNA zinc finger binding domains and they all have the capability to link the GATA sequence. Considering reprogramming function of GATA1/2/5, the three GATA family members are inducers of pluripotency reprogramming. Specifically, as a transcriptional activator or repressor, GATA1 possibly play a general switch role during erythroid development. GATA1 combines to the DNA sites with the common sequence 5′-GATA-3′ inside modulatory districts of globin genes and others genes of red blood cells and motivates genes transcription concerned with erythroid differentiation of K562 erythroleukemia cells, containing HBB, HBG1/2, ALAS2 and HMBS. GATA2 is a transcription factor participating in stem cell sustain, hematopoiesis and mediating endothelin-1 gene expression of endothelial cells. GATA5 is an indispensable TF during cardiovascular development38 and transcriptional program, which is the basis of diversity of smooth muscle cell. It’s reported that GATA5 united in the CEF-1 nucleoprotein binding domain of cardiac-specific slow/cardiac troponin C transcriptional enhance39. FOS acts as a nuclear phosphoprotein that composes a close but non-covalently connected complex with JUN/AP-1 transcription factor. Both FOS and the JUN/AP-1 basic region appear to interact with symmetrical DNA half-sites in heterodimers. Multimeric SMAD3/SMAD4/JUN/FOS complexes are formed at the AP1/SMAD binding site during TGF-β activation to mediate TGF-β signaling. It has been documented that FOS critically functions in modifying the formation and maintenance of skeletal cell40 and it is believed FOS plays an significant role in signal switching, cellular proliferation and differentiation.

For further investigating functions of immune cells in PAH, CIBERSORT was performed for immune infiltration assays. Results displayed that NK cell activation, monocyte, T cell CD4 memory activation and mast cell infiltration were increased and T cell CD4 naive infiltration was decreased in peripheral blood of PAH patients contrasted with the control group, which may be related to the development and exacerbation of PAH. It is noteworthy that that it was monocyte that was the highest proportion in PAH peripheral blood. Previous researches showed that for coping with hypoxia, pulmonary monocytes could perceive hypoxia, permeate pulmonary arterioles, and accelerate vascular reforming, which suggested that the monocyte lineage functions directly in the morbidity of PAH41. NK cells activation is able to identify and eliminate cells infected with viruses or tumor lesions and degranulate perforin-containing granules and granzymes with the participation of FasL and TRAIL molecules42. Ormiston et al. demonstrated that NK cells in PAH patients generated a greater amount of matrix metalloproteinase 9, leading to deleterious effects on pulmonary vascular remodeling and functional impairment43. At most concomitant states, PAH is bound up with CD4 T cells defects and related to HIV or HHV-8 infection and some experimental studies have displayed that the application of NK cells and T cells suppressed the development of PAH44. In a clinical study, mast cells were abundantly found in PAH by quantitatively measuring the number and activity of mast cells in blood and urine among 44 PAH patients and 29 healthy individuals45. Besides, mast cells were discerned as connective tissue types expressing tryptase and chymotrypsin46. The above findings indicated that mast cells were conducive to the vascular pathophysiology of PAH, which is consistent with the findings of our study.


In summary, results in present study manifested that the development of pulmonary hypertension probably is the result of imbalance between pulmonary vascular remodeling and immune micro environment. We discovered that genes SLC4A1, AHSP, ALAS2, CA1, HBD, SNCA, HBM, SELENBP1, SERPINE1, ITGA2B are the most notable markers of PAH. In addition, (TEAD4, TGIF2LY, GATA5, GATA1, GATA2, FOS) are predicted TFs regulating DEGs and may be regarded as potential targets for the prevention of pulmonary vascular restructure. Regulatory NK cell activation, monocyte, T cell CD4 memory activation and mast cell were detected as the most obvious immune cells infiltrating the peripheral blood of PAH. Since a large number of immune cells are clearly changed in PAH, although it is not clear how the immune system affects vascular remodeling, it can be conjectured that immune system goes a long way in this pathological process. The inadequacies of this study are that the insufficient amount of data may lead to results bias and yet we did not carry out trials to verify the results obtained. Further experiments on immune cells can identify targets to perfect immuno-modulatory therapy for PAH patients. To a certain extent, it could be interpreted like this, key genes and TFs are tightly related to the occurrence of PAH and regarded as underlying targets for therapeutic makers as well, however, it is necessary to explore the specific mechanism using some animal and cell experiments.

Materials and methods

Raw data acquisition and preprocessing

In the study, three gene expression profiles GSE22356, GSE131793 and GSE168905 were obtained from GEO (Home–GEO–NCBI ( database. GSE22356 was based on the GPL570 platform including 10 normal samples and 10 PAH samples. GSE131793 was based on the GPL6244 platform involving 10 healthy samples and 10 PAH samples. GSE168905 was based on the GPL16791 platform containing 9 control samples and 12 PAH samples. Raw data of these profiles were annotated according to their respective platform files following by probes switched into gene symbols, respectively47. The software Perl and R (version 4.0.2) were employed for data preprocessing.

Identification of DEGs

For the purpose of identifying the significant DEGs between the PAH and the healthy group, Limma (linear models for microarray data) package was adopted processing three datasets with |log2 fold change (FC)|≥ 1 and adjusted p value < 0.05 as the thresholds21. We obtained the DEGs of the three datasets respectively. Additionally, volcano plots and heatmaps were generated to evaluate these selected DEGs.

Functional and pathway enrichment analysis

To assess the biological functions of the total DEGs, we combined the DEGs for gene ontology (GO) annotation and Kyoto encyclopedia of genes and genomes (KEGG) pathway analysis applying online website DAVID (, which classifies gene functions into BP, CC and MF48. Then use R to generate bubble and bar charts to visualize the obtained results. P value < 0.05 was defined statistically significant.

Protein–protein interaction (PPI) network and hub gene identification

PPI network among all DEGs was established based on an online tool STRING ( and then software Cytoscape (version 3.7.2) was employed to adjust and visualize PPI networks49. Subsequently, we utilized a plug-in Cytohubba of Cytoscape to determine top 10 hub genes according to the MCC algorithms. All key genes’ scores were over 15 degrees.

Construction of TF-DEG regulation network

For searching potential interactions between DEGs and transcription factors (TFs), another plugin iRegulon in Cytoscape used for the identification of DEG-targeted TFs. The enriched motifs in iRegulon were ranked depending on the direct targets by means of position weight matrix50. The TF-DEG crosstalk pairs were gained from databases such as TRANSFAC, TRED and so on. Then the TF-DEG regulatory network was visualized by Cytoscape.

Immune cell infiltration analysis

Numerous studies have shown that PAH has a certain relationship with the immune micro environment, so we performed this analysis. The corrected gene matrix was uploaded to CIBERSORT website (CIBERSORT (, a powerful online immune assay calculation tool, via a deconvolution algorithm to calculate the percentages of 22 kinds of infiltrating immune cells with P < 0.05 as standard51. Hereafter, the results of deduced immunocytes infiltration analysis were visualized by R language in different manners (Spplemetary Table 1).