Biological heterogeneity in idiopathic pulmonary arterial hypertension identified through unsupervised transcriptomic profiling of whole blood

Idiopathic pulmonary arterial hypertension (IPAH) is a rare but fatal disease diagnosed by right heart catheterisation and the exclusion of other forms of pulmonary arterial hypertension, producing a heterogeneous population with varied treatment response. Here we show unsupervised machine learning identification of three major patient subgroups that account for 92% of the cohort, each with unique whole blood transcriptomic and clinical feature signatures. These subgroups are associated with poor, moderate, and good prognosis. The poor prognosis subgroup is associated with upregulation of the ALAS2 and downregulation of several immunoglobulin genes, while the good prognosis subgroup is defined by upregulation of the bone morphogenetic protein signalling regulator NOG, and the C/C variant of HLA-DPA1/DPB1 (independently associated with survival). These findings independently validated provide evidence for the existence of 3 major subgroups (endophenotypes) within the IPAH classification, could improve risk stratification and provide molecular insights into the pathogenesis of IPAH.

the lowest percentage missingness at 19.10%. For cohort visit 1, Glasgow again showed the highest rate of missingness at 71.96%, while Papworth had the lowest at 39.10%. Imperial and Hammersmith accounted for 33.43% of patients in the study, while the other centres contributed between 5.57% and 18.10% of participants, thus inflating missingness statistics. Centre specific differences also exist, with Sheffield using the shuttle test to assess exercise performance in the place of 6MWD. Specific values for missingness calculated for classifier variables only can be seen in Supplementary Figure 12, with NT-proBNP exhibiting the highest missingness and age at the time of diagnosis with the lowest missingness, for both diagnosis and cohort visit 1 datasets.

Sample and gene selection preprocessing
The initial gene expression dataset consisted of 508 samples, both patients and controls. A number of samples had to be filtered out to ensure a high quality, unskewed input sample set for all subsequent clustering runs. Initially, the first occurrences of three samples were removed from the RNA-seq matrix as they were repeated samples from the same visit of the corresponding patient/control with identical gene expression values. Moreover, 11 patients were excluded as their diagnosis of PAH was not of the idiopathic form and an additional eleven as they were diagnosed with a different form of PH (Pulmonary veno-occlusive disease). Finally, 10 relatives of IPAH patients were excluded due to potentially sharing underlying genetic characteristics with the corresponding IPAH samples.
As part of the preprocessing for the clustering two gene filtering steps were implemented. Firstly, only genes that have more than two reads (in a transcript level) in at least 95% of control and patient samples were considered. This step reduces the number of genes from 60,144 (that occurred after the transcript transformation) to 25,966. Additionally, 11 male genes were removed as they were driving all following clusterings forming subgroups composed entirely by male or female samples.
After selecting for genes and samples a patient gene expression dataset of 385 samples and 25955 genes was generated. The dataset was used to determine the most fitted clustering algorithm in terms of robustness and partition consistency, estimate the optimal number of subgroups and as input for the gene filtering step.
A large number of clinical variables, 121 from the original clinical file and an additional 760 from Open-Clinica study database, were uniformly measured at the different medical and research centers that provided samples as described in the Cohort / Study design section. The date related clinical entries were updated according to the census date of 30.01.2019. Specifically, the date of sampling (defined as the first visit to the corresponding medical center when the blood sample was taken) was composed of additional data on patient samples and healthy volunteers from Imperial and Sheffield local databases. They were mapped to the sample ids to be used either by the clustering algorithms or the downstream biological analyses. Samples with multiple trials were flagged and technical repeats were excluded from the sample pool. Samples with multiple visits (where blood was collected on different dates) were dated to retain the chronological relation between them as useful longitudinal information for the biological analyses.

Feature selection of genes
Next generation sequencing methods measure the expression of thousands of genes providing a huge number of dimensions (>20,000) per sample. However, diseases are usually regulated by smaller groups of genes rather than thousands, as reviewed for PAH in (Ma and Chung 2017). Therefore, the majority of genes are expected not to contribute to PAH development. Additionally, gene filtering helps in reducing the computational burden of most clustering algorithms also affecting their performance by removing misdirecting noise, as shown in (Rodriguez et al. 2019).
Investigating IPAH subgroups concerns the structure underneath the idiopathic form of the disease and is characterised by the complete lack of sample labels. Standard feature selection methods, used for RNAsequencing data, can not be used in this case as they require labels, e.g. in (Rodriguez et al. 2019; Wenric and Shemirani 2018) where they use disease and control labelling. Therefore, to investigate disease subgroups we ranked all genes based on the variability of their expression across patient samples, as expression variance indicates interesting gene behaviour in our disease context. Each gene was scored according to its variability across the 385 patients using the var() function from the Stats R package. Subsequently, all 25,955 genes were ranked based on that score. To generate the candidate gene sets for the determination of the most stable one, multiple subsets of the top ranked genes were extracted, each time increasing the size by 100.
Highest stability gene set In our case of feature selection we needed to select one of the gene sets to base the clustering on. Since there are no known IPAH-related genes in the literature we moved forward with the gene sets (one per pipeline) that generated the subgroups of highest stability. We used the established clusterboot function from the fpc R package. The function assesses the clusterwise stability by resampling the data in a bootstrap approach. Then it computes the Jaccard similarities of the original subgroups to the most similar subgroups in the resampled data. Spectral clustering (kernlab R package), 50 resampling runs, k between 2 and 6 and a seed of 28588 for reproducibility purposes was used. We generated multiple gene sets starting from the top 100 ranked genes and increasing the size of the gene set by 100 until we reach the total number of 25,955 genes.

Clustering algorithm selection
There is a wealth of methods in the unsupervised learning field that are appropriate for certain data types. Most studies employ widely-used methods (e.g. hierarchical clustering) without utilising any kind of selection method that would point towards a certain effective methodology. In this study we aimed to examine a group of diverse algorithms that cover different clustering approaches. Since we lacked labels, and thus a performance measure, we compared the partitioning consistency of the different approaches on the expression data. As good consistency we defined high agreement and low standard deviation calculated between different variations of a clustering algorithm. When two different clustering runs agree on the partitioning of the samples they show robustness since they do not randomly assign samples to subgroups but rather are driven by the underlying structure of the data. (Supplementary Table 6) presents the various algorithms used, along with various distance measures and clustering categories they belong to.
The preprocessed RNA-seq dataset and k2,10 were used for the determination of the algorithm agreements. Within each pair of clustering runs the agreement was calculated using the adjusted Rand Index, the correctedfor-chance version of the original Rand index (Rand 1971), which is based on the number of times any pair of points is partitioned in the same subgroup throughout different clusterings. To calculate the intra-agreement of each clustering algorithm (spectral, k-means, hierarchical) we considered only pairs in which both runs were based on the same algorithm. For those pairs the agreement was averaged across clustering runs and ks. In a similar way, the standard deviation per algorithm was calculated.
Optimal number of subgroups k Estimating the actual number of subgroups is a key decision for every clustering algorithm and is usually based on field knowledge, specific study decisions or statistical indexes. Due to the lack of prior knowledge about the structure of IPAH we are unable to set the number of subgroups based on literature or biology. Since this is an exploratory study (which does not include a predetermined number of subgroups) we are utilizing various indexes depending on the questions we are addressing.
For the case of the IPAH subgroups, we are using ensemble learning, the process where multiple indexes (or experts) participate in the selection of the optimal value of a machine learning decision 2 . Specifically, since we can not use any class related labels, we estimate the optimal number of subgroups using voting among 15 internal indexes that evaluate the compactness and/or the distance between different subgroups (see Supplementary methods: Internal Index Voting).
Determining the optimal number of clusters (k) is an inherently difficult task in unsupervised machine learning as it is always an educated estimation, since we do know the actual number of categories within our data. Indeed some of the 14 used indexes are bound to not work on our data type (RNA-seq) and that is why we used an ensemble/voting method to estimate k (supplementary section Internal index voting) since we cannot base our estimation on any one index. The voting result (Supplementary Data 1) showed the clear majority of indexes to favour up to 5 clusters, with a preference to 2 clusters. The most important aspect in selecting the number of clusters in a data set is retaining as much information as possible, therefore selecting the highest supported k minimizes information lost. Following that notion, we retained the highest voted k = 5, where we discovered 3 distinct adequate sized clusters and 2 small clusters that, despite their interesting gene expression profiles (Figure 2A), were unable to show any statistical significance in follow-up work due to their small size. In Supplementary Figure 16, we demonstrate the flow of patients between clusterings along with the cluster sizes and the proportion/count of transferring patients across k. The colored nodes represent our subgroups I, II,III, IV and V. According to the clustering tree, the 3 main subgroups I, II and V remain clustered together when k = 2 (in a 341 sized cluster) and k = 3 (in a 295 sized cluster). This indicates that for k < 5 we are missing the information that separates these 3 distinct subgroups. The two smaller subgroups (III and IV) mostly originate from a group of patients (circled in green) that dissociates early on from main subgroups I, II and V implying that these samples show some differences even when less subgroups are requested. The remaining samples that end up in subgroup III have a common parent with subgroup II.

Gene signatures of subgroups
A number of biological analyses are used to explore IPAH sub-structure. For the patients of each subgroup survival (Kaplan-Meier curves), response to vasodilators (IPAH treatment), gender and functional class (Fisher's exact test) are calculated and compared while measuring significance. Additionally, the difference in the age of diagnosis (one-way anova test) and a number of known PAH-related genes are examined across subgroups. We perform a driver gene discovery analysis (LASSO regression model) which can indicate the most influential genes whose literature-generated annotations are investigated. The results of the patient clustering are used to draw genetic differences between the two groups. IPAH subgroups are subjected to differential expression analysis and subsequently to pathway analysis in order to interpret the genes' involvement in terms of functionality. The p values of each gene when considering the fold change between subgroups I and II were calculated using a Welch Two Sample t-test on the raw values and presented in the Supplementary Table 4. The absolute lower cut off values of fold change (log2 scaled) is 0.28.

Internal index voting
To implement ensemble learning, we used the majority voting rule among 15 internal validation indexes, selecting as the optimal number of subgroups (k) the one voted by the most indexes. The various indexes 6-17 , (McClain and Rao 1975) and (Ray and Turi 1999) results can be found in Supplementary Data 1. All of the above are based on variations of the same idea, to score a partitioning on how compact each subgroup is and how well the subgroups separate. No index can select the real number of subgroups with perfect accuracy, therefore the "voting of experts" method can provide a safer alternative to using any one index. The preprocessed RNAseq dataset after the appropriate gene filtering step was used for this work.

Clinical variable associations / classification
Survival analysis using a Kaplan-Meier estimate 18 was undertaken to compare the time until death between patient subgroups, a measure able to overcome issues such as subjects withdrawing from the study or not experiencing death during the course of the study's observations. In this cohort withdrawal etiologies include withdrawal by clinician, transplant, leaving the country, and loss to follow up. As the clinical data for the patients did not include a cause of death, it was decided to limit the duration of the analysis to minimise the inclusion of other causes of death. A duration of 10 years from diagnosis was selected as it is a period during which almost 80% of IPAH deaths occur if conventional therapy is used (Kang et al. 2014), while also being sufficient time to allow for useful statistical analysis. Patients who did not die during this period were considered to be alive for the analysis and had their survival time set to 10 years. The Kaplan-Meier model was created using the survival R package to compare survival time between the subgroups, and subsequently plotted using survminer R package (ggsurvplot function). Cox Regression was undertaken to show any statistically significant survival differences between the subgroups. A Cox model was selected as it has been shown to be a more flexible alternative to parametric methods, and does not require the distribution of survival times to be stated (Bradburn et al. 2003). It was noted that patient survival could be affected by factors other than their subgroup membership. Therefore, survival analysis was repeated using a multivariate Cox regression which included the patients' age at diagnosis, sex, and New York Heart Association (NYHA) functional class. This method allows for adjustment to the impact of these other factors, and shows an estimate of their respective strength of effect.
A frequency table was created for vasoresponders, gender and each functional class within each patient subgroup. Pairwise comparisons were made between the subgroups using Fisher's exact test. This test was performed using the rcompanion R package and a Bonferroni correction was used as multiple comparisons were made.
A comparison of the age at IPAH diagnosis was made between the patient subgroups using a one-way anova test. This test assumes that the observations within each subgroup are normally distributed, and that the data are homoscedastic with equal standard deviation between the subgroups. The normality of the data within each subgroup was confirmed visually by producing a histogram for the age of diagnosis for each subgroup, as well as by plotting a histogram of the residuals for the anova model to ensure that these followed an approximately normal distribution, and creating a Q-Q plot of the residual values. The homoscedasticity was assessed by plotting the residual values against the fitted value, and by using the car R package to perform a Levene's test.
A regression model was used for feature selection of genes whose expression most significantly drove subgroup membership. The RNA-seq counts were split into a training and testing set with a ratio of 70:30. The glmnet R package uses penalised maximum likelihood in order to fit a regression. The model family was set to "multinomial" as the model was to be used to predict a nominal dependent variable with multiple categories, subgroup membership, given gene expression data. As an additional parameter, the type.multinomial was set to "grouped", meaning that multinomial coefficients for a variable were included or excluded together. This also demonstrated an increase in model prediction accuracy during initial testing of models. Ridge, elastic-net, and lasso models were created using the training set, their parameters optimised by cross-validation (cv.glmnet function), and then used to predict the subgroup membership of samples in the test data-set. The models run on the entire data-set to produce coefficients for each gene relating to each subgroup. The elastic-net regression model was preferred based on its ability to select strongly correlated variables in or out together, a useful feature when dealing with genes which may be correlated due to sharing a biological pathway (Zou and Hastie 2005). From the regression results heat maps were produced showing the coefficients for genes in each subgroup. Genes in the top 5% for largest coefficients were selected for further investigation to identify common pathways or functions. It was decided to investigate genes with both positive and negative coefficients, as genes with decreased gene expression which drove subgroup membership are still of biological interest.
The pathfinder R package was used to demonstrate enriched pathways between subgroups. Genes with an absolute LFC in the highest 10% for their subgroup, and with an adjusted p value ≤ 0.05, were inputted to the package.

Identifying clinical signatures of RNA subgroups
The following clinical variables were used during the clinical signature pipeline: age_diagnosis, sex, diagnosis, drug_exposure, bmi, functional_class, 6 minute walking distance, oxygen saturation (pre), oxygen saturation (post), mRAP, mPAP, mPAWP, cardiac output, SvO2, vasoresponse (lenient), vasoresponse (stringent), FEV1, FVC, TLC, KCO Ensemble feature selection 19 based on recursive feature elimination (RFE) 20,21 and a linear SVM 22 as the estimator, was used to ensure robust identification of the smallest set of clinical features (signature), from the clinical features, that best describe each subgroup. RFE feature ranking was based on absolute weights of features from the SVM, which quantifies the contribution or importance of each feature towards the multivariate construction of a hyperplane separating the subgroups. The regularisation parameter of the SVMs was set to C=1. The discovery dataset used for feature selection was resampled without replacement into 500 subsamples (90% of samples), for each subgroup over signature sizes (s) ranging between s=1 to 20. Each resampled dataset was further divided into bootstrap samples (k), with k=50. Feature values of each bootstrap sample were normalised to improve feature selection performance. RFE-SVM 20 was used to rank features for each bootstrap sample, and an aggregate rank was calculated for each feature using all k bootstrap rankings. This process was repeated over all resampled datasets, resulting in 500 candidate signatures for each signature size, s. Each candidate signature was then used to develop a classification model, which was then trained on the discovery dataset to discriminate between a given subgroup from all other subgroups. Classification models were built using support vector machines (SVM) 23 , random forest (RF) 24 , logistic regression (LR) 25 , and knearest neighbour (KNN) 26 .
LR was implemented using sklearn.linear_model.LogisticRegression using l2 penalty and default values used for all other parameters. SVM was implemented using sklearn.svm.LinearSVC with regularisation parameter C set to 1, and default values used for all other parameters. RF was implemented using sklearn.ensemble.RandomForestClassifier with default values used for all other parameters. kNN was implemented using sklearn.neighbors.KNeighborsClassifier with a number of neighbours (n_neighbors) set to 5 and default values used for all other parameters.
For feature selection tasks, we used ensemble feature selection based on recursive feature elimination (RFE) technique. RFE is a backward feature elimination technique that iteratively prunes the least informative feature(s) from a training dataset. A RFE based on a linear SVM starts by using all features to train an SVM model and ranks all features according to importance. The least ranked feature is removed from the training dataset and the SVM model refitted. This is iteratively done until only the required number of features remain. All features are also ranked according to importance.
Ensemble feature selection aggregates several feature rankings into a single consensus feature ranking to ensure robustness of the feature selection process and of selected features. Feature importance measures used for feature ranking are based on the hyperplane weight vector of a linear support vector machine (SVM). The weight vector quantifies the contribution of each feature to the construction of the hyperplane, and is used for ranking features according to importance.
Ensemble feature selection56 based on recursive feature elimination (RFE)57,58 and a linear SVM59 as the estimator, was used to ensure robust identification of the smallest set of clinical features (signature) that best describe each subgroup. RFE feature ranking was based on absolute weights of features from the SVM, which quantifies the contribution or importance of each feature towards the multivariate construction of a hyperplane separating the subgroups. The regularisation parameter of the SVMs was set to C=1. The discovery dataset used for feature selection was resampled without replacement into 500 subsamples (90% of samples), for each subgroup over signature sizes (s) ranging between s=1 to 20. Each resampled dataset was further divided into bootstrap samples (k), with k=50. Feature values of each bootstrap sample were normalised to improve feature selection performance. RFE-SVM57 was used to rank features for each bootstrap sample, and an aggregate rank was calculated for each feature using all k bootstrap rankings. The set of feature rankings R, aggregated over all bootstrap samples, is calculated as, where k is the number of bootstrap samples, N is the total number of features in the dataset, and is the rank of feature n in bootstrap sample i. This process was repeated over all resampled datasets, resulting in 500 candidate signatures for each signature size, s.
The candidate signature that obtained the best performance was selected. This process was repeated for all signature sizes, s=1 to s=20, for subgroups I, II and v. A final signature for each subgroup was selected based on a compromise between the fewest number of features (s=1 to s=20) and classification performance. Final selected signatures for each of the subgroups were pooled to create a composite signature, which was then used to develop a multi-class classification model. The model was trained on the discovery dataset to discriminate between subgroups I, II and V, used to predict subgroup membership of an unseen validation dataset. The predicted subgroup membership was then used to calculate survival of predicted subgroups. Survival of the predicted subgroups was compared to known survival of subgroups in the discovery dataset for validation purposes.

Differential expression analysis
Differential expression analysis was performed between patients' subgroups. The raw un-normalised counts which were the output of the Salmon quantification were used. The DESeqDataSetFromTximport function was used to create an input data-set for DESeq2 which included the raw count data, and the subgroup membership for each sample. Rows with fewer than a total of 10 counts were excluded in order to decrease computing time.
Utilizing the apeglm R package the log fold change (LFC) shrinkage was performed on the results in order to reduce noise from genes with low counts, while retaining genes with large fold changes. This is an alternative to introducing filtering thresholds or pseudocounts which have disadvantages such as resulting in the loss of genes with true expression differences. This method was used to create pairwise comparisons of gene LFC between the control subgroup and each patient subgroup. Using the LFC data, genes with a log2fold change of greater than +/-1.5 were selected, and ranked by their p-value. The LFC data used ensembl92 gene IDs, so the biomaRt R package was used to add HUGO Gene Nomenclature Committee (HGNC) gene names and a brief description to allow for easier identification of genes of interest.

Secondary clustering analysis with healthy volunteers
All RNA-seq samples (508) are utilised along with the genes that provide the most information when attempting to predict whether each sample belongs to a patient or a healthy volunteer. We used the patient and control labels as ground truth. Utilizing the labels, we ranked the genes according to the amount of information they contribute in distinguishing the two classes. To determine the amount of information each gene contributed towards separating IPAH and healthy samples we used the Information Gain Criterion from the Biocomb R package. The 25,955 genes were scored and ranked. To generate the candidate gene sets for the determination of the most stable one, multiple subsets of the top ranked genes were extracted, each time increasing the size by 50. While investigating the differences between disease and healthy samples we can utilise the only ground truth we have established, the partitioning of the samples into patient and control groups. This knowledge enables the selection of the number of subgroups (k) to be based on the average subgroup purity of each k and compare them to select the k with the highest average purity.

Additional clustering pipeline of IPAH patients
To estimate the impact the 33 HPAH patients had on our main clustering pipeline, we examined their distribution across subgroups and ran an additional clustering pipeline including exclusively the 313 IPAH samples. As in the main pipeline, we utilised spectral clustering with the rbfdot kernel, the 300 most variable genes and identical preprocessing (section Sample and gene selection preprocessing).

Supplementary Tables
Supplementary Table 1 Table 5 | Correlations between discovery, validation and external validation cohorts. In the discovery set, Spearman correlations were calculated between RNA-req TPM values, the validation set between negative delta Cts values with GAPDH used as the endogenous control gene. Green cells denote agreement in the directionality of gene-clinical variable correlations between the validation sets and the discovery set. Red cells denote disagreement/opposite correlation between the two data sets. Notation of (**) denotes a p-value of less than 0.01, (***) denotes a p-value of less than 0.001 (using asymptotic approximated p-value by using the t distribution , while no stars denote non-significant p-values.  Supplementary Figure 5: Hazard Ratio of discovery cohort clustering adjusted for gender and age category of patients. Notation of (**) denotes a two-sided log rank test p-value of less than 0.01, (***) denotes a pvalue of less than 0.001, while no stars denote non-significant p-values. Data are presented as median values and error bars as 95% confidence intervals. Gender did not reveal any relationship with survival while an age over 52 was significantly associated with poor survival (HR=2.29). The most significant association with poor survival was found for patients classified in subgroup I.