Development and validation of a 1-year survival prognosis estimation model for Amyotrophic Lateral Sclerosis using manifold learning algorithm UMAP

Amyotrophic Lateral Sclerosis (ALS) is an inexorably progressive neurodegenerative condition with no effective disease modifying therapies. The development and validation of reliable prognostic models is a recognised research priority. We present a prognostic model for survival in ALS where result uncertainty is taken into account. Patient data were reduced and projected onto a 2D space using Uniform Manifold Approximation and Projection (UMAP), a novel non-linear dimension reduction technique. Information from 5,220 patients was included as development data originating from past clinical trials, and real-world population data as validation data. Predictors included age, gender, region of onset, symptom duration, weight at baseline, functional impairment, and estimated rate of functional loss. UMAP projection of patients shows an informative 2D data distribution. As limited data availability precluded complex model designs, the projection was divided into three zones with relevant survival rates. These rates were defined using confidence bounds: high, intermediate, and low 1-year survival rates at respectively \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$90\%$$\end{document}90% (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pm 4\%$$\end{document}±4%), \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$80\%$$\end{document}80% (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pm 4\%$$\end{document}±4%) and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$58\%$$\end{document}58% (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pm 4\%$$\end{document}±4%). Predicted 1-year survival was estimated using zone membership. This approach requires a limited set of features, is easily updated, improves with additional patient data, and accounts for results uncertainty.

one observational study 3 . Included clinical trials were conducted between 1990 and 2010 and lasted on average 12 months. Core patient data included clinical and biological lab results data. In December 2015, data from 5 clinical trials were added to PRO-ACT without additional information, totalling 22 different clinical trial sources and 10,723 patients. Patient age ranged from 18 to 88, 82% were on Riluzole therapy (with variable dosage), but no El Escorial categorisation or individual vital capacity values were made available.
The fourth dataset is population-based and contains RW patient data. Data were obtained from the database of the Paris tertiary referral centre for ALS (Pitié Salpêtrière Hospital -Assistance Publique des Hôpitaux de Paris, France) between September 1999 and April 2008. The initial patient sample size was 1,377 and included data on baseline ALSFRS sub-scores, age, time interval since symptom onset, El Escorial criteria and muscle strength values.
As patient treatment is not relevant for our current work; please refer to corresponding clinical trial articles for additional information.

Additional information on UMAP
Supervised learning models usually require large amounts of data to avoid overfitting and lack of generalisation, which aren't available in ALS research. Unsupervised learning methods have the advantage of capturing distribution patterns without data implications. Standard linear methods such as Principal Component Analysis (PCA) 4 have been used in ALS for gene expression analysis 5 . Unfortunately, these conventional linear-based methods are not capable of describing non-linear relationships and have underperformed in this study context. Non-linear methods provide new modelling possibilities given their comprehensive ability to describe data correlations and have successfully been tried out for ALS phenotype identification on clinical trial data 6 with the current state-of-the-art manifold learning model, t-Student Stochastic Neighbour Embedding (t-SNE) 7 . UMAP outperforms t-SNE on the following aspects: it scales with regards to complexity (calculation-wise), allows dimension reduction for other purposes than visualisation (i.e. dimensions can be larger than 3), has a convex cost function (where t-SNE has a non-convex cost function that leads to initialisation based results) and preserves neighbourhood, distances and density (and not only neighbourhood like t-SNE) 8 . All these assets make UMAP more suitable for clustering at a later stage than t-SNE.
UMAP is neighbourhood-based and works in two steps. First, a compressed embedding of the input space is built through topological analysis of the data structure using simplexes 1 . The compressed embedding is a simplicial simplex which can be seen as a neighbourhood graph of the input data that is built from the open cover 2 of the simplexes. Second, a low-dimensional (in our case two-dimensional) data embedding is found through a cross-entropy 3 optimisation process. UMAP preserves data neighbourhoods, distances and density. The initial modelling step depends on whether the algorithm should focus on preserving the local or global input data structure. Data structure is estimated according to the size of the neighbourhood investigated. The second compression step is mainly defined using two parameters which are the output dimension size and the minimum distance permitted between two points in the output space, i.e. how compact the output projection can be. Direct model specification is not possible as the UMAP projection function is a black-box approach. That being said, the UMAP projection function can be stored and used afterwards on new data. As it is a black-box approach, projection features cannot be analysed to provide any interpretability. Output dimension analysis, as commonly performed for PCA, cannot be carried out. Analysis of input feature distribution in the UMAP projection space is an alternative as it gives a broad overview of variable importance with regards to the projection. In Figure 1, age and baseline ALSFRS score appear to be the variables which matter most distance-wise in the output space. Both variables have a global incidence on the overall UMAP projection distribution. Other variables, which show a weaker or limited impact on the overall UMAP projection distribution, may have a more local impact in the output space distance-wise.
Performance is directly related to the observed sample size (and its ability to represent the overall ALS patient population) as, the more data collected, the finer the split in the input space with a controlled confidence bound. UMAP was performed on a 2D plane as the 1D projection led to uninterpretable results and the 3D projection led to results similar to those observed in a 2D setting (with a potato-shaped form). Given the additional dimension, data density was lower and projection analysis and partitioning was more complex so the 2D projection was preferred. Given the current collected dataset size, we could not explore finer splits without a significant increase of the confidence bound. As more data gets collected, a finer division of the projection space can be expected and a precision medicine approach can be implemented to provide clinicians with more distinct patient divisions. Model updating is straightforward as novel data points can be projected onto the 2D space using the learnt projection function. UMAP projection is fitted using the clinical trial data. UMAP projection for additional data is not computed using the UMAP package as proper mixed data type management (when dealing with both categorical and numerical features) has yet to be implemented in the official UMAP package. As a proxy, UMAP projection is learnt using a random forest model (with 10 trees). The random forest model serves as a proxy for now to estimate the 2D mapping of additional samples. Our random forest model reached a 99.3% coefficient of determination (R2) score. The random forest was trained using 80% of the clinical trial data and was tested using the remaining 20%.

Additional information on distribution differences
Differences in distribution between the development and validation datasets have been analysed with regards to each outcome in scope, for different outcome values. These were survival or death for 1-year survival, ALSFRS range for functional loss and survival time range for overall survival. Kullback-Leibler (KL) divergence for development and validation distributions were calculated for the different outcomes and different zone memberships as detailed in Table 7. KL divergence between the validation and development distributions for patients within the low survival rate zone with a 1-year ALSFRS higher than 30 was not calculated due to sample size for the development distribution. Visual differences for these subsets are presented in Figure 5 and Figure 6. Distribution patterns with regards to 1-year survival for patients that died appeared similar, as presented in Figure 5b. For patients that survived, in Figure 5c, patients seemed to concentrate on one side of the projection pane. Distribution differences, given the discrepancies in sampling size and biases, are difficult to explain. Distribution patterns with regards to overall survival loss seemed similar with regards to the 4 different survival intervals as shown in Figure 5d to 5g. Distribution differences appeared stronger for patients with weaker functional loss, as shown in Figure 5h and 5k as our validation data had weaker patients. The left shift observed for 1-year survival was also observed. A more refined analysis of distribution patterns for 1-year survival is presented in Figure 6 with the spatial division proposed for our model. Overall, spatial differences corresponded to the left shift previously identified. The limited sampling size for deceased patients within the high survival rate zone can partially explain how the differences in distributions according to the x-axis seemed so strong.

Additional information on statistical testing
Significance testing was performed (using an F-test/ANOVA) and identified significant differences in means for all features except gender and onset, as shown in Table 8. Significance testing was also tested (using an F-test/ANOVA) on clinical trial data only and identified significant differences in means for all features but gender, onset and age, as presented in Table 9. ANOVA testing assumptions regarding homogeneity of variance and normality were, however, not met, as shown in Table 10. Patient samples are however independent. Differences in distributions for all four datasets are not relevant with regards to model design. They do not impact non-linear dimension reduction methods, and in our case UMAP, as they do not require specific distribution assumptions for data analysis. Showing that all datasets are equivalent has a limited added value with regards to model generalisability as collected data is biased 10 . Differences in means, for clinical trial data, for all features except gender, onset, and age can be partially explained by discrepancies in trial inclusion criteria. Dissimilarities between clinical trial and RW data were expected due to patient selection bias in clinical trials. Inclusion of Trophos and Exonhit trials are meant to broaden patient scope (i.e. case typologies). As in clinical trials, patient selection is carried out to maximise patient survival till clinical trial end. This overestimates patient survival. Adding additional sources of data improves overall model relevance.
Statistical testing is carried out using the analysis of variance test (ANOVA), a generalisation of t-test testing to more than 2 distributions. The null hypothesis H0 is that each tested distribution has the same distribution mean. Rejecting the null hypothesis implies that there exists a significant difference between one or more distribution means.
Cohen's rule of thumb for effect size (η 2 ) analysis: 0.01 for small, 0.06 for medium and 0.14 and more for large.
Three assumptions need to be tested before carrying out ANOVA testing: • Normality (tested on residuals of the ordinary least square model which fits the outcome (predictor) depending on the source using the Shapiro test); • Homogeneity of variance (no differences in distribution variances) using Levene's test; • Independent observations.  (c). For overall survival, patient population is divided based on survival range: less than 3 months (d), between 3 and 6 months (e), between 6 and 12 months (f) and above 12 months (g). For functional loss, patient population is divided based on ALSFRS range: less than 10 (h), between 10 and 20 (i), between 20 and 30 (j) and above 30 (k). Axes are dimensionless and come from UMAP dimension reduction.

Additional information on projection space division
The division proposed in Figure 3c can be furthermore refined. The initial projection space in Figure 7a is divided into multiples small square cells as shown in Figure 7b. The less populated a cell, the wider the confidence interval associated with the survival probability within that cell and the less reliable the analysis of cell membership. The grid is used to identify survival patterns. These patterns help find a boundary for the two zones which maximises survival rate for the upper panel zone and minimises survival rate for the lower panel zone. Optimisation of the survival rate zone boundaries was performed by adjusting the previously straight-line boundaries to exclude patients with an uninformative outcome from high and low survival rate zones. Similarly to what was earlier presented, the UMAP projection space can be divided into three new zones with different survival rates: high, intermediate and low survival rate zones with respectively 90% (±4%), 77% (±4%) and 50% (±6%) as shown in Figure 7c. The low survival rate is slightly modified to −8% for the low survival rates zones. As for the simple division, the intermediate survival rate zone is non-informative as the zone's survival rate is similar to the overall patient survival rate of 79%.   . The overall space is divided in three zones based on the distribution pattern observed in square grid (c); the survival rate for each zone is calculated using patients belonging to each zone. Axes are dimensionless and come from UMAP dimension reduction.