Identifying patterns in amyotrophic lateral sclerosis progression from sparse longitudinal data

The clinical presentation of amyotrophic lateral sclerosis (ALS), a fatal neurodegenerative disease, varies widely across patients, making it challenging to determine if potential therapeutics slow progression. We sought to determine whether there were common patterns of disease progression that could aid in the design and analysis of clinical trials. We developed an approach based on a mixture of Gaussian processes to identify clusters of patients sharing similar disease progression patterns, modeling their average trajectories and the variability in each cluster. We show that ALS progression is frequently nonlinear, with periods of stable disease preceded or followed by rapid decline. We also show that our approach can be extended to Alzheimer’s and Parkinson’s diseases. Our results advance the characterization of disease progression of ALS and provide a flexible modeling approach that can be applied to other progressive diseases.


Supplementary Notes Modeling Approach
Gaussian Process Regression Gaussian processes take the form: where m(x) describes the model's mean function and k(x,x') describes the model's covariance function.To specify the covariance function, the Gaussian processes in this implementation of MoGP uses the SE kernel, with the form: where  / is the signal variance and  is the length-scale.The signal variance ( / ) determines the average distance of the function from the mean.The length-scale () specifies the smoothness of the function, with increasing length-scales resulting in smoother functions.For the length-scale, a gamma prior with a mean of 4 and variance of 9 was used.The length-scale prior is approximately half of the maximum trajectory duration included in our model, selected to encourage minimal mean function crossings.
In contrast to the SE kernel, the LKM kernel is a linear kernel added to a bias kernel, with the form: where  5 / and  7 / are the signal and bias variance, respectively.The bias allows for a non-zero intercept.
For both Gaussian Process kernels, a fixed signal variance of 1 was used to train the models.A gamma prior with mean 0.75 and variance of 0.25 2 was used for the likelihood noise variance, selected to account for noise present in the data; this parameter was optimized through model training.

Dirichlet Process Clustering
This implementation of Dirichlet Process (DP) clustering uses a collapsed Gibbs sampler, in which we iteratively assign probabilities of each sample joining either existing clusters or generating a new cluster to calculate the likelihood of cluster membership.By repeating this process for each sample until convergence, we are able to identify the number of clusters the data captures as well as sample-specific cluster assignments.
The DP clustering model in MoGP takes the form: where f indicates a cluster-specific GP regression function, k indicates cluster membership, and  : indicates cluster-specific probability, where: and  : ~  (• |1, ∝) ∝ indicates the scaling parameter and modifying this can influence the degree of cluster discretization and therefore the number of identified clusters.For these experiments, to encourage large clusters with at least 50 individuals per cluster, alpha was set to the number of patients in a given dataset divided by 50.
We also show a parameter sensitivity analysis of the prediction experiment to varying alpha values (Supplementary Figure 2).Four different alpha values were tested: 0.1, 0.5, 2.0, and 5.0 times the original mixing parameter.Both the overall error as well as the number of clusters is reasonably stable across the experiments.In the absence of structure in the data, we would expect the mixing parameter to have a direct effect on the cluster sizes; however, the stability of the results across these parameters points to data structure that is learned in the training process.
The results instead indicate that overfitting likely drives the differences in the model performance.The experiments with fewer training data points are more susceptible to overfitting in the case of a complex model like the flexible gaussian process; however, as more data is provided, the MoGP better captures the data structure.

Monotonic Inductive Bias
To encourage monotonically declining functions, we use two modifications to MoGP: 1) a negative linear mean function in our GPs, and 2) a thresholding function to determine cluster membership.In our sampling procedure for our DP model, the probability of each individual joining each cluster is calculated.Our thresholding function constrains the number of clusters an individual can join.If the score for initial visit for a given sample is not close (where close is defined by a user-set 'threshold' parameter) to a cluster's mean function, then the algorithm sets the probability of joining that cluster to 0. This prevents the probability that a participant with a starting ALSFRS-R score vastly divergent from a given cluster will be added to the cluster.For these experiments, this threshold is set as 0.5 for z-scored data, which roughly approximates 5 ALSFRS-R points, because it would be clinically unlikely that one sees this large of a range in patient function for a trajectory pattern.For our negative linear mean function, we used a gamma prior with a mean of 0.66 and variance of 0.2.Together, these values were chosen to minimize major deviation from a monotonic trajectory, although these are weak priors that allow for a degree of non-monotonic behavior depending on the data.
One limitation for our monotonic inductive bias is that our priors are relatively weak.This means that occasionally, outliers or variance in data may cause non-monotonic behavior in some clusters, particularly those that are small; an example of this can be seen in one of the clusters in Extended Data Figure 2. Future work can involve strengthening this constraint by modifying the optimization of the GP kernel.

Baseline Model
Since there are many settings in which patient-specific parametric models are very useful, we provide additional characterization of per-patient parametric models using our framework: a personalized-GP, the D50 sigmoidal model 20,21 , a quadratic model, and a linear mixed effect model.
The personalized-GP model is initialized with the same priors as our MoGP model, and optimized using GPy (see Supplementary Algorithm 1 for GP model priors).
Our D50 model is implemented in the following form: where D50 = time point when the ALSFRS-R drops to 24; dx = slope of ALSFRS-R decrease.The model parameters are fit using scipy curve_fit (dogbox method, with bounds ((0.1, 0.1), (75, 5)), with a D50 initial value of 5, and a dx initial value of 0.5.
Our quadratic model is implemented in the following form:  =  / +  +  where coefficients a, b, and c are fit using scipy curve fit (dogbox method, initial values a=1, b=1, c=1, no bounds).
Our linear mixed model is implemented using statsmodels.formula.api.mixedlm, with the design "Y~x", and groups indicating individual patients.The model is fit using the lbgfs method.In some cases of sparse data, the linear mixed model did not reach convergence; in these cases, the best performance was reported.from the National Institute of Neurological Disorders and Stroke (NINDS).For the original CEFT study, institutional review board approval was obtained at each center, as well as the MGH coordination center IRB, and participants provided written informed consent before screening.We received approval for PRO-ACT from the Pooled Resource Open-Access ALS Clinical Trials Consortium.PRO-ACT is an anonymized database that includes merged datasets from multiple ALS clinical trials.It requires an application to request access, in which the user must agree to protect the security of the data.Dr. Jonathan Glass provided approval and access for using the EMORY dataset.Data used in the preparation of this article were obtained from the ALS/MND Natural History Consortium.As such, the following organizations and investigators within the ALS/MND Natural History Consortium contributed to the provided data, but did not participate in the analysis of the data or the writing of this manuscript: • Henry Ford Health Systems/ Ximena Arcila londono For the original EMORY dataset, the Emory institutional review board approved the study.For NATHIST, each individual site had local IRB approval.Data used in the preparation of this article were obtained from the Pooled Resource Open-Access ALS Clinical Trials (PRO-ACT) Database.In 2011, Prize4Life, in collaboration with the Northeast ALS Consortium, and with funding from the ALS Therapy Alliance, formed the Pooled Resource Open-Access ALS Clinical Trials (PRO-ACT) Consortium.The data available in the PRO-ACT Database has been volunteered by PRO-ACT Consortium members.Data used in the preparation of this manuscript were captured by the ALS/MND Natural History Consortium and obtained from NeuroBANK® patient-centric platform hosted by Neurological Clinical Research Institute at Mass General Brigham.

Table 1 .
Study PopulationsAbbreviations: PRO-ACT = Pooled Resource Open-Access ALS Clinical Trials; NATHIST= ALS/MND Natural History Consortium database; CEFT = Clinical Trial Ceftriaxone in Subjects With ALS; AALS = Answer ALS; EMORY = Emory ALS Clinic.IQR indicates interquartile range.Slope (points per month) is calculated as an anchored linear regression across all data points, with the anchor indicating an imputed value of 48 at symptom onset.

Supplementary Table 2. Study Populations -Extended Summary Statistics IQR
indicates interquartile range.N/A indicates not reported.

Table 3 . Percentage of patients that have a reduction in error when MoGP is used as compared to LKM
ALSFRS-R point thresholds range from 0 ALSFRS-R points (indicates percent of all participants for whom a MoGP provides predictions with a lower error than LKM) to 5 ALSFRS-R points (percentage of patients that have a reduced error of more than 5 ALSFRS-R points using a MoGP).

Supplementary Table 6. Number of clusters in each of the models, across study populations Because
a slope model was fit to each patient, the number of slope models is equivalent to number of patients included in the training data.

Inclusion Criteria Total No. Participants Included Median (IQR) No. Visits Median (IQR) Months Followed Median (IQR) ALSFRS
Supplementary Table 7. Study population summary statistics for participants included in prediction (≥ 4 visits) and interpolation (≥ 10 visits) experimentsIQR indicates interquartile range.Slope (points per month) is calculated as an anchored linear regression across all data points, with the anchor indicating an imputed value of 48 at symptom onset.

Table 8 . Summary statistics for dominant ALS progression patterns
Patterns shown in Extended Data Figure 7. P-Values calculated with hypergeometric test.
Supplementary Table 10.Percentage of individuals in dominant PD patterns with a stable PIGD or TD diagnosisDominant patterns shown in Supplementary Figure8.N/A indicates none of the individuals had a stable postural instability/gait difficulty (PIGD) or tremor dominant (TD) label within that cluster.
• Neuromuscular Omnicenter, Milan, Italy/Christian Lunetta • Northwestern University/Senda Ajroud-Driss • Providence ALS Center/Nicholas Olney • Saint Louis University School of Medicine/Ghazala Hayat • Temple University/Terry Heiman-Patterson • University of Florida Gainesville/James Wymer • University of Minnesota/David Walk • Virginia Commonwealth University Health/Kelly Graham Gwathmey • Neurological Clinical Research Institute, MGH/Alexander Sherman • Neurological Clinical Research Institute, MGH/Kenneth Faulconer • Neurological Clinical Research Institute, MGH/Ervin Sanani • Neurological Clinical Research Institute, MGH/Alex Berger • Neurological Clinical Research Institute, MGH/Julia Mirochnick This research includes the National Institute of Neurologic Disease and Stroke's Archived Clinical Research data (Clinical Trial of Ceftriaxone in ALS, Merit Cudkowicz, Massachusetts General Hospital) received from NINDS Archived Clinical Research Datasets webpage.