Identifying Parkinson’s disease subtypes with motor and non-motor symptoms via model-based multi-partition clustering

Identification of Parkinson’s disease subtypes may help understand underlying disease mechanisms and provide personalized management. Although clustering methods have been previously used for subtyping, they have reported generic subtypes of limited relevance in real life practice because patients do not always fit into a single category. The aim of this study was to identify new subtypes assuming that patients could be grouped differently according to certain sets of related symptoms. To this purpose, a novel model-based multi-partition clustering method was applied on data from an international, multi-center, cross-sectional study of 402 Parkinson’s disease patients. Both motor and non-motor symptoms were considered. As a result, eight sets of related symptoms were identified. Each of them provided a different way to group patients: impulse control issues, overall non-motor symptoms, presence of dyskinesias and pyschosis, fatigue, axial symptoms and motor fluctuations, autonomic dysfunction, depression, and excessive sweating. Each of these groups could be seen as a subtype of the disease. Significant differences between subtypes (P< 0.01) were found in sex, age, age of onset, disease duration, Hoehn & Yahr stage, and treatment. Independent confirmation of these results could have implications for the clinical management of Parkinson’s disease patients.


Motor variables specification
In this section, we provide the specification of the motor variables that were considered in the study. • PIGD (MDS-UPDRS items 2.12, 2.13, 3.10, 3.11, 3.12)

Methodological preliminaries
In this section, we provide methodological background for multi-partition clustering with Bayesian networks, as well as a brief introduction of probabilistic inference with Bayesian networks.

Notation
We use capital letters such as X, Y , Z, to denote variable names, and lower-case letters, such as x, y, z, to denote specific values taken by these variables. Sets of variables are indicated by bold capital letters, such as X, Y, Z, and assignments of values to these variables are indicated by bold lower-case letters, such as x, y, z.

Multi-partition clustering with Bayesian networks
Let D be a dataset with a set of N independent, identically distributed (i.i.d.) data instances with an associated set of categorical and continuous observed variables X = {X 1 , . . . , X m }. Multi-partition clustering algorithms [1]- [4] assume that data has been generated by a probability distribution P (X) that can be expressed as a finite mixture model [5] with s categorical latent variables H = {H 1 , . . . , H s }. Each latent variable in the model represents a unique partition (i.e., clustering solution) with k i probabilistic clusters: P (X) = H1 · · · Hs P (H)P (X|H) . (1) However, working with this model becomes cumbersome because the number of model parameters increases exponentially with respect to the number of categorical variables. Furthermore, its interpretation also becomes difficult because each partition is defined by all the observed variables, regardless of their relevance to it. It is therefore necessary to exploit the conditional independences that are present in the data. Conditional independence is a central concept to Bayesian networks (BNs) [6,7]. When conditional independences are present, BNs produce a factorization of the joint probability distribution that substantially reduces the number of model parameters and allows to graphically represent relevant relationships between variables.
Given a set of variables Y = {X, H}, a BN is defined by: (i) a directed acyclic graph G that comprises the structure of the network and represents the conditional independences among the variables, and (ii) a set of parameters θ that represents the conditional probability distribution (CPD) of each variable Y i ∈ Y given its parents Pa G i in the graph. B = {G, θ} is a BN with respect to G if and only if it satisfies the local Markov property, i.e., each variable is conditionally independent of its non-descendants given its parents in the graph. Hence, the joint probability distribution factorizes as Depending on the nature of the variables that are present in a BN, we can distinguish between categorical, continuous and mixed BNs. In this study, we are interested in clustering continuous data with multiple categorical latent variables. Therefore, we are going to work with mixed BNs, which are composed of categorical and continuous variables. More specifically, we are going to consider conditional linear Gaussian (CLG) BNs [8]. In a CLG BN, every categorical variable may only have categorical parents and its CPD is categorical. In addition, every continuous variable may have both categorical and continuous parents, and its CPD is a CLG. Let Y be a continuous variable where C are its continuous parents and D are its categorical parents. We say that Y follows a CLG distribution if for every assignment d ∈ Ω D (where Ω D represents all possible joint assignments to D) there are Gaussian parameters β d,i and σ 2 d, such that Learning a BN from data is typically performed by a search method that approaches the learning process from a model selection perspective. Search methods define a hypothesis space of potential models, a set of operators to navigate this space, and a scoring function that measures how well the model fits the observed data [9]. When all the variables in the network are observed, the decomposability property of BN scores such as the Akaike information criterion (AIC) [10], and the Bayesian information criterion (BIC) [11] allows for efficient learning. However, in the presence of latent variables it is not feasible to efficiently learn a BN because scoring functions do not decompose.
Friedman's structural expectation-maximization (SEM) [12] is the most widely used algorithm for learning a BN in the presence of latent variables. It generalizes the expectation-maximization (EM) algorithm [13] to the problem of BN learning. In the expectation step, it uses the current model to estimate the values of latent variables and generates a completed dataset D * . Then, in the maximization step, it estimates the parameters and structure of the new model using the completed data. Any search method can be used in the maximization step. However, the scoring function to be maximized must be a penalized version of the log-likelihood (LL), such as AIC or BIC.
It is important to note the benefits of SEM compared to a brute-force approach: rather than re-estimating the model parameters after each structure change, the output of a single expectation step is used to perform many structure changes. At each iteration t, SEM selects the model B t+1 with the highest expected score. The expected score is computed over the completed data and it is referred to as score(B t+1 : D * t ). Its use is motivated by the following inequality: which states that a score improvement with respect to the completed data guarantees an improvement with respect to the observed data. Hence, Equation (4) ensures that the SEM algorithm converges to a local optimum without the need of using probabilistic inference in each structure change. When a latent variable is known to exist, we can introduce it into the BN model and apply SEM. However, when performing multi-partition clustering, we do not know either the appropriate number of latent variables, nor their respective cardinalities. Current literature [1,2] has approached this problem by constructing the BN model one local move at a time (i.e., introducing a latent variable, removing an arc, increasing the cardinality of a latent variable, etc.). Surprisingly, not much attention has been paid to incorporate the SEM algorithm into the learning process. As a result, existing approaches limit their search of models to tree-like structures in order to reduce their computational complexities.

Probabilistic inference with Bayesian networks
BNs offer an additional advantage for multi-partition clustering. When the model is learned, it can be used for making predictions, diagnoses and explanations. To do this, the CPD of a variable (or a set of variables) of interest is computed given the values of other variables. This process is known as probabilistic inference.
Given a BN B over a set of random variables X, probabilistic inference allows us to answer general queries of the form P (I = i|E = e), where e are the values of the evidence variables E ⊂ X and i are the values of the variables of interest I ⊆ {X \ E}, which we do not know. For example, in the medical domain, we can query the probability of a certain disease given the observed symptoms. The goal of inference can be formulated as follows: The general problem of probabilistic inference in BNs was first tackled by Kim and Pearl [14]. On a very high level, inference algorithms can be divided into two main classes: exact inference methods and approximate inference methods. In this paper, we use approximate inference methods because exact inference in conditional linear Gaussian BNs is usually unfeasible [15]. For additional information on this topic, Salmeron et al. [16] provide a recent review of the literature.

Greedy latent structure learner
In this section, we develop a multi-partition clustering method that incorporates a variational Bayesian (VB) [17] version of SEM to learn CLG BNs with categorical latent variables from continuous data. Our method iteratively explores this space of models using five latent operators and the VB-SEM algorithm. Latent operators are tasked with introducing latent variables, removing latent variables, and changing the cardinality of latent variables. Each application of the latent operators produces a candidate model whose structure is subsequently refined using a local run of VB-SEM, which introduces, removes, or reverses BN arcs. Both aspects of the structure search (latent variables and BN arcs) are separated to reduce the computational cost of the algorithm.
We start with a description of the latent operators in Section 3.1. Then, in Section 3.2, we present the VB-SEM algorithm and its local variant (see Section 3.2.1). Finally, in Section 3.3, we discuss the search procedure.  Figure S1 provides an example application of this operator.

Latent operators
• Latent variable elimination (LE) generates a new model by removing a latent variable and its associated arcs.
• Cardinality increase (CI) creates a new model by increasing the cardinality of a categorical latent variable H by one. This operator is not applicable if the variable has a cardinality equal to k max , which is established by the user.
• Cardinality decrease (CD) produces a new model by decreasing the cardinality of a categorical latent variable H by one. This operator is not applicable if H already has a cardinality of two.

VB-SEM
In this section, we introduce a variant of SEM that it is framed within the VB framework to ensure its applicability to CLG BNs. We refer to this method as VB-SEM. The introduction of the VB framework arises from the need to apply probabilistic inference in SUPPLEMENTARY FIGURE S1: Example application of the CLI operator. The base model contains two latent variables (i.e., H 1 and H 2 ). However only H 1 has more than two children. As a result, two candidate models are generated by introducing a new latent variable H 3 as the child of H 1 . Categorical variables are colored purple while continuous variables are colored yellow. The number in parentheses accompanying each latent variable represents its cardinality.
if score(B t+1 : D) ≤ score(B t : D) then 7 break / * Stop the loop * / Output: The resulting model B t the expectation step of SEM. Since exact inference is NP-hard for CLG BNs (even when restricting the BN structure to a polytree) [15], approximate inference is required. Two of the most popular approximation schemes are Markov chain Monte Carlo (MCMC) [18] and variational inference (VI) [19]. We chose VI (and more specifically the VB framework) due to its speed advantage over MCMC and its easy integration with BNs through the variational message passing algorithm [20]. Algorithm 1 describes VB-SEM, which performs as follows. It receives a dataset D, an initial model B 0 , and a set of arc restrictions R A that may limit the structures produced by the algorithm (e.g., we may not be interested in adding/removing certain graph arcs). Line 1 is the main loop of the algorithm. Line 2 describes the expectation step, in which the current model B t estimates the values of its latent variables and generates a completed dataset D * t . Lines 3-7 describe the maximization step. In line 3, the completed dataset is used to learn the structure G t+1 of the new model B t+1 . For simplicity, we use a greedy search (GS) method with arc addition, arc removal, and arc reversal operators [21]. Once the structure has been learned, the parameters θ t+1 of the new model are estimated using the observed data D (line 4). To this purpose, the VB-EM algorithm [22] is employed. Finally, the resultant model B t+1 is compared with the current best model B t , and the model with greater score with respect to D is selected.
VB-SEM estimates the parameters and score of the new model with respect to the observed data (i.e., the observed score) rather than with the completed data (i.e., the expected score) to improve its model fitting with respect to the observed data. Despite the desirable properties of SEM, Equation (4) offers no guarantee that the model selected at the end of the search is near the optimum of the observed score [7]. For example, if a search method is used inside SEM, only the first structure change (e.g., an arc addition) guarantees an improvement with respect to the observed score. Benjumeda et al. [23] address this problem by estimating the observed score after each structure change at the expense of the score decomposition. Their approach applies to categorical data, where exact inference can be made tractable by bounding the treewidth of the BN [24]. We instead consider the approximation of estimating the expected score when learning the BN structure, and the observed score when learning the parameters.
The scoring function that is traditionally used within the VB framework is the evidence lower bound (ELBO). The ELBO function evaluates how closely the variational distribution Q(H, θ) approximates the posterior distribution of latent variables and parameters P (H, θ|D, G) using the Kullback-Leibler (KL) divergence: where log P (D|G) is the logarithm of the evidence (i.e., the marginal likelihood). However, when applied to models with categorical latent variables, an adjustment must be made to the ELBO in order to account for the model parameters' lack of identifiability. A common solution involves introducing a small penalty in the ELBO that considers the cardinality k i of each latent variable H i in the model [25]. This results in a penalized version of the ELBO (p-ELBO), which we use as our scoring function: Finally, a key aspect for the performance of VB-SEM is prior specification. When expert information is available, prior parameters can be selected to best reflect the expert knowledge. When this information is unavailable, we propose to use the following strategy: (i) observed variables assume empirical Bayes [26] priors (with maximum likelihood estimates as the values of prior parameters), and (ii) latent variables assume a symmetric Dirichlet prior with a total concentration of 1.00.

Local VB-SEM
Repeatedly evaluating candidate models produced by the latent operators can become prohibitive when each evaluation involves a full execution of VB-SEM. Therefore, we propose replacing VB-SEM with a local version of this method, which only considers the section of the model that is affected by the latent operator while the rest of the model is kept unchanged. We refer to this method as local VB-SEM. SUPPLEMENTARY FIGURE S2: Example execution of the local VB-SEM algorithm with two iterations. The initial model is a candidate of the latent introduction operator, which has introduced H 2 . It introduces two arcs in its first iteration (H 2 → X 4 and X 5 → X 4 ) and one arc in its second iteration (X 3 → X 6 ). Variables highlighted in green have their parameters estimated by the local VB-EM algorithm. Variable colors and parentheses have the same meaning as in Supplementary Figure S1.
In the local VB-SEM algorithm, we allow adding, removing, or reversing arcs that contain any of the variables considered by the latent operator. Once the structure has been learned, the local VB-EM algorithm [27] estimates the parameters of those variables belonging to the Markov blankets (MBs) of: (i) variables initially selected by the latent operator, and (ii) variables affected by a structure change (e.g., a new incoming arc). For any variable in a BN, its MB consists of the set of all its parents, children, and spouses (parents of children) in the network. Therefore, the pseudocode of local VB-SEM is identical to the pseudocode of VB-SEM (Algorithm 1), except that the structure learning process is always restrained by a specific set of arc restrictions, and the parameter learning process is done with the local VB-EM algorithm.
One iteration of local VB-SEM is computationally much cheaper than one iteration of VB-SEM because it considers fewer structure changes and it updates fewer model parameters. This also implies that a run of local VB-SEM usually requires fewer steps to converge than a run of VB-SEM. The computational complexity of local VB-SEM is upper-bounded by the computational complexity of VB-SEM.
Supplementary Figure S2 provides an example execution of the local VB-SEM algorithm. The initial model is a candidate of the LI operator, which has introduced a new latent variable H 2 as the parent of the observed variables X 5 and X 6 . In its first iteration, local VB-SEM introduces two new arcs: H 2 → X 4 and X 5 → X 4 . After the structure learning process, the local VB-EM algorithm estimates the parameters of H 2 , X 4 , X 5 , and X 6 (highlighted in green in the figure), all of which belong to the MBs of the variables initially selected by the latent operator. In its second and last iteration, local VB-SEM introduces a new arc from X 3 to X 6 and estimates the parameters of H 2 , X 2 , X 3 , X 4 , X 5 , and X 6 . While X 2 does not belong to the MB of the variables initially considered by the latent operator, it does belong to the MB of a variable affected by a structure change (i.e., X 3 ). This is not the case of H 1 and X 1 , whose parameters are kept unchanged.

Search procedure
The search starts with an initial model B 0 established by the user. Given this model, let W denote the set of variables that are considered by the LI operator (i.e., the set of Output: The resulting model B variables that currently have no parents) and let H C denote the set of latent variables that are considered by the CLI operator (i.e., the set of latent variables that currently have three or more children). Once the initial model has been established, the search method traverses the space of models by iteratively applying the latent operators. Each application of a latent operator produces a set of candidate models that are evaluated using the local VB-SEM algorithm with p-ELBO as the scoring function (see Equation (6)). Once all candidate models have been evaluated, the highest scoring model B is selected. If its score is higher than that of B, the new model takes its place and the process is repeated. Once the search stops, the resulting model is refined using a full run of VB-SEM. This search procedure is formally defined in Algorithm 2.
Supplementary Figure S3 provides an example execution of the GLSL algorithm. It starts with an empty graph with three categorical variables {X 1 , X 2 , X 5 } and three continuous variables {X 3 , X 4 , X 6 }. In its first iteration, the LI operator is selected and a new latent variable H 1 is introduced as the parent of X 3 and X 4 . In addition, the local run of VB-SEM results in other three arcs: H 1 → X 1 , H 1 → X 2 , and X 5 → X 4 . In its second iteration, the CI operator is selected, and the cardinality of H 1 is increased. No structure changes are introduced by the local VB-SEM. In its third iteration, the CLI operator is selected. A new latent variable H 2 is introduced as the child of H 1 , and as the parent of X 3 and X 4 . In the fourth and last iteration, the CD operator is selected, and the cardinality of H 2 is decreased. In addition, the local run of VB-SEM introduces a new arc from X 1 to X 3 . Finally, the refinement run of VB-SEM introduces a new arc from X 5 to X 6 and the model is returned. SUPPLEMENTARY FIGURE S3: Example execution of the GLSL algorithm with four iterations. It introduces a latent variable H 1 in the first iteration, it increases the cardinality of H 1 in the second iteration, it introduces a new conditional latent variable H 2 in the third iteration, and it decreases the cardinality of H 2 in the fourth iteration. Each iteration is followed by a local run of the VB-SEM algorithm, which is tasked with introducing, removing, or reversing BN arcs. Once the iteration process of GLSL finishes, a full run of VB-SEM refines the structure, introducing a new arc from X 5 to X 6 . Variable colors and parentheses have the same meaning as in Supplementary Figure S1.

Complexity analysis
The following auxiliary variables are considered for the complexity analysis of GLSL: • W → current set of variables with no parents.
• H → current set of latent variables.
• H C → current set of latent variables with three or more children.
• Ch Ci → current set of children variables of the latent variable H i ∈ H C .
At each iteration of the GLSL algorithm, the CI, CD, and LE operators evaluate a total of O(|H|) candidate models each. In turn, the LI operator evaluates a total of O((|W| 2 − |W|)/2) candidate models. Estimating the number of candidate models that are evaluated by the CLI operator is more complex than for LI. This is because CLI depends on the current number of latent variables and their respective number of children variables. The CLI operator evaluates a total of O( i (|Ch Ci | 2 − |Ch Ci |)/2) candidate models. Therefore, the total number of candidate models evaluated at each iteration of GLSL is O(3|H| + (|W| 2 − |W|)/2 + i (|Ch Ci | 2 − |Ch Ci |)/2). Note that each evaluation requires running the local VB-SEM algorithm.

Model selection
In this section, we present the results of comparing GLSL multi-partition clustering model with others from the state of the art. Due to the absence of prior information, we considered empirical Bayes priors for GLSL. We compared GLSL with several model-based clustering methods from the literature that allow continuous data. We considered three single-partition clustering methods (i.e., traditional model-based clustering methods that learn a mixture model with a single latent variable) and two multi-partition clustering methods: • Latent class model (LCM) [28], which learns a mixture model where all the observed variables are conditionally independent given a categorical latent variable. The cardinality of the latent variable is iteratively estimated by maximizing the BIC score. The implementation is provided in our public repository.
• Unsupervised k-dependence Bayesian classifier (uk-DB) [29], which learns a BN model with a single categorical latent variable that is the parent of all the observed variables. Observed variables may also be the children of other k − 1 observed variables. The cardinality of the latent variable is iteratively estimated using a scoring function and the arcs between observed variables are usually estimated using SEM.
Given the presence of mixed data, we used the p-ELBO and the VB-SEM algorithm for the structure learning process. We also considered k = 3. The implementation is provided in our public repository.
• Gaussian mixture model (GMM) [5], which learns a mixture model where all the observed variables follow a joint Gaussian distribution, and whose parameters depend on the value of the categorical latent variable. The cardinality of the latent variable is iteratively estimated by maximizing the BIC score. The implementation is provided in our public repository.
• Gaussian expansion adjustment simplification until termination (GEAST) [1], which learns a pouch latent tree model, where each internal node represents a latent variable, and each leaf node represents a set of observed variables. Each set of observed variables is conditionally independent of the rest of variables in the model given a categorical latent variable. Similar to a GMM, each set of observed variables follow a joint Gaussian distribution whose parameters depend on the value of the categorical latent variable. We used the original Java implementation (https://github.com/kmpoon/pltm-east). For reproducibility purposes, the implementation is also provided in our public repository.
• Multi-partition mixture model (MPMM) [2], which learns a mixture model with several categorical latent variables. Each latent variable is the parent of a unique subset of observed variables. Each observed variable is conditionally independent of the rest of variables given its latent parent. The implementation is provided in our public repository.
SUPPLEMENTARY 1 The higher the score the better.
Abbreviations: LL, log-likelihood; BIC, Bayesian information criterion; s, seconds; #Partitions, number of partitions; #Subtypes, total number of subtypes across all partitions; #Parameters, number of model parameters.
From the considered methods, only GLSL was able to inherently work with missing values. For this reason, the rest of methods worked with an imputed version of the dataset. Missing data imputation was carried out by the multiple imputation by chained equations (MICE) algorithm [30], which is well-known for its good performance with continuous data. We used the implementation provided in the Python library Scikit-learn version 0.24.2.
In order to evaluate the quality of these models, we estimated their LL and BIC scores. As indicated by the Supplementary Table S1, the scores of the single-partition clustering models (LCM, GMM and uk-DB) were similar between them, but far from the majority of the multi-partition clustering models (GEAST and GLSL). MPMM was the exception, returning the worst model. In terms of score, GLSL returned the best model, showing the highest BIC score. Although GEAST obtained a better LL, the BIC difference indicated that GEAST probably suffered from overfitting. This intuition was corroborated by the fact that the GEAST model had 18 partitions and 301 parameters, while the GLSL result had 9 partitions and 131 parameters.
In terms of clustering quality, we observed remarkable differences between singlepartition clustering models, and even greater differences between single-partition and multi-partition clustering models. First, we observed that both uk-DB (see Supplementary Figure S4) and GMM (see Supplementary Figure S5) were only able to find two subtypes. These two subtypes could be easily associated with the general severity of all the disease symptoms. One subtype was formed by patients with low (mean of 0.06) symptom severity and the other subtype was formed by patients with a slightly higher (mean of 0.18) symptom severity.
The other single-partition clustering model, the LCM (see Supplementary Figure S6), was able to find five subtypes: (i) patients with almost no severity in both motor and non-motor symptoms (mean of 0.05); (ii) patients with slight severity in both motor and non-motor (mean of 0.11), (iii) patients that also had slight severity in both motor and nonmotor (mean of 0.12), but with some differences, such as the presence of sweating and the absence of weight loss problems; (iv) patients with slightly higher symptom severity in both motor and non-motor symptoms (mean of 0.17), especially in mental problems such as depression, apathy and anxiety; (v) patients with mild severity in all symptoms (mean of 0.25).  S4: Bayesian network structure of the single-partition clustering model returned by the uk-DB algorithm. Colors and parentheses have the same meaning as in Figure 1.  Figure 1. We used the style of Poon et al. [1] to represent joint nodes (i.e., nodes with a set of observed variables).
While LCM was able to find more subtypes than both GMM and uk-DB, its subtypes also lacked specificity. The absence of feature selection or multiple partitions led to every subtype being dependent on all the motor and non-motor symptoms. This resulted in insubstantial variables (i.e., tremor and rigidity, whose severity did not change much between subtypes) and general patterns (i.e., there were no remarkable differences between subtypes apart from subtypes 1 and 5, which were practically opposite). The main problem of single-partition clustering models lies on the dimensionality of each subtype. Given that each subtype depends on all of the data attributes, increasing the number of subtypes to model a specific pattern results in a large number of unnecessary parameters (overfitting). For this reason, and to avoid overfitting, single-partition clustering models usually end up identifying only a few subtypes.
Multi-partition clustering models returned by GEAST (see Supplementary Figure S8) and GLSL (see Figure 1) were able to find a higher number of subtypes that were only related to certain data attributes. These subtypes were not only far more specific, but they also appeared to be far more faithful to the data (much higher scores). Between the two algorithms, GLSL was able to find the highest scoring model. While GEAST was able to find a good-fitting model, its intrinsic restriction to learning tree structures resulted in a model with 18 partitions (55 subtypes) that was difficult to interpret. Alternatively, the model found by GLSL had 9 partitions, with a total of 19 subtypes. In addition, it was able to exclude the tremor and rigidity symptoms from the clustering, which, as previously commented, appeared to be independent of the rest of symptoms. Compared to the GEAST result, GLSL returned a model much easier to interpret.

Statistical tests
In this section, we provide the results (i.e., p-values) of the statistical tests that were performed in partition analysis. See Supplementary Tables S2 and S3.

Probabilistic inference
A total of 29 probabilistic queries were performed to analyze the connections between the 8 partitions. See Supplementary Table S4. SUPPLEMENTARY

SUPPLEMENTARY TABLE S4
Probabilistic queries that were performed to analyze the associations between the partitions identified by GLSL.