Introduction

Once patients are diagnosed with type 2 diabetes mellitus (T2DM), a constellation of chronic comorbidities might develop over time1. Common comorbidities are cardiovascular disease, diabetic retinopathy, peripheral neuropathy, and at later stages, chronic kidney disease (CKD) and musculoskeletal complications2,3. This implies that multimorbid T2DM patients have a high disease burden and are likely to experience a high degree of polypharmacy4. Understanding the development of comorbidities and identifying trajectory patterns may aid in developing more personalized management strategies. However, the evolution of chronic comorbidities in patients with T2DM is poorly understood.

With the growing availability of large electronic healthcare records and advances in machine learning, different statistical models have been used to find clusters of T2DM patients with similar diseases or comorbidity progression patterns. For instance, Aguado and colleagues used network analysis5 to identify comorbidity development following T2DM diagnosis, while Khan et al. utilized network analysis to predict the progression of diabetes6. A study by Ahlqvist et al.7, later replicated using clinical data by Dennis et al.8, identified five different subgroups of T2DM glycaemic progression using k-means hierarchical clustering based on six variables. Importantly, all these studies found that the clusters were associated with diabetic complications such as kidney disease or retinopathy. However, no study to date has examined changes in comorbidity clusters over time following the start of T2DM.

Modelling comorbidity progression can help clinicians understand and prevent poor health trajectories and potentially harmful polypharmacy. Previous studies have used latent class analysis (LCA) in healthcare data to broadly model multimorbidity trajectories of chronic diseases9,10,11. However, LCA models pose some limitations, such as they assume that the number of features is known and that the features follow a Gaussian distribution. Hence, we propose that adopting a Bayesian nonparametric model might help overcome these limitations as they allow data to be modelled in an unspecified number of latent features12,13. Only a few epidemiological studies have used this approach, for instance, in understanding comorbidities in patients with psychiatric disorders14 or suicide attempts15. However, Bayesian nonparametric models have never been used in electronic health records to understand T2DM disease progression.

Therefore, this study aimed to identify and describe the progression of common chronic comorbidities after T2DM onset using a Bayesian nonparametric model in a primary care electronic health records database.

Results

Patient cohort and characteristics

Following exclusions, a total of 175,383 eligible T2DM patients were identified, Fig. 1. Table 1 provides the demographic characteristics of the patients at the index date, stratified by sex. There were 97,148 males and 78,235 females, with an average age of 60.6 years. The five most prevalent comorbidities at the index date were high blood pressure (38.1%), cancer (25.5%), osteoarthritis (19.8%), and anxiety and depression (17.2%).

Figure 1
figure 1

Flowchart of included patients. T2DM type 2 diabetes mellitus, PCOS polycystic ovarian syndrome.

Table 1 Demographic characteristics of 175,383 T2DM patients at index date (first NIAD prescription).

Comorbidity cluster identification and characteristics

From the initial list of 58 chronic conditions, we selected a total of 23 conditions that had a prevalence higher than 1.0% to avoid numerical and convergence problems of the Bayesian nonparametric model. The selected chronic comorbidities are shown in Supplementary Table S1.

We found 14 different latent features, of which the first one, the bias term, was active for every patient. The 14 latent features resulted in 385 clusters, each corresponding to a unique combination of the latent features. Table 2 provides an overview of the 20 most common clusters and the top three most prevalent conditions associated with each. Except for cluster 1, which includes the bias term (i.e., latent feature 1), most of the clusters were represented by one highly prevalent chronic disease with other additional diseases having elevated O/E ratios. For example, the second cluster, which had latent feature 2 active, was strongly associated with congestive heart failure (CHF), Table 2. Overall, 98% of the patients in the second cluster had CHF, and the O/E ratio was 43.4. Additionally, once a patient developed CHF, the probability of concomitantly having atrial fibrillation and senile cataract was increased 7.0- and 2.3-fold, respectively, as seen in the corresponding O/E ratios. The third cluster was mainly composed of patients with hypothyroidism, while the fourth and fifth were characterized by patients with osteoporosis and obesity, respectively.

Table 2 Description of the three most prevalent conditions for the first 20 clusters.

From cluster 15 on, we observed that the clusters resulted from combining two or more latent features, Table 2. Hence, patients had two distinct primary diseases along with other secondary comorbidities. For instance, cluster 15 was composed of patients who all (100%) had chronic kidney disease (CKD) and CHF, while one-third (33.3%) had atrial fibrillation. All three top conditions also had elevated O/E ratios of 93.8, 44.3, and 8.4, respectively. In cluster 16, all (100%) patients had chronic bronchitis, 99.2% also had CHF, and 38.9% had atrial fibrillation. Again, elevated O/E ratios were identified for the three top conditions. The complete list of comorbidities identified per cluster is provided in Supplementary Table S2.

Sex differences between clusters were also identified, as shown in Supplementary Figure S1. For example, clusters 2 and 6 (cardiovascular disease clusters) were more heavily dominated by males, as evidenced by the lower proportion of females within the clusters (34.8% and 33.1% female, respectively). Similarly, the O/E ratios for the gender distribution were below 1.0 in both clusters (e.g., 0.70 and 0.78 for clusters 2 and 6, respectively). Conversely, other clusters were female-dominated. For example, clusters 3 and 4 (hypothyroidism and osteoporosis) consisted of 57.8% and 71.3% females, respectively, and the sex O/E ratios were 1.3 and 1.6, respectively), Supplementary Figure S1.

In Table 3, we present the probability of presenting at least one of the latent features active, either in combination with other latent features or as a single feature. We found that 84.3% of the individuals had only the bias term, latent feature 1, active. Moreover, certain comorbidities were more likely to appear than others. For instance, latent feature 2, corresponding to CHF, was the feature with the highest probability of being active, either in combination with other features (2.3%) or as a single feature (1.6%). The least likely features to be active, either in combination with others or as a single feature, were latent features 13 and 14, associated with chronic kidney disease and angina pectoris, respectively, as shown in Table 3. Hence, having CHF and subsequent comorbidities was more likely than having CKD with other comorbidities.

Table 3 Probabilities (%) of possessing at least one latent feature or a single feature.

The probability of having at least two latent features active is presented in Table 4. We found that the empirical probability of two latent features was around twice as large as the product probabilities, indicating that an active latent feature was associated with an increased probability of having another latent feature active. For instance, the empirical probability of having latent feature 2 active, which is dominated by a high prevalence of CHF, and latent feature 4, associated with osteoporosis, was 0.11%, which was 2.5 times higher than the product probability of 0.04%.

Table 4 Probabilities of possessing at least two latent features.

Additionally, we also saw that some diseases increased the probability of having concomitantly other diseases. Having a given feature active led to an increased probability of having another one active, Table 5. For example, latent feature 2 appeared frequently with features 4, 6, 8, 13, and 14. Therefore, this would indicate that osteoporosis, intermittent claudication, arthropathy, angina pectoris, and CKD were commonly associated with CHF in our T2DM cohort.

Table 5 Empirical probabilities of possessing at least latent features \({k}_{1}\) and \({k}_{2}\) given that \({k}_{1}\) is active.

Complementarily, we compared the three main clusters associated with cardiovascular disease, Supplementary Table S3. We present the proportion of patients with each comorbidity overall and within the three clusters. Additionally, the O/E ratios by cluster are provided. Additionally, we compared the proportions for each comorbidity across clusters, using cluster 2 as the comparator. Cluster 2 was characterized by latent feature 2, CHF, and was associated with a higher prevalence of atrial fibrillation and senile cataract. However, when latent feature 13 was also active, cluster 15, a slight shift was observed. Here, 100% of the individuals had CHF and CKD. Moreover, most of the O/E ratios increased in this cluster compared to cluster 2; except for deafness, irritable bowel syndrome, anxiety, and chronic liver disease. Similarly, when latent features 2 and 8 were active in cluster 16, 100% of the patients had chronic bronchitis, and again most of the O/E ratios were increased.

Additionally, comparing baseline characteristics stratified by cluster, similar relationships were found, Supplementary Table S4. For instance, in cluster 2, atrial fibrillation and angina pectoris were highly prevalent at the index date, 19.2% and 16.8%, respectively. Nonetheless, CHF had a low prevalence at baseline, < 1%, suggesting that CHF might develop after atrial fibrillation.

Evolution of clusters over time

In Fig. 2A, we visualize the progression of the top 20 clusters by estimating the proportion of patients belonging to the individual clusters over time, while the probability of the 14 individual latent features being active over time is provided in Fig. 2B. We found that the proportion of people belonging to each cluster increased over time, Fig. 2A.

Figure 2
figure 2

(A) Probability of belonging to each cluster over time. Note that some clusters increased at a higher rate compared to others. More information on the cluster characteristics can be found in Table 2. (B) Evolution of active latent features over time. Latent feature 1 is not depicted as it is always active.

Similarly, looking at the 14 individual latent features in Fig. 2B, we found that the probability of having a given latent feature active increased over time, except for the first latent feature (not shown), which was always active with a constant probability of 1. We observe that the proportion of patients with latent feature 2, which was associated with a high prevalence of CHF, increased at a higher rate compared to the other features. We also see an increase in the prevalence of latent feature 4, osteoporosis, over time Fig. 2B.

The network analysis that depicts the transition between the top 20 clusters over time is provided in Supplementary Figure S2. Overall, patients tended to remain in the same cluster over time. However, transitions from the first (latent feature 1 active) cluster to the other 14 (characterized by a single latent feature) were the most frequent. We further note that patients in clusters 5, 7, 8, 10–12 did not transition to other more complex clusters over time. However, a transition into cluster 2, was associated with further transitions into clusters 15–20, which were characterized by the presence of 2 active latent features.

Discussion

This study confirmed the potential of using a large electronic healthcare database to identify clusters of chronic disease comorbidities among patients with newly treated T2DM. This is the first analysis that applied a Bayesian nonparametric model to real-world electronic medical records to identify distinct comorbidity clusters and disease progression patterns based on hidden latent features. In our case example of patients with T2DM, we could identify 14 different latent features that were strongly associated with a primary disease. Importantly, we identified comorbidity patterns consistent with the literature, pointing to the applicability of this approach in medical data. Thus, we found that Bayesian nonparametric models are a powerful tool to use in electronic health records to identify unique comorbidity clusters and health trajectories.

Understanding disease progression in T2DM patients is paramount to preventing new disease onset, optimizing treatment strategies, reducing polypharmacy, and increasing the safety and effectiveness of therapeutic options. However, due to the complexity of comorbidity patterns, there is a lack of understanding of patterns or trajectories. Previous studies have modelled T2DM progression in electronic health records using different approaches, including network modelling6, naïve Bayes, support vector machines, random forests, and gradient boosted trees16,17, or by using typical and atypical disease trajectory analysis18. Although these approaches can shed some light on the disease progression and comorbidities development, they might not be able to capture relationships between hidden or unknown risk factors. While using latent feature models can overcome important shortcomings of the aforementioned approaches, most models require pre-specifying the number of latent features to be retrieved. Consequently, they might not perform very well in the presence of binary matrices and might lack interpretability because latent features might extend over the real line14,15.

The results of our study identify that a Bayesian nonparametric model is a novel approach for studying chronic comorbidity progression. Bayesian nonparametric models overcome the limitations of traditional latent feature models as they can automatically infer the number of binary latent features from the data19. Using this approach, we found that the development of certain comorbidities can lead to a dramatic increase in the probability of developing other conditions over time. For example, in our analysis, once a patient with T2DM develops CHF, their probability of being diagnosed with atrial fibrillation increases, as seen in cluster 2. We also found that patients with hypothyroidism had an elevated likelihood of being diagnosed with irritable bowel syndrome, anxiety, and neurotic disorders increase20,21. While previous literature has found individual associations between hypothyroidism, irritable bowel syndrome, and anxiety22,23, the link between these as a common cluster, particularly among patients with T2DM, has not been previously identified. Therefore, our models could identify hidden (or previously unknown) connections between the diseases that form each cluster.

While our models identified unique comorbidity clusters, the predicted posterior probabilities were consistent with the known progression of T2DM. For instance, we found that all latent features, especially latent feature 2, which was associated with cardiovascular events, steadily increased over time. Conversely, the posterior probability for the baseline cluster, only latent feature 1 active, decreased over time. These results are in line with previous literature. For instance, Khan et al. found that cardiovascular conditions such as cardiac arrhythmias or hypertension were the most prevalent diseases appearing after T2DM onset6. Similarly, Oh and colleagues identified hyperlipidaemia and hypertension as frequent comorbidities after T2DM diagnosis18. Hence, after T2DM onset, the probability of developing certain comorbidities increases over the course of the disease.

Although our study was population-based and applied Bayesian nonparametric models, which overcome many of the limitations found in previous work, there are remaining limitations that must be considered when interpreting the results of this study. Firstly, we only looked at a specific subset of 23 different chronic comorbidities. Thus, we might have missed some patterns in the data. Moreover, we did not include acute outcomes in our list of comorbidities. We acknowledge that chronic diseases can increase the risk of experiencing an acute event, and acute events can also trigger or accelerate the onset of new chronic conditions. Therefore, future research could assess if similar trajectories are found when incorporating acute events or the impact of the chronic disease clusters on the onset of new acute outcomes.

In addition, since comorbidities were coded as binary variables and remained active after the first diagnosis, we might have missed different severity levels that a chronic disease might have had. Moreover, we did not include pharmacological treatments, which can impact the onset/delay of new comorbidities or alter the current disease status.

Cancer is a very complex and heterogeneous disease that requires thorough medical attention. In our analysis, we grouped all cancer diagnoses as a single disease for interpretability. Nonetheless, we might have missed links between different cancer types and comorbidity clusters, particularly those more commonly associated with T2DM (e.g., pancreatic or gastric cancer). Therefore, future studies may consider using Bayesian nonparametric models to investigate comorbidity clusters associated with specific cancer diagnoses to generate new hypotheses in diabetes patients.

In this population-based study of patients with T2DM, we could confirm the potential of using a Bayesian nonparametric model to identify distinct patient clusters. Our models found results consistent with the literature (e.g., growing prevalence of cardiovascular disease), thereby providing confidence in the utility. In contrast to previous studies based on latent feature analysis, we uncovered previously unknown, or hidden, factors. Based on these results, Bayesian nonparametric models may be useful for developing our understanding of complex comorbidity patterns and disease progression in chronic diseases. A deeper understanding of T2DM progression and multimorbidity can foster new hypotheses for further epidemiological studies and be used in clinical guidance of the patients.

Methods

Data source

The IQVIA Medical Research Database UK (IMRD-UK) incorporates data supplied from The Health Improvement Network (THIN), which is a Cegedim database of anonymized electronic health records generated from the daily record of General Practitioners (GPs). It includes data from more than 18 million patients from over 800 GP practices in the UK and about 6% of the UK population. The database contains detailed information about patient characteristics (i.e., year of birth, sex, practice registration date, practice de-registration date, ethnicity), medical conditions (i.e., diagnoses with dates, referrals to hospitals, symptoms), medications (i.e., drug name, formulation, date, strengths, quantity, dosing instructions), in practice immunizations, laboratory tests, and results, and other patient-level data (i.e., smoking status, height, weight, alcohol use, pregnancy, birth, death dates). For medical conditions, all diagnoses are coded according to the Read clinical code system, a comprehensive coding language with over 100,000 codes and are comparable to the international classification of diseases (ICD) system.

The IMRD contains routinely collected patient data from participating GP practices. Informed consent from all patients to have their data included in the IMRD is obtained by the GP and patients have the option to opt out of the data collection at any time. Ethical approval for the use of the IMRD for medical and public health research was approved by the London—South East Research Ethics Committee (Ref 18/LO/0441). Ethical approval for the protocol of this project was obtained by the IMRD Scientific Research Council (SRC reference number: 20SR062). All methods in this study were carried out in accordance with the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline and was performed according to the Declaration of Helsinki.

Study population

To identify patients with T2DM, we included all adult patients (age 18 +) with a first-ever prescription of a non-insulin antidiabetic drug (NIAD) between January 1st 2006 and December 31st 2019. The date of the first NIAD prescription defined the index date (start of follow-up). In order to identify new users, patients were required to have a minimum of one year of valid data collection prior to the first-ever prescription of a NIAD. Patients with a history of polycystic ovarian syndrome (PCOS), gestational diabetes, or insulin prescription prior to the index date were excluded since these conditions are treated with NIAD, although not necessarily T2DM patients.

Chronic disease conditions

Chronic diseases were identified as conditions that last longer than one year and require medical attention24. We selected 58 distinct chronic comorbidities using Read Codes Supplementary Table S1, the clinical terminology used in General Practice in the UK in which each Read Code represents a term or short phrase which describes a health-related concept25. Read Codes were simplified to the third level, i.e., the first three letters of the Read Code, to encompass all the possible and small deviations from the primary diagnosis. For example, a “conductive hearing loss”, with Read Code F590500, can be collapsed to “conductive deafness”, F590.11, or further summarised to “hearing loss”, F59.00. The selected chronic conditions were based on conditions from the Quality Outcome Framework (QOF) and previous studies on comorbidities commonly associated with T2DM11,26,27. Given the considerable heterogeneity in the pathogenesis and pathophysiology of cancer, we grouped all diagnoses of neoplasms under one category (Read Codes starting with B). We identified existing comorbidities if a patient had ever had a recorded diagnosis on or before the index date. Finally, to avoid convergence problems of the models, we selected those chronic comorbidities with a prevalence higher than 1.0% for males and females.

We created a longitudinal patient-disease binary matrix in discrete diabetes years (i.e., years elapsed between chronic disease onset and index date). Therefore, every row corresponded to a specific patient in a given year, and the columns corresponded to the comorbidities that the patient had developed in that time point. For model fitting, we selected the last observed period for each patient. Thus, we ended up with a single row per patient which encoded the chronic comorbidities that the patient had developed.

Statistical methods

Prior to model development, we summarized main patient characteristics at index date, stratified by sex. Latent feature models assume that there is an unknown low-dimensional representation of patients-disease28. Traditional methods are matrix factorization or latent Dirichlet allocation (LDA)29. However, these approaches require that the number of latent features to be retrieved be specified and assumed to follow a specific distribution, e.g., Gaussian distribution. An elegant solution to these issues is achieved by using Bayesian nonparametric models, such as a General Latent Feature Model (GLFM), by posing an Indian Buffet Process (IBP) as nonparametric prior over binary observation matrices30. This generated a binary matrix where columns represent a potentially unlimited number of features, while rows, representing patients, are finite. Therefore, GLFMs conduct latent feature analysis without pre-specifying the number of latent features. Each data point \({x}_{n}^{d}\) can be explained by a K-length binary vector \({\mathbf{z}}_{n}=\left[{\mathrm{z}}_{n1},\dots ,{\mathrm{z}}_{nK}\right]\) whose elements indicate whether a latent feature is active or not for the \({n}^{th}\) object, and a real-valued weighting vector \({\mathbf{B}}^{d}=\left[{b}_{1}^{d},\dots ,{b}_{K}^{d}\right]\) whose elements \({b}_{k}^{d}\) weight the influence of each latent feature in the \({d}^{th}\) attribute of \(\mathbf{X}\). Therefore, the likelihood can be described as:

$$p\left(\mathbf{X}|\mathbf{Z},{\{\mathbf{B}}^{{\varvec{d}}}{\}}_{{\varvec{d}}=1}^{{\varvec{D}}}\right)=\prod_{{\varvec{d}}=1}^{{\varvec{D}}}\prod_{{\varvec{n}}=1}^{{\varvec{N}}}{\varvec{p}}\left({x}_{n}^{d}|{\mathbf{z}}_{{\varvec{n}}},{\mathbf{B}}^{{\varvec{d}}}\right).$$

The binary latent feature vectors \({\mathbf{z}}_{n}\) are gathered in a \(N\times K\) matrix \(\mathbf{Z}\) which follows an IBP prior with \(\alpha\) as a concentration parameter, i.e., \(\mathbf{Z}\sim IBP\left(\alpha \right),\) where \(\alpha\) controls the a priori activation probability of new features. Therefore, larger values will result in a higher number of expected latent features as well as a larger number of active features per row. For further details see Valera et al.19. Moreover, we forced the first latent feature to be always active, acting as a bias term (i.e., all patients who do not have comorbidities or just one random comorbidity would only have the first latent feature), making this group to act as a baseline cluster.

On the \({\mathbf{B}}^{d}\) matrix we place a Gaussian prior, \({\mathbf{B}}^{d}\sim N\left(0,{\sigma }_{B}^{2}{\mathbf{I}}_{K}\right)\). In order to overcome the problems of not having a Gaussian-distributed observation matrix, we transform each data point \({x}_{n}^{d}\) into an auxiliary Gaussian variable \({y}_{n}^{d}\), also called pseudo-observation, by applying a transformation function \({f}_{d}\left(\cdot \right)\). The pseudo-observation is defined as

$$p\left({y}_{n}^{d}|{\mathbf{z}}_{n},{\mathbf{B}}^{d}\right)=N\left({y}_{n}^{d}|{\mathbf{z}}_{n}{\mathbf{B}}^{d},{\sigma }_{y}^{2}\right).$$

In the case of a binary observation matrix \(\mathbf{X}\) each observation \({x}_{n}^{d}\) can only take two values \({x}_{n}^{d}\in \left\{0, 1\right\}\). Hence, we can map the real values to the positive real numbers by applying the following transformation

$$x_{n}^{d} = f_{d} \left( {y_{n}^{d} } \right) = \left\lfloor {f_{{{\text{R}}_{ + } }} \left( {y_{n}^{d} } \right)} \right\rfloor = \left\lfloor {\frac{{ {\text{log}} \left( {{\text{exp}}\left( {y_{n}^{d} } \right) + 1} \right)}}{\omega } + \mu } \right\rfloor ,$$

where \(\omega\) and \(\mu\) are scale and location hyper-parameters. Hence, the likelihood is defined as

$$p\left({x}_{n}^{d}|{\mathbf{z}}_{n},{\mathbf{B}}^{d}\right)=\Phi \left(\frac{{f}^{-1}\left({x}_{n}^{d}+1\right)-{\mathbf{z}}_{n}{\mathbf{B}}^{d}}{{\sigma }_{y}}\right)-\Phi \left(\frac{{f}^{-1}\left({x}_{n}^{d}\right)-{\mathbf{z}}_{n}{\mathbf{B}}^{d}}{{\sigma }_{y}}\right),$$

where \({f}_{{\mathfrak{R}}_{+}}^{-1}: {\mathfrak{R}}_{+}\to \mathfrak{R}\) is the inverse function of the transformation \({f}_{{\mathfrak{R}}_{+}}\left(\cdot \right).\)

Inference

Given that the posterior distribution of \({\mathbf{B}}^{d}\) is intractable, we rely on a Markov Chain Monte Carlo (MCMC) approach, i.e., Gibbs sampling19, to obtain posterior samples from Z and B. In order to speed up the sampling process, those patients who did not have any comorbidity were not sampled, and were assigned only the bias term. The sampling procedure can be summarized as follows:

Firstly, we sample \(\mathbf{Z}\)

$$p\left({Z}_{nk}=1|{\mathbf{Z}}_{-nk},\mathbf{X}\right)\propto \frac{{\mathbf{m}}_{k}-{Z}_{nk}}{\mathrm{N}}p\left(\mathbf{X}|\mathbf{Z}\right),$$

then we sample \({\mathbf{B}}^{{\varvec{d}}}\)

$$p\left({\mathbf{b}}^{d}|{\mathbf{y}}_{n}^{d},\mathbf{Z}\right)=N\left({\mathbf{b}}^{d}|{\mathbf{P}}^{-1}{{\varvec{\uplambda}}}^{d},{\mathbf{P}}^{-1}\right),$$

where \(\mathbf{P}={\mathbf{Z}}^{\mathrm{\top }}\mathbf{Z}+1/{\upsigma }_{B}^{2}{\mathbf{I}}_{k}\) and \({{\varvec{\uplambda}}}^{d}={\mathbf{Z}}^{\mathrm{\top }}{\mathbf{y}}^{d}\). Finally, we sample \({Y}^{d}\) given \(\mathbf{X},\mathbf{Z},{\mathbf{B}}^{\mathrm{d}}\),

$$p\left({y}_{n1}^{d}|{x}_{n}^{d},{\mathrm{z}}_{n},{\mathrm{B}}^{d}\right)=N\left({y}_{n1}^{d} |{\mathbf{z}}_{n}{\mathbf{b}}_{1}^{d},{\sigma }_{y}^{2}\right){\mathbb{I}}\left({f}_{{\mathfrak{R}}_{+}}^{-1}\left({x}_{n}^{d}\right)\le {y}_{n1}^{d}<{f}^{-1}\left({x}_{n}^{d}+1\right)\right),$$

where we sample \({y}_{n1}^{d}\) from a Gaussian left-truncated by \({f}_{{\mathfrak{R}}_{+}}^{-1}\left({x}_{n}^{d}\right)\) and right-truncated by \({f}_{{\mathfrak{R}}_{+}}^{-1}\left({x}_{n}^{d}+1\right)\). This inference procedure is repeated as many times as iterations set.

We set the Gibbs sampler to run for 1000 iterations, the variance of the Gaussian prior to the weighing vectors \({\mathbf{B}}^{\mathbf{d}}\) to \({\upsigma }_{B}^{2}=1\), and the concentration parameter for the IBP to \(\mathrm{\alpha }=1\). In order to speed up the computations, we did not sample those rows of Z corresponding to patients with no disease.

Predictions

In order to analyze the evolution of comorbidities over time, we estimated the active latent features in each period per patient. To do so, we retrieve all the unique combinations of latent features \({\mathbf{z}}_{i}\) from \(\mathbf{Z}\) and compute the likelihood of each \({\mathbf{z}}_{i}\) to each observation \({x}_{n}\), as previously shown,

$$p\left({\mathbf{x}}_{{\varvec{n}}}|{\mathbf{z}}_{{\varvec{i}}},{\{\mathbf{B}}^{{\varvec{d}}}{\}}_{{\varvec{d}}=1}^{{\varvec{D}}}\right)=\prod_{{\varvec{d}}=1}^{{\varvec{D}}}p\left({x}_{n}^{d}|{\mathbf{z}}_{n},{\mathbf{B}}^{d}\right).$$

Description of clusters

We described each cluster \({\mathbf{z}}_{i}\) and tabulated the count and proportion of patients with a specific disease within that cluster, the proportion of people with that specific disease in the overall population, and the Observed-Expected (O/E) ratio. The O/E ratio is the ratio between the proportion of patients with a given disease in a cluster divided by the proportion of patients with that disease overall, and it gives a magnitude of how a specific comorbidity is over- or underrepresented in a given cluster. Moreover, we reported the proportion of females within that cluster and in the population overall and computed the corresponding O/E ratio, the proportion of females in a cluster divided by the proportion of females overall, to detect if there were female-dominated clusters.

We reported the empirical probabilities of possessing at least one latent feature or a single feature. Additionally, we computed the empirical and the product probability of possessing at least two latent features to identify if two given latent features were independent. For instance, once a latent feature is active, the probability of having another given latent feature is higher. Finally, we also computed the probability of possessing at least latent features \({k}_{1}\) and \({k}_{2}\) given that \({k}_{1}\) is active, i.e.,

$$p\left({k}_{1}=1,{k}_{2}=1|{k}_{1}=1\right)\frac{{\sum }_{n=1}^{N}{z}_{n{k}_{1}}{z}_{n{k}_{2}}}{{\sum }_{n=1}^{N}{z}_{n{k}_{1}}}.$$

The Bayesian nonparametric model was implemented in C ++ , and all statistical analyses and summary statistics were done in R version 3.5.1 (R Project for Statistical Computing). Network visualization was done in Gephi31.

Network visualization

To visualize the progression between clusters \({\mathbf{z}}_{i}\) over time we performed a network visualization. The history of latent membership for patient \(n\) in time \(t\) is represented by \({\mathbf{Z}}_{nt}\). Nodes represent the different clusters \({\mathbf{z}}_{i}\), and the directed edges the direction of the transition between clusters \({\mathbf{z}}_{i}\) in different times \(t\). The size of the nodes, the weight, is proportional to the number of times patients were in that specific node. In order to improve the visualization of the nodes, we took the log of the node weight and rescaled the weights between 0 and 1 as follows:

$${w}_{i}=\frac{log\left({w}_{i}\right)-min\left(log\left(w\right)\right)}{max\left(log\left(w\right)\right)-min\left(log\left(w\right)\right)}.$$

Ethical approval

Ethical approval for the protocol of this project was obtained by the THIN scientific research council (reference number: 20SR062).