Estimation of the prevalence and incidence of motor neuron diseases in two Spanish regions: Catalonia and Valencia

According to the degree of upper and lower motor neuron degeneration, motor neuron diseases (MND) can be categorized into amyotrophic lateral sclerosis (ALS), primary lateral sclerosis (PLS) or progressive muscular atrophy (PMA). Although several studies have addressed the prevalence and incidence of ALS, there is a high heterogeneity in their results. Besides this, neither concept has been previously studied in PLS or PMA. Thus, the objective of this study was to estimate the prevalence and incidence of MND, (distinguishing ALS, PLS and PMA), in the Spanish regions of Catalonia and Valencia in the period 2011–2019. Two population-based Spanish cohorts were used, one from Catalonia and the other from Valencia. Given that the samples that comprised both cohorts were not random, i.e., leading to a selection bias, we used a two-part model in which both the individual and contextual observed and unobserved confounding variables are controlled for, along with the spatial and temporal dependence. The prevalence of MND was estimated to be between 3.990 and 6.334 per 100,000 inhabitants (ALS between 3.248 and 5.120; PMA between 0.065 and 0.634; and PLS between 0.046 and 1.896), and the incidence between 1.682 and 2.165 per 100,000 person-years for MND (ALS between 1.351 and 1.754; PMA between 0.225 and 0.628; and PLS between 0.409–0.544). Results were similar in the two regions and did not differ from those previously reported for ALS, suggesting that the proposed method is robust and that neither region presents differential risk or protective factors.


Materials and methods
Design. Two Spanish Mediterranean regions were analysed, Catalonia and Valencia, which together comprise 26.72% of the total Spanish population and 53.67% of the Spanish population living in the Mediterranean basin (Catalonia 7,566,000 inhabitants and Valencia 4,975,000 inhabitants) 25 (Fig. 1). These two neighbouring regions share climatic and sociodemographic conditions and have a common history (both belonged to the Kingdom of Aragon, one of the two kingdoms comprising Spain from the Middle Ages onwards), and much of the Catalan population originates from Valencia and vice-versa. Thus, it is likely that these populations have shared genetic and environmental risk factors of MND.
The study population was made up of the residents in the regions of Catalonia and Valencia between 1 January 2011 and 31 December 2019 (Catalonia) and between 1 January 2013 and 31 December 2018 (Valencia).
Two population-based cohorts were used, (one from Catalonia and the other from Valencia), encompassing the entire population of both regions. In both cohorts, clinical data were prospectively obtained, while some demographic data were retrospectively collected. The cohorts included all the patients who had been assessed at two Motor Neuron Disease Functional Units (MNDFU) in Spain: the Bellvitge University Hospital in Hospitalet de Llobregat in Catalonia, and the La Fe University and Polytechnic Hospital in Valencia, which act as referral units in the Catalan and Valencian regions, respectively. All the patients with MND seen in the Bellvitge Hospital since 2011 and the La Fe Hospital since 2013 were recorded prospectively in a database created for this purpose, and which included their demographic and clinical characteristics. All the patients studied were adults (specifically, the minimum age was 21 years in the Valencia cohort and 25 in the Catalonia cohort).
This study included the patients who had been evaluated in either of the two referral units during the time periods stated above and who met the following criteria: (1) a diagnosis of ALS, defined as the presence of UMN and LMN dysfunction in at least one body region at any time of the disease course, or LMN dysfunction in at least two body regions with a disease course typical of ALS (continuous non-invasive ventilation or death within four years from disease onset) 26  Variables. Response variable. Different models were estimated depending on the response variable. First, all MND cases were considered as the response variable, regardless of diagnostic category. Second, the following diagnostic categories were considered as additional response variables: ALS, PLS and. PMA. This was done separately for each region (Catalonia and Valencia).
Individual control variables. Age at diagnosis and sex were included as individual explanatory variables. Age was further categorised into quintiles, taking the first quintile as the reference category. For the sex variable, male was considered as the reference category.
Patients were further categorized, according to the symptom of onset, into spinal, bulbar, respiratory and cognitive. The C9ORF72 expansion was studied in all MND patients, whereas SOD1 was sequenced in those familial MND patients not carrying the C9ORF72 expansion. In PMA patients, expansions in the androgen receptor gene were appropriately excluded.
Contextual control variables. The Spanish Epidemiology Association's Deprivation Index 2011 (IP2011) 27 corresponding to each of the census tracts of the municipalities of each of the two regions, was included as a contextual explanatory variable of interest. The IP2011 combines the information on six socioeconomic indicators, calculated for each census tract and based on the data collected in the Spanish Population and Housing Census 2011 28 : percentage of the manual worker population, percentage of the temporary working population, percentage of the unemployed population, percentage of the population without complete compulsory schooling, and percentage of primary residences without access to the Internet The IP2011 was considered as the first component of a sequential principal component analysis (PCA). First, a PCA with ten selected indicators was carried out (in addition to the six already mentioned, the following were included: the percentage of the population born in low-income countries and arrived in Spain after 2006, the percentage of the population born in low-income countries or born in Spain whose father or mother were born in low-income countries, the percentage of the population aged 65 or over, and the percentage of single-parent households with a woman in charge). Second, the ten indicators were correlated (using Spearman's correlation coefficient) with the first component obtained through the PCA. Third, those indicators with correlations less than 0.5 were excluded. Finally, a PCA was carried out using the six indicators finally considered (further information can be found in 27 ). The deprivation index was categorised into quintiles, taking the first quintile (i.e. the one corresponding to the less economically deprived census tracts) as the reference category.
The populations in each of the two regions, stratified by sex and age groups (women under 16 years, women aged 16 to 64 years, women ≥ 65 years, men under 16 years, men aged 16 to 64 years, men ≥ 65 years), were also included as contextual explanatory variables. The reference categories were the youngest categories for each sex. The total population at risk (population aged over 16 years) was used as the offset in the model for each of the two regions.
Population data by census tract, age, and sex was obtained from the Spanish Population and Housing Census 2011 28 .
The graphic representation was produced based on the cartography of Catalonia (SRC ED50/UTM zone 31 N) and Valencia (SRC ED50/UTM zone 30 N) at the census tract level for 2011 29 . www.nature.com/scientificreports/ Statistical analysis. Selection bias is common in observational designs. That is, the analysed cohorts do not constitute a random sample of the population. Some subjects have a higher probability of being observed than others and, therefore, of being present in the sample. To this effect, the subjects that come into contact with one of the referral MNDFUs in the regions studied will be overrepresented. If the selection was exogenous (also termed ignorable non-response, missing at random, or selection on observables), or in other words if the probability of a subject to be observed was identical for all subjects, weighting the sample to give less weight to the subjects actually observed would be sufficient. The weighting used would be the same as the inverse of the percentage of subjects among the population seen at each hospital, stratified by sex and age group. In epidemiology, this method of weighting is called standardisation by age and sex 30 , considering the population covered by each hospital as the standard population.
However, these subjects are very likely to go to one of the referral MNDFUs, not only because they have symptoms of the onset of the disease, but also because there are unobserved factors that influence their use, which would be correlated with the non-observable factors that affect the response variable 31 . In this sense, the population served by the MNDFU has a centre bias. In other words, they do not serve all the patients in the respective populations, but rather a part that may be biased (e.g. some centres send more patients than others; the patients they send are younger than the real population, etc.). In all cases, the probability of these subjects being observed is not the same for all the subjects. Therefore, weighting (standardisation) by age and sex would not correct the selection bias 31 .
In this case (known as endogenous selection, also termed informative selection, missing not at random, or selection on unobservables), other more complex statistical methods should be used to obtain unbiased estimates of prevalence and incidence, something that standardization by age and sex does not permit. Therefore, we chose a two-part model 32 . In the first part, the probability of a subject being observed is estimated. These probabilities are then used as weightings in the second part of the model, where the prevalence and incidence are estimated, with the aim of correcting the non-randomness (in other words, the selection bias). The two parts are typically estimated using a two-part model called Hurdle 32,33 . Thus, we specified a Hurdle model in which the two components were estimated jointly 33 .
In the first part, the probability that an individual is being observed is modelled using a generalised linear mixed model (GLMM) with a binomial link. Included as variables associated with this probability are those that could explain that a subject was observed (in other words, that he/she had had contact with the UFMN of the corresponding hospital). More specifically, we considered individual and contextual variables.
Conditional on the true risk at the location x i , the probability of a case (of the corresponding response variable) occurring in this location, P(x i ) , i = 1, . . . , n , is distributed as a binomial.
n.trials i is the population at risk of being a case in the location x i . The link function is as follows: where the subindex i indicates the census tract of the municipalities in each of the two regions and β are the coefficients of the explanatory variables ( e β is the relative risk associated with each of them). η and S are random effects. η indicates the spatially unstructured individual heterogeneity. In other words, it indicates the nonobservable confounders associated with every individual which do not vary over time. S is a spatially structured random effect normally distributed with zero mean and a Mátérn covariance function: where K ν is the modified Bessel function of the second type and order ν > 0 . ν is a smoother parameter, σ 2 is the variance and κ > 0 is related to the range ( ρ = √ 8ν/κ ), the distance to which the spatial correlation is close to 0.1.
Because of the large number of census tracts with no cases, in the second part a GLMM with Zero Inflated Poisson distribution is used to model the number of patients diagnosed for each of the response variables and by census tract. The same explanatory variables as in the first part of the model are included.
Conditional to the true risk in the location x i , the mathematic expectation of cases of the response variable occurring in each of the census tracts, θ(x i ) , i = 1, . . . , n , is distributed as a Poisson: In this case, the link function is as follows: The definition of the parameters, the variables, and the random effects, is the same as described previously.
In both parts, random effects that include non-observed confounders are also included, capturing individual heterogeneity (specific for each of the two parts) and spatial dependence. That is to say, areas that are close in space show more similar prevalence and incidence than areas that are not close (common to the two parts). In the model used to estimate the incidence, an additional random effect was included to control for the existence of temporal trends (also common to the two parts). The temporal dependence was controlled using a random effect structured as a random walk of order 1 34 (further details can be found elsewhere 33 ).
Both parts were estimated simultaneously following a Bayesian perspective, using the Integrated Nested Laplace Approximation (INLA) approach 35 . Priors that penalize the complexity (PC priors) 36 and have been found to be very robust were used.
Consequently, point and interval estimates were obtained for the prevalence and the incidence, both at the level of the region to which the hospital belongs, and at the level of census tract of each of their municipalities.
Last, the prevalence and incidence of MND per 100,000 inhabitants is represented on the map of the census tracts of Catalonia and Valencia using the mapping by census tracts of the Population and Housing Census 2011 28 , and the "leaflet" library 37 (Figs. 3 and 4).
All analyses were carried out using the free software R (version 3.6.2) 38 , through the INLA package 35,39 .
Ethics approval. The Table 5 shows the main demographic and clinical characteristics of patients diagnosed with MND, stratified by region. Mean age at diagnosis was 62.43 (median = 63) years in Valencia and 63.51 (median = 65) years in Catalonia. In both regions, there were more men than women (57.9% in Valencia and 55.5% in Catalonia). When the diagnostic category is taken into account, 85.5% of all MND cases in Valencia were ALS, 10.5% were PMA, and 3.9% PLS. The same pattern was observed for Catalonia, with ALS being the most frequent diagnosis (96.7%), followed by PMA (2.6%) and PLS (0.7%). Among ALS patients in Valencia, 66.67% had spinal ALS, 28.31% bulbar ALS, 3.20% cognitive onset, and 1.83% respiratory ALS. Similar patterns were seen in Catalonia: 64.12% spinal ALS, 29.58% bulbar ALS, 3.82% respiratory ALS, and 2.48% cognitive onset. Finally, in Valencia 6.95% of the patients carried a C9ORF72 expansion and 3.27% the SOD1 mutation, while in Catalonia it was 7.67% and 2.22%, respectively. Table 6 shows the estimated prevalence and incidence of MND (point estimation and credibility interval at 95%), for the two regions, as well as by diagnostic category, per 100,000 inhabitants. Although the point estimator    , there were no statistically significant differences between the two regions, given that the credibility intervals overlapped (shown graphically in the forest plot in Fig. 2a). Likewise, although the prevalence of ALS, PMA and PLS observed in Catalonia were always above those of Valencia (ALS 4.72 vs 3.25, PMA 0.61 vs 0.48, PLS 0.66 vs 0.38, respectively), the differences between the two regions were not statistically significant. Regarding the estimation of the incidence of MND and their variants by region, Table 6 and Fig. 2b show a greater incidence in Catalonia than in Valencia: MND (2.346 vs 1.390 per 100,000 person-year, respectively), ALS (4.721 vs 3.249 per 100,000 person-year, respectively), PMA (0.415 vs 0.222 per 100,000 person-year, respectively) and PLS (0.476 vs 0.202 per 100,000 person-year, respectively). However, differences were not statistically significant, given that the credibility intervals overlapped in the two cases.

Cases PLS by tract [n (%)]
Finally, Fig. 3a,b show the estimated prevalence by census tract in Catalonia and Valencia, and Fig. 4a,b are a graphic representation of the estimated incidence by census tract. The census tracts whose at risk populations are very small must be interpreted with caution.
In summary, and using the intersection regions of the credibility intervals of the two regions, it was estimated with a 95% probability for the period 2011-2019 (Catalonia 2011-2019, Valencia 2013-2018) that the prevalence of MND would be between 3.990 and 6.334 per 100,000 inhabitants; the prevalence of ALS between 3.248 and 5.120 per 100,000 inhabitants; the prevalence of PMA between 0.065 and 0.634 per 100,000 inhabitants; and the prevalence of PLS between 0.046 and 1.896 per 100,000 inhabitants.

Discussion
This work makes a number of important contributions. First, it provides prevalence and incidence estimators not only of MND itself, but also of ALS, PLS and PMA. Second, an original prevalence and incidence estimation method is proposed, that controls the selection bias caused by not using random samples from the population. Third, to meet the objective of this study, both individual and contextual databases from various and different sources are needed, which is where Real-World Data takes on importance. This approach is used in this work to combine the individual-level databases with a contextual-level database containing population data and data on the socioeconomic level of the area where the subjects live. Fourth, the existence of non-observed confounders and the spatial and temporal dependence inherent to the data used are also controlled for and last but not least, the study compares the prevalence and incidence in two different Spanish regions, allowing for any differential risk or protective factors in them to be assessed.  (Table 1) are almost identical to those of previous population-based studies in the Mediterranean area 40 , suggesting that our cohorts were representative of the ALS populations in Valencia and Catalonia, respectively. Moreover, the frequency of the C9ORF72 expansion in both ALS cohorts (8.2-8.8%) was very similar to that previously found in a Spanish study (7.2%) 41 . The frequency of the SOD1 mutation (1.9-2.4%) was somewhat lower than that previously reported (4.25%) 42 , probably because, unlike C9ORF72, SOD1 is not routinely studied in sporadic ALS patients. As expected 2 , PLS represented approximately 3% of MND patients and their characteristics were similar to previously described cohorts of PLS patients. Specifically, they were younger than the ALS patients (Tables 2  and 3). PMA data must be compared with previous reports with caution, since there has previously been no uniform definition of PMA. For this study, the recently proposed definition of 'restricted LMN signs for at least four years' 1 was used to clearly differentiate these patients from LMN-predominant ALS patients. Despite possible differences in the definition, the characteristics of the PMA patients were similar to previous reports 43, 44 : a frequency of 2.6-10.5% of MND patients, a striking male and spinal-onset predominance, and a similar age to that of ALS patients. Importantly, in the PMA patients' expansions in the androgen receptor gene had been appropriately excluded, but three SOD1 mutations were found. This is not surprising, since a genetic overlap between ALS and PMA has been previously found 45 . www.nature.com/scientificreports/ Prevalence and incidence. To the best of our knowledge, this is the first study in which incidence and prevalence of MND, including ALS, PMA and PLS, have been assessed.
In line with the literature, ALS was the most frequent MND, comprising 85.5-96.7% of the MND patients. The prevalence and incidence estimates of both MND and ALS is within the ranges described in the literature (a prevalence of between 4.1 and 8.4 per 100.000 person-year and an incidence of between 0.6 and 3.8 per 100.000 person-year 6 ).
More specifically, the prevalence estimates in this study were very similar to those reported by Chiò et al. 9 (median prevalence of 4.48; interquartile range, IQR, 3.03-6.70) and fairly similar to those found by Orsini et al. 10 (a prevalence of between 4 and 6 per 100,000 inhabitants); Tesauro et al. 8 (5 per 100,000 inhabitants); Pradas et al. 46 (prevalence for Catalonia of 5.4 per 100,000 inhabitants); and Belbasis et al. 7 (5.4 per 100,000 inhabitants).
The incidence estimates found in this study were almost identical to those reported by Marin et al. 18 (standardised incidence of 1.68, 95% CI 1.50-1.85 all per 100.000 person-year) and in the range found by Chiò et al. 9 (median incidence in Europe of 2.08 per 100,000 person-year, IQR: 1.47-2.43); Zarei et al. 18 (incidence in Europe of between 1.5 and 2.7 per 100,000 person-year) and Santurtun et al. 47 (incidence in Spain of between 1 and 3 cases per 100,000 person-year).
However, the incidence estimates found here differ from those described by Pradas et al. 46 , who estimated an incidence for Catalonia below the credibility interval found in this study (standardised incidence of 1.4 per 100,000 person-year vs. 1.682 per 100,000 person-year, respectively).
Notably, these comparisons must be made with caution. The estimators for prevalence and incidence present a huge heterogeneity. This could be due to the heterogeneity of the disease itself, to the structure of age and sex of the population studied, to geographical variables, to environmental factors, to the period under study (2011-2019 in this case, 1995-2011 in Chiò et al. 9 , the decade of the 1990s in Zarei et al. 19 and 2013 in Santurtun et al. 47 and Pradas et al. 46 ), to the different diagnostic criteria, to different clinical practices, to the genetic predisposition of the subjects, to the type and sources of data used, to the type of design (prospective-retrospective, cohorttransversal study), and/or to the method used to calculate the prevalence and incidence (crude, standardised and multivariate models). An additional significant problem is that many of the authors do not explain the method they used to estimate the incidence and the prevalence.
Regarding PMA and PLS, there have been no previous epidemiological studies on these diagnostic categories, although their incidence has been estimated at < 0.1 per 100,000 2,43 . The estimated incidence found in this study ranges from 0.2 to 0.6 in both categories, suggesting a higher incidence than previously thought. However, these point estimates must be considered with caution given the wide confidence intervals.
In the present study, the point estimates of both prevalence and incidence for Valencia were, in all cases, lower than those for Catalonia. Given that the structure of age and sex was similar for both regions, and that economic deprivation was controlled for (possible risk factor), we believe that such differences are not due to the presence of additional risk factors because these are two Mediterranean regions with similar characteristics. These differences could be attributed to the different coverage provided by the referral hospitals in the two regions. While health care is public and universal in both regions, they each have their own organisation in terms of care, which could alter patient referral criteria and percentages. Furthermore, the population in Valencia is more dispersed geographically (39.6% of its inhabitants live in the metropolitan area of Valencia vs 71.5% living in the metropolitan area of Barcelona in Catalonia), thus (possibly) limiting access to the La Fe Hospital's UFMN. Despite this, differences in prevalence and incidence between areas were not statistically significant, since there was a considerable overlap in the confidence intervals. This suggests that the methodological approach is robust enough to account for local differences.
This study has some limitations. First, despite clinical data being prospectively obtained, some demographic data were collected retrospectively. This retrospective feature may have implied that the point estimators that we have obtained have been lower than those obtained in prospective cohorts of other European studies. Second, some patients that did not meet the criteria for ALS, PLS or PMA (according to the definition used here) when the study was carried out. These patients, while having an MND, were not included in the present study since they did not yet have a confirmed diagnosis which would have enabled them to be classified into one of the categories. Therefore, both incidence and prevalence may be greater than those estimated. However, this limitation applies for all previous epidemiological studies regardless of the definition used. Moreover, for this study we used the most recent consensus definitions. Third, the studied period is different in the two hospitals and the coverage of cases may also be different. Nonetheless, the method is sufficiently robust to give similar results, at least regarding the interval estimate. Last, when the population at risk is very small, the point estimators can be relatively unstable. Therefore, it is better to produce aggregated results and to interpret Figs. 3 and 4 with caution. Furthermore, point estimates must never be interpreted in isolation but rather together with their interval estimators (as was done throughout this study) which, with a certain probability, contain the true value of the parameter (i.e. prevalence and incidence). This paper has several strengths. First, this is the first study to estimate the prevalence and incidence of PLS and PMA. Second, contrary to the studies based on information systems, here the diagnosis of MND and its variants was supervised by specialists in MND in leading hospitals, thus minimizing the possibility of diagnostic errors. Third, in Spain there are only two studies estimating the prevalence and incidence of ALS using information from prior to 2011, neither of which provide estimates of MND according to the different diagnostic categories cited previously or are differentiated by region. Fourth, we propose an unbiased estimate of prevalence and incidence when, as in our case, there is selection bias due to the lack of random samples and, in addition, this selection bias is endogenous. Moreover, the sample used in this study comprised a large number of patients from two Spanish referral hospitals. Last, despite the different coverage in the two hospitals analysed, the fact www.nature.com/scientificreports/ that both the prevalence and the incidence estimators (by interval) overlapped can be taken as in indication that the method proposed enables the selection bias to be controlled for.

Conclusion
In summary, within the context of RWD, we have used an original estimation method which has allowed us to control for the selection bias resulting from not using random samples of the population. With this method, confounders, both observed and unobserved, individually and contextually, were appropriately controlled. Using this method, estimators were obtained of both the prevalence and the incidence of MND and its different categories: prevalence of MND, with a 95% probability, between 3.990 and 6.334 per 100,000 inhabitants; between 3.248 and 5.120 per 100,000 inhabitants for ALS; between 0.065 and 0.634 per 100,000 inhabitants for PMA; and between 0.046 and 1.896 per 100,000 inhabitants for PLS; incidence of MND, between 1.682 and 2.165 per 100,000 person/years; between 1.351 and 1.754 per 100,000 person-years for ALS; between 0.225 and 0.628 per 100,000 person-years for PMA; and between 0.409 and 0.544 per 100,000 person-year for PLS.
Using an original estimation method based on RWD, the prevalence and incidence of ALS and, for the first time, of PMA and PLS, in two Spanish Mediterranean regions was obtained. The data produced are similar to those of other regions and do not differ greatly from data reported previously for ALS, suggesting that the proposed method is robust and that neither region presents differential risk or protective factors. License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.