Introduction

Alzheimer’s disease (AD) is the most common form of dementia currently affecting over 44 million people worldwide1. It is estimated that every 3 seconds, someone in the world develops dementia.

AD is characterized by a progressive deterioration of intellectual abilities. During the development of disease, the patient progressively loses intellectual faculties, initially regaining short-term memory, and then also including long-term memory and motor skills2.

The first symptoms, however, are generally revealed from the age of 50, with an increase in the incidence of the disease of about twice every 5 years, starting from 1 % from 64 years, up to 40 % for patients over 85 years3.

At the morphological level, the peculiarity of AD is an accumulation of proteins and specific neuropathological findings in the cortex area, in particular senile (neuritic) plaques and neurofibrillary clusters. The study of AD and the consequent creation of a care model appears very complex for a number of reasons: (i) the first symptoms of the disease can occur even many years after the onset; (ii) a definite diagnosis can only be made after death; (iii) the underlying causes of the disease could be many and not easy to identify. Therefore, given the multifactorial etiology of AD, a multidisciplinary “One Health” approach based on heterogeneous data linked to AD such as environmental, social, clinical factors could clarify the causes and the pathogenesis of dementia.

Although the etiology and pathogenesis of AD is not fully understood, oxidative stress is a key component4,5,6,7. The increase in oxidative stress could be translated into a greater risk of AD. Oxidative stress is a state of imbalance between the production and elimination of Reactive Oxygen Species (ROS) by antioxidant defense systems8. Reactive oxygen species are reactive molecules that present oxygen, therefore capable of oxidizing cells and important for maintaining oxygen homeostasis in tissues and destroying microbacteria9. Examples of ROS are the free radicals of oxygen and nitrogen (ROS and RNS) and their derivatives. The oxidation of lipids, proteins and DNA due to excessive amounts of ROS leads to deterioration of brain functions, including motor skills and it is responsible for the brain aging10,11.

In recent years, air pollution has been considered a major environmental risk factor for dementia12,13,14. The presence of ROS in the atmosphere seems to be connected to a common feature among AD patients: the olfactory dysfunction15. This dysfunction is due to the formation of neurofibrillary tangles in the olfactory bulb and olfactory centers before deposition in the brain. Air pollutants can directly act as prooxidants, favoring the production of free radicals, inducing inflammatory responses and oxidative stress16 and thus contributing to the onset and development of AD.

Reference17 highlights the hypothesis that Alzheimer’s and Parkinson’s diseases may be catalyzed by agents entering the brain via the olfactory mucosa.

Exposures to low-ozone doses are associated with neurodegenerative diseases because they causes a chronic oxidative stress state18. Ozone is the most widespread air pollutant and a powerful ROS agent produced by photochemical reactions between NOX and Volatile Organic Compounds (VOC) in the troposphere19. It can cause serious health problems in urban areas, especially during heavily polluted and strong sunlight days20.

Other studies reported that air pollution exposures, in particular \(PM_{10}\) and \(NO_2\)21,22, may induce AD-like cortical atrophy and lead to poorer cognitive function23,24,25,26.

Not only air pollution but also soil pollution seems to be connected to the onset of Alzheimer. In fact pesticide use could increase the risk of developing AD27,28,29,30.

As evidence of the multifactorial nature of AD, also social factor could be considered in the prediction and prevention of AD. In particular some studies reported that low education is closely linked with increased cognitive decline or dementia31,32,33.34 reported that early life education can prevent the development of cognitive decline. Other social factors like life experiences and demographic influences showed an important role in incidence of dementia in adults35.

In our study, we aimed to investigate the connections between AD mortality, socio-economic factors, some clinical comorbidities and sources of environmental pollution in Italy at a provincial scale through an artificial intelligence approach. For this purpose we used publicly available indicators over the period 2015-2019 collected from national and regional agencies and computed the AD deaths incidence expressed as Standardized Mortality Ratio (SMR). We predicted the SMR by means of a Random Forest algorithm and investigated the role of each considered features in determining dementia mortality in each Italian provinces at different years through two different feature importance approaches: (i) a global one based on Random Forest; (ii) a local one by means of eXplainable Artificial Intelligence (XAI). The application of XAI techniques has allowed us to increase the transparency and interpretation of our machine learning model and to identify the features that gave a main contribute to dementia mortality.

Materials and methods

The main goal of our work was the investigation of possible significant correlations between different type of environmental pollution, socio-economic factors, clinical comorbidities and AD mortality in the Italian provinces from 2015 to 2019 through a procedure based on machine learning techniques. Our pipeline is summarized in Fig. 1.

To select the most performing algorithms between Linear Model (LM), and Random Forest (RF) we applied a common five-fold cross validation procedure. Then we used the feature set selected by wrapper method Boruta to feed this learning model for the prediction of SMR. Also to verify the robustness of our results we employed a complex networks tool used in previous works to analyze co-expression networks. Finally, we applied a feature importance procedure to outline the role of each feature at wider global and finer local scales with Random Forest internal functionalities and Shapley (SHAP) values algorithm respectively.

Figure 1
figure 1

Flowchart of the implemented analysis. After a data collection and a pre-processing phase, we selected the best model (among LM and RF) to forecast the SMR for the 107 Italian provinces between 2015 and 2019. Then, we developed a feature importance procedure to improve the performance of the selected algorithm and to measure the importance of each variable in the model.

Data collection and preprocessing

In this work we considered 31 indicators that we grouped into 5 categories: Air Pollution, Soil Pollution, Urban Environment, Socio-economic Data and Other Pathologies. The full list of such input variables, along with descriptions and related data sources, is reported in the Supplementary Information. We collected data for the years 2015-2019 of each Italian province from public repositories of Italian National Institute of Statistics (ISTAT) and Regional Environmental Protection Agencies (ARPAs). Only the feature “Life Quality Index” is provided by “Il Sole 24 Ore”. This index is calculated considering 90 different indicators. As for the clinical comorbidities we chose the mortality rate of some diseases connected to AD according to several studies, namely diabetes, ischemia, pathologies related to the circulatory, digestive and brain systems. Among Air Pollution data we also considered Air Quality Index (AQI) provided by http://moniqa.dii.unipi.it/. This index is obtained by dividing the measurement of the pollutant, by its reference limit, established by the Italian Legislative Decree 155/201036.

All features used in this work are available at https://github.com/OneHealthBari/Italian-provinces-data.git

Standardized AD mortality

ISTAT does not provide the Standardized Mortality Ratio (SMR) of AD for the 107 Italian provinces. Therefore, we computed it through the following definition37,38:

$$\begin{aligned} SMR = \frac{O_m}{E_m}, \end{aligned}$$
(1)

where \(O_m\) and \(E_m\) are the observed and the expected number of death by cause respectively. \(E_m\) is defined as the weighted sum of age-specific death rates of the reference population \(R_i^M\) per the population of a given locality and given age \(n_i\):

$$\begin{aligned} E_m=\sum _{i=1}^{I} R_i^M \times n_i ,\end{aligned}$$
(2)

\(R_i^M\) is obtained dividing the number of deaths by age and cause of the reference population \(M_i\) with the age-specific reference population size \(N_i\):

$$\begin{aligned} R_i^M=\frac{M_i}{N_i} ,\end{aligned}$$
(3)

For a given province a SMR value higher than 1, means that the mortality incidence exceeds the reference one (we considered the Italian value as reference). Figure 2 shows the distribution of the value of the SMR (Panel A) and the concentrations of \(O_3\) (Panel B) and \(NO_2\) (Panel B) by province, averaged for the considered time window (2015–2019). Our computation of SMR is available on39 . More information about SMR computation is reported on40.

Figure 2
figure 2

Panel (A) shows the average standardized mortality rate distribution for Alzheimer’s disease within Italian provinces; Panel (B) and Panel (C) show the average average distribution of \(O_3\) and \(NO_2\) concentrations (\(\mu g/m^3\)) at Italian provincial level. This image has been created with the software package “sf” of R 4.2.241.

The study of AD is a very complex issue to deal with in terms of time, in fact we do not know exactly when the disease started, the onset could even go back decades before the patient’s death. Therefore, due to the peculiarity of the analyzed pathology and the impossibility of predicting the exact moment in which it began to develop, in our analysis we focused exclusively on the spatial characterization of the data. The idea is therefore to eliminate the temporal component from the analysis, and to study the spatial correlations between SMR and pollution, socio-economic and health data. We assumed that there have been no substantial changes in the spatial distributions of input data over the considered years.

Feature selection procedure

We applied a feature selection procedure relied on the wrapper method Boruta42 with the aim to improve the performance of our selected learning model. This represents a common and extensively applied strategy in machine learning analysis. It involves an initial step of selecting features that optimize model performance using a wrapper algorithm. Subsequently, only the most informative features are fed into the algorithm as input to reduce noise. This approach helps to mitigate the potential pitfalls associated with standard machine learning applications, such as overfitting and underfitting. Boruta is a robust method, based on Random Forest, that reduces noise and correlated features through the randomization of the training set. For each original feature, the algorithm produces a synthetic one, called shadow, built by randomly mixing the values of each original indicator. With the new dataset (original features and shadows) Boruta trains a RF algorithm in order to compute, after a number of independent shuffles, the importance of both original and artificial variables. Finally only the features that are statistically more important than their respective shadow counterparts are selected. In particular, a permutation procedure is used to validate the role that the Random Forest (RF) algorithm assigns to the features and thus increase the robustness of the method. Shadow attributes are used as reference values to determine the importance of the features. When synthetic features have an importance that closely matches the corresponding best shadow features, it is challenging for Boruta to make decisions with the desired confidence. Boruta is intentionally designed to select all features that are relevant to the prediction of the outcome variable and that minimise prediction error. The specific steps that Boruta performs are as follows43:

  • Permute each feature \(X_j\) to create a shadow feature \(X_j^{(s)}\).

  • Build a Random Forest model using both the original and shadow features.

  • Calculate the importance of each feature \(X_j\) and \(X_j^{(s)}\) by Mean Decrease Accuracy. Z-score is then calculated from the ratio between the mean accuracy loss and the standard deviation of the same distribution.

  • Identify of the maximum Z-score among the shadow attributes (MZSA).

  • Declare \(X_j\) as important for a single run if its Z score exceeds the Z score of MZSA.

  • Perform a two-sided statistical test for all attributes assuming the null hypothesis that the importance of the variables is equal to the maximum importance of MZSA. For each characteristic \(X_j\), the algorithm records how often in M runs the importance of \(X_j\) exceeds the MZSA (a hit is registered for the variable). The expected number of hits, following a binomial distribution with \(p =\) \(q =\) 0.5, is \(E(M) = 0.5M\) with a standard deviation \(S = \sqrt{0.25M}\). Subsequently, \(X_j\) is categorized as important when the number of hits significantly exceeds E(M) and as unimportant when the number of hits significantly falls below E(M).

  • Repeat the preceding steps for a predetermined number of iterations, or continue the process until all attributes are appropriately tagged.

The entire feature selection procedure was carried out within a 5-fold cross-validation framework as described below.

Learning framework

We implemented a learning framework to forecast the SMR at provincial level. We fed our models with the features selected by Boruta. We started with a linear hypothesis and then we used a machine learning approach based on Random Forest (RF)44 to improve the model performances. Multiple linear regression is one of the most basic and used statistical models. This model investigates, under a linear hypothesis, the relationship between a dependent variable (y), some independent variables (\(x_i\)) and their interactions:

$$\begin{aligned} y = \beta _0 + \beta _1x_1 + \beta _2x_2 +... \beta _nx_x + \eta, \end{aligned}$$
(4)

where \(\beta _0\) is the intercept value, \(\beta _i\) are the regression coefficients to estimate, \(\eta\) is the model error and n i the number of features selected by Boruta.

RF is an algorithm composed by an ensemble of binary classification trees (CART). Ensemble learning refers to the methodology of utilizing multiple models that are trained on the same dataset. The final output is determined by averaging the results produced by each individual model. This approach aims to achieve a more potent and robust predictive or classification result by leveraging the diverse perspectives and strengths of multiple models44. RF is a supervised machine learning model widely used because is suitable for modeling multimodal data and easy of tuning with on only two parameters to set: the number of randomly selected features at each node F , and the number of trees of the forest D. Furthermore RF is very robust against overfitting issue thanks to a training phase based on a bootstrap process and a feature randomization procedure during which the forest is developed. Thanks to the use of decision trees, the Random Forest (RF) algorithm is able to capture non-linear relationships present in the input features unlike the linear model. Another important functionality of RF is the ability to assess the importance of each variable used in the model through an internal feature importance procedure. In our work we trained RF model with a dataset composed by 107 Italian provinces and 31 socio-economic, health and pollution indicators. Furthermore, we used the mean decrease impurity as feature importance method, and a RF configurations with M = 600 trees and F = S/3, where S is the number of input features.

We applied a 5-fold cross validation (CV) framework, repeated 100 times, to further increase the robustness of our procedure. In the same way, RF overall feature importance is computed by averaging over 100 CVs.

We evaluated the performance of models using both the linear correlation coefficient between predicted and actual values and the root mean absolute error (MAE), defined as:

$$\begin{aligned} MAE =\frac{1}{n} \sum _{i=1}^{N} \left| A_i - P_i\right|, \end{aligned}$$
(5)

where \(A_i\) is the actual value and \(P_i\) is the predicted value.

Complex networks tool

To verify the robustness of our findings we implemented a complex network approach. A complex network is a geometric model consisting of points (nodes) and lines (links) that symbolise the relationships between the elements within a complex system. The complex network approach is widely used in the study of complex systems because it provides information about the behaviour of the system through the abstraction of the network structure.

Our aim was to evaluate whether adding further confounding features to the ML model described in Sect. "Learning framework", our results remain constant. To do this we have created the network of Italian provinces. Starting from the dataset described in Sect. "Data collection and preprocessing" (107 provinces and 31 independent indicators) we build a network in which the nodes are represented by the Italian provinces. From this network, firstly we extracted some network features (4 new variables), able to capture the connections between different provinces and to provide additional spatial information to the model. Then we added the new features to the 31 indicators previously used and repeated the Machine Learning procedure described in the previous section.

In order to create the adjacency matrix we computed Spearaman’s correlation between each province, which is defined by an array of the 31 features described in Sect. "Data collection and preprocessing". We used the Spearman’s correlation for two main reasons: (i) outliers were present in analyzed dataset45; the sample size was quite small46. Given provinces i and j, we computed \(d_{i,j}\), an element of the Spearman correlation matrix D, as the absolute value of the correlation coefficient \(r_{i,j}\) between the indicator values of province i (\(x_i\)) and province j (\(x_j\)), for the \(N = 31\) indicators:

$$\begin{aligned} d_{i,j}=|r_{i,j}|=\left| \frac{\sum _{a=1}^{N} (R(x_{i,a}) - R(\bar{x}_i))\sum _{b=1}^{N} (R(x_{j,b}) - R(\bar{x}_j))}{\sqrt{\sum _{a=1}^{N} (R(x_{i,a}) - R(\bar{x}_i))^2\sum _{b=1}^{N} (R(x_{j,b}) - R(\bar{x}_j))^2}}\right|, \end{aligned}$$
(6)

where \(\bar{x}_i\) and \(\bar{x}_j\) are the mean province i and province j indicator values across all the used indicators and \(R(x_{i,a})\) the rank variable.

Adjacency matrix and information entropy

Starting from the correlation matrix D with the elements \(d_{i,j}\), we computed the adjacency matrix C of elements \(c_{i,j}\) by means of a hard thresholding procedure:

$$\begin{aligned} c_{i,j}={\left\{ \begin{array}{ll} 1, &{} \text {if} \,d_{i,j}\ge th \,\text {and} \,i\ne \,j, \\ 0, &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$
(7)

where th indicates the optimal threshold that we selected to maximize the Shannon entropy based on Freeman’s betweenness centrality, a topological property of the network. So, we choose the network configuration that maximizes the entropy of betweenness distribution. The betweenness of node (province) i in a network with G elements (provinces) is defined as:

$$\begin{aligned} b_i=\sum _{i\ne j\ne k}^{G}\frac{n_{jk}(i)}{n_{jk}}. \end{aligned}$$
(8)

where \(n_{jk}(i)\) is the number of geodesics (the shortest path connecting two nodes) between node i and node j that pass trough node i and \(n_{jk}\) is the number of geodesics between node j and node k.

For each value of th, we got a different adjacency matrices \(C^{(th)}\) and compared the Shannon entropy:

$$\begin{aligned} H_{C^{(th)}}=-\sum _{i=1}^G b_i^{(th)} \log _2(b_i^{(th)}) \end{aligned}$$
(9)

where \(b_i^{(th)}\) is the betweenness of node (province) i in the network defined by adjacency matrix \(C^{(th)}\).

The hard threshold analysis implemented in this work is an approach proposed in previous articles for the study of gene co-expression networks47,48,49.

Network centrality features

We calculated some quantities related to the intensity of the node connections and the weight distribution in order to be able to study the same nodes as their interaction changes. For each province in the network we computed the betweenness and the degree defined as the amount of connections of the node i:

$$\begin{aligned} k_{i}=\sum _{j=1}^{G}c_{ij}, \end{aligned}$$
(10)

Along with these two measurements we also evaluated the eigenvector centrality that quantifies a node’s importance while giving consideration to the importance of its neighbors and the closeness centrality of province i in a network with N nodes:

$$\begin{aligned} cl_{i}=\frac{N-1}{\sum _{j=1}^{N-1}s_{ij}}, \end{aligned}$$
(11)

where \(s_{ij}\) is the shortest-path distance between i and j.

Explainable artificial intelligence and Shapley values

The main purpose of Explainable Artificial Intelligence (XAI) is to increase transparency and interpretability of Machine and Deep Learning methods50,51,52. XAI refers to a set of techniques that combines a number of properties of AI models such as informativeness, uncertainty estimation, generalization and transparency53,54. In our analysis, we implemented the SHAP local explanation method to evaluate the role of each feature in the Random Forest model. Unlike the feature importance evaluated entirely by Random Forest which provides global information of the machine learning algorithm on the whole training set, SHAP gives the contribution of each feature in the prediction of the single observation. The SHAP algorithm is based on cooperative game theory55,56 and the concept of the Shapley (SHAP) values. Given all possible feature subsets F of the total feature set S (\(F \subseteq S\)) for a feature j the SHAP value is evaluated as the difference between two model outputs, the first obtained including that specific specific feature, the second without. The SHAP value of the j-th feature for the observation x is measured through the addition of the j-th feature to all possible subsets,

$$\begin{aligned} SHAP_j(x) = \sum _{F\subseteq S - \{j\}} \frac{|F|!(|S|-|F|-1)!}{|S|!} [f_{x}(F \cup {j})-f_{x}(F)], \end{aligned}$$
(12)

where |F|! is the number of feature permutations which precede the j-th feature; \((|S|-|F|-1)!\) is the number of feature permutations that follow the j-th feature value; |S|! represents the total number of permutations of features; \(f_{x}(F)\) indicates the model prediction f for the sample x, considered a subset F without the j-th feature; \(f_{x}(F \cup {j})\) is the output of the same model including the j-th feature55. In our analysis we computed the mean SHAP values after a 5-fold CV, repeated 100 times for each considered year.

Data processing and statistical analyses were performed in R 4.2.241 and Python 3.9.

Results

Firstly, we applied a preprocessing procedure, in which we filled the missing values in the considered dataset with the mean values of the corresponding features. Often a certain characteristic had no values for some provinces. Therefore, these missing values were replaced by the average value of the feature for the reference year. As suggested by Zhongheng Zhang57, the use of an estimate such as the mean for imputing missing values is appropriate in analyses with a limited number of samples, such as our study. Then we applied a linear model on the whole feature sample inside a repeated (100 times) 5-fold cross-validation framework (described in Sect. "Learning framework") for each considered year.

We used linear model results as benchmarks of our analysis. To improve the findings of linear model we implemented a RF classifier, compared the performances of the two models and selected the best performing one (RF). Then we applied Boruta algorithm to choose the most informative set of feature with which to fed RF inside a 5-fold cross-validation framework.

The performances obtained through the linear model, RF and RF fed by the Boruta features are reported in Table 1. All correlation coefficients of RF with and without Boruta are statistically significant at \(1\%\). Figure 3 shows the distributions of MAE for each analyzed year: RF with Boruta resulted the best performing method. The results of the feature importance procedure assessed by means of RF algorithm are summarized in the panel A of Fig. 4, where the features importance for each year is shown in a color scale from red (high) to yellow (low). The features not selected by Boruta are indicated with white boxes. The analysis shows that \(O_3\) and \(NO_2\) have a consistently high importance.

Network analysis was then conducted. In particular we built the adjacency matrix through the hard thresholding procedure based on the Shannon entropy of betweenness described in Sect. Complex networks tool. Next, we computed the 4 network centrality metrics, added them to the original database, and repeated the machine learning and feature selection procedures for each year. The new RF performance shows no relevant changes, with a small difference of 0.1%. The role of the 4 network features is not relevant within the model as shown in panel B of the Fig. 4. The importance of the other features instead remains almost constant. Therefore, for the sake of simplicity, we performed SHAP analysis on the dataset without network metrics. We conclude that our results are robust even adding new features.

Table 1 Mean absolute error and p-values of the three developed models, for each year.
Figure 3
figure 3

MAE for the three implemented models: linear model (LM), random forest (RF) and random forest with boruta (RF+B). Each distribution was computed through a 5-fold cross validation procedure repeated 100 times.

Figure 4
figure 4

Average feature importance obtained through RF model after a 5-fold CV procedure with 100 ripetitions, for the considered time span (2015–2019). The white boxes show that the corresponding feature was rejected in the feature selection procedure relied on the Boruta algorithm. Panels (A) and (B) are referred to RF models trained with and without centrality network features respectively.

For each feature selected by Boruta and for each year, we computed the SHAP values associated to the SMR prediction. The means of these SHAP values have been reported in the Supplementary Information. Figure 5 show the distribution of SHAP values over the period 2015–2019 for each Italian province. These plots, where the features are ordered in terms of importance in the model, display that \(O_3\), \(NO_2\) always appear among the most important factors.

Figure 5
figure 5

SHAP distribution values of the most influential features for the considered time window (2015–2019). Each point in the same row corresponds to a different province.

Finally, to underline the differences in terms of AD mortality and pollution in the different provinces, we performed a min-max normalisation of both the SMR distribution and the \(O_3\) and \(NO_2\) concentrations shown in Fig. 2. From these three maps, we computed an average map, which is displayed in the Fig. 6.

Figure 6
figure 6

Average map between the SMR distribution and the \(O_3\) and \(NO_2\) concentrations shown in Fig. 2. Before averaging the 3 distributions, minimum maximum normalisation of the values was performed. This image has been created with the software package “sf” of R 4.2.241.

Discussion

In the present work, we focused on the relation between AD mortality and pollution, socio-economic and health indicators by means of a learning methodology based on Artificial Intelligence algorithms. We used 31 publicly available indicators to forecast the spatial distribution of AD deaths incidence expressed as Normalized Mortality Ratio (SMR) calculated for each Italian province over the period 2015-2019. First we chose the best performing method in terms of MAE and linear correlation between LM and RF. Then we selected the most important factors through the Boruta wrapper method and fed a Random Forest algorithm. Through RF with Boruta we obtained a good average performance of 0.21 of MAE over all considered years. To verify the robustness of our results we implemented a complex network approach, never applied on this type of data. Through a procedure based on Spearman’s correlation and Shannon’s entropy we built the network of the Italian provinces. From this network we extracted 4 features that we added to the 31 indicators previously used and we fed the machine learning model repeating the procedure described in Sect. "Learning framework". These measures of centrality reported in Sect. "Network centrality features" condense spatial information at various levels. The new features do not improve the performance of our procedure (we only noted a difference of 0.1%) but this procedure shown a good stability of our findings. In fact, the role of the most important features in the model remains unchanged even by adding factors that could be confounding. The result of the implemented global feature importance procedure based on Boruta and Random Forest, shown in Fig. 4, highlights the burden of environmental pollution on AD mortality. Overall, we found that \(O_3\), \(NO_2\) and the mortality rate due to diseases of circulatory system are among the most important factors associated with high SMR values. A clear spatial association between AD mortality and \(O_3\) and \(NO_2\) at a provincial scale emerges through a comparison among the average distribution (over 5 years) of SMR, shown in panel A of Fig. 2 and the average concentrations of \(O_3\) and \(NO_2\) shown in the panel B and C of the same Figure, confirming the existence of some geographic patterns. To emphasise these patterns, we calculated the average map of the maps shown in Fig. 2. The geographical areas with the most intense colour scale have the highest values for both pollution and SMR. It is evident that most of the provinces in Puglia, Abruzzo, Tuscany and Lombardy are significantly affected by both high AD mortality risk and relevant levels of \(O_3\) and \(NO_2\) pollutants, although they differ in geographical, meteorological and socio-economic terms. This confirms that AD mortality and air pollution are more associated to each other than any other variable that could explanation different patterns.

The pivotal role of \(O_3\) and \(NO_2\) in the model is confirmed also by XAI analysis displayed in Fig. 5. An important information provided by SHAP results is how the considered features influence the sign of the SHAP values. The behaviors of \(O_3\) and \(NO_2\) in the Shapley plots indicate that high concentrations of these air pollutants correspond to an excess of AD deaths. So both Boruta and SHAP highlighted the importance of \(O_3\) and \(NO_2\) pollutants, giving a global and local explanation respectively. As previous mentioned, these results are already investigated in several works. Studies on rats have shown that long-term exposure to ozone, produces short- and long-term memory loss, along with motor disabilities18,58.59 underlined the effect of chronic ozone exposure on brain tissue and the close relationship between ozone pollution and neurodegenerative diseases.14 reported an association among increase in ozone concentration in Rome and increase in risk of hospitalisation with dementia.60 found a high risk of AD in areas of Taiwan strongly polluted by ozone .61 reported evidence of a positive correlation among residential levels of air pollution across London (in particular \(NO_2\)) and people diagnosed with AD and vascular dementia. These findings are in agreement with many other works which conclude that increased \(NO_2\) concentrations are linked to an elevated risk of AD (and other forms of dementia) incidence62,63,64.

As we have already said, another important finding underlined by our feature importance procedure is the relevant rule of mortality related to cardiovascular diseases in the prediction of SMR. SHAP analysis underlined that low mortality rate for the circulatory system problem correspond to high AD mortality values. This result is confirmed by a negative correlation coefficient (\(-0.32\)) between SMR and the feature “circulatory system”, but appears to be in contrast to the findings of other works in which AD and cardiovascular diseases are reported strongly associated65,66. An explanation of our results may be due to the type of data used since they are mortality rates and not incidence rates. The negative correlation between SMR for AD and the mortality rate for cardiovascular diseases is likely due to the fact that a large number of patients with dementia also suffer from heart problems, but death from cardiac causes precedes that from AD and vice versa. This strong association between AD and cardiovascular disease translates into an inverse association of the relative mortality rates. Many other work reported this link between AD and cardiovascular diseases.

In a study carried by Scherbakov and Doehner, it was showed that brain is subject to cerebral perfusion, which commonly happens in hearth failure. The reason is the high vascularization of the brain67.68 found that cerebral hypoperfusion increases the production of neurofibrillary tangles and Amyloid-\(\beta\) (A\(\beta\)) plaques, characteristic of AD. Also, the hypoperfusion causes the breakdown of the blood-brain barrier, which obstructs the removal of A\(\beta\)69. Another mechanism which play an important role in the formation of amyloid plaques, in cases of hypoperfusion, is the breakdown of the blood-brain barrier, which impairs the clearance of A\(\beta\)69. To our knowledge, our work represents the most comprehensive study of the close connection between environmental factors (especially pollution) and dementia affecting an entire Italian population so far. Our analysis, through the support of machine learning and XAI techniques and with a large set of variables, provides a view of the most important factors both at a global and local level.

We acknowledge that our study has some limitations. The first one is the incipient nature of dementia that increases the difficulty of accurate diagnosis and proves problematic in detecting associations with AD risk factors. This aspect leads to a possible underestimation of AD mortality value. In fact, as we have already said, it is possible to diagnose AD only after the patient’s death, therefore the need to perform an autopsy automatically generates an underestimation of the mortality rate. Another limitation to this analysis is represented by the estimation of the concentrations of the analyzed pollutants. We assessed the concentration of air pollutants in a province through the available monitoring stations, which does not cover the whole territory. So we made a spatial approximation. A possible evolution of this approach consists in the creation of a pollution map of Italian territory by means of satellite data (for example Sentinel-5 mission) which have a large point density. Our results suggest that constant monitoring over the years and the ever-increasing availability of data will also help predict the evolution of AD in the future, given that it occurs many decades after exposure to pollutants.

Conclusions

In this study we investigated possible risk factors of AD taking into account pollution socio-economic and health data. We built a model, based on Machine Learning and Explainable Artificial Intelligence techniques and complex networks framework to predict the AD mortality at the Italian provincial level over a period of 5 years (2015–2019). Our model presented a good precision with a mean absolute error of 0.22. Through a feature importance procedure relied on global and local approaches we found a link between pollution and AD with \(O_3\) and \(NO_2\) that assumed pivotal rules in the model. Although our analysis presents some limitations due to the physiology of AD, our findings are promising and deserve further investigation also by means of satellite data to estimate pollution concentrations.