Machine learning and XAI approaches highlight the strong connection between \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O_3$$\end{document}O3 and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$NO_2$$\end{document}NO2 pollutants and Alzheimer’s disease

Alzheimer’s disease (AD) is the most common type of dementia with millions of affected patients worldwide. Currently, there is still no cure and AD is often diagnosed long time after onset because there is no clear diagnosis. Thus, it is essential to study the physiology and pathogenesis of AD, investigating the risk factors that could be strongly connected to the disease onset. Despite AD, like other complex diseases, is the result of the combination of several factors, there is emerging agreement that environmental pollution should play a pivotal role in the causes of disease. In this work, we implemented an Artificial Intelligence model to predict AD mortality, expressed as Standardized Mortality Ratio, at Italian provincial level over 5 years. We employed a set of publicly available variables concerning pollution, health, society and economy to feed a Random Forest algorithm. Using methods based on eXplainable Artificial Intelligence (XAI) we found that air pollution (mainly \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O_3$$\end{document}O3 and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$NO_2$$\end{document}NO2) contribute the most to AD mortality prediction. These results could help to shed light on the etiology of Alzheimer’s disease and to confirm the urgent need to further investigate the relationship between the environment and the disease.


Materials and methods
The main goal of our work was the investigation of possible significant correlations between different type of environmental pollution, socio-economic factors, clinical comorbidities and AD mortality in the Italian provinces from 2015 to 2019 through a procedure based on machine learning techniques.Our pipeline is summarized in Fig. 1.
To select the most performing algorithms between Linear Model (LM), and Random Forest (RF) we applied a common five-fold cross validation procedure.Then we used the feature set selected by wrapper method Boruta to feed this learning model for the prediction of SMR.Also to verify the robustness of our results we employed a complex networks tool used in previous works to analyze co-expression networks.Finally, we applied a feature importance procedure to outline the role of each feature at wider global and finer local scales with Random Forest internal functionalities and Shapley (SHAP) values algorithm respectively.

Data collection and preprocessing
In this work we considered 31 indicators that we grouped into 5 categories: Air Pollution, Soil Pollution, Urban Environment, Socio-economic Data and Other Pathologies.The full list of such input variables, along with descriptions and related data sources, is reported in the Supplementary Information.We collected data for the years 2015-2019 of each Italian province from public repositories of Italian National Institute of Statistics (ISTAT) and Regional Environmental Protection Agencies (ARPAs).Only the feature "Life Quality Index" is provided by "Il Sole 24 Ore".This index is calculated considering 90 different indicators.As for the clinical comorbidities we chose the mortality rate of some diseases connected to AD according to several studies, namely diabetes, ischemia, pathologies related to the circulatory, digestive and brain systems.Among Air Pollution data we also considered Air Quality Index (AQI) provided by http://moniqa.dii.unipi.it/.This index is obtained by dividing the measurement of the pollutant, by its reference limit, established by the Italian Legislative Decree 155/2010 36  www.nature.com/scientificreports/

Standardized AD mortality
ISTAT does not provide the Standardized Mortality Ratio (SMR) of AD for the 107 Italian provinces.Therefore, we computed it through the following definition 37,38 : where O m and E m are the observed and the expected number of death by cause respectively.E m is defined as the weighted sum of age-specific death rates of the reference population R M i per the population of a given locality and given age n i :

R M
i is obtained dividing the number of deaths by age and cause of the reference population M i with the age- specific reference population size N i : For a given province a SMR value higher than 1, means that the mortality incidence exceeds the reference one (we considered the Italian value as reference).Figure 2 shows the distribution of the value of the SMR (Panel A) and the concentrations of O 3 (Panel B) and NO 2 (Panel B) by province, averaged for the considered time window (2015-2019).Our computation of SMR is available on 39 .More information about SMR computation is reported on 40 .
The study of AD is a very complex issue to deal with in terms of time, in fact we do not know exactly when the disease started, the onset could even go back decades before the patient's death.Therefore, due to the peculiarity of the analyzed pathology and the impossibility of predicting the exact moment in which it began to develop, in our analysis we focused exclusively on the spatial characterization of the data.The idea is therefore to eliminate the temporal component from the analysis, and to study the spatial correlations between SMR and pollution, socio-economic and health data.We assumed that there have been no substantial changes in the spatial distributions of input data over the considered years. (1)

Feature selection procedure
We applied a feature selection procedure relied on the wrapper method Boruta 42 with the aim to improve the performance of our selected learning model.This represents a common and extensively applied strategy in machine learning analysis.It involves an initial step of selecting features that optimize model performance using a wrapper algorithm.Subsequently, only the most informative features are fed into the algorithm as input to reduce noise.This approach helps to mitigate the potential pitfalls associated with standard machine learning applications, such as overfitting and underfitting.Boruta is a robust method, based on Random Forest, that reduces noise and correlated features through the randomization of the training set.For each original feature, the algorithm produces a synthetic one, called shadow, built by randomly mixing the values of each original indicator.With the new dataset (original features and shadows) Boruta trains a RF algorithm in order to compute, after a number of independent shuffles, the importance of both original and artificial variables.Finally only the features that are statistically more important than their respective shadow counterparts are selected.In particular, a permutation procedure is used to validate the role that the Random Forest (RF) algorithm assigns to the features and thus increase the robustness of the method.Shadow attributes are used as reference values to determine the importance of the features.When synthetic features have an importance that closely matches the corresponding best shadow features, it is challenging for Boruta to make decisions with the desired confidence.Boruta is intentionally designed to select all features that are relevant to the prediction of the outcome variable and that minimise prediction error.The specific steps that Boruta performs are as follows 43 : • Permute each feature X j to create a shadow feature X (s) j .
• Build a Random Forest model using both the original and shadow features.
• Calculate the importance of each feature X j and X (s) j by Mean Decrease Accuracy.Z-score is then calculated from the ratio between the mean accuracy loss and the standard deviation of the same distribution.
• Identify of the maximum Z-score among the shadow attributes (MZSA).
• Declare X j as important for a single run if its Z score exceeds the Z score of MZSA.• Perform a two-sided statistical test for all attributes assuming the null hypothesis that the importance of the variables is equal to the maximum importance of MZSA.For each characteristic X j , the algorithm records how often in M runs the importance of X j exceeds the MZSA (a hit is registered for the variable).The expected number of hits, following a binomial distribution with p = q = 0.5, is E(M) = 0.5M with a standard deviation S = √ 0.25M .Subsequently, X j is categorized as important when the number of hits significantly exceeds E(M) and as unimportant when the number of hits significantly falls below E(M).
• Repeat the preceding steps for a predetermined number of iterations, or continue the process until all attrib- utes are appropriately tagged.
The entire feature selection procedure was carried out within a 5-fold cross-validation framework as described below.

Learning framework
We implemented a learning framework to forecast the SMR at provincial level.We fed our models with the features selected by Boruta.We started with a linear hypothesis and then we used a machine learning approach based on Random Forest (RF) 44 to improve the model performances.Multiple linear regression is one of the most basic and used statistical models.This model investigates, under a linear hypothesis, the relationship between a dependent variable (y), some independent variables ( x i ) and their interactions: where β 0 is the intercept value, β i are the regression coefficients to estimate, η is the model error and n i the number of features selected by Boruta.
RF is an algorithm composed by an ensemble of binary classification trees (CART).Ensemble learning refers to the methodology of utilizing multiple models that are trained on the same dataset.The final output is determined by averaging the results produced by each individual model.This approach aims to achieve a more potent and robust predictive or classification result by leveraging the diverse perspectives and strengths of multiple models 44 .RF is a supervised machine learning model widely used because is suitable for modeling multimodal data and easy of tuning with on only two parameters to set: the number of randomly selected features at each node F , and the number of trees of the forest D. Furthermore RF is very robust against overfitting issue thanks to a training phase based on a bootstrap process and a feature randomization procedure during which the forest is developed.Thanks to the use of decision trees, the Random Forest (RF) algorithm is able to capture non-linear relationships present in the input features unlike the linear model.Another important functionality of RF is the ability to assess the importance of each variable used in the model through an internal feature importance procedure.In our work we trained RF model with a dataset composed by 107 Italian provinces and 31 socio-economic, health and pollution indicators.Furthermore, we used the mean decrease impurity as feature importance method, and a RF configurations with M = 600 trees and F = S/3, where S is the number of input features.
We applied a 5-fold cross validation (CV) framework, repeated 100 times, to further increase the robustness of our procedure.In the same way, RF overall feature importance is computed by averaging over 100 CVs.
We evaluated the performance of models using both the linear correlation coefficient between predicted and actual values and the root mean absolute error (MAE), defined as: where A i is the actual value and P i is the predicted value.

Complex networks tool
To verify the robustness of our findings we implemented a complex network approach.A complex network is a geometric model consisting of points (nodes) and lines (links) that symbolise the relationships between the elements within a complex system.The complex network approach is widely used in the study of complex systems because it provides information about the behaviour of the system through the abstraction of the network structure.
Our aim was to evaluate whether adding further confounding features to the ML model described in Sect."Learning framework", our results remain constant.To do this we have created the network of Italian provinces.Starting from the dataset described in Sect."Data collection and preprocessing" (107 provinces and 31 independent indicators) we build a network in which the nodes are represented by the Italian provinces.From this network, firstly we extracted some network features (4 new variables), able to capture the connections between different provinces and to provide additional spatial information to the model.Then we added the new features to the 31 indicators previously used and repeated the Machine Learning procedure described in the previous section.
In order to create the adjacency matrix we computed Spearaman's correlation between each province, which is defined by an array of the 31 features described in Sect."Data collection and preprocessing".We used the Spearman's correlation for two main reasons: (i) outliers were present in analyzed dataset 45 ; the sample size was quite small 46 .Given provinces i and j, we computed d i,j , an element of the Spearman correlation matrix D, as the absolute value of the correlation coefficient r i,j between the indicator values of province i ( x i ) and province j ( x j ), for the N = 31 indicators: where xi and xj are the mean province i and province j indicator values across all the used indicators and R(x i,a ) the rank variable.

Adjacency matrix and information entropy
Starting from the correlation matrix D with the elements d i,j , we computed the adjacency matrix C of elements c i,j by means of a hard thresholding procedure: where th indicates the optimal threshold that we selected to maximize the Shannon entropy based on Freeman's betweenness centrality, a topological property of the network.So, we choose the network configuration that maximizes the entropy of betweenness distribution.The betweenness of node (province) i in a network with G elements (provinces) is defined as: where n jk (i) is the number of geodesics (the shortest path connecting two nodes) between node i and node j that pass trough node i and n jk is the number of geodesics between node j and node k.
For each value of th, we got a different adjacency matrices C (th) and compared the Shannon entropy: where b is the betweenness of node (province) i in the network defined by adjacency matrix C (th) .The hard threshold analysis implemented in this work is an approach proposed in previous articles for the study of gene co-expression networks [47][48][49] .

Network centrality features
We calculated some quantities related to the intensity of the node connections and the weight distribution in order to be able to study the same nodes as their interaction changes.For each province in the network we computed the betweenness and the degree defined as the amount of connections of the node i: Along with these two measurements we also evaluated the eigenvector centrality that quantifies a node's importance while giving consideration to the importance of its neighbors and the closeness centrality of province i in a network with N nodes: where s ij is the shortest-path distance between i and j.

Explainable artificial intelligence and Shapley values
The main purpose of Explainable Artificial Intelligence (XAI) is to increase transparency and interpretability of Machine and Deep Learning methods [50][51][52] .XAI refers to a set of techniques that combines a number of properties of AI models such as informativeness, uncertainty estimation, generalization and transparency 53,54 .In our analysis, we implemented the SHAP local explanation method to evaluate the role of each feature in the Random Forest model.Unlike the feature importance evaluated entirely by Random Forest which provides global information of the machine learning algorithm on the whole training set, SHAP gives the contribution of each feature in the prediction of the single observation.The SHAP algorithm is based on cooperative game theory 55,56 and the concept of the Shapley (SHAP) values.Given all possible feature subsets F of the total feature set S ( F ⊆ S ) for a feature j the SHAP value is evaluated as the difference between two model outputs, the first obtained including that specific specific feature, the second without.The SHAP value of the j-th feature for the observation x is measured through the addition of the j-th feature to all possible subsets, (6) where |F|! is the number of feature permutations which precede the j-th feature; (|S| − |F| − 1)! is the number of feature permutations that follow the j-th feature value; |S|! represents the total number of permutations of features; f x (F) indicates the model prediction f for the sample x, considered a subset F without the j-th feature; f x (F ∪ j) is the output of the same model including the j-th feature 55 .In our analysis we computed the mean SHAP values after a 5-fold CV, repeated 100 times for each considered year.Data processing and statistical analyses were performed in R 4.2.2 41 and Python 3.9.

Results
Firstly, we applied a preprocessing procedure, in which we filled the missing values in the considered dataset with the mean values of the corresponding features.Often a certain characteristic had no values for some provinces.Therefore, these missing values were replaced by the average value of the feature for the reference year.As suggested by Zhongheng Zhang 57 , the use of an estimate such as the mean for imputing missing values is appropriate in analyses with a limited number of samples, such as our study.Then we applied a linear model on the whole feature sample inside a repeated (100 times) 5-fold cross-validation framework (described in Sect."Learning framework") for each considered year.We used linear model results as benchmarks of our analysis.To improve the findings of linear model we implemented a RF classifier, compared the performances of the two models and selected the best performing one (RF).Then we applied Boruta algorithm to choose the most informative set of feature with which to fed RF inside a 5-fold cross-validation framework.
The performances obtained through the linear model, RF and RF fed by the Boruta features are reported in Table 1.All correlation coefficients of RF with and without Boruta are statistically significant at 1% .Figure 3 shows the distributions of MAE for each analyzed year: RF with Boruta resulted the best performing method.The results of the feature importance procedure assessed by means of RF algorithm are summarized in the panel A of Fig. 4, where the features importance for each year is shown in a color scale from red (high) to yellow (low).The features not selected by Boruta are indicated with white boxes.The analysis shows that O 3 and NO 2 have a consistently high importance.
Network analysis was then conducted.In particular we built the adjacency matrix through the hard thresholding procedure based on the Shannon entropy of betweenness described in Sect.Complex networks tool.Next, we computed the 4 network centrality metrics, added them to the original database, and repeated the machine learning and feature selection procedures for each year.The new RF performance shows no relevant changes, with a small difference of 0.1%.The role of the 4 network features is not relevant within the model as shown in panel B of the Fig. 4. The importance of the other features instead remains almost constant.Therefore, for the sake of simplicity, we performed SHAP analysis on the dataset without network metrics.We conclude that our results are robust even adding new features.For each feature selected by Boruta and for each year, we computed the SHAP values associated to the SMR prediction.The means of these SHAP values have been reported in the Supplementary Information.Figure 5 show the distribution of SHAP values over the period 2015-2019 for each Italian province.These plots, where the features are ordered in terms of importance in the model, display that O 3 , NO 2 always appear among the most important factors.
Finally, to underline the differences in terms of AD mortality and pollution in the different provinces, we performed a min-max normalisation of both the SMR distribution and the O 3 and NO 2 concentrations shown in Fig. 2. From these three maps, we computed an average map, which is displayed in the Fig. 6.

Discussion
In the present work, we focused on the relation between AD mortality and pollution, socio-economic and health indicators by means of a learning methodology based on Artificial Intelligence algorithms.We used 31 publicly available indicators to forecast the spatial distribution of AD deaths incidence expressed as Normalized Mortality Ratio (SMR) calculated for each Italian province over the period 2015-2019.First we chose the best performing method in terms of MAE and linear correlation between LM and RF.Then we selected the most important factors through the Boruta wrapper method and fed a Random Forest algorithm.Through RF with Boruta we obtained a good average performance of 0.21 of MAE over all considered years.To verify the robustness of our results we implemented a complex network approach, never applied on this type of data.Through a procedure based on Spearman's correlation and Shannon's entropy we built the network of the Italian provinces.From this network we extracted 4 features that we added to the 31 indicators previously used and we fed the machine learning model repeating the procedure described in Sect."Learning framework".These measures of centrality reported in Sect."Network centrality features" condense spatial information at various levels.The To emphasise these patterns, we calculated the average map of the maps shown in Fig. 2. The geographical areas with the most intense colour scale have the highest values for both pollution and SMR.It is evident that most of the provinces in Puglia, Abruzzo, Tuscany and Lombardy are significantly affected by both high AD mortality risk and relevant levels of O 3 and NO 2 pollutants, although they differ in geographical, meteorological and socio- economic terms.This confirms that AD mortality and air pollution are more associated to each other than any other variable that could explanation different patterns.
The pivotal role of O 3 and NO 2 in the model is confirmed also by XAI analysis displayed in Fig. 5.An impor- tant information provided by SHAP results is how the considered features influence the sign of the SHAP values.The behaviors of O 3 and NO 2 in the Shapley plots indicate that high concentrations of these air pollutants corre- spond to an excess of AD deaths.So both Boruta and SHAP highlighted the importance of O 3 and NO 2 pollutants, giving a global and local explanation respectively.As previous mentioned, these results are already investigated in several works.Studies on rats have shown that long-term exposure to ozone, produces short-and long-term memory loss, along with motor disabilities 18,58 . 59underlined the effect of chronic ozone exposure on brain tissue and the close relationship between ozone pollution and neurodegenerative diseases. 14reported an association among increase in ozone concentration in Rome and increase in risk of hospitalisation with dementia. 60found a high risk of AD in areas of Taiwan strongly polluted by ozone . 61reported evidence of a positive correlation among residential levels of air pollution across London (in particular NO 2 ) and people diagnosed with AD and vascular dementia.These findings are in agreement with many other works which conclude that increased NO 2 concentrations are linked to an elevated risk of AD (and other forms of dementia) incidence [62][63][64] .
As we have already said, another important finding underlined by our feature importance procedure is the relevant rule of mortality related to cardiovascular diseases in the prediction of SMR.SHAP analysis underlined that low mortality rate for the circulatory system problem correspond to high AD mortality values.This result is confirmed by a negative correlation coefficient ( −0.32 ) between SMR and the feature "circulatory system", but appears to be in contrast to the findings of other works in which AD and cardiovascular diseases are reported strongly associated 65,66 .An explanation of our results may be due to the type of data used since they are mortality rates and not incidence rates.The negative correlation between SMR for AD and the mortality rate for www.nature.com/scientificreports/cardiovascular diseases is likely due to the fact that a large number of patients with dementia also suffer from heart problems, but death from cardiac causes precedes that from AD and vice versa.This strong association between AD and cardiovascular disease translates into an inverse association of the relative mortality rates.Many other work reported this link between AD and cardiovascular diseases.In a study carried by Scherbakov and Doehner, it was showed that brain is subject to cerebral perfusion, which commonly happens in hearth failure.The reason is the high vascularization of the brain 67 . 68found that cerebral hypoperfusion increases the production of neurofibrillary tangles and Amyloid-β (Aβ ) plaques, characteristic of AD.Also, the hypoperfusion causes the breakdown of the blood-brain barrier, which obstructs the removal of A β 69 .Another mechanism which play an important role in the formation of amyloid plaques, in cases of hypoperfusion, is the breakdown of the blood-brain barrier, which impairs the clearance of A β 69 .To our knowledge, our work represents the most comprehensive study of the close connection between environmental factors (especially pollution) and dementia affecting an entire Italian population so far.Our analysis, through the support of machine learning and XAI techniques and with a large set of variables, provides a view of the most important factors both at a global and local level.
We acknowledge that our study has some limitations.The first one is the incipient nature of dementia that increases the difficulty of accurate diagnosis and proves problematic in detecting associations with AD risk factors.This aspect leads to a possible underestimation of AD mortality value.In fact, as we have already said, it is possible to diagnose AD only after the patient's death, therefore the need to perform an autopsy automatically generates an underestimation of the mortality rate.Another limitation to this analysis is represented by the estimation of the concentrations of the analyzed pollutants.We assessed the concentration of air pollutants in a province through the available monitoring stations, which does not cover the whole territory.So we made a spatial approximation.A possible evolution of this approach consists in the creation of a pollution map of Italian territory by means of satellite data (for example Sentinel-5 mission) which have a large point density.Our results suggest that constant monitoring over the years and the ever-increasing availability of data will also help predict the evolution of AD in the future, given that it occurs many decades after exposure to pollutants.

Conclusions
In this study we investigated possible risk factors of AD taking into account pollution socio-economic and health data.We built a model, based on Machine Learning and Explainable Artificial Intelligence techniques and complex networks framework to predict the AD mortality at the Italian provincial level over a period of 5 years (2015-2019).Our model presented a good precision with a mean absolute error of 0.22.Through a feature importance procedure relied on global and local approaches we found a link between pollution and AD with O 3 and NO 2 that assumed pivotal rules in the model.Although our analysis presents some limitations due to the physiology of AD, our findings are promising and deserve further investigation also by means of satellite data to estimate pollution concentrations.

Figure 1 .
Figure1.Flowchart of the implemented analysis.After a data collection and a pre-processing phase, we selected the best model (among LM and RF) to forecast the SMR for the 107 Italian provinces between 2015 and 2019.Then, we developed a feature importance procedure to improve the performance of the selected algorithm and to measure the importance of each variable in the model.

Figure 2 .
Figure 2. Panel (A) shows the average standardized mortality rate distribution for Alzheimer's disease within Italian provinces; Panel (B) and Panel (C) show the average average distribution of O 3 and NO 2 concentrations ( µg/m 3 ) at Italian provincial level.This image has been created with the software package "sf " of R 4.2.2 41 .

Figure 3 .
Figure 3. MAE for the three implemented models: linear model (LM), random forest (RF) and random forest with boruta (RF+B).Each distribution was computed through a 5-fold cross validation procedure repeated 100 times.

Figure 4 .
Figure 4. Average feature importance obtained through RF model after a 5-fold CV procedure with 100 ripetitions, for the considered time span (2015-2019).The white boxes show that the corresponding feature was rejected in the feature selection procedure relied on the Boruta algorithm.Panels (A) and (B) are referred to RF models trained with and without centrality network features respectively.

Figure 5 .
Figure 5. SHAP distribution values of the most influential features for the considered time window (2015-2019).Each point in the same row corresponds to a different province.

Figure 6 .
Figure 6.Average map between the SMR distribution and the O 3 and NO 2 concentrations shown in Fig. 2.Before averaging the 3 distributions, minimum maximum normalisation of the values was performed.This image has been created with the software package "sf " of R 4.2.241 .

Table 1 .
Mean absolute error and p-values of the three developed models, for each year.All metrics are evaluated using a 5-fold Cross Validation repeated 100 times.