Introduction

Long-term exposure to ambient fine particulate matter (PM) is associated with adverse human health conditions. PM2.5 particles, with an aerodynamic diameter <2.5 μm, can be inhaled into the nasal passages and can carry toxic substances that are harmful to human health1,2. Studies have shown that long-term exposure to high PM2.5 concentrations can have serious impacts on human organs such as the liver, lungs and be responsible for the development of cardiovascular diseases3,4,5,6. Therefore real time monitoring of PM2.5 concentrations is extremely important for preventing pollution-related health issues as well as for the formulation of effective environmental protection measures.

Since 2013, many PM2.5 monitoring stations across China have been established to measure air quality data. However, due to their sparse and uneven distribution, with most being located near cities, limitations still exist for effective and representative in situ monitoring of PM2.5 concentrations at the regional scale7,8,9. Monitoring capability can be increased by fusing satellite remote sensing data such as the aerosol optical depth (AOD) with PM2.5 measurements and help in the construction of space-time models of PM2.5 concentrations within or across regions. In addition, auxiliary datasets such as meteorological data, source emissions data, land use data, topographic data and socio-economic data, can also been used to reinforce the relationship between PM2.5 concentrations and various observed variables7,9,10.

Existing models for predicting PM2.5 concentrations can be classified into two categories: deterministic models and statistical models9,11. Deterministic models, including large-scale air quality simulations12,13,14, model physical processes such as emission, dispersion, transformation, and diffusion, as well as the chemical reactions occurring in polluted air11,15,16. However, since most models require sophisticated prior knowledge of pollutant diffusion states and chemical reaction pathways, deterministic PM2.5 concentration estimation is complex and computationally expensive17. Statistical models, while not necessarily simpler in design, are however able to achieve an almost equivalent level of PM2.5 concentration prediction accuracy18 and due to their greater speed, have been extensively developed and deployed for monitoring purposes. Linear statistical models including simple linear regression models19, multiple linear regression (MLR) models20, empirical models21,22, and geo-weighted regression models8,10, have been able to obtain satisfactory results. However, the functional relationship between the PM2.5 concentration and explanatory variables is nonlinear. As a result, many nonlinear statistical models have been used to estimate PM2.5 concentrations including support vector regression, generalized additive models8, artificial neural network (ANN) models9,11,18,23,24 and more recently, deep learning methods9,11,18,23,24. With improvements in computational capacity, models have also gradually incorporated more exogenous variables such as meteorological factors (e.g., relative humidity, temperature, and wind speed), land use factors, topographic data, source emissions data and socio-economic data9,11,18,23,24.

Despite all these significant advances, most models have ignored the influence of the geographical distance between PM2.5 monitoring stations as well as Tobler’s First Law of Geography25 - that everything is related to everything else but that nearby things are more related than distant things. Furthermore, various studies have demonstrated that the distribution of PM2.5 concentrations shows significant spatial auto-correlation. In this study, we aim to exploit this additional information by developing a spatial back propagation neural network (S-BPNN) that can improve the accuracy of PM2.5 concentration estimation by explicitly including a spatial lag variable (SLV). The performance of the S-BPNN model is compared with that of a conventional back propagation neural network (BPNN) that does not include the SLV. We then use the S-BPNN to map the yearly and seasonal mean distribution of PM2.5 concentrations across China for the study period, and assess exceedances.

Data and Methods

Data fusion

In order to construct a S-BPNN multivariate model, ground-level PM2.5 concentration measurements were fused with satellite aerosol optical depth data, meteorological synoptic conditions data and source emissions data at 1280 monitoring sites in China, to form a large and spatially-diverse sample dataset of seasonal and yearly mean values. Table S1 presents the sources, units and spatial scales of each variable. Arcpy in ESRI’s ArcGIS was used to perform spatial interpolation for meteorological data as part of the preparation of the sample data.

Ground-level PM2.5 measurements

Hourly PM2.5 concentrations at 1280 stationary sites in 190 cities from 2015-01-01 to 2015-12-31 were collected from the official database of the China National Environmental Monitoring Centre (CNEMC: http://www.cnemc.cn/en/). PM2.5 concentrations were measured via the tapered element oscillating microbalance method (TEOM) and then averaged at each site to produce time series of daily mean PM2.5. Seasonal (spring: MAM, summer: JJA, autumn: SON, winter: DJF) and yearly mean PM2.5 concentrations were then also calculated from the daily mean PM2.5 for all monitoring stations. Figure 1 shows the spatial distribution of these PM2.5 monitoring sites in China and the yearly mean value for 2015.

Figure 1
figure 1

ArcGIS map of the distribution of ground-level monitoring sites in China, 2015.

MODIS AOD satellite data

The AOD at 550 nm from the moderate resolution imaging spectraradiometer (MODIS) has been found to weakly correlate with PM2.5 concentrations11,26,27. Nonetheless, this is an important explanatory variable used to drive satellite remote sensing models of PM. We obtained AOD data from the MODIS Terra and Aqua Collection 6.1 via the NASA Level-1 and Atmospheric Archive and Distribution System (https://ladsweb.modaps.eosdis.nasa.gov). The AOD data has a maximum spatial resolution of 10 km and covers the study period from 2015-01-01 to 2015-12-31. The 10 km AOD products were retrieved using the Dark Target (DT) algorithm and the MODIS Conversion Toolkit (MCTK). For each grid cell, where Terra MODIS AOD (MOD04) data was available, we estimated missing Aqua MODIS AOD (MYD04) data by linear interpolation to extract values at the centre of the pixel. The same estimation procedure was used to for MOD04 when only MYD04 was available. MOD04 and MYD04 data was averaged where both products were available. Daily AOD products were then averaged to produce seasonal and yearly AOD values.

Synoptic conditions data

Meteorological data was then obtained from the Meteorological Data Sharing Service System in China (http://data.cma.cn/en) and includes wind speed (WS, m/s), relative humidity (RH, %), surface pressure (PRS, Pa), temperature (TEM, °C), precipitation (PRE, mm), and sunshine duration (SSD, h). There are 839 meteorological monitoring stations providing a total of 306,235 records during the study period. To obtain seasonal means and yearly means, we averaged the seasonal and yearly mean WS, RH, PRE, TEM, PRE, and SSD values calculated from the daily meteorological data across the monitoring network. Yearly mean distribution maps of each meteorological variable in the study area were then spatially interpolated with Arcpy in ESRI’s ArcGIS software by calculating the seasonal and yearly mean meteorological conditions at the monitoring stations with the inverse distance weighted (IDW) using a grid size of 10 km × 10 km. Values are extracted from the centre of 10 km grid.

PM2.5 emissions data

Land use is a major contributor to the source apportionment of PM2.5 pollution7. Land use data for 2015 having 30 metre resolution was obtained from the Geographical Information Monitoring Cloud Platform (http://www.dsac.cn/) and categorised into built-up areas, arable land, forest, water bodies, and bare land. We also downloaded NDVI and population density data having spatial resolution 1 km × 1 km from the Resource and Environment Data Cloud Platform (http://www.resdc.cn/) and LandScan (https://web.ornl.gov/sci/landscan/) respectively. In both cases, we extracted the pixel value closest to each PM2.5 monitoring station. To account for the contribution of traffic emission pollution sources to the PM2.5 concentration7, main roads within 10 km of the PM2.5 monitoring sites were included using road network data downloaded from OpenStreetMap (www.openstreetmap.org/). To account for industrial sources of pollution, we considered the number and distribution of state monitoring enterprises for exhaust gas emissions as being indicative of the number and distribution of industrial pollution sources. The basic information (including addresses) of the exhaust gas monitoring enterprises in 2015 (totaling 3206) was obtained from the Ministry of Ecology and Environment of the People’s Republic of China (http://www.mee.gov.cn/). The Baidu Geocoding API (http://lbsyun.baidu.com/index.php?title=webapi/guide/webservice-geocoding) was then used to obtain the geographical location (latitude and longitude) of each enterprise. The number of emission enterprises within a radius of 10 km of the PM2.5 monitoring stations was used as a measure of industrial emissions. Finally, digital elevation model (DEM) data was derived from the Geospatial Data Cloud (http://www.gscloud.cn/) and the pixel value containing or nearest to each PM2.5 station was extracted.

Figure S1 presents the histograms and associated median statistics for the set of variables in the fused sample data set spanning the period 2015-01-01 to 2015-12-31 calculated from the 1280 PM2.5 ground-based monitoring sites in China. The median value of the PM2.5 concentration is 52.0 μg/m3. The distributions of AOD at 550 nm, WS, temperature, SSD, NDVI and construction land area exhibit some similarity to the distribution of PM2.5 concentrations. In contrast, other variables including RH, pressure, precipitation, population density, road length, DEM base height, and the number of industrial pollution companies have distributions that are either multimodal or power law in nature. While these variables aren’t expected to strongly co-vary with the PM2.5 concentration, recent studies have shown that they can nevertheless still have an influence on the spatial distribution of PM2.5 concentrations7,9,18,26. Consequently, we performed principal components analysis (PCA) on the full set of parameters, retaining those components that account for >98% of the total variance in line with recent approaches24,28. As a result, the dimensionality of the was reduced from 15 variables (AOD, latitude, longitude, RH, WS, temperature, pressure, precipitation, NDVI, DEM, population density, number of pollution companies, road length, construction land area plus the SLV calculated from localised PM2.5 concentrations described in the next section) to 11 principal components.

Spatial autoregression

In order to develop a continuous model for the distribution of PM2.5 concentrations over China as from local measurements made at air quality network monitoring stations together with regionally-distributed independent variables, we construct a spatial autoregressive (SAR) model. SAR models are a class of statistical models that apply to observations over a continuous spatial domain typically made at local nodes of a network or vertices of a uniform or non-uniform grid. Importantly, they allow the effect of spatial correlation to be included explicitly. SAR extends conventional multiple linear regression by allowing outcomes in one area to be affected by outcomes in nearby areas (i.e. spatial lags on the dependent variable), by covariates from nearby areas (i.e. spatial lags on independent variables), and by spatially autoregressive errors from nearby areas. The general form for a first-order SAR model is given by29,30:

$$\begin{array}{c}{\boldsymbol{y}}={\bf{X}}\beta +\rho {{\bf{W}}}_{1}{\boldsymbol{y}}+{\boldsymbol{u}}\\ {\boldsymbol{u}}=\lambda {{\bf{W}}}_{2}{\boldsymbol{u}}+{\boldsymbol{\varepsilon }}\\ {\boldsymbol{\varepsilon }}\sim N(0,{\sigma }^{2}{{\boldsymbol{I}}}_{n})\end{array}$$
(1)

where y is a [n × 1] scalar vector of observations of the dependent variable (PM2.5 concentrations in our case), X is a [n × k] matrix of exogenous variables with [k × 1] regression coefficients β, W1y, is the [n × 1] spatial lag variable (SLV) calculated from the weighted average of nearby PM2.5 monitoring sites (see next section) with spatial autoregression parameter ρ, u is the error term expressed in terms of [n × 1] spatially-lagged errors W1u, with spatial autoregression coefficient λ and ε which is a [n × 1] scalar vector of normally distributed (iid) random errors. W1 and W2 are the [n × 1] spatial weights matrices and in geophysical applications, are usually equivalent with the general property that all of their diagonal elements are zero and their rows sum to one. Equation (1) is sufficiently general that it embraces two major classes of spatial statistics models.

For ρ = 0, Eq. (1) reduces to the spatial error model (SEM) whereby the spatial dependence of y is correlated with spatial autoregression in the errors and measures the influence of errors in the exogenous variables in the local neighbourhood of observations of the dependent variable:

$${\boldsymbol{y}}={\bf{X}}\beta +{({\bf{I}}-\lambda {{\bf{W}}}_{2})}^{-1}{\boldsymbol{\varepsilon }}$$
(2)

For λ = 0, Eq. (1) reduces to the spatial lag model (SLM) whereby y is spatially-autocorrelated and models the diffusion of y over a region:

$${\boldsymbol{y}}={({\bf{I}}-\rho \lambda {{\bf{W}}}_{1})}^{-1}{\bf{X}}\beta +{({\bf{I}}-\rho \lambda {{\bf{W}}}_{1})}^{-1}{\boldsymbol{\varepsilon }}$$
(3)

The SLM therefore incorporates a spatial multiplier (I − ρW1)−1 called the “Leontief inverse” which connects the dependent variable y to all exogenous variables xi in the system at location I and not just the error at location i. This observed behaviour has been reported in several studies that have shown that the distribution of PM2.5 concentrations show significant spatial autocorrelation11,26. For a specific grid, the SLV is given by:

$$SLV=\frac{\mathop{\sum }\limits_{i=1}^{n}w{s}_{i}P{M}_{2.5,i}}{\mathop{\sum }\limits_{i=1}^{n}w{s}_{i}}$$
(4)

where, n is the number of nearby PM2.5 concentration measurements, wsi = 1/dsi is the spatial weight for the ith nearby PM2.5 concentration and dsi is its spatial distance. SAR modeling with Arcpy in ESRI’s ArcGIS suggests that n = 3 is optimal for our sample data set. In order to construct an integrated model that not only reflects the local autocorrelation of the PM2.5 concentrations but also expresses the nonlinear relationship between PM2.5 concentration and independent variables, we explicitly incorporate the SLV as a virtual variable and construct a S-BPNN.

Spatial back-propagation neural network (S-BPNN)

Previous studies confirm that ANN models perform more efficiently than MLR models for PM air pollution monitoring and forecasting9,18,31 due to their increased capacity for modeling the nonlinear relation between exogenous variables and PM2.5 concentrations. A conventional BPNN with three layers (an input layer, a hidden layer and an output layer) was constructed in our study as a control model as it has the desired property that it can act as a universal function approximator (UFA)32,33,34 and provide reliable baseline results. We then investigate the effect of including spatial autoregression by incorporating the SLV as an additional virtual variable in the set of exogenous inputs. We refer to the resulting model as a S-BPNN model of PM2.5 concentration. As described in the section on Data Fusion, PCA was applied to the list of 15 exogenous parameters that include the SLV and the resulting 11 principal components, accounting for >98% of the total variance, were used as explanatory variables in the input layer. In accordance with the requirements of a UFA35 the hidden layer comprises neurons having a nonlinear activation function. The output layer is a single linear neuron providing the PM2.5 concentration. A schematic diagram of the S-BPNN is illustrated in Fig. 2.

Figure 2
figure 2

Schematic of the S-BPNN used to estimate PM2.5 concentration in China.

Optimization of the S-BPNN model structure

When optimising the design of feed-forward back-propagation neural networks with supervised learning, training accuracy is expected to increase with the addition more neurons added to a hidden layer. However, this does not then necessarily translate into a corresponding improvement in overall model accuracy when tested on ‘unseen’ validation data. Apart from the obvious increase in training time needed, the problem is over-fitting of the neural network to the training data; resulting in a loss of ability to generalise on new data; significantly and negatively impacting overall model validity and performance24,32. Whether a top-down approach to selecting the number of neurons with network weight pruning or a bottom-up convergence approach is adopted, the goal for setting the number of nodes in the hidden layers of feed-forward neural networks is to use as few nodes as necessary to meet the accuracy requirements of the model. Studies have shown that approximate bounds on the numbers of nodes in the hidden layer for most applications range from \(2\sqrt{{\rm{n}}}+{\rm{\mu }}\) to2n + 19,36, where n and μ are the numbers of nodes in the input layer and output layer, respectively. With a single output and 11 inputs from PCA, this suggests that the optimal number of nodes in the hidden layer of the S-BPNN required to maximise performance ought to be in the range 7 to 23. Following previous studies9,24,33, we applied a sensitivity analysis approach to the neural network architecture by running a grid of models varying the number of nodes in the hidden layer from 1 to 33 and monitoring the model accuracy. Each neuron in the hidden layer had a nonlinear logistic activation function \(f(x)=\frac{1}{1+{e}^{-x}}\) and for weight and bias optimisation, MATLAB’s “traingdx” back-propagation algorithm was adopted which performs gradient descent with momentum (0.9) and an adaptive learning rate (0.05) using the mean squared error (MSE) as a cost function.

Figure S2 shows the training and validation goodness of fit statistics for the grid of runs used to optimise the architecture of the S-BPNN. In the evaluation of each run, 10-fold cross-validation was applied to the sample data divided 70%: 30% into a training (model fitting) dataset containing 896 records and a test (validation) dataset containing 384 records. The statistical indicators used to measure the goodness of fit are the coefficient of determination (R2), the root-mean-square error (RMSE, μg/m3), the mean prediction error (MPE, μg/m3), and relative prediction error (RPE, %) which are defined in the Supplementary Information accompanying this paper. As expected, we observed that, as the number of neurons in the hidden layer increases, S-BPNN model performance slightly improves for the training (fitting) data but gradually degrades for the test (validation) data. Over-fitting is observed when the number of neurons > 7 (the point beyond which the R2 value continues to degrade from its first local maximum). As such, 7 hidden neurons were adopted as the optimal case. We also investigated the effect of changing the back-propagation training algorithm to the Levenberg-Marquardt algorithm and using the hyperbolic tangent (tanh) nonlinear activation function, but neither led to any improvement in the performance.

Results and Discussion

Performance of the models

After determining the optimal neural network architecture, we then partitioned the sample dataset comprising N = 1280 records (one for each monitoring station) including yearly and seasonal mean data, and trained and tested the S-BPNN and BPNN. A common framework was used for both models, with the exception that the S-BPNN includes the SLV in the principal components fed as inputs to the network. MATLAB’s Neural Network Toolbox version 6.0 was then used to build and train the S-BPNN and the BPNN. The accuracy of the trained models was calculated using 10-fold cross-validation. During this procedure, the sample data set was randomly divided into 10 parts; 9 of which were used for fitting and 1 for validation each time. The mean accuracy of 10 cross-validation trials for the networks trained on yearly mean data is shown in Table 1.

Table 1 Accuracy of the trained S-BPNN and BPNN models calculated with 10-fold cross-validation applied to yearly mean data.

Inclusion of the SLV in the S-BPNN leads to an increase in performance with the mean R2 of the fitting and validation data increasing from 0.80 and 0.75 respectively for the BPNN to 0.89 for the S-BPNN. A corresponding reduction in all mean error measures is observed. The RMSE of the S-BPNN never exceed 7.45 μg/m3. To further evaluate the performance of S-BPNN and BPNN models, scatter plots of the fitting and validation results are shown in Fig. 3.

Figure 3
figure 3

Scatter plots of BPNN and S-BPNN fitting and validation results for yearly mean data. The solid line is the trend line and the dashed line is the 1:1 line as a reference. (a) and (c) are the BPNN model fitting and 10-fold cross-validation results, respectively. (b) and (d) are the S-BPNN model fitting and 10-fold cross-validation results, respectively.

To assess the performance on the seasonal timescale, the same approach was adopted and applied to seasonal mean datasets to train and evaluate S-BPNN and BPNN models. The mean accuracy of 10 cross-validation trials for the networks trained on the seasonal mean data are presented in Table 2. Some variation in model performance with season is apparent with lowest errors in the summer and largest errors in the winter seasons. This is to be expected for two reasons. Firstly, cloud cover is seasonal and impacts the quality of satellite AOD retrievals. While uncertainty on the model inputs was not explicitly included in the model design in this study, the propagation of uncertainty through spatial nonlinear models warrants future attention. Secondly, the sample data is constructed from daily averaging which leads to variation in the uncertainty of the sample data used to train the models at this timescale. This variation is expected to impact the quality assessment of models as their prediction timescale (yearly → seasonal) approaches the daily timescale. Nevertheless, it is clear that the S-BPNN outperforms the BPNN in all cases and that the incorporation of spatial information is of clear benefit even at shorter modelling timescales.

Table 2 Accuracy of the trained S-BPNN and BPNN models calculated with 10-fold cross-validation applied to seasonal mean data.

Spatial distribution of PM2.5 concentration

In Fig. 4, the S-BPNN model was used to map the seasonal and yearly mean distribution of PM2.5 concentrations in China at a spatial resolution of 10 km. The distributions of seasonal and annual PM2.5 concentrations have considerable spatial heterogeneity and spatial aggregation. This has two major impacts. Firstly, the sparse monitoring of large expanses of south-western China leads to significant gaps in model input data in this region. Secondly, strong clustering of monitoring stations in regions of high urbanisation means that spatial data in these regions is likely to be more representative. The S-BPNN modeled yearly mean PM2.5 concentration for China in 2015 was found to be 41.76 μg/m3 and reflects well the value of 52.0 μg/m3 calculated at the nodes of the air quality monitoring network. Importantly, the interpolative power of the S-BPNN model at interstitial locations allows for a moderate resolution assessment of national exceedances nationwide. We find that more than 70% of Chinese territory exceeds Level 2 of the Ambient Air Quality Standards (CAAQS) having a yearly mean concentration > 35 μg/m3.

Figure 4
figure 4

Spatial distributions of seasonal and annual estimated PM2.5 concentrations (μg/m3) in China, 2015: (a) spring, (b) summer, (c) autumn, (d) winter and (e) annual (Jan. 2015 to Dec. 2015). The white regions indicate missing data. Maps were made using ArcGIS software.

Overall, the levels of PM2.5 concentrations are higher in the northern regions than in the southern regions. Heavily polluted regions are located in the North China Plain, especially Beijing-Tianjin-Hebei (BTH), and south-western Xinjiang, where the highest annual PM2.5 concentration reached 108 μg/m3. However, the causes for such high levels of PM2.5 are different in these locations. Pollution in the North China Plain’s is caused mainly by industrial emissions and is exacerbated by stagnant weather, with a weak wind and a relatively low boundary layer height reducing the dispersion, transformation and diffusion of atmospheric gases and chemical reactions26. South-western Xinjiang’s pollution is due to desert dust particles which make a significant contribution to the accumulation of PM2.5 concentrations7. While lower level regions of PM2.5 concentrations are found in the south provinces (e.g., Hainan, Guangdong, Fujian and Yunnan), these regions benefit from low levels of industrial source emissions and favourable meteorological conditions for gas dispersion and chemical reaction in the atmosphere. While it is immediately apparent that winter is the most polluted season with high levels of PM2.5 concentrations and summer is the cleanest with the lowest levels, some regions also exhibit high levels of PM2.5 concentrations in the spring especially in the North China Plain and over Northwest China. PM2.5 pollution is mitigated to some extent in the autumn months.

Conclusions

Faced with the complexity involved in modelling PM2.5 concentrations deterministically, neural network-driven statistical models like BPNNs have been developed with demonstrable advantages for the estimation and mapping of PM2.5 concentrations and other particulate matter components of air pollution9,24,37,38,39,40. By incorporating spatial correlation information using a SLV in an S-BPNN model design, we were able to improve the accuracy and performance of a BPNN trained to retrieve PM2.5 concentrations at 10 km resolution from satellite AOD, meteorological data, land use data, source emission data and related geographical data. The exogenous variables used encompass not only static features that impact PM2.5 concentrations, but also the dynamic processes at work at the local and regional scale.

Cross-validation results suggest that the S-BPNN model outperforms conventional BPNN models and provides accurate estimates of yearly mean PM2.5 concentrations and exceedances for China to a precision of RMSE < 7.45 μg/m3. Similarly reliable estimates were also obtained for seasonal means over China, recovering understood patterns in the geographical distribution of pollution sources across the country both in terms of geography and also in terms of synoptic conditions. Sensitivity analysis applied to neural network design using a grid of runs in combination with 10-fold cross-validation enabled model performance to be optimised in a systematic way, and ensured that the models produced are robust and reproducible.

Despite the satisfactory performance of the S-BPNN model, the SLV only accounts for the spatial distance between PM2.5 monitoring stations. It is possible that a further increase in performance can be achieved by extending the SAR model to include also the spatial lag of covariates as well as spatially autoregressive errors both within and between spatial domains. In particular, it is expected that the inclusion of and propagation of uncertainty information on the variates will to help models capture higher frequency variability PM concentrations. It is also not yet clear what the balance is between increased SAR model complexity and nonlinearity in terms of the efficiency and performance of statistical models of PM2.5 concentrations, i.e. whether or not a linear framework could be adequate. In a future study we will adapt the SLM to MLR models of PM and consider other approaches such as deep learning to help identify spatial features and model trends in the data. For this, we will exploit the availability of high-resolution satellite AOD products to map at even higher resolution greater detail, to estimate the distribution of PM2.5 concentrations.