A novel hybrid model for six main pollutant concentrations forecasting based on improved LSTM neural networks

In recent years, air pollution has become a factor that cannot be ignored, affecting human lives and health. The distribution of high-density populations and high-intensity development and construction have accentuated the problem of air pollution in China. To accelerate air pollution control and effectively improve environmental air quality, the target of our research was cities with serious air pollution problems to establish a model for air pollution prediction. We used the daily monitoring data of air pollution from January 2016 to December 2020 for the respective cities. We used the long short term memory networks (LSTM) algorithm model to solve the problem of gradient explosion in recurrent neural networks, then used the particle swarm optimization algorithm to determine the parameters of the CNN-LSTM model, and finally introduced the complete ensemble empirical mode decomposition of adaptive noise (CEEMDAN) decomposition to decompose air pollution and improve the accuracy of model prediction. The experimental results show that compared with a single LSTM model, the CEEMDAN-CNN-LSTM model has higher accuracy and lower prediction errors. The CEEMDAN-CNN-LSTM model enables a more precise prediction of air pollution, and may thus be useful for sustainable management and the control of air pollution.

Complete ensemble empirical mode decomposition with adaptive noise. The empirical mode decomposition (EMD) algorithm is a signal analysis method originally proposed by Huang et al. 29 . It is an adaptive data processing or mining method that is suitable for the processing of nonlinear and non-stationary time series. It is also essentially a smoothing process of data series or signals 30 . In EMD, any given complex signal can be empirically decomposed into a collection of basic oscillatory components, called intrinsic mode functions (IMFs). The IMF represents the oscillation mode of the original signal 31 . The original signal x(t) can be reconstructed by the following formula: where c i (t) is the i th IMF (i.e. local oscillation) and r n (t) is the i th residue (i.e. local trend).   The EMD method can ideally be applied to the decomposition of any type of time series (signal) because of its obvious advantages over previous smoothing methods in dealing with non-stationary and non-linear data 32,33 . To overcome the problem of mode mixing in EMD and solve the problem of IMF component alignment during ensemble averaging, Torres improved CEEMD from the decomposition process and added white noise, and then proposed a complete ensemble empirical mode decomposition of adaptive noise 34 . In the new CEEMD, white noise is added in pairs to the original data (i.e. one positive and one negative) to generate two sets of ensemble IMFs.
where S is the original data data; N is the added white noise; M 1 is the sum of the original data with positive noise, and M 2 is the sum of the original data with the negative noise. There is less residual noise in the inherent modal components, which effectively reduces the reconstruction error, and a global stopping standard exists at each stage of the decomposition. The decomposition efficiency in this method was the highest 35 .
This study uses the CEEMDAN algorithm to decompose non-stationary air pollution series data to form a series of IMF subsequences and residual terms (RES) with different frequency characteristics.

Convolutional neural network.
A CNN is a neural network used to process data with a known grid-like topology 36 . A CNN is a feed-forward neural network whose basic structure is determined by the input, convolutional, pooling, fully connected, and output layers 37 . The convolutional layer is the core of the CNN, where the convolutional kernel C j is used to extract the internal features.
where A i represents the input, ⊗ represents a convolution operator, σ represents the activation function (where ReLU is selected), ω i is the weight of the kernel linked to the i th feature map, and b i represents the bias matrix.
The pooling layer is mainly used to pool the data after the sniper operation. Its main function is to compress the data, remove unnecessary information, effectively improve the generalization ability of the network, and increase the calculation speed 38,39 . Each node of the fully connected layer is connected to all nodes of the upper layer, which is used to integrate the comprehensive features extracted from the front and aid in the prediction of the subsequent LSTM layer 40 . The structure of a one-dimensional convolutional neural network is shown in Fig. 1.
Long short-term memory model. Long and short-term memory (LSTM) 41 neural networks are special recurrent neural networks that can learn dependent information for a long time and effectively avoid the phenomenon of a disappearing gradient 42 . It is a machine-learning architecture that allows the model to "learn" over many time steps. Additionally, it can root the memory cell in the neural nodes of the hidden layer of the cyclic neural network to record historical information; by adding three gate structures (input, forget, and output), the historical information can be realized 43 .
As shown in Fig. 2, when setting the input sequence to x(x 1 , x 2 , . . . , x t ) , the state of the hidden layer is (h 1 , h 2 , . . . , h t ) , and the state update and output of the memory unit can be summarized as  As air pollution data from ground monitoring sites are usually in a time series format, air pollution can be modeled by considering the time-dependent patterns 44 . Feed-forward neural networks (FNNs) have been commonly used in previous studies to predict air pollution. However, these models cannot consider the time dependency of the parameters. Sequence modeling facilitates the excavation of temporal dynamic features in historical data and enusres better predictions 45 . Compared with the FNN, recurrent neural networks (RNN) are designed to deal with time-series data; however, this technique experiences vanishing or exploding gradient problems 46 , and LSTM can be used to overcome this problem. In this study, we chose LSTM for air pollution prediction as it extracts representative features from historical air pollution data and obtains further representations of the merged features to generate predictions.
Particle swarm optimization. Particle swarm optimization (PSO) is an evolutionary computation technique developed by Kennedy and Eberhart in 1995 47 . This algorithm is a swarm intelligence optimization algorithm that simulates the foraging behavior of bird swarms and adjusts its own speed and position to optimize it until it meets the convergence termination condition 48,49 . All particles in the swarm stay in the set search space, as shown in Eqs. (6) and (7): where V t i,j is the velocity of particle i at generation t , and j is the dimension; x t i,j is the position of particle i ; c 1 and c 2 are cognitive and social coefficients; y t i,j is the best value in the group at generation t ; y t is the best value of all of the best values from different groups; and r t 1,i,j and r t 2,i,j are uniformly distributed random numbers in the interval [0,1]. Furthermore, the concept of inertia weight ω is developed to obtain better control exploration and exploitation of the searched particles.
PSO has several advantages over other metaheuristic techniques in terms of its simplicity, convergence speed, and robustness. It converges to global or near-global optima, irrespective of the shape or discontinuities of the cost function. As PSO can prevent the network convergence from falling into the local best solution, it can be selected to optimize the LSTM input layer weights 50 . Most previous studies on PSO systems have provided empirical results and conducted informal analyses 47,51,52 . Many studies have shown that the PSO algorithm can improve the prediction accuracy by optimizing the LSTM model [53][54][55] . Thus, this study initially proposes an enhanced PSO-based LSTM model, which is used to forecast air pollution.

Proposed air pollution forecasting model
The CEEMDAN-CNN-LSTM model. Air pollution data are a time series characterized by complex instability, nonlinearity, and periodic uncertainty, which are affected by many factors. As mentioned above, LSTM has a strong modeling and analysis ability for processing time-series data. The performance of the LSTM model in time-series analysis is extraordinary 47 . However, the LSTM model only extracts the temporal features of the flames, whereas turbulent flames characterize both the temporal and spatial evolution. A CNN is a deep learning network wherein the local and overall features of the input data can be constantly extracted using nonlinear www.nature.com/scientificreports/ mapping 56,57 . It can extract the spatial features of the flames. Therefore, we proposed a combination of CNN and LSTM models. CNN-LSTM has been extensively employed for time-series forecasting. However, determining the structure is difficult, and often falls into a local minimum 58 . The CEEMDAN method can divide the singular values into separated IMFs and determine the general trend of the real time series; thus, it can help determine the characteristics of the complex non-linear or non-stationary time-series data 59 . This can effectively reduce unnecessary interactions among singular values and improve the performance when a single kernel function is used in forecasting 60 . This section proposes a model that combines the CEEMDAN and CNN-LSTM models for air pollution prediction. As shown in Fig. 3, in this study, the CEEMDAN algorithm was used to decompose the data of air pollution change, measured by the air quality monitoring station, to obtain a limited number of IMFs. Subsequently, we used the CNN-LSTM model to learn and predict the short-term time series of each IMF component, and added the predicted values of each IMF component to obtain the final prediction result.
Finally, PSO is used to optimize the hyperparameters of LSTM because of its simplicity and ease of implementation 61 . The core idea of the PSO algorithm is to first initialize a set of random solutions and then iteratively find the optimal solution 62 . The PSO algorithm can enable the LSTM model to accurately and quickly determine the optimal parameters according to the characteristics of the air pollution data, and realize an effective combination of the network structure of the LSTM model and the features of the air pollution data 63 .

Model fitting and validation.
To evaluate the predictive ability of the models, two indices, namely, the root mean square error (RMSE) and mean absolute error (MAE) were calculated in this study. In general, the smaller the RMSE, MAE, and R 2 , the more accurate the model. RMSE, MAE, and R 2 are defined in Eqs. (11)-(13), respectively.
where n is the number of data points, y i is the measured aqueous air pollution, y is the average of the real values, and f i is the air pollution simulated by the model.

Case study
Study area and data set. According to historical research, air pollution is highly correlated with six air pollutants (PM2.5, PM10, NO2, CO, O3, and SO2) 15,[64][65][66] . Therefore, this study investigated 20 cities with the worst air quality in China and selected the most representative 6 cities according to their primary pollutants, economic conditions, and geographical factors to prove the validity and robustness of the hybrid model. The final choices were Xinxiang (main air pollutant: PM2.5), Taiyuan (PM10), Zibo (SO2), Handan (NO2), Binzhou (O3), and Jinan (CO). In this study, data were obtained from the national urban air quality real-time release platform of the China Environmental Monitoring Station. Daily data were obtained for the period from January 1, 2016, to December 31, 2021, with a total of 2192 observations. For each city, data from 2016 to 2020 were used  Descriptive statistics. To better illustrate the situation of the used data, the pollutant concentrations of the six cities were plotted as a line graph, and the results are shown in Fig. 4. Overall, the six major air pollutants showed obvious periodicity; PM2.5, SO2, CO, and PM10 showed a yearly decreasing trend. Among them, the concentrations of PM2.5, SO2, NO2, CO, and PM10 reached their highest values in January and the lowest in September each year, showing a "U" shape; the change trend of O3 is the opposite. The highest and lowest concentrations of O3 occur in September and January every year, respectively, and the distribution is in the shape of "Λ". This research suggests that this anomaly is not a coincidence, and a deeper connection exists between the six pollutants. In summer, strong solar radiation causes the surface temperature to rise sharply and heats the air near the surface. This leads to increased convection and precipitation, which accelerate the diffusion and deposition of atmospheric pollutants 67 . Frequent sandstorms cause air pollution 68 . Stable weather and biomass combustion are common 69 . In winter, the low surface temperature causes surface inversion, and the meteorological conditions are not conducive to vertical convection 70 ; therefore, the near-surface air pollution is high. As these six pollutants have similar influencing factors and trends, the remaining five pollutants need to be combined to predict the concentration of a single pollutant.

Results and discussion
Through knowledge of past forecasting studies [71][72][73][74][75] , we know that the prediction work based on LSTM obeys a particular framework: PM2.5 (or other time series) is decomposed into several IMFs and a residual by EMD; subsequently, the LSTM model is applied to each IMF and residual; and finally, the training results are simply added to obtain the predicted value. However, this framework has some limitations: 1. The inability to prevent the transfer of white noise from high frequency to low frequency during EMD decomposition. 2. Choosing the high-frequency IMF. 3. Unable to choose the optimal parameter combination in the LSTM model. 4. Decomposition predictions only for a single sequence without considering whether other factors will influence the prediction results.
To address these problems, we combine the model in this section to provide the results and discussion.
CEEMDAN decomposition results of PM2.5. The CEEMDAN algorithm was selected to solve the problem occurring when the white noise of the EMD algorithm transfers from high frequency to low frequency.  Table 2. Among them, the RMSE and MAE under CEEMDAN decomposition are 55.20 and 44.54% lower than EMD,  PSO parameter optimization results. In this study, the parameter space was chosen as the time step (n) of the time series and the number of neurons (cells) in the LSTM neural network model. The range of n was 1-20, and the range of cells was 1-100. We considered the daily data of 2016-2020 and 2021 as the training and test sets, respectively, and the parameter results of the PM2.5 prediction model training for the six cities are presented in Table 3. As presented in Table 3, among the six cities, the optimal time step for the six cities is only two and four, which shows that the CEEMDAN-PSO-CNNLSTM model constructed in this study is only dependent on data from the past few days. Additionally, the minimum and maximum numbers of LSTM neurons were 42 and 92, respectively, indicating that the model was more sensitive to changes in the number of parameter neurons.

Final model predictions. After CEEMDAN decomposition and PSO algorithm optimization, the variables
and parameters were input into the CNN-LSTM model to predict the final air pollutant value. In other studies, prediction methods combined with EMD treated the first high-frequency IMF sequence as a noise term and discarded it, which did not contribute to the prediction result 76,77 . This method is simple and crude and may lose some useful information and retain some noise signals. Therefore, a CNN was selected to screen the IMFs.  Table 4. Overall, the proposed CEEMDAN-PSO-CNNLSTM model has the best prediction accuracy; it has the smallest MSE and MAE and the highest R2 among the predictions for the six cities. Simultaneously, Fig. 6 also shows that the PM2.5 prediction curve of the proposed model has a high degree of fit with the actual curve, and the prediction accuracy is high. Therefore, the proposed model is considered to be effective and robust in predicting results under different polluted environments and outperforms the other models.
Specifically, the prediction accuracy of the model after CEEMDAN decomposition was significantly higher than those of the models without decomposition (CEEMDAN-PSO-CNNLSTM vs. PSO-CNN-LSTM, CEEMDAN-PSO-LSTM vs. PSO-LSTM, and CEEMDAN-SVM vs. SVM). Considering Binzhou as an example, the prediction accuracies of the SVM, PSO-LSTM, and PSO-CNNLSTM models decomposed by CEEMDAN improved by 0.08, 0.31, and 0.39 (R 2 ), respectively, while the prediction errors were reduced by 12.24, 34.34, and 51.82%, (RMSE) and 13.85, 32.05, and 48.61% (MAE), respectively. This shows that the signal decomposition technique can effectively reduce the non-stationarity of the PM2.5, thereby improving the performance. Additionally, according to the results listed in Table 2, the necessity of CEEMDAN decomposition is also confirmed. Finally, in response to the second question raised at the beginning of this section, the model prediction results are compared and analyzed. We found that when CEEMDAN is not decomposed or the model has few input variables (only five variables are input in the PSO-LSTM and PSO-CNN-LSTM models), the improvement in prediction accuracy by using CNN for feature screening is not obvious. In the six selected cities, the R 2 values increased by approximately 0.01-0.08, and the RMSE and MAE decreased by approximately 1-8% and 1-9%, respectively. For the model after CEEMDAN decomposition (there are more than 10 variables in the CEEMDAN-PSO-LSTM and CEEMDAN-PSO-CNNLSTM models), the use of CNN for feature screening greatly improved the model prediction accuracy; in the six selected cities, the R 2 increased by 0.10-0.18, and the RMSE and MAE decreased by approximately 27-46% and 22-42%, respectively. We concluded that when there are more variable inputs, using a CNN for feature screening can increase the accuracy and error of the model. Therefore, it is feasible and effective to use a CNN to screen each IMF component after CEEMDAN decomposition.
The PM2.5 prediction scatter plot for each model is shown in Fig. 7. The proposed model had the highest R-value (0.94). The graph shows that CEEMDAN-PSO-CNNLSTM also shows a fitting advantage over the other models. The scattered points are evenly distributed on both sides of the diagonal, and the fitted straight line is the closest to the diagonal. Additionally, although the R-value of the SVM model was high, based on the results of the model prediction, the predicted value of the SVM was usually high, and it was not sensitive to changes in Comparison of prediction results without introducing air pollutants. The prediction results for PM2.5 prediction results of the six cities are shown in Fig. 8. From the graph, the curve of the joint prediction of the five pollutants has a high degree of fit with the actual curve, which can better reflect the change trend Notably, compared with those of the single PM2.5 time-series prediction, the RMSE and MAE obtained by the combination of the other five pollutants were smaller, and the R 2 was larger. This indicated that the input of the five pollutant data improved the prediction accuracy of the model. The positive effect, especially for the cities not polluted by PM2.5, by combining the five air pollutants to predict the model performance was more significant; however, for the cities mainly polluted by PM2.5, introducing the remaining five air pollutants into www.nature.com/scientificreports/ the model had a more significant impact. Various pollutants also have a certain effect on improving the prediction accuracy. Therefore, it is necessary to combine the predictions of the remaining five pollutants with that of the sixth air pollutants.  www.nature.com/scientificreports/ Health effect assessment. Air pollution affects the economy and causes serious damage to human health.
The number of deaths due to excessive pollutant concentrations is presented in Table 8. This section is based on the predicted concentrations in 2021, combined with historical research, and uses the WHO revised guidance value in 2021, the first-level limit, and the second-level limit of China's "Environmental Quality Standard" as the reference concentrations to evaluate the health impact of the population. Tables 6 and 7 the aggregated air pollutant concentration reference values and percentage increases in the population mortality, respectively. Notably, the air pollutants that significantly increase the mortality rate of the population are NO2, which increases the mortality rate by 1.4% per 10 μg·m −3 , followed by SO2, which increases the mortality rate by 0.9% per 10 μg·m −3 .
As presented in Table 8, under the latest WHO standard, the number of deaths due to excessive NO2 concentration is the largest among the six cities, followed by those due to PM10 and PM2.5, respectively. According to  www.nature.com/scientificreports/ the national standard, the number of deaths due to excessive PM10 is the largest, followed by PM2.5. In general, although the concentrations of SO2, CO, and O3 in some cities are significantly higher than those in others, PM2.5, PM10 are the main air pollutants affecting human health. PM2.5, PM10, and NO2 should be considered as the focus of air pollution prevention and control, and must be simultaneously combined with the city's own SO2, CO, and O3 concentration characteristics to gradually tighten the standard limit.

Conclusions
Cities are usually affected by air pollution in several ways. To promote the sustainable development of urban public health and the sustainable development of society, a stable and high-precision air pollutant prediction model is required. This paper studied a series of existing LSTM prediction frameworks and found that some problems still exist in the existing prediction frameworks, including the selection of high-frequency feature signals, the selection of LSTM model parameters, and predictions without considering other closely related drivers of air pollution. Therefore, this study develops a hybrid model named CEEMDAN-PSO-CNNLSTM to solve these problems. First, CEEMDAN is used to decompose the air pollutant signal, and the decomposed data are then sent to the CNN-LSTM neural network for PSO optimization. Finally, the optimized parameter input model was trained using the original data to obtain the final prediction result. Combined with the evaluation criteria, the proposed model had the highest accuracy among the six compared models. Additionally, predictions were made for six cities affected by different pollutants, and we found that the prediction accuracy of the proposed model was the highest in each comparison, indicating the robustness of the model. The advantages of the proposed hybrid model are as follows: 1. considering the influence of other air pollutants, the prediction accuracy for a single air pollutant was improved. 2. Combining the CEEMDAN decomposition with the PSO algorithm and using CNN to screen the IMF not only solves the problem of parameter selection in the LSTM model, but also solves that of white noise and high-frequency signals interfering with the prediction results; thus, it realized the improvement of the traditional prediction framework. At the end of the article, we predict the degree of harm that air pollutants may bring to the health of the population, and offer some suggestions. However, the model proposed in this study still has room for optimization. For example, we consider the spatial location information of each forecast station and improve the prediction accuracy through the joint prediction of different sites. Additionally, referring to forecasting work in other areas, there are still many variables that have not been added to the process of predicting pollutants, such as the wind speed, air pressure, humidity, and temperature, are also important factors affecting the air quality 82,83 . In future studies, the prediction of air pollution will be further refined, and other variables that may affect air pollution will be added to further optimize the hybrid model and improve its effectiveness.

Data availability
All data generated or analyzed during this study are included in this its supplementary information files.