## Abstract

Air quality is closely linked to human health and social development, making accurate air quality prediction highly significant. The Air Quality Index (AQI) is inherently a time series. However, most previous studies have overlooked its temporal features and have not thoroughly explored the relationship between pollutant emissions and air quality. To address this issue, this study establishes a historical correlation model for air quality based on a time series model—the Gaussian Hidden Markov Model (GHMM)—using industrial exhaust emissions and historical air quality data. Firstly, a traversal method is used to select the optimal number of hidden states for the GHMM. To optimize the traditional GHMM and reduce error accumulation in the prediction process, the Multi-day Weighted Matching method and the Fixed Training Set Length method are utilized. Both direct and indirect prediction modes are then used to predict the AQI in the Zhangdian District. Experimental results indicate that the improved GHMM with the indirect mode provides higher accuracy and more stable state estimation results (MAE = 13.59, RMSE = 17.59, mean forecasted value = 117.94). Finally, the air quality historical correlation model is integrated with the air quality meteorological correlation model from a previous study, further improving prediction accuracy (MAE = 11.59, RMSE = 14.87, mean forecasted value = 120.88). This study demonstrates that the GHMM’s strong ability to analyze temporal features significantly enhances the accuracy and stability of air quality predictions. The integration of the air quality historical correlation model with the air quality meteorological correlation model from a previous study leverages the strengths of each sub-model in handling different feature groups, leading to even more accurate predictions.

### Similar content being viewed by others

## Introduction

Air quality is closely related to people’s lives. Severe air pollution significantly impacts the economy, society, and human health^{1,2,3}. When the human body is exposed to high concentrations of air pollutants for extended periods, it can lead to a range of health issues, including lung diseases, cardiovascular diseases, and respiratory diseases^{4,5,6,7,8,9}. The Air Quality Index (AQI), serving as a quantitative descriptor of air quality, can effectively depict the present level of air pollution or cleanliness and its health implications, as per the Chinese standard 'Technical Regulation on Ambient Air Quality Index (on trial) (HJ633-2012)'. Hence, precise prediction of AQI enables early warning and forecasting of air pollution, aiding governmental bodies in formulating proactive air pollution prevention measures.

In recent years, various models have been employed for air quality prediction, categorized into mechanism models and non-mechanism models based on their establishment methods. Mechanism models are numerical prediction models that account for numerous physical and chemical processes, encompassing the movement and transformation of atmospheric pollutants and photochemical reactions. These models often involve intricate equations and are optimized through meteorological factors, emission inventories, and other data^{10}Common mechanism models include the AMS/EPA Regulatory Model (AERMOD), the Community Multiscale Air Quality Modelling System (CMAQ), the Comprehensive Air Quality Model Extensions (CAMx) and the Weather Research Forecast-Chem (WRF-Chem)^{11}. Yang et al. utilized the WRF-CMAQ model to forecast the Air Quality Index (AQI) and its corresponding levels in Changzhou. However, the predicted AQI values tended to be lower than the actual values. The accuracy of AQI level prediction varied across different seasons, reaching its highest level at 54.90%^{12}. Sengupta et al. developed a high-resolution air quality early warning system based on the WRF-Chem model to forecast particulate matter concentration (PM_{10} and PM_{2.5}) and Air Quality Index (AQI) in New Delhi, India. This system integrated meteorological parameters including air temperature, relative humidity, wind speed, wind direction, and short-wave radiation. However, the prediction performance for PM_{10} and PM_{2.5} concentrations, as well as AQI, exhibited instability, with root mean squared errors (RMSE) generally exceeding 60.00. Additionally, the PM_{10} concentration was consistently overestimated^{13}. The numerical prediction model based on the mechanism of pollution processes heavily relies on both emission inventories of pollution sources and meteorological field data. The emission list of pollution sources is derived from the Technical Guidelines for the Preparation of Air Pollutant Emission Lists of Various Industries. Annual pollutant emissions are calculated using the emission coefficient method, typically based on data from the preceding 1–2 years^{14,15}. In reality, daily emissions of industrial waste gas exhibit significant fluctuations, rendering the emission inventory of pollution sources unable to reflect real-time changes in pollutant emissions. Moreover, the determination process of physical and chemical parameters is complex, and the absence of established scientific index thresholds further compounds the issue^{14}. These factors contribute to significant uncertainty in the predictive efficacy of the model. Consequently, the applicability of the numerical prediction model is limited.

Non-mechanism models mainly include statistical models and machine learning models^{16,17}, such as Random Forests (RF), Extreme Learning Machine (ELM), Support Vector Regression (SVR), Artificial Neural Network (ANN) and so on. These models can mine and analyze the characteristics of various factors affecting AQI. Therefore, non-mechanism models can predict the trend of pollutants over a period. Compared with numerical prediction models, non-mechanism models do not require extensive data, such as meteorological data and pollutant discharge inventories. The data dimensions needed for modelling and forecasting are relatively small, typically including historical air quality data and meteorological data. Liu and Zhang used ELM to predict the AQI in Beijing, Tianjin, and Shijiazhuang, and the results showed that the RMSE for each model was above 37.00^{18}. Zhu et al. predicted the AQI in Xingtai, China using a hybrid model that combined the LS-SVR and seasonal ARIMA algorithm. The RMSE was 24.46^{19}. Xu et al. combined an intelligent optimization search algorithm with SVR for AQI prediction in Taiyuan, achieving a MAPE (Mean Absolute Percentage Error) of 37.28%^{20}. Qin et al. utilized the improved Grasshopper Optimization algorithm (IGOA) to optimize the parameters of the BP neural network, establishing an air quality prediction model for Taiyuan City^{21}. The experimental results showed that the MAPE of the optimized BP neural network model could achieve 31.91%. The methods discussed in the literature above confirm that machine learning models can indeed be applied to AQI prediction. However, it’s worth noting that all these methods overlook the time-series nature of AQI. Different algorithms are suitable for different data characteristics, mainly due to the principles they use to process data. Nontime series algorithms, such as RF and SVR, usually assume that data points are independently and identically distributed, lacking consideration of temporal dependencies in the data. Therefore, these algorithms are not suitable for handling time series data with significant temporal correlation^{22}.

The AQI time series exhibits instability and nonlinear complexity^{19,20,21,23,24}. The time series characteristics of AQI indicate that past air quality impacts future air quality, showing a certain trend and periodicity. Therefore, future AQI can be predicted through statistical analysis of historical AQI data. Additionally, AQI is influenced by pollution-related meteorological conditions and pollution sources, leading to sudden changes. Thus, it is necessary to combine meteorological characteristics and the dynamic changes of pollution sources to predict air quality. Currently, commonly used air quality time series prediction models include ARIMA (Autoregressive Integrated Moving Average Model), RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory Network), and CNN (Convolutional Neural Network), among others. Zhang et al. used a novel spatiotemporal algorithm, the bidirectional gated recurrent unit integrated with an attention mechanism (BiGRU), to predict AQI for ten cities in the Huaihai Economic Zone. The results showed that the prediction accuracy of the proposed model outperformed traditional machine learning methods, with an RMSE of 31.10^{25}. Chhikara et al. established an air quality prediction model of Delhi, India, based on CNN-LSTM, and the results showed that the RMSE was 221.68^{26}. Sethi and Mittal used the ARIMA model to predict air quality in Gurugram, India. Considering the concentration of O_{3}, CO, SO_{2}, PM_{2.5}, and NO_{2}pollutants along with meteorological parameters, their prediction accuracy had an RMSE of 66.80^{27}. The above studies have considered the historical feature of AQI when predicting air quality. However, these methods have certain limitations. For example, the ARIMA model is a univariate time series prediction model that requires the time series to be stable and linear. ARIMA typically cannot capture the nonlinear relationship within the data and does not align with the nonlinear and irregular characteristics of air pollutants. Therefore, the prediction accuracy of the ARIMA model is significantly limited by its linear mapping capability^{28}. In the application of RNN in long time series, there are issues of gradient vanishing and explosion. Although LSTM, which is an improvement over RNN, mitigates these problems, it still cannot completely solve them in long time series^{29,30}.

In the above studies, Zhu et al., Xu et al. and Qin et al. considered the influence of air pollutant concentrations^{19,20,21}. Meanwhile, Zhang et al^{25}. and Shishegaran et al^{28}. considered both meteorological factors and pollutant concentrations. Additionally, Liu and Zhang^{18}, Li et al^{23}. and Chhikara et al^{26}. incorporated the influence of historical AQI series. Hu et al. developed a hybrid prediction model for air quality at sparse monitoring stations by leveraging spatio-temporal features extracted from both the target station and its surrounding stations’ air quality and meteorological data^{30}. Sarkar et al. established a hybrid model for air quality prediction, aiming to enhance prediction accuracy through diverse feature selection and classification techniques^{31}. Regional air quality is comprehensively affected by meteorological factors, pollutant emissions, past air quality and the transmission of pollutants from surrounding areas. Achieving high accuracy and effectively combating air pollution solely by considering a single factor in AQI prediction poses a challenge. A multi-factor forecasting method is expected to yield better prediction results. However, meteorological data and historical air quality data exhibit distinct characteristics. For instance, meteorological factors entail strong uncertainty, while historical air quality data is characterized by its time-series nature. Consequently, combining meteorological factors and historical air quality data as input variables in the same model may inevitably affect prediction accuracy. This is because different machine learning models typically perform more effectively when handling different types of data features. Mixing data from different features in a single model may hinder the simultaneous capture of each feature’s influence^{32,33}. Hence, meteorological factors and historical air quality are modelled separately to thoroughly explore their respective impacts, and then integrated in subsequent steps. This approach accommodates the distinct data features of each factor and leverages the strengths of each sub-model in handling different feature groups^{30}.

In the early study of our research group^{34}, an air quality meteorological correlation model was developed using the RF algorithm. The model utilized daily industrial exhaust emissions and meteorological factors as input variables. By leveraging meteorological conditions, the model achieved the objective of dynamically adjusting the key operations of polluting enterprises and mitigating the conflict between environmental preservation and economic interests^{34,35}. Therefore, building on the meteorological correlation model, this study aims to separately investigate the impact of past air quality. The GHMM model, a classical machine learning sequential prediction model, is employed to establish the air quality historical correlation model. GHMM has the advantages of simple structure, few parameters and strong generalization ability. A further explanation regarding the transferability and generalizability of GHMM is provided in the Supplementary. The number of samples required by GHMM is much smaller than that of the neural network model, and GHMM can capture the time series characteristics of AQI series in the modelling process. The air quality historical correlation model is then integrated with the air quality meteorological correlation model to improve prediction accuracy (the structure of this study is displayed in Fig. 1). Additionally, the transmission of pollutants from surrounding areas exhibits spatial correlation and dynamic changes. This characteristic introduces significant challenges and complexities in modelling due to the variability in pollutant dispersion influenced by various factors such as weather conditions in surrounding regions and geographic features. These aspects will be further explored in subsequent research.

## Data and research methodologies

### Study area and data source

Zhangdian District (located in Zibo City, Shandong Province, between 36°04 ′30 "-36°54′ 00" N and 117°55 ′40 "-118°12′ 20" E) is one of the most important industrial bases in Shandong Province. In 2019, there were 133 industrial enterprises above designated size in 2019 in the district, including 108 heavy industries. In recent years, the air quality of Zibo City has consistently ranked last in Shandong province, with the number of days with heavy pollution consistently higher than the provincial average^{36}.

The datasets of this study include daily AQI, daily concentrations of six major pollutants (PM_{10}, PM_{2.5}, SO_{2}, NO_{2}, CO, O_{3}), and daily industrial exhaust emissions (including NO_{x} emissions, SO_{2} emissions, TSP emissions, and total waste gas emissions) recorded in Zhangdian District from 1/1/2017 to 31/12/2019. The daily emissions of industrial exhaust are obtained from Zhangdian District Bureau of Ecology and Environment. The daily AQI and concentrations of the six major air pollutants are collected from three national air monitoring stations in Zhangdian District: People’s New District, Dongfeng Chemical Plant, and New District. Data from these stations are sourced from https://www.aqistudy.cn/. The location of each monitoring station is shown in Fig. 2. The meteorological data are sourced from https://wheata.cn.

### AQI calculation equation

According to the “Ambient Air Quality Index (AQI) Technical Regulations (Trial) (HJ 633–2012)”, the Individual Air Quality Index (IAQI) is calculated using Eq. (1), and the AQI is calculated using Eq. (2).

In the above equation: IAQI_{P} is the individual air quality index of the pollutant P; C is the air quality concentration of the pollutant P; BP_{Hi} and BP_{Lo} are the high and low values of the pollutant concentration limits bracheting C_{p}; IAQI_{Hi} and IAQI_{Lo} are the IAQI values corresponding to BP_{Hi} and BP_{Hi}.

In the above equation: n represents the nth pollutant, where n = 6, corresponding to PM_{10}, PM_{2.5}, SO_{2}, NO_{2}, CO, and O_{3}.

### Improved GHMM algorithm

Hidden Markov Model (HMM) is widely used in speech recognition, fault diagnosis, biological information and financial markets^{37}. HMM is a probabilistic model consisting of dual stochastic processes. The state sequence of the model is unobservable, referred to as the hidden states sequence^{38,39}. Transitions between two hidden states in the sequence are random. Each hidden state randomly generates observation data according to the observation probability distribution function \({\text{b}}_{{\text{S}}_{\text{i}}}({\text{O}}_{\text{t}})\). The observation sequences consist of a series of observation data. However, the observation probability function of HMM is a discrete probability distribution function, while AQI series are continuous variables. Therefore, GHMM with a continuous observation probability function is suitable for constructing the air quality history correlation model. In this study, the observation sequences serve as the input variables of the model, which can be categorized into three types: AQI, IAQI and exhaust emissions. The structure of air quality historical correlation model based on GHMM is shown in Fig. 3, where \({a}_{ij}\) represents the transition probability between state \({S}_{i}\) and state \({S}_{j}\), and \({b}_{{Si}_{(\text{Ot})}}\) is the probability of generating observation \({O}_{t}\) from state \({S}_{i}\) at time t.

**Select the optimal number of hidden states.**In the Hidden Markov model, the complexity and running time of the model are influenced by the number of hidden states. The complexity of the GHMM increases with the number of hidden states^{40}. However, if the number of hidden states is too small, the model’s accuracy will decrease, and the desired prediction effect cannot be achieved. Therefore, selecting an appropriate number of hidden states improves the model’s accuracy and reduces the running time^{41}. The learning algorithm of the traditional GHMM model is typically based on known number of hidden states^{42}, which might not be suitable for substantive research. For example, Lolea and Stamule directly specified the maximum number of hidden states as 4 based on economic justification^{43}. However, determining the actual meaning of hidden states in the context of air quality prediction can be challenging. Various methods, such as the Akaike information criterion (AIC), the Bayesian information criterion (BIC), and Odd–Even-Half-Sampling (OEHS), are used to select the number of hidden states^{38}. Nevertheless, the AIC tends to underestimate the complexity of a model, the BIC tends to overestimate the complexity of the models, and the OEHS criterion divides the original series into odd-position sequences and even-position sequences, which may not be entirely suitable for GHMM based on continuous-time.

A traversal method is proposed to determine the number of hidden states in this paper. Firstly, the interval within which the number of hidden states will be traversed is specified. Then, within this interval, each number of hidden states is used to establish the corresponding GHMM under the same input data. Finally, the prediction results of GHMMs with different hidden states are compared, and the number of hidden states that corresponds to the highest prediction accuracy is selected as the optimal number of hidden states. Hidden states can be assigned a range of values based on their actual meaning in specific applications, or they can be independent of their physical meaning. They can be viewed as an abstract representation within the model, whose specific interpretation may not directly correspond to a physical concept in the real question, thus avoiding the limitations of subjective knowledge. This abstract nature of hidden states can aid the model in better learning patterns and regularities within the data, free from the constraints of prior knowledge^{42,43}. The study determines the range of hidden state numbers (2–7) based on specific physical meanings. For example, two states might represent increasing and decreasing trends, while seven states could denote significant increase, moderate increase, slight increase, no change, slight decrease, moderate decrease, and significant decrease. This approach ensures that the model captures sufficient information without becoming overly complex and prone to overfitting.

**The Multi-day weighted matching method.** The traditional GHMM parameters are trained using the Baum-Welch algorithm and the forward-backward algorithm. Simultaneously, the optimal hidden state sequence is determined through the Viterbi algorithm. Utilizing the optimal hidden state sequence and the model parameters—state transition probability matrix, the most probable state for the next day is identified. Subsequently, the AQI for the following day is computed based on the observation probability density function. However, this forecasting method may yield a fixed trend (monotonically increasing, monotonically decreasing, or no change) in the forecasted values for the next few days, resulting in a notable disparity between the predicted and actual values.

Considering the limitations of traditional GHMM in forecasting, the Multi-day Weighted Matching method is proposed as an alternative approach (the algorithm process is shown in Fig. 4).

Step 1: Choose the training data, train the GHMM model, and derive the optimal hidden state sequence (referred to as the optimal path) along with the state transition probability matrix.

Step 2: Determine the state (assumed to be S_{2}) and its corresponding probability \({P}_{T(S2)}\) for the last day (Day T) of the training data.

Step 3: In the state transition probability matrix, find the probability values most similar to \({P}_{T(S2)}\) and their corresponding dates. These dates are the “matching days”.

Step 4: Calculate the differential value between \({P}_{T(S2)}\) and \({P}_{j(S2)} \left(1\le j<T\right), \Delta P=\left|{P}_{j(S2)}-{P}_{T(S2)}\right|\), sort the ∆P from small to large, and select the top ∆P, which corresponding \({P}_{j(S2)}\) are \({P}_{1}\), \({P}_{2}\), \({P}_{3},\dots ,{P}_{d}\).

Step 5: Calculate the weight and the predicted AQI (Seen in Eq. (3) and Eq. (4)).

In the above equation: \({\omega }_{\text{i}}\) is the weight of \({P}_{i}\), d is the days we match, T is the number of training set, the y is the predicted AQI value, the \({y}_{T}\) is the AQI value of the last day. \({y}_{i}\) is the AQI value corresponding \({P}_{i}\), and the \({y}_{i+1}\) is the AQI value corresponding next day of \({P}_{i}\).

**Fixed the length of training set.** To enhance AQI prediction performance, the Fixed Training Set Length method is proposed to mitigate error accumulation. Initially, the test data is integrated into the training set instead of utilizing predicted data. Subsequently, the earliest data point in the original training set is removed to maintain a fixed training set length. Compared to the traditional GHMM approach, this method employs a consistent training set length, thereby reducing error accumulation and uncertainty stemming from increasing training set lengths. Algorithm 1 (refer to the Supplementary) outlines the pseudo-code for the Fixed Training Set Length method.

## The air quality historical correlation model based on improved GHMM algorithm

### Data analysis

The research datasets consist of daily AQI data, daily air pollutant concentration and industrial exhaust emissions data from Zhangdian District, Zibo City, spanning from January 1, 2017 to December 31, 2019. Statistical analysis revealed the presence of 5 groups of zero values in the datasets, which were subsequently removed, resulting in 1089 remaining groups of datasets. The GHMM algorithm places greater emphasis on the conditional probability among input variables and the distribution characteristics of the data. Therefore, utilizing the original data directly aligns with the principles of GHMM and benefits AQI prediction. The statistical features such as mean values, standard deviation, maximum, and minimum values of the datasets are presented in Table 1. The frequency histograms of the above datasets are displayed in Fig. 5. The trend diagram of AQI and six air pollutants is shown in Fig. 6.

As depicted in Fig. 5. and Fig. 6, the overall AQI in Zhangdian District during 2017–2019 exhibited significant fluctuations, with peaks typically occurring in winter. The AQI values were primarily clustered around the mean value. Furthermore, the distribution skewed to the left, suggesting that most AQI values fell below the mean of 99.77. Table 1 illustrates a standard deviation of 49.63, indicating relatively large fluctuations. The minimum AQI value recorded was 13, corresponding to excellent air quality, while the maximum value of 313 indicated severely polluted air. The considerable difference between these extremities underscores the extensive variation range of the AQI series in Zhangdian District, characterized by instability and nonlinearity. Predominantly, air quality levels ranged from good to lightly polluted. The AQI trend paralleled that of PM_{10}, PM_{2.5}, and NO_{2}, all peaking in winter with significant fluctuations. In contrast, the CO trend exhibited a relatively gentle pattern, while O_{3} concentration was more pronounced in summer and less so in winter.

Figure 7 depicts the Pearson correlation coefficient matrix diagram illustrating the relationship between each IAQI and AQI. The correlation scale ranges from 0.8 to 1, indicating highly strong correlation, from 0.6 to 0.8 for strong correlation, from 0.4 to 0.6 for moderate correlation, and less than 0.3 for weak correlation. In the Pearson analysis, significance levels were set as follows: **** indicates p ≤ 0.0001, *** indicates p ≤ 0.001, ** indicates p ≤ 0.01, and * indicates p ≤ 0.05. The results showed that except for the correlation between AQI and IAQI_{O3}, which was marked with *, all other correlations were marked with ****, indicating a high level of statistical significance. According to the Pearson correlation analysis results, the AQI in Zhangdian District is positively correlated with all IAQI variables. Among them, the AQI has the strongest correlation with IAQI_{PM2.5}, with a value of 0.78, indicating that PM_{2.5} concentration has the most significant impact on the AQI. The correlation between IAQI_{O3} and AQI is the weakest, with a value of only 0.07, suggesting that ozone concentration has a minimal effect on the AQI in this area. Therefore, ozone may not be a primary factor influencing AQI in Zhangdian District. Additionally, there is a strong positive correlation between IAQI_{PM2.5} and IAQI_{PM10}, with a correlation coefficient of 0.89, indicating a high similarity in their sources and variation trends. Except for IAQI_{O3}, all other IAQI variables are positively correlated with each other, while IAQI_{O3} is negatively correlated with the other IAQI variables. This could be due to the fact that the formation and depletion of ozone are closely related to the concentration changes of other pollutants such as NO_{x}. Typically, in photochemical reactions, NO_{x} reacts with volatile organic compounds (VOCs) to form ozone. Therefore, when the concentration of other pollutants is high, ozone may be consumed or its formation inhibited. In Zhangdian District, PM_{2.5} is the main pollutant affecting AQI and should be the primary focus for control. Moreover, controlling particulate matter pollution requires comprehensive consideration of the sources and mitigation measures for both PM_{2.5} and PM_{10}. Due to the negative correlation between ozone and other pollutants, managing ozone pollution necessitates a systematic approach to the coordinated control of NO_{x} and VOCs.

### Experiments and results

Four sets of experiments with different input variables are designed, each corresponding to Experiments 1, 2, 3, and 4. The datasets cover the period from 1/1/2017 to 30/11/2019 for training the model, while the data from 1/12/2019 to 31/12/2019 are reserved for testing to evaluate the model’s performance. For details on the modelling process and parameters, please refer to the Supplementary.

**Experiment 1.** Experiment 1 uses the direct mode for AQI prediction, with the input variable being only the historical AQI series. The model proposed in this study is compared with the traditional GHMM, ARIMA, and LSTM models. The experimental results are shown in Table 2. The summer month forecast result is provided in the Supplementary. It can be seen that the proposed model greatly improves AQI forecasting accuracy compared to the traditional GHMM, ARIMA, and LSTM models. The proposed model significantly enhances prediction accuracy and demonstrates a notable advantage in predicting AQI time series with non-linear and unstable characteristics. Therefore, the improved GHMM model proposed in this study is used for further analysis in Experiments 2, 3, and 4.

**Experiment 2.** To obtain more accurate results, Experiment 2 predicted the AQI using the indirect mode, since the AQI is calculated from the Individual Air Quality Index (IAQI). According to the Pearson correlation matrix plot (Fig. 7), the IAQIs affect each other. Therefore, different combinations of input variables are designed to predict each IAQI (shown in Fig. 8). After predicting each daily IAQI, the corresponding daily AQI is calculated using Eq. (2). The optimal combination of input variables for each IAQI, the corresponding prediction results for each IAQI, and the accuracy of the calculated AQI are shown in Table 3. As seen in Table 3, compared with Experiment 1, the accuracy of AQI prediction in the indirect mode is higher than in the direct mode, with MAPE improved by 14.64%, RMSE by 11.46%, and MAE by 24.08%

**Experiment 3.** The AQI is not only closely related to the IAQI but also influenced by pollutant emissions. To comprehensively predict the AQI, it is necessary to include industrial exhaust emissions. In Experiment 3, exhaust emissions are combined with the AQI as input variables for the proposed model. The combinations and corresponding prediction accuracy of the AQI are shown in Table 4. It can be seen that the prediction accuracy of AQI is highest when the input variables are AQI and total waste gas emissions compared to other input combinations.

**Experiment 4.** In Experiment 4, the model’s input data includes both IAQI and industrial exhaust emissions. The combinations of each IAQI and industrial exhaust emissions are shown in Fig. 9 based on Experiment 2. Table 5. presents the best combinations of IAQI and industrial exhaust emissions as input variables and their corresponding prediction accuracy. The AQI values and evaluation indices are calculated based on the predicted IAQI, which are also shown in Table 5.. It is found that the prediction accuracy of the IAQI, except for IAQI_{CO} and IAQI_{O3}, is improved when industrial exhaust emissions are considered as input variables. Furthermore, the prediction accuracy of AQI has been improved in Experiment 4.

**Comparison and analysis of each experiment.** The results from Experiment 1 to Experiment 4 are summarized in Table 6 and Fig. 10. The AQI prediction results of Experiment 3 are based on the combination of Total waste gas emissions and AQI. According to the evaluation metrics MAPE, RMSE, and MAE, the following conclusions can be drawn:

(i) Compared with the traditional GHMM, the precision of the improved GHMM model is significantly higher when predicting AQI in Zhangdian District.

(ii) The indirect mode has higher precision in predicting AQI than the direct mode (Experiment 1 compared with Experiment 2, Experiment 3 compared with Experiment 4).

(iii) Compared with Experiment 1, considering exhaust emissions in Experiment 3 does not significantly improve AQI prediction. This may be due to the varying primary pollutants on different days (as seen in Fig. 11). Consequently, the prediction results of Experiment 3 (d) (AQI and Total waste gas emissions) are higher than the other combinations in Experiment 3.

(iv) The prediction precision in Experiment 4 is higher than in all other experiments. This demonstrates that the indirect mode combined with industrial exhaust emissions yields relatively higher AQI prediction precision when fully considering historical AQI. Compared to Experiment 1, the AQI prediction results of Experiment 4 improved MAPE, RMSE, and MAE by 24.87%, 26.95%, and 30.38%, respectively. The prediction precision of IAQI_{CO} and IAQI_{O3} in Experiment 4 is lower than in Experiment 2 due to the lack of CO emissions and VOCs emissions data.

## Model fusion

In the previous project, the RF (Random Forest) algorithm was used to develop a meteorological correlation model for predicting AQI. This algorithm, based on multiple decision trees, is capable of handling high-dimensional meteorological features. In this meteorological correlation model, the input variables include various meteorological factors and pollutant emission data from January 1, 2017, to December 31, 2019. The specific meteorological factors are: precipitation, air temperature, relative humidity, wind speed, air pressure, total sunshine intensity, and precipitation.

The air quality historical correlation model based on the improved GHMM proposed in this study is combined with the air quality meteorological correlation model based on the RF algorithm established earlier in the project. The two models are fused using the weighted average method (as shown in Eq. (5), (6), and (7)) to establish an ensemble model for air quality prediction. In this fusion, \({w}_{G}\) and \({G}_{i}\) represent the weight and predicted value of the proposed GHMM model, respectively, while \({w}_{R}\) and \({R}_{i}\) represent the weight and predicted value of the RF model, respectively, and \({y}_{i}\) is the actual value of the AQI. The AQI prediction values of the improved GHMM model are adopted from the prediction results of Experiment 4. The calculated weights are \({w}_{G}\) = 0.5050 and \({w}_{R}\) = 0.4950. The fusion results are shown in Table 7. and Fig. 12. The results demonstrate that the AQI prediction precision is further improved by comprehensively considering the effects of meteorology, exhaust emissions, and historical air quality.

However, from the fusion results, it is evident that the ensemble model failed to predict the peak of AQI on the 9th to 10th days. This may be due to the fact that regional air quality is influenced not only by meteorological factors, pollutant emissions, and historical air quality, but also by the transmission of pollutants from surrounding areas. This study did not discuss the impact of pollutant transmission from neighbouring regions because the characteristics of various influencing factors differ. Mixing these inputs into a single model would inevitably affect prediction accuracy. Different machine learning models perform differently when handling various types of data features. The transmission of pollutants from surrounding areas has spatial correlations and dynamic variability, influenced by meteorological conditions, geographical features, and other factors affecting pollutant dispersion. GHMM or RF models struggle to capture the spatial diffusion and transmission characteristics of pollutants. Future research can explore establishing inter-city spatial correlation models for regional transmission to further improve the accuracy of AQI predictions.

## Conclusions

This study introduces an indirect mode for predicting the Air Quality Index (AQI), contrasting with the commonly employed direct mode. Initially, Individual Air Quality Index (IAQI) values are predicted, followed by the calculation of AQI based on these predictions, aligning more closely with the AQI calculation principle. Additionally, a novel air quality historical prediction model is established using GHMM, incorporating industrial exhaust emissions and historical AQI data as input factors. To address issues of randomness and subjectivity in selecting the number of hidden states in the model, a traversal method is employed. Furthermore, methods such as the Multi-day weighted matching and Fixed training set length are introduced to mitigate error accumulation in long time series prediction, thereby enhancing the model’s accuracy. Lastly, a novel air quality fusion method is proposed, integrating meteorological factors, historical air quality data, and industrial exhaust emissions, providing a more comprehensive approach to air quality forecasting.

Air quality is correlated with meteorological factors, air pollutant discharge, and historical air quality. Considering the characteristics of the AQI time series, an improved GHMM was developed to establish a novel air quality historical prediction model. Additionally, a new prediction mode called the indirect mode was introduced for AQI prediction, aligning more closely with the AQI calculation principle. To evaluate the proposed model, numerous comparative experiments were conducted using AQI time series data from Zhangdian District, covering the period from January 1, 2017, to December 31, 2019. Both indirect and direct modes were proposed for AQI prediction. The experimental results demonstrate that the model, based on the air quality historical prediction framework proposed in this study, achieves the best prediction performance using the indirect mode. Furthermore, by integrating the air quality historical correlation model, based on the improved GHMM, with the air quality meteorological correlation model, based on the RF algorithm established earlier, the prediction performance of the Ensemble model surpasses that of individual models.

The study did not directly consider the emissions from residential and traffic sources because their emission levels are relatively stable. Since the prediction of air pollutant concentrations is based on daily-scale data, the changes in residential and traffic sources occur at a low frequency and at a slower rate, lacking significant daily-scale fluctuations and making it difficult to capture daily rapid changes. Including these low-frequency, slow-varying factors in the model could introduce noise, thereby interfering with the model’s prediction accuracy. In the long run, residential and traffic sources will inevitably impact air quality, but this impact accumulates over time. This study proposes the fixed training set length method, which involves replacing the predicted data with test data in each new round of prediction, while simultaneously removing the first data point in the original training set to achieve a fixed training data length. This improved method indirectly reflects the impact of residential and traffic sources by treating the long-term stable changes in these sources as systematic errors. Additionally, it reduces the errors generated by outdated socio-economic information. In addition, air quality is also influenced by factors in surrounding regions. In future research, an intercity correlation model for air quality will be developed to further improve AQI prediction accuracy.

## Data availability

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

## References

Chen, Z., Wang, F., Liu, B. & Zhang, B. Short-Term and Long-Term Impacts of Air Pollution Control on China’s Economy.

*Environ. Manage.***70**, 536–547. https://doi.org/10.1007/s00267-022-01664-1 (2022).Piersanti, A.

*et al.*The Italian National Air Pollution Control Programme: Air Quality.*Health Impact and Cost Assessment. Atmosphere.***12**, 196 (2021).Xiao, Q.

*et al.*Environmental regulation, economic growth and air pollution: Panel threshold analysis for OECD countries.*Sci. Total Environ.***657**, 234–241 (2019).Han, Y.

*et al.*Effects of air pollution on cardiopuLmonary disEaSe in urban and peri-urban reSidents in Beijing: protocol for the AIRLESS study.*Atmos. Chem. Phys.***20**(24), 15775–15792 (2020).Balasooriya, N. N., Bandara, J. S. & Rohde, N. Air pollution and health outcomes: Evidence from Black Saturday Bushfires in Australia.

*Soc. Sci. Med.***306**, 115165 (2022).Coker, E. & Kizito, S. A. Narrative Review on the Human Health Effects of Ambient Air Pollution in Sub-Saharan Africa: An Urgent Need for Health Effects Studies.

*Int. J. Environ. Res. Public Health.***15**(3), 427 (2018).Li, F.

*et al.*Long-term exposure to air pollution and risk of incident inflammatory bowel disease among middle and old aged adults.*Ecotox. Environ. Safe.***242**, 113835 (2022).Shen, W., Yu, X., Zhong, S. & Ge, H. Population Health Effects of Air Pollution: Fresh Evidence From China Health and Retirement Longitudinal Survey.

*Front. Public Health.***9**, 779552 (2021).Darçın, M. Association between air quality and quality of life.

*Environ Sci Pollut Res.***21**, 1954–1959 (2014).Nopmongcol, U.

*et al.*Modeling Europe with CAMx for the Air Quality Model Evaluation International Initiative (AQMEII).*Atmos. Environ.***53**, 177–185 (2012).Ma, S.

*et al*. Multimodel simulations of a springtime dust storm over northeastern China: implications of an evaluation of four commonly used air quality models (CMAQ v5.2.1, CAMx v6.50, CHIMERE v2017r4, and WRF-Chem v3.9.1).*Geosci. Model Dev*.**12**(11), 4603–4625 (2019).Yang, W.

*et al*. The test of WRCMAQ model on the air quality forecast effect of Changzhou in 2016–2018. Proceedings of 2019 CSES Annual Conference on Environmental Science and Technology. Xi’an, Shanxi, China, 23–25 August, 2019.Sengupta, A.

*et al*. Probing into the wintertime meteorology and particulate matter (PM2.5 and PM10) forecast over Delhi.*Atmos. Pollut. Res*.**13**(6), 101426 (2022).Xu, J., Zhai, Y., Chang, L., Zhou, G. & Ma, J. Practice on Forecast and Evaluation Technique of Meteorology for Air Pollution in Shanghai.

*Adv. Meteorol. Sci. Technol*.**7**(06), 150–156 (in Chinese) (2017).Shang, Z., Kang, Y., Du, H. & Wang, S. Study on the relationship between air pollution and meteorological conditions in Beijing and their forecasting. J.

*Lanzhou Univ. (Nat. Sci.)*.**56**(03), 380–387 (in Chinese) (2020).Zaini, N., Ean, L. W., Ahmed, A. N. & Malek, M. A. A systematic literature review of deep learning neural network for time series air quality forecasting.

*Environ Sci Pollut Res.***29**, 4958–4990 (2022).Wang, D., Wei, S., Luo, H., Yue, C. & Grunder, O. A novel hybrid model for air quality index forecasting based on two-phase decomposition technique and modified extreme learning machine.

*Sci. Total Environ.***580**, 719–733 (2017).Liu, H. & Zhang, X. AQI time series prediction based on a hybrid data decomposition and echo state networks.

*Environ. Sci. Pollut. Res.***28**(37), 51160–51182 (2021).Zhu, S.

*et al.*Daily air quality index forecasting with hybrid models: A case in China.*Environ. Pollut.***231**, 1232–1244 (2017).Xu, T., Yan, H. & Bai, Y. Air Pollutant Analysis and AQI Prediction Based on GRA and Improved SOA-SVR by Considering COVID-19.

*Atmosphere.***12**, 336 (2021).Qin, P., Hu, H. & Yang, Z. The improved grasshopper optimization algorithm and its applications.

*Sci Rep.***11**, 23733 (2021).Zhou, Z.-H.

*Machine Learning*(Tsinghua University Press, Beijing, 2016).Li, H., Wang, J., Li, R. & Lu, H. Novel analysis–forecast system based on multi-objective optimization for air quality index.

*J. Clean Prod.***208**, 1365–1383 (2019).Jiang, F., He, J. & Tian, T. A clustering-based ensemble approach with improved pigeon-inspired optimization and extreme learning machine for air quality prediction.

*Appl. Soft. Comput.***85**, 10582 (2019).Zhang, K., The, J., Xie, G. & Yu, H. Multi-step ahead forecasting of regional air quality using spatial-temporal deep neural networks: A case study of Huaihai Economic Zone.

*J. Clean Prod.***227**, 123231 (2020).Chhikara, P., Tekchandani, R., Kumar, N., Guizani, M. & Hassan, M. M. Federated Learning and Autonomous UAVs for Hazardous Zone Detection and AQI Prediction in IoT Environment.

*IEEE Internet Things J.***8**(20), 15456–15467 (2021).Sethi, J. K. & Mittal, M. Analysis of Air Quality using Univariate and Multivariate Time Series Models. 2020 10th International Conference on Cloud Computing, Data Science & Engineering, Noida, India, 29–31 January 2020.

Shishegaran, A., Saeedi, M., Kumar, A. & Ghiasinejad, H. Prediction of air quality in Tehran by developing the nonlinear ensemble model.

*J. Clean Prod.***259**, 120825 (2020).Wang, J., Li, X., Li, J., Sun, Q. & Wang, H. NGCU: A New RNN Model for Time-Series Data Prediction.

*Big Data Res.***27**, 100296 (2021).Hu, Y., Xiaoxia, C. & Hanzhong, X. A hybrid prediction model of air quality for sparse station based on spatio-temporal feature extraction.

*Atmospheric Pollution Research.***14**, 101765 (2023).Sarkar, N., Gupta, R., Keserwani, P. K. & Govil, M. C. Air Quality Index prediction using an effective hybrid deep learning model.

*Environ. Pollut.***315**, 120404 (2022).Ali, R., Lee, S. & Chung, T. C. Accurate multi-criteria decision making methodology for recommending machine learning algorithm.

*Expert. Syst. Appl.***71**, 257–278 (2017).Ali, R.

*et al.*Algorithm selection using edge ML and case-based reasoning.*J. Cloud. Comp.***12**, 162 (2023).Liu, Y., Wang, P., Li, Y., Wen, L. & Deng, X. Air quality prediction models based on meteorological factors and real-time data of industrial waste gas.

*Sci Rep.***12**, 9253 (2022).Wang, P., Liu, Y., Fu, Y., Lin, Z. & Cheng, Z. A multi-dimensional conceptual association model based on PSO-LSSVM for regional AQI.

*Environ. Sci. Technol. (China)*.**43**(06), 108–114 (in Chinese) (2020).Statistical Bureau of Zibo, China, 2020. Statistical Yearbook of Zibo City 2020. Statistical Yearbook of Zibo City 2020. http://tj.zibo.gov.cn/gongkai/channel_c_5f9fa491ab327f36e4c13077_n_1605682681.5678/doc_600138f13a3e1f0c453d3e30.html. Accessed 27 June 2022.

Bhavya, M., Sunita, G. & Ajay, K. A systematic review of Hidden Markov models and their applications.

*Arch. Comput. Method Eng.***28**, 1429–1448 (2020).Celeux, G. & Durand, J. B. Selecting hidden Markov model state number with cross-validated likelihood.

*Comput. Stat.***23**, 541–564 (2008).Zhou, J., Song, X. & Sun, L. Continuous time hidden Markov model for longitudinal data.

*J. Multivar. Anal.***179**, 104646. https://doi.org/10.1016/j.jmva.2020.104649 (2020).Stoltz, M., Stoltz, G., Obara, K., Wang, T. & Bryant, D. Acceleration of hidden Markov model fitting using graphical processing units, with application to low-frequency tremor classification.

*Comput. Geosci.***156**, 104902 (2021).Li, D., Sun, Y., Sun, J., Wang, X. & Zhang, X. An advanced approach for the precise prediction of water quality using a discrete hidden markov model.

*J Hydrol.***609**, 127659 (2022).Li, Y. & Song, X. Order selection for regression-based hidden Markov model.

*J. Multivar. Anal.***192**, 105061 (2022).Lolea, L. C. & Stamula, S. Trading using Hidden Markov Models during COVID-19 turbulences.

*Manag. Mark.***16**(4), 334–351 (2021).

## Acknowledgements

This work was supported by the Key Research and Development Program of Sichuan Province (Grant No. 23ZDYF2652).

## Funding

Key Research and Development Program of Sichuan Province,23ZDYF2652.

## Author information

### Authors and Affiliations

### Contributions

L.W. and L.Y. conducted the conceptual design of the work and wrote the main manuscript text. L.W. and C.X. conducted the experiments. Z.L. conducted the simulations and data analysis. L.Y., Y.C., and Y.L. revised the manuscript. All authors reviewed the manuscript.

### Corresponding author

## Ethics declarations

### Competing interests

The authors declare no competing interests.

## Additional information

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Supplementary information

## Rights and permissions

**Open Access** This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

## About this article

### Cite this article

Liu, Y., Wen, L., Lin, Z. *et al.* Air quality historical correlation model based on time series.
*Sci Rep* **14**, 22791 (2024). https://doi.org/10.1038/s41598-024-74246-2

Received:

Accepted:

Published:

DOI: https://doi.org/10.1038/s41598-024-74246-2