Introduction

Air is an admirable and most valued resource. It is the essential source on this earth that supports all living beings to survive and sustain. Unfortunately, in current years, due to several human exercises our valuable natural resources are getting contaminated. Air pollution is the most substantial environmental concern in almost all parts of the world. Due to the extensive advancements in economy around the globe, air quality evolves into a major issue as the diminishing air quality has incessant and somber effects not only on human health but also on the ecosystem. The WHO stated that around 90% of the world's population is inhaling polluted air (www.who.in). Moreover, the State of the Global Air report (2019), addresses air pollution as the fifth dominant hazard for mortality across the globe. Main pollutants that affect most of the nation comprises of particulate matter (PM), nitrogen dioxide (NO2), lead (Pb), carbon monoxide (CO), ozone (O3), sulphur dioxide (SO2) etc.. Raising levels of these deadly pollutants due to industrial activities, vehicles, construction sites, power plants, and natural processes like volcanoes, forest fires, has considerable brunt on human wellbeing. The deteriorated quality of air results in several types of aversions, cardiovascular diseases, respiratory ill health, etc.1,2,3. Owing to the numerous adverse effects on human well-being, this environmental issue must be considered significantly. Moreover, now a days many nations are collaborating to address the issue of increasing air pollution4.

The primitive pollutant affecting human health is PM. PM is composed up primarily of smoke, dust and soot or liquid droplets discharged into the environment from industries, vehicles, construction spots etc. The particles having an aerodynamic diameter lower than 2.5 μm are fine, PM2.5 and those having a diameter less than 10 μm are known as coarse particles, PM10. The pollutant deeply penetrates the respiratory system and the blood streams leading to many health hazards5,6. Hence, it is imperative to develop effective tools for monitoring PM levels, disseminating information regarding hazardous concentrations, and providing recommendations for preventive measures to mitigate such levels. Numerous investigations have been carried out, encompassing not only the quantification of PM levels but also the evaluation of potential health hazards associated with heightened PM exposure for the population. These studies greatly enhance our understanding of the complex public health challenges caused by the widespread effects of air pollution7,8,9,10,11,12,13,14,15,16.

Further, investigating the entire involved parameters contributing air pollution is an arduous task. To deal with this, air pollution models are desired to evolve early warnings and command actions and further to examine forthcoming ensuing discharge schemes17,18. The increasing role of machine learning in air quality prediction represents a significant leap forward in our ability to monitor and manage environmental health. Machine learning techniques have ushered in a new era of air quality forecasting, allowing us to harness vast amounts of data, including historical air quality information, meteorological data, and even satellite imagery. These algorithms can identify complex patterns and correlations within this data, enabling more accurate predictions of air quality parameters such as PM concentrations, ozone levels, and pollutant concentrations. By providing real-time, high-resolution forecasts, machine learning models empower policymakers, environmental agencies, and the public to make informed decisions, take preventive measures, and mitigate the adverse effects of air pollution on public health and the environment. The growing integration of machine learning into air quality prediction signifies a promising avenue for advancing our understanding of air pollution dynamics and enhancing the quality of life for communities around the world. Among various statistical procedures, Artificial Neural Networks (ANNs) have been demonstrated to be altogether effective for appropriating complex relationships and enhancing forecast accuracy19,20,21,22,23,24.

ANNs are computational models inspired by the structure and function of the human brain. They consist of interconnected nodes, or artificial neurons, organized into layers. These networks are used for various machine learning tasks, including pattern recognition, classification, regression, and even more complex tasks like natural language processing and image recognition. In ANNs, information flows through the network, with each neuron processing and transmitting data to the next layer. ANNs have gained widespread popularity due to their ability to handle complex and high-dimensional data, making them a crucial component of modern artificial intelligence and deep learning applications. They have been instrumental in advancing fields such as computer vision, speech recognition, and autonomous systems, among many others. Broadly, different kinds of ANN involve the back-propagation neural network25,26, multilayer perceptron27,28, radial basis function29,30, and adaptive neuro-fuzzy inference systems31,32.

The primary objective of this study is to assess the performance of ANN trained with different algorithms for predicting PM2.5 concentration. Additionally, we have conducted a comparative analysis with the traditional multiple linear regression model (MLR).

Background

Numerous researchers have engaged in the thorough evaluation of air quality prediction models, with a specific emphasis on the precision of PM concentration predictions across a wide spectrum of scenarios, employing ANN. The use of ANNs for estimating PM concentration has been asserted for the prediction of hourly and daily average concentrations relying on air pollutants and atmospheric data33,34. In the Santiago city of Chile, Perez et al.35 demonstrated estimations of hourly average concentrations of PM2.5 several hours before, depending on values attained at a steady site. Further, outcomes acquired employing ANN revealed estimated errors within the extent 30–60%. Moreover, they examined the noise cutback of dataset to enhance predictions as imperative. A comparison of ANN technique with classical regression techniques for PM10 and PM2.5 estimation was conducted by McKendry36. He established that meteorological variables, endurance, and co-pollutant values effectively estimated PM levels. In another investigation, Chelani et al.37 entrenched an ANN procedure to predict PM10 and noxious metals contamination investigated in the Jaipur city of India. Authors were adept at estimating contaminations quite justly. Tecer38 suggested ANNs to estimate PM levels in Zonguldak Province, Turkey. The outcomes revealed that the suggested technique can effectively be employed to estimate air quality. Pires et al.39 demonstrated the accomplishment of five linear models to estimate the daily average PM10 levels and certified that the size of the dataset is an imperative factor for the estimation of models. Paschalidou et al.40 employed multilayer perceptron for PM10 hourly levels prediction in Cyprus. The prediction revealed that the MLP models displays the best estimation performance. Also, Roy et al.41 have suggested the utilization of both multiple regression and ANN techniques for analyzing PM levels in different seasons at a vast opencast coal mine in India. The findings indicated that the ANN-based forecasting outperformed the multiple regression models. An online air pollutants predicting ANN technique that utilizes parameters attained through geographic modeling for the district Besiktas, Instanbul was suggested by Kurt & Oktay42. This system employs the meteorological parameters, the air pollutants levels and certain area specific attributes as input parameters. The ANN technique was carried out in this study to develop PM2.5 concentration prediction model. In Spain, another ANN model for PM10 daily levels estimation was suggested that executes the estimation of a 24 h average PM10 levels and employs deterministic variables for overall transit of aerosols from arid areas43. An innovative approach was employed to forecast PM2.5 and PM10 levels in major Chinese cities. This approach integrated a feedforward ANN model with a rolling criterion to capture input data patterns and a cumulative generating conduct of gray model to reduce data sample unpredictability44. The prediction procedure relied mainly on the daily values of PM2.5 and PM10 levels and on a few atmospheric parameters. With an aim to analyze the impact of exposure to PM10 on health and to estimate PM10 levels using ANN another study was conducted in Yasuj city45. The daily average values of PM10 as well as the climatic data was utilized in this analysis. In general, amongst all the machine learning approaches, ANN has been proven to be the most favorable approach of the researchers. This study examined ANN technique with varying training functions to establish the most effective model for PM2.5 estimation.

Methodology

Study area and air quality data

In India, central pollution control board (CPCB) is the pinnacle institution that investigates and monitors air quality. This institution supervises air pollution with the support of its abundant stations extended in nearly every city. Air quality across the country is systematically monitored through a combination of Manual and Continuous Ambient Air Quality monitoring stations. At present, this network comprises a total of 1257 monitoring stations. Manual monitoring activities are undertaken at 883 stations, encompassing 378 cities and towns distributed across 28 States and 7 Union Territories. Simultaneously, continuous monitoring is carried out at 374 stations, situated in 190 cities and towns across 27 States and 4 Union Territories. To facilitate the monitoring of air pollutants, the responsibility is shared with various entities such as the State Pollution Control Boards (SPCB), Pollution Control Committees (PCC), and other reputable institutions. The CPCB collaborates with these organizations to ensure the uniformity and consistency of air quality data while offering technical and financial support. It identifies and calculates pollutants as well as atmospheric parameters. Moreover, the monitoring of air pollutants is enforced with the support of SPCB, PCC, and several other reputed organizations. CPCB work with these assisting institutes to provide uniform, consistent air quality data46. The data generated through manual and continuous monitoring integrated for the year 2021 has been taken for this study involving the annual average values of SO2, NO2, PM10 and PM2.5 (in μg/m3) as shown in Fig. 1.

Figure 1
figure 1

Values of air pollutant variables for the year 2021.

Modeling and opting suitable input variables

The observed levels of air pollutants PM10, PM2.5, NO2, and SO2 were investigated with an objective to frame an air pollution estimation model. The specific dataset was sourced from CPCB for the year 2021. Figure 2 visually represents the relationships among SO2, NO2, PM10, and PM2.5 levels. We observed a positive correlation among all these variables, signifying their relevance to the study. Notably, the maximum correlation values were found to be 0.31 for PM2.5 with SO2, 0.61 with NO2, and 0.83 with PM10. Consequently, SO2, NO2, and PM10 were selected as the input variables for the PM2.5 air pollution estimation model.

Figure 2
figure 2

Correlation matrix of air pollutants in India for the year 2021.

Estimating PM2.5

Multiple linear regression (MLR) model

Multiple linear regression (MLR) is a statistical technique used to assess the relationships between a single dependent variable and two or more independent variables. The method works by fitting a linear equation to the data, with coefficients representing the contribution of each independent variable to the dependent variable. The model aims to find the best-fitting line through the data points, which minimizes the sum of the squared differences between the observed and predicted values. MLR is a valuable tool for uncovering complex associations and understanding the underlying factors that influence a particular phenomenon. The formula for expressing the output dependent variable y in terms of independent variables x1, x2, …, xn is as follows:

$$y={\alpha }_{0}+{\alpha }_{1}{x}_{1}+{\alpha }_{2}{x}_{2}+\dots +{\alpha }_{n}{x}_{n}+\varepsilon ,$$
(1)

where n = number of observations, \({\alpha }_{0}\) = the y intercept, \({\alpha }_{n}\)= coefficient of the independent variable \({x}_{n}\) and \(\varepsilon\)= model error.

In this particular investigation, PM2.5 is taken as a dependent variable, while SO2, NO2 and PM10 are considered as independent variables. The MLR model computes the coefficients \({\alpha }_{1},\) \({\alpha }_{2}\),…,\({\alpha }_{n}\) using the least square method.

Proposed ANN models

In the present analysis, the nftool of MATLAB (Version R2014b) was used and executed on a system equipped with an Intel HD Graphics card, a 17-inch display, 4 GB of memory, an Intel 11th generation i5 430 M processor, and a 512 GB SSD47,48,49. The ANN were trained using NO2, SO2, and PM10 as input variables and PM2.5 as the target variable. The neural network architecture consisted of 20 neurons in the hidden layer, as depicted in Fig. 3. For the initial training phase, the dataset was partitioned into training (70%), validation (10%), and testing (20%) subsets. ANNs employ various training algorithms to adjust the network's parameters (weights and biases) in order to minimize errors and improve performance. In this study, the network underwent training using three distinct algorithms sequentially: Levenberg–Marquardt, Bayesian Regularization, and Scalar Conjugate training algorithms.

Figure 3
figure 3

The structure of the ANN layers.

Performance metrics

The assessment and differentiation of the MLR and the three proposed Artificial Neural Network (ANN) models are carried out by examining the Root Mean Square Error (RMSE) and Coefficient of Determination (R2). These metrics are defined as follows:

$$\mathrm{RMSE }= \sqrt{\frac{1}{n}} \sum_{i=1}^{n}{(o\left(t\right)-p\left(t\right))}^{2},$$
$${{\text{R}}}^{2} =1-\frac{\sum_{i=1}^{n}{\left(o\left(t\right)-p\left(t\right)\right)}^{2}}{\sum_{i=1}^{n}{\left(o\left(t\right)-\frac{1}{n}\sum_{i=1}^{n}o\left(t\right)\right)}^{2}},$$

where, n = number of observations, o(t) = actual value of the variable , p(t) = predicted value of the variable.

Simulation results and discussion

This study focuses on the estimation of PM2.5 concentrations using annual average data of SO2, NO2, and PM10 for the year 2021 as input parameters. The performance evaluation of the models was based on the Root Mean Square Error (RMSE) and the coefficient of determination (R2). These metrics provided insights into the effectiveness of the MLR model and the three ANN models: ANN trained using the Levenberg–Marquardt algorithm (LM-ANN), the Bayesian Regularization algorithm (BR-ANN), and the Scaled Conjugate Gradient algorithm (SCG-ANN).

The experiments entail a comparison between the MLR model and three ANN models. The results of this comparison, specifically the evaluation metrics RMSE and R2, are presented in Table 1 for reference.

Table 1 Statistical error indices.

The results revealed that the LM-ANN model outperformed the others, yielding the lowest RMSE of 9.5223 compared to 9.6555, 11.0165, and 11.7585 for BR-ANN, SCG-ANN, and MLR, respectively. Furthermore, the PM2.5 concentration estimated by the LM-ANN model demonstrates a strong correlation with observed values, with an R2 of 0.8164. In contrast, BR-ANN exhibits an R2 value of 0.8118, while SCG-ANN yields 0.7551, and MLR results in 0.7201.

Correlation amongst observed and estimated LM-ANN, BR-ANN SCG-ANN and MLR models are illustrated in Fig. 4. Additionally, Figs. 5, 6, and 7 represent the regression plots for the LM-ANN, BR-ANN, and SCG-ANN models, respectively. Moreover, Fig. 8 provides a time series illustration of the detected and estimated PM2.5 values for the suggested models.

Figure 4
figure 4

Correlation amongst observed and estimated PM2.5.

Figure 5
figure 5

Regression plot of LM-ANN.

Figure 6
figure 6

Regression plot of BR-ANN.

Figure 7
figure 7

Regression plot of SCG-ANN.

Figure 8
figure 8

Comparison of the observed and estimated values of PM2.5.

Notably, the results highlighted the superior performance of the LM-ANN model in comparison to the other models, signifying its enhanced capability in estimating PM2.5 concentrations.

The current investigation offers a comprehensive exploration of the effectiveness of various ANN techniques when applied to the air quality modeling. This research not only sheds light on the adequacy of different ANN methodologies but also delves into their relative strengths and weaknesses in the context of air quality modeling. Consistency and size of data used, alongwith upholding of identical controlling factors for training and testing data are some of the limitations of proposed ANN models. By examining these diverse ANN approaches, we gain a deeper understanding of how they perform and contribute to the field of air quality modeling.

Conclusion and future scope

Particulate matter (PM) is a major air pollutant known to have detrimental impacts on human health. This study involved the predictive analysis of PM2.5 levels by utilizing Artificial Neural Network (ANN) models based on data concerning SO2, NO2, and PM10 concentrations. The three distinct ANN models, namely LM-ANN, BR-ANN, and SCG-ANN were applied to India's air quality dataset for the year 2021 sourced from the CPCB. The error metrics, specifically Root Mean Square Error (RMSE) and R-squared (R2) were employed to assess the performance of these models. The findings demonstrated that the LM-ANN model exhibited superior performance compared to the other two ANN models and the Multiple Linear Regression (MLR) model. Moreover, these models have the potential to alert the public when PM concentration surpasses its prescribed level. Furthermore, the suggested models can be deployed to forecast real-time air quality trends using historical data, making them valuable tools for proactive planning and management of air pollution concerns. In summary, this ANN modeling approach offers a practical solution for governmental agencies to address air pollution issues and formulate effective strategies for mitigating their impact.

Regarding future research directions, we aim to extend our investigations into air pollution by integrating daily and hourly data, thereby enabling a more exhaustive analysis of pollution levels across diverse urban areas. Additionally, doe to the prominent performance exhibited by the LM-ANN model, there is potential for further enhancements to fine-tune its capabilities in air quality prediction.