A comparative analysis of linear regression, neural networks and random forest regression for predicting air ozone employing soft sensor models

Zhou, Zheng; Qiu, Cheng; Zhang, Yufan

doi:10.1038/s41598-023-49899-0

Download PDF

Article
Open access
Published: 16 December 2023

A comparative analysis of linear regression, neural networks and random forest regression for predicting air ozone employing soft sensor models

Zheng Zhou¹,
Cheng Qiu¹ &
Yufan Zhang¹

Scientific Reports volume 13, Article number: 22420 (2023) Cite this article

1545 Accesses
1 Citations
Metrics details

Subjects

Abstract

The proposed methodology presents a comprehensive analysis of soft sensor modeling techniques for air ozone prediction. We compare the performance of three different modeling techniques: LR (linear regression), NN (neural networks), and RFR (random forest regression). Additionally, we evaluate the impact of different variable sets on prediction performance. Our findings indicate that neural network models, particularly the RNN (recurrent neural networks), outperform the other modeling techniques in terms of prediction accuracy. The proposed methodology evaluates the impact of different variable sets on prediction performance, finding that variable set E demonstrates exceptional performance and achieves the highest average prediction accuracy among various software sensor models. In comparing variable set E and A, B, C, D, it is observed that the inclusion of an additional input feature, PM₁₀, in the latter sets does not improve overall performance, potentially due to multicollinearity between PM₁₀ and PM_2.5 variables. The proposed methodology provides valuable insights into soft sensor modeling for air ozone prediction.Among the 72 sensors, sensor NN_R[Y]C outperforms all other evaluated sensors, demonstrating exceptional predictive performance with an impressive R² of 0.8902, low RMSE of 24.91, and remarkable MAE of 19.16. With a prediction accuracy of 81.44%, sensor NN_R[Y]C is reliable and suitable for various technological applications.

Air quality prediction models based on meteorological factors and real-time data of industrial waste gas

Article Open access 03 June 2022

Estimating PM2.5 utilizing multiple linear regression and ANN techniques

Article Open access 19 December 2023

Air quality prediction model based on mRMR–RF feature selection and ISSA–LSTM

Article Open access 07 August 2023

Introduction

Background and importance of air ozone prediction

Air pollution, including compounds such as ozone, has become a global concern due to its detrimental effects on human health and the environment^1,2. Ozone is a reactive gas formed through complex photochemical reactions involving precursor pollutants such as nitrogen oxides (NO_x) and volatile organic compounds (VOC_s)^3,4,5. Elevated ozone levels in the atmosphere can contribute to respiratory issues, cardiovascular diseases, and lung inflammation in humans. It can also harm plants, reduce crop yields, and disrupt ecosystems. Accurately predicting ozone concentrations in the air is crucial for effective air quality management and the development of appropriate mitigation strategies. By forecasting ozone levels, policymakers, environmental agencies, and health professionals can take timely measures to reduce exposure and mitigate the potential health and ecological risks associated with high ozone concentrations. This can include implementing emission controls, adjusting industrial activities, and raising awareness among vulnerable populations.

Soft sensor modeling for air ozone prediction and its significance

Soft sensor modeling, also known as virtual sensing or data-driven modeling, enables the estimation of specific physical or chemical parameters using available data and mathematical models^6,7,8. In the context of air ozone prediction, soft sensor modeling involves constructing models using relevant environmental variables such as meteorological data, pollutant concentrations and historical ozone measurements to predict ozone levels in real-time or for future periods. This approach allows for the development of virtual sensors that provide continuous estimates of ozone concentrations, even in cases where physical sensors are not present or practical to deploy^9,10. The significance of soft sensor modeling lies in its ability to overcome limitations associated with physical sensors, such as cost, maintenance, and limited coverage. Soft sensors offer a cost-effective and flexible alternative for ozone prediction, enabling widespread monitoring and forecasting of ozone concentrations. Furthermore, soft sensor models can be continuously updated and optimized using new data, providing accurate and up-to-date information for decision-makers in air quality management and public health.

Objectives of the study

The main objectives of this study are to compare and evaluate the performance of different soft sensor modeling techniques for air ozone prediction. Specifically, we will compare the effectiveness of linear regression, neural networks and random forests regression in predicting ozone concentrations. These techniques were chosen due to their widespread usage and demonstrated capabilities in modeling complex relationships in environmental systems. Through this comparative analysis, we aim to identify the most suitable modeling technique for air ozone prediction based on criterion such as predictive accuracy, efficiency and interpretability. Additionally, we seek to explore the strengths and limitations of each modeling approach and provide insights into their practical applications in air quality management and decision-making.

Literature review

Overview of linear regression, neural networks and random forests regression

Air ozone prediction has been an important area of research due to the detrimental effects of ozone pollution on human health and the environment¹¹. In recent years, several studies have been conducted to develop and evaluate different methods for air ozone prediction. Here, we provide an overview of some key research findings and methodologies.

Linear regression

LR (Linear regression) is a popular and widely used modeling technique in statistics and machine learning. It aims to establish a linear relationship between the input variables and the target variable. The model assumes a linear combination of the input features to predict the continuous output variable. The coefficients of these input variables are estimated using various optimization algorithms, such as least squares. LR is simple to implement and interpret, making it a good choice for scenarios with linear relationships between variables. MLR (Multiple linear regression) is a form of LR that is suitable for this case. MLR provides equations linking a number of input variables (x_n) to a target variable (y) using Eq. (1)¹².

$$ {\text{y}} = {\text{w}}_{0} + {\text{w}}_{{1}} {\text{x}}_{{1}} + \cdots + {\text{w}}_{{\text{n}}} {\text{x}}_{{\text{n}}} $$

(1)

where w₀ is the intercept, w_n is a coefficient for x_n and n is the number of input variables. Out-of-sample accuracy can be improved by using regularization methods which add a penalty term to the model input variables, shrinking the freedom of the input variable during learning.

Nonlinear extension refers to the use of nonlinear feature functions to transform independent variables in linear regression, in order to capture nonlinear relationships in the data.

In LR, we assume that there is a linear relationship between the independent variables and the dependent variable. However, in real-world data, there may exist nonlinear relationships, where the relationship between the independent variables and the dependent variable cannot be accurately described by a simple linear model.

To address this issue, we can use nonlinear extension. This means applying some nonlinear functions to the independent variables to introduce nonlinear features in the model, in order to better fit the nonlinear relationships in the data.

For example, if there is a quadratic relationship between the independent variable x and the dependent variable y, we can square the independent variable x to obtain x² as a new independent variable, and then use both x and x² as input variables to build a linear regression model. This way, the model can capture the quadratic relationship between x and y.MLR with nonlinear extension(MLR-NE) provides equations linking a number of input variables (xn) to a target variable (y) using Eq. (2).

$$ {\text{y}} = {\text{w}}0 + {\text{w}}_{{1}} {\text{x}}_{{1}}^{{2}} + \cdots + {\text{w}}_{{\text{n}}} {\text{x}}_{{\text{n}}}^{{2}} $$

(2)

In addition to using the square function, other nonlinear functions such as logarithmic, exponential and trigonometric functions can also be applied to transform the independent variables. This allows the model to adapt to more complex nonlinear relationships.

It is important to note that nonlinear extension can improve the fitting capability of the model and make it more suitable for nonlinear data. However, the resulting extended model may be more complex, less interpretable and have a risk of overfitting. Therefore, when performing nonlinear extension, a trade-off between the accuracy of model fitting and interpretability needs to be considered.

Data-driven models, such as regression-based approaches, have been widely used for air ozone prediction. Linear regression (LR) is a statistical modeling technique used to establish a linear relationship between a dependent variable and one or more independent variables. In air ozone prediction, LR models can be employed to identify correlations between ozone levels and relevant factors, such as temperature, humidity, wind speed and pollutant concentrations. Researchers have utilized various variables, including meteorological parameters, pollutant concentrations and emission data, to develop accurate prediction models. For example, Wei Zhao employed multiple linear regression to predict ozone levels based on boundary layer height, humidity, wind direction, surface solar radiation, total cloud cover and sea level pressure in Hong Kong¹³.

Neural networks

BPNN (Backpropagation Neural Networks) and RNN (Recurrent Neural Networks) are two commonly used artificial neural networks, respectively suitable for regression tasks and sequential data processing.

BPNN utilizes the backpropagation algorithm to train the network by iteratively adjusting the weights and biases of the neurons to minimize the difference between the predicted and actual output,as shown in Fig. 1. This iterative process helps the model capture complex non-linear relationships between input and output variables, making it suitable for various regression problems.

RNN is a type of neural networks designed to process sequential data, such as time series or text data. Unlike BPNN, RNN has a feedback mechanism that allows information to be carried forward through time loops, as shown in Fig. 2. This recurrent structure enables RNN to capture temporal dependencies and contextual information within the data. In regression tasks, RNN can model the sequence of input variables and predict the corresponding continuous output. They are particularly useful for problems where past inputs have a significant impact on current predictions.

Machine learning techniques have gained popularity in air ozone prediction due to their ability to capture complex relationships in data. Neural networks are computational models inspired by the structure and functioning of biological neural networks. These models consist of interconnected nodes (neurons) organized in layers and are trained using optimization algorithms to learn complex patterns in the data. For air ozone prediction, neural networks can capture nonlinear relationships between predictor variables and ozone concentrations.Neural networks, including BPNN and RNN, have been utilized for ozone prediction. RNN possesses feedback connections that allow information to flow between different time steps, making them ideal for time series analysis and prediction. In air ozone prediction, RNN can effectively capture temporal dependencies and patterns in ozone data.RNN, in particular, has shown promise in capturing temporal dependencies and patterns in ozone data^14,15. Wang Dongsheng et al. developed an RNN model to predict hourly ozone concentrations in air quality monitoring stations in the Yangtze River Delta, China¹⁶.

Random forest regression

RFR (random forest regression) is an ensemble learning technique that combines the power of decision trees and randomness. It constructs a multitude of decision trees using random subsets of the training data and randomly selected subsets of the input variables. Each decision tree makes independent predictions and the final prediction is obtained by averaging the predictions of all the trees,as shown in Fig. 3. RFR handles both linear and non-linear relationships, effectively captures complex interactions between input variables and is robust against overfitting. It is particularly suitable for high-dimensional data with categorical and numerical features and performs well even in the presence of outliers and missing values.

Ensemble models, such as RFR (random forest regression) and gradient boosting, have also been applied for air ozone prediction^17,18. RFR is an ensemble learning method that combines multiple decision trees to make predictions. Each decision tree is built using a random subset of features and the final prediction is determined by aggregating the predictions from individual trees. RFR is known for its robustness, ability to handle high-dimensional data and resistance to overfitting¹⁹. For instance, Massimo Stafoggia et al.²¹ used RFR to predict daily ozone concentrations in Sweden, considering various meteorological variables such as air temperature, cloud coverage, barometric pressure and snow albedo²⁰.

Applications of methods in environmental prediction

LR, NN and RFR have been widely employed in various environmental prediction tasks beyond air ozone prediction.

Water quality prediction

These methods have found applications in areas such as water quality prediction. LR, NN and RFR have been used to predict water quality parameters, including dissolved oxygen levels, pH and nutrient concentrations^21,22,23.

Air pollutant concentration modeling

NN and RFR have been applied to forecast concentrations of air pollutants, such as particulate matter (PM) and nitrogen dioxide (NO₂)^24,25.

Environmental impact assessment

LR and NN have been applied for environmental impact assessment, such as global warming, human health, metal depletion, freshwater ecotoxicity, particulate matter formation and terrestrial acidification^26,27,28.

These examples highlight the versatility and effectiveness of these modeling techniques in addressing a range of environmental prediction tasks.

Performance in ozone prediction of prediction models

LR, NN and RFR are prediction models based on different principles and algorithms. LR predicts by fitting a linear relationship between input features and output variables. NN utilizes multi-layered neuron networks to establish nonlinear mapping relationships. RFR combines multiple decision tree models through ensemble learning to enhance prediction performance.

To accurately predict ozone concentrations and trends, various prediction methods have been employed.The performance of commonly used different prediction models in ozone prediction is compared as Table 1.

Table 1 Methods used in ozone concentrations prediction.

Subjects

Abstract

Similar content being viewed by others

Air quality prediction models based on meteorological factors and real-time data of industrial waste gas

Estimating PM2.5 utilizing multiple linear regression and ANN techniques

Air quality prediction model based on mRMR–RF feature selection and ISSA–LSTM

Introduction

Background and importance of air ozone prediction

Soft sensor modeling for air ozone prediction and its significance

Objectives of the study

Literature review

Overview of linear regression, neural networks and random forests regression

Linear regression

Neural networks

Random forest regression

Applications of methods in environmental prediction

Water quality prediction

Air pollutant concentration modeling

Environmental impact assessment

Performance in ozone prediction of prediction models

Comparison of prediction models

Methodology

Data collection and preprocessing

Feature selection and engineering

Application

Models

Assessment of soft sensor model

Modeling process

Results and analysis

Results of LR

Results of NN

Results of RFR

Comparison of different variable sets

Comparison of different models

Comparison of all sensors

Conclusion

Summary of the study

Discussion of the most effective modeling technique

Future directions for research in soft sensor modeling for air ozone prediction

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links