Random forest method for estimation of brake specific fuel consumption

The internal combustion engine is a widely used power equipment in various fields, and its energy utilization is measured using brake specific fuel consumption (BSFC). BSFC map plays a crucial role in the analysis, optimization, and assessment of internal combustion engines. However, due to cost constraints, some values on the BSFC map are estimated using techniques like K-nearest neighbor, inverse distance weighted interpolation, and multi-layer perceptron, which are recognized for their limited accuracy, particularly when dealing with distributed sampled data. To address this, an improved random forest method is proposed for the estimation of BSFC. Polynomial features are employed to increase higher dimensions of features for random forest by nonlinear transformation, and critical parameters are optimized by particle swarm optimization algorithms. The performance of different methods was compared on two datasets to estimate 20%, 30%, and 40% of BSFC data, and the results reveal that the method proposed in this paper outperforms other common methods and is suitable for estimating the BSFC map.


IDW method
The IDW method is a conventional and efficient technique for interpolation.Its fundamental concept involves assigning higher weights to the points in the training set closer to the interpolation points.Let the coordinates of n known points be ( X i , Y i , Z i ), and i = 1, 2, 3, …, n, then the z value at the point (x, y, z) is given as where d −2 i is the inverse of the Euclidean distance from (x, y) to ( X i , Y i ) squared.The weight in this method fol- lows a normalization condition, and it is evident that the closer a point is to the interpolation point, the higher the weight assigned to it.

OK method
The OK method is based on the assumption that the data space has uniform expectations and variance.It uses optimal estimation to obtain the data for unknown points.This geostatistical technique is widely applied in fields such as geographical sciences, environmental sciences, and atmospheric sciences.The OK method has been utilized for deposit Cu concentration 14 and has been reported to provide high-fidelity uncertainty quantification in composite shell dynamics 18 .

MLP method
The MLP method employs cascaded neurons that use a sigmoid nonlinear function to map the input to output, enabling the approximation of any nonlinear function.Thus, the neural network can approximate any given multivariable continuous function, including drawing characteristic curves for power machines.This method is highly flexible and possesses a strong nonlinear mapping ability, making it a broadly applicable computational technique.It has found use in numerous applications, such as predicting macroclimate index runoff in atmospheric science 19 and assessing the sensitivity to flood temperature in geographical research 20 .
(1) z(x, y) = Improved RF method for the estimation of BSFC KNN method RF is a regression method based on trees and has the benefits of strong prediction ability, low overfitting risk, and high interpretability 8,9 .This method is computationally efficient and exhibits superior speed and accuracy 14,15 .It has been widely applied in various fields, including environmental science, agriculture, and engineering.For instance, it has been utilized to classify medical images 21 and predict indoor radon concentration 22 .
RF is one of the widely used ensemble learning methods.It employs a large number of regression trees for ensemble learning, with random attribute selection during the training process.The regression tree serves as the fundamental learner for RF regression.As with other machine learning techniques, in RF, features and labels, are referred to as X and Y, respectively, while N represents the sample number and D represents the training data set.The representation is as follows: A regression tree corresponds to a partition of the feature space and labels on the partitioned units.Dividing the feature space into M units R 1 , R 2 , . . ., R M , each unit R M with a fixed label C m , the regression tree model can be represented as The square error (E) is used to express the prediction error of the regression tree for the training data in the feature space whose partitioning method has been determined.This error is used to determine the optimal output value on each unit.In the RF method, the following algorithm is used to generate a regression tree.
Step 1: Select the j-th variable and its value s as the segmentation variable and segmentation point, respectively.The two regions are defined as follows: Step 2: Solve the following problem to obtain the optimal j and s values.These values divide the input space into two regions, R 1 and R 2 .
It is easy to understand that the optimal value ĉm of c m on a unit R m is the mean of the outputs y i correspond- ing to all input instances x i in the unit, which can be expressed as Step 3: Repeat steps 1 and 2 for R1 and R2, respectively, until the termination condition is reached.The termination condition can be that each interval contains one sample, all samples have been used, or the number of units has reached a specified number.
The RF method involves creating a training subset by randomly sampling D and evaluating the error of the remaining samples.Multiple random trees are then generated using the same method for generating random trees, except that instead of using all features, a specified number of features are randomly selected.A total of NT regression trees were generated, denoted as {f 1 (x), f 2 (x), . . ., f N T (x)} .If the weight is set to W = {w 1 , w 2 , . . ., w N T } and w 1 = w 2 = • • • = w N T = 1/NT , the regression prediction result for feature x is It is evident that the diversity in RF integration arises not only from sample disturbances but also from attribute disturbances.This results in a greater variation between individuals, leading to strong adaptability and anti-interference ability toward the data. (2)

Improved RF method
In the RF algorithm, decision trees are generated directly from the features.In machine learning, adding some nonlinear features of input data can be an effective way to increase the complexity of the model.Therefore, this study introduces polynomial features that can generate higher dimensions of features and terms related to each other.Polynomial features are a method of increasing dimensionality and performing nonlinear transformations in machine learning.It combines and expands the original features, which improves the model's expression ability and fitting effect.
Let the feature vector be x = [x 1 , x 2 , . . ., x m ] , and define the feature of a polynomial of degree 0 as φ 0 (x) = 1 .The d-th polynomial feature can be represented by the following iterative formula.
is the row vector, that contains one or more variables from all possible x 1 , x 2 , . . ., x m variables, with a degree of d as a monomial expression.
When the RF method is used to estimate BSFC, the d-degree polynomial feature φ d (x) of feature x serves as the input feature of the RF regression model.This can incorporate more combinations of original features into the consideration of generating decision trees, enhancing their fitting and expression abilities.
The polynomial feature , and the features with a proportion of p in all features are employed when the nodes split.
For a given feature vector x , and its polynomial feature, denoted by φ d (x) , the predicted result value ŷ = F(φ d (x)) is obtained using model F. The map from feature vector x to ŷ is called a polynomial feature RF model f (d,p) (x) with hyperparameters (d, p).

Parameter optimization based on particle swarm algorithm
When polynomial features are introduced, the feature dimension for the feature vector x = [x 1 , x 2 , . . ., x m ] and polynomial feature φ d (x) increases from m to C d m+d = (m + d)! (d!m!) .However, too many polynomial features can cause slow training due to a large number of feature dimensions and may lead to overfitting, while too few features can result in underfitting.Thus, the degree d of the polynomial feature needs to be selected carefully.
Similarly, in decision tree generation, the parameter p represents the proportion of features considered to the total number of features.Too many features can lead to model complexity, which can be affected by noise and randomness, while too few features may cause under-fitting, making it difficult to capture complex relationships in the data.Therefore, when polynomial features are introduced, p needs to be selected more carefully.
Since both p and d are critical parameters, particle swarm optimization algorithms can be considered to optimize their combination.The object function is as follows.
The optimization process begins with initialization, where the total number of particles and the number of iterations are specified.Each particle is randomly assigned a position pi = {pi, di} and a velocity vi = {v pi , v di }.The objective function of each particle is then calculated to obtain the individual optimal solution of that particle, and the position of the particle with the smallest objective function is considered the global optimal solution.
In each iteration, the following calculations are performed.
For the i-th particle, the objective function of its particle is calculated.If the objective function result is less than the objective function at the position g best i = {g best pi , g best di } of the individual optimal solution, update the individual optimal solution to the current position.If the objective function result is less than the objective function at the global optimal solution position g best = {g best p , g best d } , update the global optimal solution to the current position.The velocity and position of the particles are updated as In the above equation, ω is the inertia weight, generally set to 0.9.c 1 and c 2 are the acceleration coefficients, generally set to 2.0.r 1 and r 2 are randomly selected from [0, 1] at each update.
When the maximum number of iterations is reached, g best is the optimal parameter of p and d.

Experimental data
The data sets used in this paper were obtained from references 23,24 , The data sets actual measurements of two gasoline internal combustion engines, including speed, power and fuel consumption rate.The two engines (9) www.nature.com/scientificreports/produced a total of 52 and 80 measured data points, respectively.Tables 1 and 2 show the Speed, power and fuel consumption rate of the engines.Figures 1 and 2 show the distribution of the first fuel engine in the speed-power plane and the distribution of the speed-power-fuel consumption in the three-dimensional space.
Figures 3 and 4 show the distribution of the second fuel engine in the speed-power plane and the distribution of the speed-power-fuel consumption in the three-dimensional space.

Evaluation index
In this paper, the following indicators are used for evaluation: root mean square error (RMSE), normalized mean square error (NMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and R-squared (R 2 ).Each indicator is calculated as follows.
The RMSE is defined as: where n is the total number of the data to be estimated, yi is the real value to be estimated, and yl is the estimated value.
To compare the accuracy and degree of variation of different datasets, the NMSE) is proposed to compare the methods on different datasets.The calculation of NMSE is as follows.
Table 1.Speed, power and fuel consumption rate of first engines 23 .The MAE is also used to compare estimation errors.The calculation of MAE is

Rpm (r/min) P(kW) be (g/kW h) P(kW) be (g/kW h) P(kW) be (g/kW
To evaluate and compare the accuracy of different algorithms and data sets, the MAPE is utilized in this study.The MAPE is considered more robust than the MAE, as it normalizes the error of each data point and can be used as an evaluation indicator.It is defined as R 2 is also used to evaluate different estimation methods, representing the proportion of estimated data information to original data information.The calculation of R 2 is as follows.
where y represents the average value of all the data to be estimated.The value range of R 2 is (− ∞, 1] .The closer R 2 is to 1, the more accurate the estimation method's results are.On the contrary, the farther R 2 is from 1, the greater the result error of the estimation method.When R 2 is less than 1, it indicates that the estimation error of the method is significant, even greater than using the mean as the estimation value.
In this paper, five indicators are used for evaluation, they are RMSE, NMSE, MAE, MAPE, and R 2 .RMSE represents the standard deviation between the estimated value and the true value error, while NMSE represents the percentage of error.MAE represents the average error between the estimated value and the true value, while MAPE represents the percentage of this error.R 2 expresses the degree of fit between the data and the regression model.NMSE and MAPE can serve as the primary performance indicators, while other indicators can serve as secondary indicators.

Experimental results
To compare different estimation methods, the known data in this study were randomly divided into two groups at a ratio of 4:1, with 80% of the data being known and the remaining 20% being used for estimation.The data estimation methods compared in this study include KNN, IDW, OK, MLP, RF, and the proposed RF.The performance indicators compared in this study include RMSE, NMSE, MAE, MAPE, and R 2 .To reduce the impact of grouping randomness on statistical results, the experiment was repeated 10 times, using the same ratio for random grouping each time.After each grouping, the known sample dataset and the estimated dataset used for testing have different data.The average of the performance metrics of the 10 experiments is used as the final indicator for performance comparison.
Estimating 20% of BSFC data Tables 3 and 4 present the performance metrics of various estimation methods on Dataset 1 and Dataset 2 for estimating 20% of BSFC data, respectively.The reported values in these tables are the average results from ten experiments.Figures 5 and 6 display the estimated values of different methods for the BSFC of Datasets 1 and 2, respectively.These figures show the actual estimated result data and real data of a single experiment in the ten experiments.
The results of the experiment conducted on Dataset 1 indicate that the proposed RF method described in this paper outperforms RF method with an RMSE of 0.46 lower, and it outperforms other methods with an RMSE of 5.05 lower.And additionally, the other errors are similar, and the R2 value of RF is closest to 1.These indexes show that the proposed RF has a minimal error and the highest accuracy.Similar results were observed on Dataset 2, the proposed method outperforms other methods with an RMSE of 9.71 lower. https://doi.org/10.1038/s41598-023-45026-1 https://doi.org/10.1038/s41598-023-45026-1www.nature.com/scientificreports/

Figure 1 .Figure 2 .
Figure 1.2D distribution of all collected data for the first BSFC.

Figure 3 .
Figure 3. 2D distribution of all collected data for the second BSFC.

Figure 4 .
Figure 4. 3D view of all collected data for the second BSFC.

Figure 5 .
Figure 5. Results of different methods on Dataset 1 for estimating 20% of BSFC data.

Figure 6 .
Figure 6.Results of different methods on Dataset 2 for estimating 20% of BSFC data.

Figure 7 .
Figure 7. Results of different methods on Dataset 1 for estimating 30% of BSFC data.

Figure 8 .
Figure 8. Results of different methods on Dataset 2 for estimating 30% of BSFC data.

Figure 9 .
Figure 9. Results of different methods on Dataset 1 for estimating 40% of BSFC data.

Figure 10 .
Figure 10.Results of different methods on Dataset 2 for estimating 40% of BSFC data. )

Table 2 .
24eed, power and fuel consumption rate of second engines24.

Table 3 .
Performance comparison of different methods on Dataset 1 for estimating 20% of BSFC data.Estimating 20% of BSFC dataTables5 and 6present the average performance metrics of various estimation methods on Dataset 1 and Dataset 2 for estimating 30% of BSFC data after ten experiments, respectively.Figures7 and 8display the estimated values of different methods for the BSFC of Datasets 1 and 2 in a single experiment, respectively.The results of the experiment conducted on Dataset 1 indicate that the proposed RF method described in this paper outperforms other methods with an RMSE of 0.66 lower.The proposed method outperforms other methods with an RMSE of 23.84 lower on Dataset 2. All the indexes show that the proposed RF method has a minimal error and the highest accuracy.
Estimating 40% of BSFC data Tables7 and 8present the average performance metrics of various estimation methods on Dataset 1 and Dataset 2 for estimating 40% of BSFC data after ten experiments, respectively.Figures 9 and 10 display the estimated values of different methods for the BSFC of Datasets 1 and 2 in a single experimen, respectively.The proposed RF method described in this paper outperforms other methods with an RMSE of 1.39 lower on Dataset 1.The

Table 4 .
Performance comparison of different methods on Dataset 2 for estimating 20% of BSFC data.

Table 5 .
Performance comparison of different methods on Dataset 1 for estimating 30% of BSFC data.

Table 6 .
Performance comparison of different methods on Dataset 2 for estimating 30% of BSFC data.

Table 7 .
Performance comparison of different methods on Dataset 1 for estimating 40% of BSFC data.

Table 8 .
Performance comparison of different methods on Dataset 2 for estimating 40% of BSFC data.