Applying feature selection and machine learning techniques to estimate the biomass higher heating value

The biomass higher heating value (HHV) is an important thermal property that determines the amount of recoverable energy from agriculture byproducts. Precise laboratory measurement or accurate prediction of the HHV is essential for designing biomass conversion equipment. The current study combines feature selection scenarios and machine learning tools to establish a general model for estimating biomass HHV. Multiple linear regression and Pearson’s correlation coefficients justified that volatile matter, nitrogen, and oxygen content of biomass samples have a slight effect on the HHV and it is better to ignore them during the HHV modeling. Then, the prediction performance of random forest, multilayer and cascade feedforward neural networks, group method of data handling, and least-squares support vector regressor are compared to determine the intelligent estimator with the highest accuracy toward biomass HHV prediction. The ranking test shows that the multilayer perceptron neural network better predicts the HHV of 532 biomass samples than the other intelligent models. This model presents the outstanding absolute average relative error of 2.75% and 3.12% and regression coefficients of 0.9500 and 0.9418 in the learning and testing stages. The model performance is also superior to a recurrent neural network which was recently developed in the literature using the same databank.

-Previous works have randomly used either proximate or ultimate analysis or their combination to estimate biomass HHV.This study selects the most important explanatory variables among proximate and ultimate analyses using the well-known feature selection methods.Indeed, combining feature selection scenarios and machine learning methods is the most important novelty of the current research.-Previous studies often proposed an empirical correlation or checked a small number of intelligent techniques to estimate biomass HHV.However, the present study applied several machine learning methods and selected the best one through ranking analysis.-The accuracy of the constructed approach in the present study is better than a model recently suggested in the literature.

Collected data from the literature
An extensive experimental database is needed to develop a general data-driven model capable of predicting a desired target (here, HHV).This database is also necessary to evaluate the model performance by diverse statistical criteria.On this ground, a literature databank including 532 HHV records as a function of proximate (fixed carbon, volatile matter, and ash) and ultimate (hydrogen, carbon, nitrogen, sulfur, and oxygen) compositional analyses was prepared.The supplementary material reports the numerical value of these variables and the source of each data sample.

Machine learning methods
This section describes the fundamental basis of the machine learning tools that are applied to compute biomass HHV.

Artificial neural network
Designing a reliable, accurate, and robust approach to extract the relation between input and output variables is a tough, onerous, and time-consuming mission that requires a detailed conception of the process 41 .In this way, artificial neural networks (ANNs) are suggested for such systems relying on the biological nervous systems of the human brain for function extraction, fault detection, and data mining 42,43 .Accordingly, this technique recently received a remarkable interest in different areas, specifically in the branches where getting experimental data is arduous 44 .One of the main benefits of the ANNs is related to constructing a trustworthy model between independent and dependent factors without any relation.Hence, interconnected processing units are employed to build the ANNs paradigm based on external information sources 44 .The multilayer perceptron neural network is one of the most favorable approaches 45 .To construct an MLPNN topology three main layers are required input, hidden, and output ones, which the input layer receives the main information from an external source which after some data treatment, transfers the information to the hidden layer, which here, the major data analysis and mathematical processing is employed.The operation defined by Eq. ( 1) is done in the neuron body 46 : where x is the entry signal and w is the weight vector by considering a bias (b) to specify the neuron's output.Further, it is also required to choose a proper activation function ( ) which linear (Eq.2), radial basis (Eq.3), logarithmic sigmoid (Eq.4), and hyperbolic tangent sigmoid (Eq.5) are between the most popular ones 47 .
that �(Z) indicates the neuron's output, s indicates the spread factor, and "exp" is the exponential function.It is worth noting that besides the MLPNN, the cascade feedforward neural network (CFFNN) is also one of the other well-known ANN types, which truly is a modified version of MLPNN that designs a network considering a direct connection among the input and output layers as well as concerning a non-straight connection with hidden layer 9,18 .

Group method of data handling
The GMDH approach is a machine learning approach that provides the possibility to recognize data interrelations and effectively engineer the network configuration 48 .Accordingly, this topology has a robust potential to overcome the complexity of modeling in the processes with multi-inputs and single-output.To develop a GMDH model the defined neurons are related using a quadratic polynomial where the new neurons are generated in the next layer 49 .Routinely, the GMDH network connects the input and output layers through Volterra functional, series formula described by the Kolmogorov-Gabor polynomial, i.e., Eq. ( 6) 50 : here, M indicates the number of inputs, x is the input variables, and "a" is the coefficient.Afterward, the GMDH approach must be trained to minimize the square error (SE) between the real output (y) and the calculated output (y cal ) according to Eq. ( 7) 50 : The GMDH can ignore the combination of those coupled signals that introduce a relatively high uncertainty to predict the target variable.

Random forest
The RF is one of the classifiers, which is constructed considering a group of decision trees known as weak learners that are required to be trained, parallelly, that can estimate the output concerning a majority-voting system 51 .In the RF, each decision tree strongly relies on a training dataset that is influenced by residual variation, noise, and particularity as uncertainties of data 52 .Accordingly, a minor variation in the training procedure has a significant effect on the development decision tree.However, an ensemble is employed to reduce the obstacles related to the decision tree algorithm.On the grounds, this strategy improves the accuracy of RF in comparison with a single decision tree as well as generalizes the potential of the developed approach, strongly 53 .However, to construct a more robust RF network employing heterogeneous decision trees with diversity accompanied by data particularity is required to be considered.
The required steps to design an RF paradigm are as follows 54 : 1.
Step 1: The RF topology is developed with different sampling methods and considering the bootstrapping for employed replacement.On the other hand, it is necessary to generate n training sets after getting the experienced sample n times with n times.2.
Step 2: The element dataset is utilized to build n decision tree according to the obtained n training sets from Step 1.

3.
Step 3: The single decision tree describes the features, and the best one is chosen by considering the Gini index, information divergence, and the ratio of divergence.4. Step 4: Then the Random Forest is constructed based on the trained decision trees by considering the classification and regression analysis. (1)

Least-squares support vector regressor
The SVR is one of the other well-known ML approaches, which has a main feature than the common ANNs that minimizes the error using the higher bound extension, while in the other ones, the local error is considered 55 .Generally, the SVR analyzes the data using a large-scale quadratic relying on a linear decision surface assessment.Thus, to obviate the complexity of SVR, least-square SVR (LS-SVR) was developed and in this case, the optimization procedure is achieved using some linear equations instead of quadratic assessment 33 .In this way, the LS-SVR function is characterized by Eq. ( 8) 56 : that, φ(x) indicates the kernel function, ω and B are the weight and bias of the model, respectively.On the hand, an optimization process is required for the cost function (Eqs.9 and 10) 57 , as: Further, to assess the developed optimization the Lagrange function is employed (Eq.11) 56 .
To get the LS-SVR network, it is also required to solve Eq. ( 12) 57 : It is noteworthy that the established approach is based on the kernel function, calculated by Eq. ( 13) 56 : Several kernel functions, including quadratic, cubic, polynomial, linear, and Gaussian are possible to incorporate in the LS-SVR body.

Results and discussions
Feature selection, machine learning construction/comparison, the best model selection, and performance evaluation are the main parts of the current section.

Feature selection
As mentioned earlier, the literature tried to correlate biomass HHV with the proximate and ultimate compositional analyses of bio-samples.The present study applies two well-known feature selection methods, i.e., multiple linear regression and Pearson correlation coefficient to sort fixed carbon, volatile matter, ash, carbon, nitrogen, oxygen, sulfur, and hydrogen content of biomass samples based on their effect on the observed HHV.

Multiple linear regression (MLR)
The MLR is likely the most well-known feature selection method which is often integrated with machine learning tools to efficiently handle an advanced regression task 58 .The MLR aims to extract a linear relationship between a target and its influential variables.The magnitude and sign of the coefficient of each independent variable in the MLR clarify the strength and direction of its influence on the target function.
For the sake of simplicity, some notations are assigned to the proximate and ultimate compositional analyses of biomass samples and their counterpart HHV.Table 1 introduces the symbols allocated to the involved target and influential variables in the current study.
It should also be noted that the HHV and its influential variables have different magnitudes.Hence, it is necessary to normalize them before establishing the MLR.This normalization stage helps deduce the strength of the HHV relationship with independent variables solely based on their MLR coefficients.This study uses Eq. ( 14) to scale all biomass compositional characteristics into the same range of zero to + 1 ( x).The biomass HHV is also normalized into the [0 1] range applying Eq. ( 15).The normalized HHV is abbreviated by y.
Equation ( 16) presents the mathematical expression of the MLR that linearly relates normalized HHV to its normalized influential variables.
Table 2 introduces the coefficients of the constructed MLR.The negative values of A 3 , A 6 , and A 8 clarify that the HHV decreases by the ash, oxygen, and sulfur content of biomass samples.On the other hand, fixed carbon, volatile matter, nitrogen, hydrogen, and carbon content of biomass samples result in increasing the HHV.
The relative importance (RI) of the biomass compositional analysis can be easily computed using Eq. ( 17).
The relative importance of each biomass ingredient on the observed HHV is illustrated in Fig. 1.This figure states that the nitrogen (2%), oxygen (3%), and volatile matter (3%) content of biomass samples have such a slight influence on the HHV that they can be ignored.This observation is due to the small coefficients of these biomass ingredients in the MLR, i.e., A 2 = 0.0667, A 6 = − 0.0616, and A 7 = 0.0335.Also, carbon (42%), ash (18%), fixed carbon (12%), hydrogen (10%), and sulfur (10%) content of biomass samples have a considerable effect on the HHV.
The MLR justified that it is better to model HHV solely based on the most important features, i.e., carbon, ash, fixed carbon, sulfur, and hydrogen content of biomass samples.

Pearson's correlation coefficient
The Pearson correlation coefficient is another method that helps sort influential variables based on the importance of their relationship with a target function.Equation ( 18) introduces a mathematical way to calculate the Pearson coefficient ( η ) for a correlation between HHV and each influential variable.
Here, x ave and y ave show the average value of influential and target variables, respectively.Equations ( 19) and ( 20) can be used to compute the average value of proximate/ultimate features and biomass HHV, respectively.
As Table 3 shows, Pearson's coefficient for a correlation between a pair of variables ranges from − 1 to + 1.Similar to the MLR, the sign and magnitude of this coefficient clarify the direction and strength of the correlation, respectively.
The last row of this table reports the HHV relationship with the composition of biomass ingredients.It can be seen that the biomass HHV has the weakest correlation with the nitrogen (− 0.16), oxygen (− 0.17), and (15) y j = y j − y min / y max − y min j = 1, 2, ..., N volatile matter (0.24) content of biomass samples.These are exactly those variables that are identified by the MLR method as negligible features.Furthermore, like the MLR method, Pearson's method also identifies carbon, ash, fixed carbon, hydrogen, and sulfur content of biomass samples as the most important features.In summary, the feature selection accomplished by the MLR and Pearson's methods clarifies that it is better to predict HHV solely as a function of carbon, ash, fixed carbon, hydrogen, and sulfur content of biomass samples and ignore all other ingredients of bio-samples.

Designing the machine learning models
This section aims to design different machine learning tools (random forest, multilayer and cascade feedforward neural networks, group method of data handling, and least-squares support vector regressor) to predict biomass HHV based on those influential variables suggested by the feature selection methods.Then, the most accurate intelligent model is identified by comparing the performance of machine learning tools in the learning and testing stages.
All these machine learning tools have some coefficients that automatically adjust by an optimization algorithm.In addition, they have some hyperparameters that must be determined by trial-and-error procedure or other search techniques.Indeed, different machine learning models with different hyperparameters have been developed and their performances are monitored using statistical analyses.By comparing the achieved accuracy of models with different hyperparameters it is possible to determine the best hyperparameters.Interested readers   59 study to find some techniques for hyperparameter tuning for machine learning models.Table 4 presents the most important hyperparameters of each machine-learning tool and the best ones selected through trial-and-error investigations.This table indicates that the best MLPNN and CFFNN have two neuronic layers with the 5-13-1 and 5-14-1 configurations, respectively.The integer values in the MLPNN and CFFNN configurations show the number of influential variables, the number of hidden neurons, and the number of output neurons, correspondingly.These two ANNs include different activation functions in their neuronic layers and are trained by different optimization algorithms.
The kernel type is the only hyperparameter of the LS-SVR that must be determined by the trial-and-error process.Various kernel types, including linear, quadratic, cubic, polynomial, and Gaussian are checked, and the last candidate is identified as the best one.
The number of neuronic layers and the number of nodes in each layer are those GMDH hyperparameters that must be determined appropriately.The sensitivity analysis confirms that the GMDH with three neuronic layers and 5-7-9-1 configuration is superior to the other tested ones.
Finally, the trial-and-error analysis approves that 15 trees must be placed in the forest of the RF approach.
It should be mentioned that the following statistical criteria (Eqs.21-24) 60 are used to monitor the deviation between actual and predicted HHVs and determine the best hyperparameters of each machine-learning tool.AARE%, MSE, RMSE, and R abbreviate absolute average relative error, mean squared error, root mean squared error, and regression coefficient, respectively.Furthermore, the y cal superscript designates the calculated HHV.
To distinguish the machine learning tool with the highest accuracy toward HHV prediction, it is necessary to compare the performance of the selected models in the learning and testing stages.The 532 available datasets are randomly split into learning and testing categories with a ratio of 85/15.Indeed, the learning step of all the machine learning tools is accomplished by 452 datasets and the remaining 80 unseen samples are used to test the generalization capability of the trained models.
Table 5 summarizes the RF, LS-SVR, MLPNN, CFFNN, and GMDH performance for estimating the HHV records in the learning and testing steps.The AARE%, MSE, RMSE, and R criteria are used to monitor the model's performance.Due to the availability of four statistical indexes and two different categories, it is not easy to identify the best model.Therefore, the next section uses the ranking test to sort the machine learning models based on their performance in the learning and testing phases.

Selecting the highest accurate machine learning model
The ranking test assigns the first rank (i.e., 1) to a model with the best observed statistical criterium (minimum value of AARE%, MSE, and RMSE, and maximum value of R).On the other hand, a model with the worst statistical criterium receives the last rank (i.e., 5).The second, third, and fourth ranks are also chronologically devoted to the other machine learning models.Then, it is possible to compute the average rank of a machine learning model from its ranks for the involved statistical indexes.Finally, the machine learning models are sorted based on their average performance in the learning and testing stages.Figure 2 presents the learning/testing rank of the investigated machine learning tools graphically.Although the CFFNN has the first rank in the learning stage (the best performance), it predicts the testing category so inaccurately that it places in the fifth rank position (the worst performance).Therefore, it is not feasible to consider the CFFNN the best model.The MLPNN with the second and first ranks in the learning and testing stages presents the best performance for estimating the biomass HHV.Also, the GMDH with the fifth and fourth ranks achieved in the learning and testing phases is the worst intelligent tool to predict the biomass HHV.
The performed ranking test approved that the MLPNN with a 5-13-1 configuration better predicts the biomass HHV than the other checked machine learning tools.The compatibility of actual HHVs and MLPNN predictions is approved by the excellent AARE = 2.75%, MSE = 0.59, RMSE = 0.77, and R = 0.9500 in the learning stage and AARE = 3.12%, MSE = 0.85, RMSE = 0.92, and R = 0.9418 in the testing step.
The subsequent sections comprehensively evaluate the MLPNN performance utilizing graphical and numerical analyses.In addition, the MLPNN accuracy will be compared with another model recently proposed in the literature 61 .

Performance analysis
The scatter plot of computed biomass HHVs by the MLPNN versus their associated actual measurements for the learning and testing steps has been separately displayed in Fig. 3.This analysis approves excellent compatibility between the actual and computed target function.The regression coefficients of 0.9500 and 0.9418 observed in the learning and testing steps are also an indicator of the outstanding performance of the MLPNN to simulate the HHV of biomass samples with diverse origins.The performance of the suggested model for predicting the learning and testing sets has been monitored using the observed error between actual and computed biomass HHVs (Eq.25) and the results are shown in Fig. 4.
where e is an error.This investigation justifies that the observed errors between actual and predicted biomass HHVs are mainly between − 3 and 3 kJ/g.Furthermore, less than 1.2% of the actual HHV measurements have an absolute error of higher than 3 kJ/g.Table 6 reports the main statistical characteristics (minimum, maximum, average, and standard deviation) of the error observed between actual and calculated biomass HHV.The MLPNN's error for the biomass HHV estimation ranges from − 3.061 to 4.438 kJ/g.Moreover, the average and standard deviation (SD) of the observed errors is 0.021 and 0.820 kJ/g, respectively.Equations ( 26) and ( 27) define the SD and average (e ave ) of the provided errors by the MLPNN.

Validation by the literature model
The literature recently applied recurrent neural networks (RNN) to predict biomass HHV from all proximate and ultimate compositional analyses 61 .Therefore, it is a good idea to compare the prediction accuracy of this RNN with the proposed MLPNN in the current study.Table 7 compares the RNN and MLPNN performance to compute the learning/testing biomass HHVs utilizing AARE%, MSE, RMSE, and R indexes.It is easy to conclude that the MLPNN is more accurate than the recently constructed RNN in the literature.Now, the Radar graph is employed to visually compare the MLPNN and RNN performance in the learning and testing steps, respectively.Figure 5 shows that the obtained accuracies in terms of AARE%, MSE, RMSE, and R indices by the MLPNN are better than those provided by the RNN.It is better to highlight that small values of the first three indices and the R index close to unity are desirable from the modeling perspective.
In addition, Fig. 6 displays that the MLPNN performance in terms of all four statistical indexes is superior to those obtained by the RNN during the testing stage.

Conclusions
The literature has used a random combination of proximate and ultimate analyses to estimate the biomass HHV.
Since the appropriate selection of the explanatory variables has a direct impact on the modeling accuracy, this work applied feature selection scenarios and machine learning methodologies to suggest a practical route to accurately predict the higher heating value of biomass samples.A relatively extensive experimental databank including 532 HHV records is used to validate the proposed method in the present study.The main findings of this research work can be summarized as follows: -Multiple linear regression and Pearson's correlation coefficient were applied to identify the most important influencing variables on the biomass HHV.-Carbon and ash content are the main biomass ingredients to determine the HHV. https://doi.org/10.1038/s41598-023-43496-xwww.nature.com/scientificreports/

Figure 1 .
Figure 1.The relative importance of biomass compositions on the HHV.

( 25 )Figure
Figure Correlation between actual and predicted HHVs of different biomass samples.

Figure 4 .
Figure 4. Performance checking of the MLPNN model in the learning and testing steps.
-HHV sharply increases by the carbon content and dramatically decreases by the ash content of biomass samples.-Volatile matter and nitrogen/oxygen content of the biomass have a negligible effect on the HHV.-Multilayer perceptron neural network provided more accurate prediction for the biomass HHV than the other five checked machine learning models.-The MLPNN predicted 452 learning HHVs with the AARE = 2.75%, MSE = 0.59, RMSE = 0.77, and R = 0.9500.-The model accuracy for predicting 80 unseen testing HHVs also approved by the AARE = 3.12%, MSE = 0.85, RMSE = 0.92, and R = 0.9418.-The MLPNN provides more accurate HHV predictions than those obtained by RNN suggested in the literature.

Figure 6 .
Figure 6.The Radar graph for comparing the MLPNN and RNN performance in the testing stage.
, i = 1, 2, 3, 4, 5, 6, 7, and 8 indicate fixed carbon, volatile matter, ash, carbon, hydrogen, oxygen, nitrogen, and sulfur content of biomaterials, respectively.In addition, N is the number of records.The superscripts min and max represent the minimum and maximum values of each variable.

Table 1 .
Assigned notations to define independent and dependent variables.

Table 2 .
Adjusted coefficients of the MLR equation.

Table 3 .
Pearson's coefficients between each pair of involved variables in the present work.

Table 4 .
The summary of checked/selected hyperparameters of machine learning models.

Table 5 .
Performance of different machine learning models to predict learning/testing HHV data.Ranking test to sort machine learning models based on their performance in the learning/testing stage.

Table 6 .
Summary of the MLPNN's errors to predict the HHV records.

Table 7 .
Comparing the MLPNN accuracy with the literature model.