Majority scoring with backward elimination in PLS for high dimensional spectrum data

Variable selection is crucial issue for high dimensional data modeling, where sample size is smaller compared to number of variables. Recently, majority scoring of filter measures in PLS (MS-PLS) is introduced for variable selection in high dimensional data. Filter measures are not greedy for optimal performance, hence we have proposed majority scoring with backward elimination in PLS (MSBE-PLS). In MSBE-PLS we have considered variable importance on projection (VIP) and selectivity ratio (SR). In each iteration of backward elimination in PLS variables are considered influential if they were selected by both filter indicator. The proposed method is implemented for corn’s and diesel’s content prediction. The corn contents include protein, oil, starch and moisture while diesel contents include boiling point at 50% recovery, cetane number, density, freezing temperature of the fuel, total aromatics, and viscosity. The proposed method outperforms in terms of RMSE when compared with reference methods. In addition to validating the spectrum models, data properties are also examined for explaining prediction behaviors. Moreover, MSBE-PLS select the moderate number of influential variables, hence it presents the parsimonious model for predicting contents based on spectrum data.

For modeling high dimensional data partial least squares (PLS) 1 has proven itself as potential candidate in diverse areas 2 . PLS is an iterative way of model fitting, where in each iteration PLS components describe the relation between corn's contents marked as response y and spectrum data marked as explanatory variables X . Since PLS is not a method for variable selection, hence several modifications are proposed in PLS for variable selection 3 . The presence of noise variable in high-dimensional spectrum data is quite common, which may affect the prediction capabilities of the model. Although the basic PLS was not designed for variable selection, several developments are made in PLS which accomplish the variable section for improved prediction. Among several developments in PLS the Hotelling T 2 based PLS i.e. T 2 -PLS and truncation on PLS loading weights i.e. Trunc-PLS are considered as potential. The importance of variables in PLS is defined by PLS loading weights. For instance Liland et al. 4 in Trunc-PLS assumes the normality of loading weights where a set of variables departed from the mean of loading weight's distribution are considered as noise variables and are discarded from the final fitted model. T 2 -PLS 5 can be considered as the multivariate extension of Trunc-PLS, where PLS loading weight matrix having loading weights from first components to optimum components is monitored with Hotelling T 2 .
Recently, Freeh and Mehmood 6 has introduced the majority scoring based algorithm for variable selection in PLS (MS-PLS). In MS-PLS several filter measures are considered at same time where variables were scored through considered filter measures. The set of variables which were scored higher compared to threshold were marked as influential and rest were marked as non-influential variables. Mehmood et al. 3,7 has compared filter and wrapper PLS methods for variable selection in PLS, indicating filter measures are faster while wrapper algorithm are computationally expensive but are more greedy for model performance. Backward elimination procedure is a potential wrapper variable selection method. The current article proposed the implementation of majority scoring in backward elimination, where two variable selection measures are used, variable importance on projection (VIP) 8 and selectivity ratio (SR) 9 . In each iteration of backward elimination in PLS variables are considered influential if they were selected by both filter indicator. As a case study, the proposed method is implemented for modeling corn contents and diesel contents where samples were characterized by spectrum. The performance of proposed method i.e. MSBE-PLS are compared with reference methods i.e. T 2 -PLS and Trunc-PLS. In addition to validating the spectrum models, data properties are also examined for explaining corn content's prediction.
In this paper, "Data set and spectrometers" presents spectroscopic data. "Methods" presents methodology including the PLS based models, parameter estimation, calibration, validation, and statistical analysis. "Results and discussion" presents the results and discussions.

Data set and spectrometers
We have considered the following two data sets.
Corn data. The corn Near-infrared spectra samples were obtained from http:// softw are. eigen vector. com/ Data/ Corn/ index. html. The corn data includes 80 samples which were measured on NIR spectrometer called Mp5, which is primary instrument FOSS NIRsystems 5000. The spectrum obtained covers the wavelength range ( 1100 to 2498 nm) at 2 nm intervals having 700 channels per wavelength. This constitute the 700 columns of explanatory matrix. This results in explanatory matrix X (80×700) . From each corn sample different contents like protein, oil, starch and moisture were measured. These contents construct the response variables y moisture(80×1) , y oil(80×1) ,y protein(80×1) and y starch(80×1) .

Methods
Partial least squares (PLS). In PLS the centered spectrum explanatory matrix X 0 = X − 1x ′ and response y 0 = y − 1ȳ are used 10 . PLS is an iterative procedure, so it has K components. For all PLS components k = 1, 2, . . . , K the loading weights, score vector, loadings and deflated data are computed as 1. Defining the loading weights by which reflects the covariance of X k−1 with y k−1 . Normalizing the loading weights 2. Computing the score vector t k by 3. Computing the X-loading p k through regressing X k−1 on the score vector: Similarly computing the Y-loading q k through 4. Deflating X k−1 and y k−1 by subtracting the involvement of t k : 5. If k < K go back to 1.
From each component computed loading weights, score vector, loadings and deflated data is stored in respective matrices/vectors W , T , P and q . Although PLS is suitable candidate for validation, but in presence of noise variables the validation performance may decrease. The validation performance can be improved by removing the noise variables from PLS. We have considered majority scoring backward elimination in PLS (MSBE-PLS), majority scoring in PLS (MSBE-PLS), Hotelling T 2 based variable selections in PLS ( T 2 -PLS) and truncation for variable selection in PLS ( Trun-PLS for modeling corn data. The computational structure of these methods is presented Figs. 1 and 2. The algorithm of these methods is described below.

Truncation for variable selection in PLS (Trunc-PLS).
In standard PLS, loading weights reflects the importance of variables [11][12][13] . Variables with small absolute loading weights are considered as noise and should be removed from the model. Considering the importance of PLS loading weights, Liland et al. 4 assume the PLS loading weight w assumed to follow the normal distribution, where variables located at the tail of normal distribution should be discarded from the model. The procedures follows by: 1. Sorting the PLS loading weights as w s 2. Computing the confidence interval about the median of w s as f (w s , α Trunc ).

Majority scoring in PLS (MS-PLS).
Considering more than one filter measure at a time may results in more consistent variable selection, in this context, recently, Freeh and Mehmood 6 has introduced the majority scoring based algorithm for variable selection in PLS (MS-PLS). Here we have considered variable importance on projection (VIP) 8 and selectivity ratio (SR) 9 which are defined as www.nature.com/scientificreports/ For selectivity ratio (S) target projection also called target rotation is used. Target projection is the post projection of explanatory spectrum data on the response that is the antibacterial activity of ILS, where spectrum explanatory matrix is decomposed into the residual part and latent part as . The selectivity ratio (S) from TP defined as www.nature.com/scientificreports/ where V exp,j is the explained variance through TP and V res,j is the residual variance of spectrum j . The proposed procedure is presented in a flow chart in Fig. 2  3. Construct the score matrix S whose column presents the variable and rows presents the filter measures. The (ithrow, jthcolumn) entry of S matrix presents the influence of ith filter measure over jth variable. 4. Compute the average score ( ψ ∈ [0, 1] ) for each jth variable. ( ψ → 1 ) indicates respective variable is influential. 5. convert ψ into label vector l ψ as Here pt is percentile. Its higher level is expected to result in influential variable selection. For optimal performance, it is required to tune.

Majority scoring with backward elimination in PLS (MSBE-PLS).
Majority scoring with backward elimination in PLS (MSBE-PLS) required the same filter measure as taken in MSBE for variable importance for variable selection. Let Z 0 = X then the procedure follows. Model fitting. Model fitting requires parameter tuning. For all three considered PLS based methods, number of PLS components is common parameter to tune. In addition to this Trunc-PLS has α Trunc , T 2 − PLS has α T 2 and MSBE-PLS has u . These additional parameters defines the variable selection in respective PLS models. For optimal estimation, a range of possible values of these parameters is considered in validation procedure described in upcoming subsection.
Validation and robustness of model performance. For evaluating the model prediction capability and reliable estimation of parameters double cross-validation procedure is adapted. The spectrum data X and response y is divided into test (25%) and training 75% . The training data is used for model fitting. The prediction capability, which is usually measured by RMSE is defined as where n sample of respective split of data (test/training), y i is the response which can be any of the corn or diesel content and ŷ i is respective predicted response from the model. The model with lest RMSE on training and test data set is called well calibrated and well validated model respectively. Since model fitting requires the parameter tuning, hence the 10-fold cross validation is used on training data. The parameter threshold which gives the best RMSE in 10 fold cross validation is considered as the optimum.
The data is divided into training and test randomly, hence it quite possible for given split the models may over or under perform. In order to have robust model performance estimation Monte Carlo simulation with 100 runs was used. In each Monte Carlo simulation run, the above procedure of validation is conducted 14 .
Data properties. In addition, data properties are also examined for explaining corn and diesel content's prediction. For this purpose, eigenvalue structure of sample covariance spectrum matrix and the covariance between principle components and the contents 4,15 . Irrelevant components having large eigenvalues are expected to have worst prediction.

Results and discussion
From corn samples four protein, oil, starch and moisture are measured, from diesel boiling point at 50% recovery, cetane number, density, freezing temperature of the fuel, total aromatics, and viscosity are measured. Hence each response is modeled separately with respective spectrum . www.nature.com/scientificreports/ The data properties related to corn and diesel spectrum and their contents are presented in Fig. 3. Upper panel presents, corn spectrum has strong between-variable dependencies. Very few latent components seem to explain most of data variation. Together with the sharp drop of eigenvalues, we notice distinct behavior of spectral covariances between the principal component and corn contents. On the average, moisture and oil show large covariances over the relevant components and small covariances over the irrelevant components, hence one should expect better prediction. Protein shows moderate covariances over the relevant components and small covariances over the irrelevant components, hence one should expect moderate prediction. Starch show small covariances over the relevant and irrelevant components, hence one should expect relatively low prediction. Similar trends are observed with diesel contents as presented in lower panel of Fig. 3.
Since we have considered four PLS based models including Trunc-PLS, T 2 -PLS, Ms-PLS and MSBE-PLS. For evaluating and comparison Monte Carlo simulation is implemented with N = 100 . In each run, the spectrum data X and contents y are divided into test (25%) and training 75% . Training data is used to fit the PLS based model, where 10 fold cross validation is implemented for tuning the model parameters like number of components, α Trunc , α T 2 and u . From each Monte Carlo run optimal tuning parameters, calibration RMSE , validated RMSE and number of selected wavenumbers are recorded for each of the fitted model.
For prediction models the both validated and calibrated RMSE should be small 16 . The comparison of validated and calibrated RMSE for corn and diesel content is presented in Fig. 4. Rep − PLS and Trunc-PLS show small validated and calibrated RMSE. Moreover T 2 -PLS has moderate validated and calibrated RMSE. Similarly,

Corn Contents
Components Scaled eigenvalues

Diesel Contents
Components Scaled eigenvalues  www.nature.com/scientificreports/ For validation, the stability of the model is also an important factor to consider. In Fig. 5 the standard deviations of accuracy for all fitted model is presented. MS-PLS, MSBE-PLS and Trunc-PLS has the best stability for corn moisture and oil content. MSBE-PLS and Trunc-PLS has better stability for corn protein contents. Similarly, the boiling point of diesel has best stability with MSBE-PLS and Trunc-PLS. The cetane number has good stability with all PLS methods. The diesel density has best stability with MSBE-PLS. The freezing temperature of the fuel has best stability with MSBE-PLS. The total aromatics has best stability with MSBE-PLS and Trunc-PLS. The viscosity has best stability with MSBE-PLS and MS-PLS.
After conducting the validated and calibrated RMSE comparison, and stability analysis. Analysis of variance (ANOVA) is conducted to study the effect of validation methods over the variations in validated RMSE. The ANOVA results for each corn characteristic protein, oil, starch and moisture are presented in Table 1. Among PLS models MSBE-PLS is taken as a reference model. It appears MSBE-PLS has significantly better prediction of corn's moisture (p-value=0.018) and oil (p-value < 0.001 ) compared to Trunc-PLS, similarly MSBE-PLS has significantly better prediction of all considered corn's contents (p-value < 0.001 ) compared to T 2 -PLS. The ANOVA results for each diesel characteristic diesel boiling point at 50% recovery, cetane number, density, freezing temperature of the fuel, total aromatics, and viscosity are presented in Table 2

No. Components
Corn For a well calibrated and validated model the number of selected variables is important to consider since it reflect how much information is considered noise and how much information is considered influential. Moreover the distribution of selected number of variables effects the prediction that RMSE. The distribution of number of selected variables together with standard error bars from 100 Monte Carlo simulation is presented for all of PLS methods in Fig. 7. The upper panel presents the distribution of selected variables in modeling the corn's contents while lower panel presents the results for diesel contents. It appears, Trunc-PLS is using the maximum number of variables (wavelength) while T 2 -PLS utilizes the minimum number of variables. Since the prediction capabilities from