Comparison of wavelength selection methods for in-vitro estimation of lactate: a new unconstrained, genetic algorithm-based wavelength selection

Biochemical and medical literature establish lactate as a fundamental biomarker that can shed light on the energy consumption dynamics of the body at cellular and physiological levels. It is therefore, not surprising that it has been linked to many critical conditions ranging from the morbidity and mortality of critically ill patients to the diagnosis and prognosis of acute ischemic stroke, septic shock, lung injuries, insulin resistance in diabetic patients, and cancer. Currently, the gold standard for the measurement of lactate requires blood sampling. The invasive and costly nature of this procedure severely limits its application outside intensive care units. Optical sensors can provide a non-invasive, inexpensive, easy-to-use, continuous alternative to blood sampling. Previous efforts to achieve this have shown significant potential, but have been inconclusive. A measure that has been previously overlooked in this context, is the use of variable selection methods to identify regions of the optical spectrum that are most sensitive to and representative of the concentration of lactate. In this study, several wavelength selection methods are investigated and a new genetic algorithm-based wavelength selection method is proposed. This study shows that the development of more accurate and parsimonious models for optical estimation of lactate is possible. Unlike many existing methods, the proposed method does not impose additional locality constraints on the spectral features and therefore helps provide a much more granular interpretation of wavelength importance.


Background
Cellular respiration, lactate, and adenosine triphosphate. Highlighting the relationship between lactate and Adenosine TriPhosphate (ATP), known as the universal energy currency of cells, helps demonstrate the importance of lactate as a remarkable biomarker. Transcending human biology, ATP is seen in all living organisms. It provides cells with easily accessible energy to fuel various processes such as biosynthesis, metabolism, DeoxyriboNucleic Acid (DNA) synthesis, muscle contraction, transport of ions and impulse transition in the nervous system 23 . The storage of easily accessible bioenergy in ATP is the outcome of cellular respiration. In this process, carbohydrates are broken down into simple sugars and these simple sugars are further processed to produce ATP. In animals, including humans, glucose is the most common type of such sugars that can be metabolised by cells.
Glycolysis is an important ATP producing system in living organisms. In glycolysis, cells break down a molecule of glucose and extract a net of two ATP and two molecules of pyruvate. This process does not require oxygen (anaerobic) and can produce ATP at a very fast rate. For instance, in an intense maximal workout, glycolysis may become an important energy production mechanism to fuel the body, but only for a few minutes. In the absence of a sufficient supply of oxygen, pyruvate is converted to lactate to help maintain glycolysis for longer periods. In the presence of oxygen, pyruvate, together with fatty acids and amino acids, are eventually reduced to hydrogen and carbon dioxide in a chain of reactions that produce around 28-30 additional ATP per molecule of glucose. The whole system is collectively referred to as aerobic respiration and includes 4 stages; glycolysis, the link reaction, the Krebs cycle, and oxidative phosphorylation 24 .
The completion of an aerobic respiration cycle, takes a longer time than glycolysis alone, up to 100 times. It is, however, much more "efficient" in terms of ATP production, as 30-32 ATPs per molecule of glucose are produced in the aerobic cycle compared to 2 ATPs per molecule of glucose in glycolysis alone 25 . But of course, among other things, aerobic respiration requires sufficient supplies of oxygen. Therefore, higher than normal levels of glycolysis, and consequently, higher levels of lactate may be observed in conditions where the oxygen supply to cells is restricted or when cells need faster than normal deposits of energy relative to what aerobic respiration can supply. www.nature.com/scientificreports/ This brief introduction highlights the importance of lactate as a fundamental biomarker that sheds light on the energy consumption patterns of the body. What follows provides a brief overview of the role of lactate as a prognostic and diagnostic measure in a wide range of diseases.

Lactate: a valuable indicator of disease. Lactate is a valuable biomarker in understanding diseases
both at the cellular and physiological levels. From a cellular perspective, as described in Sect. 2.1, lactate can be indicative of the energy consumption dynamics. From a physiological perspective, some organs are -mainlylactate consuming; for instance, it has been suggested that lactate may be an efficient fuel for the brain. Some organs are lactate clearing; for instance, the liver processes excess, non-metabolised lactate and converts it back to glucose in the Cori Cycle. Finally, some organs are lactate producing such as skeletal muscles and some are lactate clearing such as the kidney. The kidney helps clear excess lactate in hyperlactatemia. The lactate shuttle theory describes a dynamic, pH-dependant flow of lactate between organs 15 . Therefore, a major, unexpected disruption in these dynamics can be an early sign of disease.
While such dynamics are not fully understood, the relationship between lactate and many diseases is welldocumented. Insufficient delivery of oxygen to tissue (hypoxia) or insufficient perfusion of the tissue (ischemia) can have drastic consequences. The most significant tissues that are commonly affected by hypoxia or ischemia are cardiac and cerebral tissues. Severe cardiac or cerebral ischemia can cause irreversible and potentially fatal conditions, namely myocardial infarction and acute ischemic stroke. As mentioned in Sect. 1.1, insufficient oxygen delivery can also tilt the healthy balance between glycolysis and oxidative pathways towards the former, and result in high levels of lactate (hyperlactatemia). Therefore, monitoring lactate levels can contribute to improved diagnosis and prognosis in these conditions 3,5,26 . Similarly, different types of shock, caused by inadequate cardiac pump function (cardiogenic shock), severe infection (septic shock), obstruction of the vessels (obstructive shock), and decrease in blood volume (hypovolemic shock) are associated with hyperlactatemia 6,14 .
Increased lactate levels have been observed in patients with lung injury 8 , increased white blood cell activity 27 , and reduced lactate clearance capability by the liver and kidney 28 . Lactate has been shown to be an important diagnostic marker of generalised tonic-clonic seizures 29 . Increased lactate level has been linked to the progression of insulin resistance in diabetic patients 9 . Lactate has also been described as a key regulatory element in response to stress 30,31 .
Cancer is another life-threatening disease that causes elevated lactate levels. It is known that cancer cell metabolism is different from normal cells, and demonstrate an abnormal conversion of glucose to lactate, even in the presence of oxygen (aerobic glycolysis) [9][10][11][12] . A sudden change in the balance between glycolysis and oxidative metabolism can also take place as a result of important metabolic changes 32 .
Given this overview, it is not surprising that there is overwhelming evidence that underlines the relationship between increased lactate levels and increased morbidity and mortality in critically ill patients. This further highlights the importance of monitoring lactate in the early resuscitation of critically ill patients 1,2,26 . These subjects have been more extensively explored in the litrature 26,33-36 . Spectroscopic data and the small n , large p problem. In spectroscopic applications, commonly one is interested in analysing the optical properties of a molecule over a wide wavelength interval. Such intervals can include hundreds or thousands of wavelengths (variables, p ). In quantitative analysis, the target may be finding a mathematical model that maps such spectra to the concentration of an analyte. However, the time and cost associated with sample preparations and obtaining the spectra often limits the number of spectra (observations, n ) to only tens. This poses a challenge for model training and is commonly known as the small n, large p problem.
Partial Least Squares (PLS) and Principle Component Regression (PCR) are commonly used in the analysis of spectroscopic data due to two reasons. Firstly, often many of the wavelengths in the spectra contain collinear, redundant or uninformative information and both PLS and PCR can adequately eliminate the negative effects of such variables. They reduce the dimensionality of the p-dimensional data -the spectra-by projecting them onto a c-dimensional space, where c < p . Generally, the larger the number of collinear or redundant variables, the smaller can c become. The resulting variables are known as latent variables. Secondly, based on the Beer-Lambert law, there is a linear relationship between the optical absorbance of an absorbing species and its molar concentration, when the concentration is low. Both PLS and PCR are linear-or more accurately bilinear-regression methods.
The predictive ability of PLS and PCR are often shown to be very similar in practical applications. However, PLS usually achieves the same level of accuracy with less latent variables. This is due to the fact that PLS maximises the covariance between independent variables, X , and dependent variables, Y , while Principle Component Analysis (PCA) finds the axes of maximal variance in the independent variable space (eigenvectors), X 37-39 . While PLS is used throughout this study, the methods discussed here are generalisable to PCR. Moreover, apart from the first two investigated methods, i.e. the methods suggested by Frenich et al. 40 and Centner et al. 22 which rely on the coefficients of independent variables in the linear model, the rest of the methods can be generalised to nonlinear regression methods such as random forests, gradient boosted trees, support vector regression and neural networks.
(1) www.nature.com/scientificreports/ where X denotes the ( n × p ) matrix of independent variables e. g. the absorption profiles for n samples, each including p wavelengths, Y denotes the ( n × m ) matrix of dependent variables, e.g. the concentrations of m absorbing species for n spectra, T and U are ( n × c ) matrices of latent variables, also known as X -and Y-score matrices, or PLS components, P and Q are known as X and Y-loading matrices and are ( p × c ) and ( m × c ) respectively, E is the ( n × p ) matrix of residuals for X and F is the ( n × m ) matrix of residuals for Y . PLS maximises the covariance between the latent variables of X and Y as represented by the T and U matrices, cov(t, u) . Assuming a linear relationship between matrices T and U would lead to, where B is the ( p × m ) matrix of regression coefficients and G is the (n × m) matrix of regression residuals 41

Dataset
This section describes the spectra collection procedure and has previously been comprehensively described 42 . Na-lactate and isotonic PBS were acquired in dry form from Thermo Fisher Scientific (Massachusetts, USA). A stock solution of 600 mmol/L Na-lactate was prepared by dissolving 67.236 g of Na-lactate powder in 1L of deionized water. The lactate stock solution was then diluted with 1X aqueous PBS to make 47 solutions with molar concentrations of lactate ranging between 0 to 18 mmol/L. The lactate increments are 0.25 mmol/L within the range of 0-5 mmol/L (21 samples), and 0.5 mmol/L within the range of 5-18 mmol/L (26 samples). All solutions were maintained at room temperature 24 °C ( ±0.5 ), and their potential of Hydrogen (pH) values was tested to be 6.5 ( ±0.2 ). Measurement of pH was carried out using Orion Star A211 Advanced pH Benchtop Meter Kit (Thermo Fisher Scientific, Massachusetts, USA). The concentrations of lactate were verified using an LM5 lactate analyzer (Analox Instruments Limited, Stourbridge, England, UK).
FTIR spectra of samples, within 450-4000 cm −1 (2500-22,222 nm), were obtained using Spectrum Two FTIR Spectrometer (Perkin Elmer, Massachusetts, USA) with horizontal Attenuated Total Reflectance (ATR) accessory-ZnSe crystal (Pike Technologies, Madison, WI, USA) and the following settings, resolution of 4cm −1 , an interval of 1cm −1 , accumulation of 100 scans, and internal Jacquinot-stop of 8.94 mm . The lactate solutions were run at random to prevent any potential temporal bias.
All pre-processing operations and implementation of algorithms were carried out in MATLAB R2018a. Figure 1, depicts the dataset consisting of 47 ATR-FTIR spectra of lactate in PBS solutions. The concentrations of lactate vary between 0-18 mmol/L. Figure 1a shows the raw absorbance spectra, Fig. 1b shows the spectra with the PBS spectrum subtracted.

Wavelength selection
Although PLS and PCR address some of the issues that rise from the large p , small n nature of spectroscopic data, it has been shown that the inclusion of uninformative wavelengths in the training process negatively affects the accuracy of predictions as well as model interpretability [19][20][21][22]43 . Moreover, from a more practical point of view, the identification of the specific wavelengths, or regions of the optical spectrum, that contain information about chemical species, significantly reduces the time and cost associated with monitoring them and facilitates the development of portable, high-speed sensors.
The straightforward application of statistical variable selection techniques, such as forward selection and backward selection, to small n , large p problems can lead to extreme overfitting. This is due to the excessive model selection/rejection steps, which could exponentially increase with the number of variables for some methods. This excessive selection/rejection can undermine statistical measure of significance, cause type II errors, and lead to overfitting 44,45 . www.nature.com/scientificreports/ What follows provides a brief overview of some wavelength selection methods. In particular, wavelength selection methods may be categorised into three main groups (1) simple strategies for selection/elimination of variables based on the variables' weights (coefficients) in predictive models (2) interval selection methods and (3) heuristic global search methods. These methods are briefly described below. Subsequently, some of the wellknown methods from each group are applied to the lactate dataset and the results are compared.
Throughout this article, in order to ensure that validation sets are sufficiently representative of the dataset, they are chosen in the following way. To select n v data points for validation, the dataset is first sorted based on the concentration of lactate and is then split into n v blocks, such that all blocks contain an equal number of spectra or one less when the number of spectra is not divisible to n v . Subsequently, a data point is randomly selected from each block to produce the required n v validation data points.
The validity of models is assessed using cross-validation and an independent test set. Due to the need for manual intervention in some cases as well as the computationally demanding nature of some of the investigated wavelength selection methods, it is impractical to perform wavelength selection within each cross-validation loop. Therefore, cross-validation is performed on the models with selected wavelengths. This is expected to bias the cross-validation results and underestimate RMSE due to data leakage. To mitigate this issue, Monte Carlo cross-validation (MCCV) is used. In MCCV, a dataset containing n samples is randomly split into a validation set of size n v and a calibration set of size n − n v . The model is then calibrated on the n − n v observations and predication errors are calculated for the n v held-out observations. This process is repeated for a predetermined number of times ( N ) to estimate the prediction error. RMSECV MC is defined as, where y ij is the predicted value of the jth data point in the validation set and in the ith iteration of MCCV and y ij is the associated reference value. Compared to the two widely used leave-one-out and leave-n v -out ( n v > 1 ) cross-validation routines, MCCV (similar to leave-n v -out) places more emphasis on the validation process by including more than one observation in the validation set. In leave-n v -out, however, all of the possible ways in which the data can be split into calibration and validation sets need to be examined ( i.e.C n n v = n n v events), while in MCCV only a random subsample of such events is examined. Therefore, MCCV is more computationally tractable than leave-n v -out and less susceptible to overfitting than leave-one-out 46 . Finally, the cross-validation results are complemented with the prediction results, RMSEP and coefficient of determination %R 2 Test , pertaining to a representative independent test set of eight, held out from all pre-processing and wavelength selection steps, and only used for the validation of the models with selected features. RMSEP is given by, where n t denotes the number of observations in the independent test set. %R 2 Test is defined as, where − y is the mean of the dependent variable in the test set.

Coefficient-based variable importance. Garrido Frenich et al. proposed a simple method based on
PLS regression coefficients ( b) 40 . In this method, similar to the interpretation of parameters in standard linear regression, the PLS coefficients-B in Eq. (3)-are considered as a measure of variable (wavelength) significance, although in small n , large p problems there is little statistical justification to do so, and type I errors are to be expected. Moreover, a large coefficient can also represent a variable with small absolute value and large variance. In order to mitigate the error that could be caused by such variables, each variable is scaled with the inverse of its standard deviation. The variables with high regression coefficients, b ′ i , for the standardised matrix, X standardised , are considered to be important variables 40 , where b ′ i is the element of the coefficient vector B ′ that corresponds to the variable i.
A disadvantage of the method described above is that it does not consider the uncertainty of b ′ coefficients. In particular, these values can exhibit drastic changes in successive training with resampled data. To demonstrate this, the histograms of three coefficients obtained with MCCV for the lactate dataset are depicted in Fig. 2.
Centner et al. proposed a method, Elimination of Uninformative Variables (EUV) that similarly uses the regression coefficient to assess the importance of variables 22 . The EUV captures the uncertainty of b ′ i coefficients by considering a reliability measure, that is equivalent to the t-static for the null hypothesis of H 0 : www.nature.com/scientificreports/ where c i is the reliability of the variable i , − b i and σ (b i ) are the mean and standard deviation of the PLS regression coefficient for the i th variable (wavelength). These estimates are obtained using a resampling method, for instance, leave-one-out. The reliability values below a certain threshold may be deemed uninformative and, therefore, unselected. To find a good threshold, the spectra are augmented with random values and the cut-off reliability is selected as max(abs(c artif )) . The application of this method to the lactate dataset leads to the elimination of almost 70% of variables.
Interval selection techniques. The optical spectra have locality features. In other words, the information about the concentration of chemical species is focused on certain regions of the optical spectra. The incorporation of this assumption into the search strategies yields a different class of wavelength selection techniques. A simple way to achieve this is to split the spectra into equidistant intervals and treat each interval as a variable. As a result, the dimensionality of the space of variables can be significantly reduced, from thousands of variables to tens of intervals. This has the additional benefit that many classical techniques such as Forward Selection (FS) and Backward Elimination (BE) can be applied to these intervals. This is the key insight behind interval PLS (iPLS) proposed by Nørgaard et al. 47 , and its variations such as forward iPLS (combination of FS and iPLS), backward iPLS (BE and iPLS), synergy iPLS (combinatorial interval selection), and GA-iPLS (selection of interval combinations using a genetic algorithm). Nørgaard et al. also suggested micro-optimisations, such as onesided or double-sided expansion and truncation of the "best interval". The number of intervals is a key parameter that could have major implications on the outcome. In this study, two values for the number of intervals were used, 20 and 40.
While iPLS can provide a good representation of the informative regions of the optical spectrum, the two requirements of equidistant intervals and the predetermined number of intervals are restrictive. As a result, there is very little chance that the optimal interval can be found using this approach. Moving Window PLS (MWPLS) relaxes these requirements by selecting an interval with a predetermined length at one end of the spectrum and then moving the centre of the window one wavelength at a time 48 . As an alternative to RMSECV and RMSEV, Jiang et al. suggested a criterion based on the uncertainty of the local models. This uncertainty is reflected in the number of latent structures (the number of PLS components) and the Sum of Squared Residuals (SSR) of training. Multiple models with different complexities (PLS components) are developed for every interval. The goodness of an interval is then judged by how the addition of PLS components improves the SSR. If the SSR is relatively high or if it continues to significantly improve with the addition of PLS components, this suggests that these components are used to model inherent uncertainties, and therefore, the interval is not informative. In summary, according to this criterion, a good interval is an interval that obtains low SSRs with only a few components and the addition of extra components does not significantly improve the results.

Heuristic global optimisation search methods.
A different approach is the use of heuristic global optimisation methods to find a suboptimal combination of variables. Here the problem can be formulated as an optimisation problem where the objective is to minimise a cost function, such as RMSE of Calibration (RMSEC), or RMSECV, by choosing the right combination of variables. Genetic algorithms, a class of biologically-inspired, evolutionary algorithms have found many applications engineering, biomedicine, chemometrics, genomics, and spectroscopy due to their ability in solving complex, nonlinear optimisation problems. Many variations and different implementation of this method can be found in the literature [49][50][51][52][53][54][55] .
Genetic algorithms are inspired by Darwinian selection, i.e. the survival of "good" genes in offspring. The algorithm begins with a random population of chromosomes or candidate solutions. In a variable selection problem, each chromosome can be coded as a bit-string, where a one represents selection and a zero represents deselection. Each chromosome is then assigned a fitness value or an unfitness value in a minimisation problem. www.nature.com/scientificreports/ Here the unfitness value can be RMSEV, RMSEC, RMSECV. Candidate solutions (chromosomes) that produce better results will have a higher likelihood of passing on their genes to the next generation (selection), recombining their -likely good-genes with other fit genomes (crossover), and finally passing on their genes to the next generation of genomes with some random mutation. The last operation is necessary to ensure that the optimisation landscape will be explored and reduce the likelihood of getting stuck in local minima. This process will continue for a predetermined number of generations or until a convergence criterion is met.
The proposed GA method. In large p small n problems, genetic algorithms are likely to lead to over-fitting. A rule of thumb to avoid this is to keep the ratio of the number of variables, p , and the number samples, n , small. It has been suggested that depending on the noise levels the ratios ( p/n ) of greater than five are very likely to lead to overfitting 55 . In the lactate dataset, this ratio is greater than 80, i.e. 3151 wavelengths and 39 samples in the training set. However, there are some measures that can mitigate this issue, such as using intervals rather than variables (similar to iPLS) or using a dependant validation set to stop the operations when the RMSEV starts to increase (early stopping). As mentioned in Sect. 4.2, using intervals imposes extremely limiting constraints which leaves very little chance of obtaining a minimal and optimal selection of wavelengths. Therefore, in the proposed method we use the bit-string representation of wavelengths. The unfitness function is defined as MSEC, however, in order to reduce the likelihood of over-fitting three strategies are incorporated in the GA method; • Firstly, in each run of GA, the training set is randomly split into a validation set and a training set. The RMSEV for the randomly selected validation set is only used once to evaluate the goodness of the final solution for that GA run. If the final solution obtained in a GA run produces a RMSEV above a threshold, it will be discarded. Here the threshold is defined as 1.5 × RMSEV fullmodel . • Secondly, the evaluation of the fitness function for each chromosome is done using MCCV with 100 resampling iterations. In particular, the training set is randomly split into a training set and a validation set and the MSEC and MSEV are calculated. This resampling process is repeated 100 times and the resulting MSECs and MSEVs are averaged to produce a more robust estimate of MSEC and MSEV , denoted by MSEC and MSEV for a given chromosome. This layered design minimises the likelihood of overfitting since it ensures that the solution has produced good results across thousands of resampled datasets (a) in every fitness evaluation step and (b) along different generations. Moreover, it produces a number of important statistics that can be incorporated in the objective function, for example, the mean and standard deviations of the MSEC and/or MSEV. In the current study, only the average estimate of MSEC is used in the objective function, but a comprehensive analysis of different objective functions will be carried out in the future. • Finally, the genetic algorithm is run 100 times, and the importance of variables are calculated as the average number of times that a variable is selected. Figure 3 depicts the results for one run of the proposed GA-based wavelength selection on the lactate dataset. Figure 3c demonstrates a smooth convergence to a minimum, validating the choice of RMSEC as a measure of chromosome unfitness. Figure 3a,b show RMSEV and RMSECV MCCV respectively, both suggesting that the proposed method successfully avoids overfitting. Finally, Fig. 3d shows the selection of wavelengths (denoted with black pixels), in the fittest chromosome of the generation. It can be seen that the initial randomly selected variables in the first generation gradually converge to a solution.
In order to ensure that the solutions in different runs are stable, they are arranged in rows of a matrix, G n×p . The matrix H = G t × G , is a (p × p) matrix, where G t is the transpose of G . Each element of this matrix, h ij , shows the number of times that wavelength j was selected when wavelength i was selected. Some similarities between matrix H and 2D correlation synchronous plot are to be expected, however, matrix H highlights 1. wavelengths that most contribute to the prediction of lactate given the underlying model (here PLS) 2. Shows how consistent the pattern of selection deselection in different runs of the GA are. The matrix H and the 2D correlation synchronous plot are depicted in Fig. 4a,b. The probability of selection in different runs of GA for wavelength w i is denoted with P GA (w i ) and is defined as the number of times that w i was selected divided by the number of GA runs. This is depicted in Fig. 4c.
Interestingly, it can be seen that while unlike iPLS and MWPLS, the locality of features is not hardcoded in the method, local features have emerged, i.e. adjacent wavelengths have relatively similar P GA (w i ) values. Moreover, the results are much more granular and easier to interpret compared to both the aforementioned methods. Apart from the major peaks in the finger print region (i.e. 500 − 1500cm −1 ) the region between 3260 − 3650 cm −1 contains a few significant intervals, particularly the interval 3490 − 3520 cm −1 which has previously been linked to α-hydroxy-esters methyl lactate 56 .

Results
In this section, the eight reduced models, and the complete model are compared. The model based on the interval selected by iPLS with 20 intervals and 40 intervals are denoted by iPLS20 and iPLS40 respectively. The two FS-iPLS models are similarly denoted by FS-iPLS20 and FS-iPLS40. The method presented in Garrido Frenich et al., is denoted with β . Prior to the analysis the optimal number of components for all 12 models are selected as the elbow point in the Predicted Residuals Error Sum of Squares (PRESS) plot. The model based on MWPLS uses the interval of 1050-1430 cm −1 which is chosen based on the criterion set out Jiang et al. 48 . For the proposed GA model, all wavelengths with P GA (w i ) smaller than 0.6 have been eliminated to produce a reasonable reduction in the number of wavelengths. A more rigorous approach to select this threshold could be the use of RMSECV www.nature.com/scientificreports/ and RMSEV, but in the present study, this has not been considered for simplicity. Figure 5 demonstrates the variables selected by each method. The comparison of the nine models is summarised in Table 1. RMSECV is obtained using MCCV and on models with predetermined wavelengths. 10,000 resampling iterations are used and in each iteration 5 samples are held out for validation and the remaining 42 samples are used for training. Table 1 demonstrates the potential of variable selection methods for obtaining more accurate predictions while relying on a much smaller number of variables. The method proposed by Garrido Frenich et al., denoted by β , obtains a comparable performance with the complete model while significantly reducing the number of variables. The EUV helps eliminate two-third of uninformative variables while reducing the RMSECV compared to the complete model and delivering a comparable RMSEP Test .
The iPLS20 model exceeds the predictive ability of the complete model while only using the interval of 1090-1240 cm −1 . This interval is associated with the stretching modes of C-C, C-O, C-O-C, and C-OH  www.nature.com/scientificreports/ Figure 5. (a, b) The method proposed by Garrido Frenich et al. 40 and EUV respectively. The upper part of the plot depicts the b i ′ (in Fig. 5a) and c i (in Fig. 5b)  Interval selection methods such as IPLS and MWPLS are highly constrained by the locality assumption. These methods iteratively evaluate local models and completely ignore the potentially significant effects of combining non-neighbouring wavelengths (a) in the assessment of wavelength importance, and (b) in the determination of the optimal combination of wavelengths for quantitative analysis. Modifications such as FS-IPLS to some extent address this limitation, however, as it can be seen from Fig. 5c,d firstly they do not provide a comprehensive interpretation of all relevant wavelengths since they only search through a small subset of all possible combinations and ignore k-combinations, k > 1 . Secondly, the selection criterion is highly dependent on the choice of the number of intervals. The method proposed by Garrido Frenich et al., β , and EUV do not suffer from the restrictive assumptions of interval selection methods. They provide a good understanding of all relevant wavelengths -although with potentially many false positives-, but they fully ignore the combinatorial importance of variables. Also, the wavelengths selected based on these methods can be many and spread across the spectrum, therefore, they might be of little help for developing optical sensors with minimal wavelengths.
The proposed GA based method provides a detailed and unconstrained measure of combinatorial importance of variables (Fig. 5f). It can also be seen from Table 1 that a good trade-off between the number of variables and the accuracy of predictions can be achieved by selecting variables that were in the final solution in more than 60% of GA runs. However, the simple strategy of only selecting wavelengths that appear in most of the GA run solutions is outperformed by PLS20, iPLS20-FS, and iPLS40-FS. It is possible a combination of these variables and less frequently selected variables may be necessary to deliver predictions with higher accuracies.

Conclusion
A comprehensive investigation of the application of wavelength selection methods was carried out on a lactate and PBS dataset and it was shown that significant reductions in the number of required wavelengths is not only possible but could also be the key for achieving higher accuracies. Additionally, a new genetic-algorithm based variable selection method was proposed and it was shown that the method can provide an unconstrained and granular interpretation of wavelength importance. The proposed GA method successfully avoids overfitting, provides stable solutions, and provides a valuable interpretation of the combinatorial importance of wavelengths. However, the simple strategy used here for the extraction of variables from the GA method does not provide the highest predictive ability amongst the investigated methods. To address this, our future effort will focus on the incorporation of model parsimony in the objective function.
Moreover, in our future work, more complex matrices, namely serum, and blood will be investigated. In these solutions, more scatterers and absorbing species with similar molecular structures to lactate are present. The former can lead to a lower signal-to-noise ratio and the latter can result in overlapping peaks between these absorbers and lactate. Additionally, the interaction of lactate with other chemical entities in the blood can complicate the analysis. Nevertheless, as mentioned in the introduction section, previous studies that have used broad regions of the optical spectrum for estimation of lactate in blood, have reported promising results. The authors expect that the comparison of wavelengths pertinent to the prediction of lactate in PBS and in blood, will help conclude which of those are more suitable for practical applications. Finally, since the NIR region of the optical spectrum is more appropriate for transcutaneous applications, similar studies will be carried out with a focus on the NIR region. Table 1. The comparison of the parsimonious models derived using different wavelength selection methods in the estimation of lactate concentration. RMSECV is obtained for the models with selected features using a Monte Carlo cross-validation with 10,000 resampling iterations. RMSEP Test is obtained for an independent test set of 8 randomly selected spectra. R 2 Test % is the coefficient of determination for the test set.