Introduction

Spectroscopic data is widely employed to characterize sample composition and to relate function to the compositional information. In particular, optical spectroscopy methods, notably Raman spectroscopy1,2, laser induced fluorescence (LIF)3, infrared (IR) absorption4 and laser induced breakdown spectroscopy (LIBS)5, has received considerable attention for high-energy materials (HEM) (explosives) detection because of the lack of sample preparation requirements and ability to provide real-time assessment in stand-off mode. While the vibrational spectroscopic approaches provide a wealth of molecular information, LIBS furnishes complementary elemental information and offers higher signal sensitivity due to the intense emission lines. Researchers have exploited the single-shot and facile elemental analysis capability of LIBS in diverse fields ranging from planetary exploration6,7, archaeology8 to pharmaceutical formulation identification9, HEM screening10,11 as well as in other biomedical12,13,14 and polymer applications15. The incorporation of statistical models in the analysis of LIBS coupled with the seminal work by Miziolek and co-workers16 and stand-off operations at Los Alamos National Laboratory17 has further enhanced its attractiveness for evaluation of complex materials of stochastically varying compositions.

Specific to HEM detection, there are two principal research goals: first, to identify a specimen as a HEM and second, to classify the specimen as a specific HEM. Yet, HEMs typically are nitro-rich molecules and exhibit grossly similar spectral profiles with prominent emission lines corresponding to nitrogen, carbon, hydrogen and oxygen. Furthermore, the spectral interference emanating from trace quantities of grease, oils18 and biomaterials19 as well as the presence of oxygen and nitrogen in the ambient air and hydrogen and oxygen in the atmospheric moisture impedes the reliable detection of the materials in the field. Consequently, the basic spectroscopic approach of employing atomic (and in cases molecular) intensity ratios for these kinds of materials yield very limited results. Also, the substrate effects for HEM residue detection, which have been detailed by Gottfried et al.20, exacerbates the segmentation challenges due to the potential alterations to the light emission from the microplasma. Taken together, it is not surprising that non-analyte specific peaks appear in the spectral profile due to the entrainment of air and the (variable) substrate information21. Approaches that partially alleviate these issues include noble gas purging and double pulse method. However, purging is not a viable option for on-field application and while the double pulse method shows some improvement in the signal enhancement22, purely ambient air peaks (e.g. oxygen and nitrogen) cannot be avoided.

In this milieu, the application of multivariate chemometrics methods to LIBS datasets has enabled the extraction of information both amenable to and hidden from human examination and the resultant combination has been successfully utilized for materials’ identification23,24. The underlying hypothesis is that while assessing the spectral signatures of several specimen shows significant complexity, comparing hundreds of such spectra reveal subtle, yet reproducible, patterns that are valuable for objective sample identification. Various multivariate segmentation methods including principal component analysis (PCA)16,25, soft independent modeling of class analogy (SIMCA)21, partial least squares-discriminant analysis (PLS-DA)16,21,23, artificial neural network (ANN)26 and support vector machine (SVM)27 have been exploited to analyze the rich LIBS datasets. Concomitant advances in instrumentation, especially the use of echelle spectrographs, have further enabled measurements of high resolution thereby providing larger number of spectral features for classifier development.

However, the massive amount of data collected puts forth a different challenge as far as the development of a robust classifier is concerned. Given the high dimensionality of the LIBS dataset and the relative sample sparsity (stemming from the unavailability of large numbers of HEM samples), the quest for the ‘perfect’ classifier can often produce unwarranted conclusions and lead to apparently functional algorithms that cannot be used prospectively. Boosting the sample per feature ratio by appropriate selection of spectral regions can not only produce classifiers that are more insensitive to outliers and possess higher generalization power but also aid in interpretability by identification of the smallest possible subset of maximally discriminatory features. Recently, considerable effort has been directed towards developing and evaluating different procedures that objectively identify wavelength bands that contribute useful information and/or eliminate regions that contain mostly noise or spurious information1,21,28. In principle, judicious wavelength selection for spectroscopic datasets focusing on marker-specific features could even result in models having a greater predictive ability. Despite these advantages and the suitability of LIBS for feature selection due to its inherently narrow emission peaks, wavelength selection for building LIBS-based segmentation models has received scant attention in the literature.

Here we present two distinct approaches for wavelength selection in LIBS spectra and investigate their relative merits in influencing the classification outcomes for a set of HEM samples. Specifically, we establish segmentation models based on: (A) spectral regions selected on the basis of a priori knowledge of the chemical composition of the samples; and (B) spectral regions determined through implementation of genetic algorithm (GA), a powerful search heuristic that mimics the process of natural selection. First, we examine four spectral regions as well as the full spectrum in constructing PLS-DA derived classification models. PLS-DA is used as a representative multivariate segmentation framework, with the understanding that classification techniques for LIBS and vibrational spectra often provide similar levels of accuracy in prediction due to the richness of the spectral data. The spectral regions were based on the presence of the prominent peak of the organics in the following manner: (a) R1 region that combines the carbon and hydrogen peaks, (b) R2 region that consists of the spectral window spanning from the carbon to the CN lines, (c) R3 region covering the prominent organic window from carbon to hydrogen and (d) the R4 spectral region encompassing the atmospheric window of the oxygen and nitrogen peaks. Based on this categorization, we find that the correct classification rate using only the R3 region is nearly equal to that of the full spectrum, although it uses ~63% of the intensity values recorded at the detector. Evidently, this finding opens the door for using a detector with significantly smaller spectral coverage that also reduces the possibility of overfitting to variable, yet uninformative, regions of the spectra. Even more significantly, we observe that using the probabilistic, non-local GA search process, a spectral subset featuring an order of magnitude less wavelength information than the full spectrum is capable of surpassing the prediction performance of the latter. The approaches described herein are also sufficiently broad and general to address critical applications in disease diagnosis as well as pharmaceutical and forensic analysis. We also envision that rigorous validation of limited wavelength subsets may provide much-needed momentum to the development of discrete filter-based detector units that can fundamentally reduce the cost, complexity and spatial footprint of the echelle spectrograph – intensified CCD combination of present day LIBS platforms.

Materials and Methods

Experimental studies were undertaken to accomplish the following objectives: (1) discern the relevant discriminatory features (i.e. the differential spectral markers) for the HEM samples under consideration; and (2) use the selected features to establish decision algorithms for HEM sample recognition and determine their sensitivity. For the sensitivity analysis, the rate of correct classification (and corresponding misclassification and unclassification) was computed when all classes of the HEM samples were split into training and testsets.

Experimental section

A detailed description of our LIBS experimental setup was provided in our previous article29. Briefly, a frequency-doubled Nd:YAG laser of 7 ns pulse duration was used as the excitation source. The laser beam was focused by a 10 cm plano-convex lens onto the sample to create the plasma. The plasma emission was collected with a collimating lens assembly CC 52 (Ocean optics Inc.) that was then guided through an optical fiber (core diameter: 400 μm) to the spectrograph (ANDOR Mechelle 5000) coupled to an iSTAR DH734 ICCD detector. The samples were subjected to ca. 22 mJ of laser energy with the pulse energy output measured using a pyro-electric joule meter (Molectron, EPM 2000, Coherent Inc). The experiments were carried out in ambient air at atmospheric pressure. A gate delay of 1 μs and gate width 2 μs was used for signal acquisition.

A set of HEM samples namelyoctahydro-1,3,5,7-tetranitro-1,3,5,7-tetrazocine (also known as high-molecular-weight RDX (HMX)), nitrotriazalone (NTO), pentaerythritoltetranitrate (PETN), 1,3,5-trinitro-1,3,5-triazacyclohexane (RDX), 2,4,6-trinitrotoluene (TNT) were acquired from the High Energy Materials Research Laboratory (HEMRL), Pune, India. The fine powder material was pressed into 1 cm diameter pellets by a die-hydraulic pressing machine by applying 3–4 tons of pressure. The pelletization operation improves signal reproducibility, as the position of the focal spot is almost unchanged for all the laser pulses arriving at the pellet surface in sharp contrast to a powder specimen. The pellets were pressed to within 91–94% of the theoretical maximum density for each explosive. The samples were placed on a manual X-Y-Z translation stage such that spectra could be collected from different spatial locations in the pellet to estimate measurement variability.

For our present investigations, 472 spectra were acquired from the HEM samples. Specifically, 133 spectra for HMX, 75 for NTO, 60 for PETN, 61 for RDX and 143 for TNT were included for the ensuing classification analysis. Each spectral profile represents the signal averaged over 3 consecutive pulses. It is worth noting that the acquired spectra were screened for further analysis by subjecting them to a minimum signal-to-noise ratio threshold and performing spectral outlier detection using correlation distance. In order to prevent the deleterious impact of spectral artifacts, chemometric analysis was performed without the application of any additional preprocessing methods.

Data Analysis

Variable selection by genetic algorithm

Wavelength selection for spectroscopy is typically tackled through a priori knowledge about the composition of the target specimen. However, as alluded to earlier, complex sample matrix as well as the environment contribute substantive signals from interferents that overlap with the principal features of interest. In order to choose a globally optimal feature subset capable of offering maximal and robust discrimination between the considered HEMs, we employed GA. GAs have provided robust solutions across many applications, offer the possibility to converge on optimal solutions swifter than multidimensional raster grid searches and possess the ability to escape local maxima during the search for an optimal global solution30,31. GAs are based on ideas developed in evolutionary biology, where different potential solutions are treated as competing individuals in an evolving population. They have been gainfully utilized for spectral feature selection, where the central idea is the manipulation of binary strings (chromosomes) that contain genes which in turn encode experimental factors or variables32,33.

Here an initially random population of individuals, or “chromosomes,” was generated with certain characteristics. The optimized GA parameters were: population size, 64 chromosomes; window width, 10%; % initial terms, 1%; max number of generations, 100; percent at convergence, 50%; mutation rate 0.005; regression algorithm and # of LV, PLS and 10 LV’s; cross validation, random and three splits for multiple iterations. For our analysis, we selected spectral subsets in increments of 1% of the total population (25699 points at which intensity values were recorded) with a minimum of 1%. The optimal spectral subset for each increment was selected by maximizing the correct classification rate in a cross-validation routine in the training population.

PLS-DA multivariate analysis

Although it was originally developed as a multivariate regression method, PLS34 has been extensively used to solve classification problems by encoding the class membership of the measured samples in the target matrix35,36. An important feature of this supervised technique is that it is specifically suited to deal with problems in which the number of variables is large (compared to the number of observations) and collinear, two major challenges encountered when LIBS spectra are used. Furthermore, PLS-DA provides a powerful tool for discrimination even when the variability within a class is similar to the inter-class variability by virtue of establishing the maximal separation between each class via fitting one global model to the entire dataset. In PLS-DA, a PLS regression model is calculated that relates the independent variables (spectral data) to a binary response encoded as: {1, 0, 0, 0, 0} for sample belonging to class 1, {0, 1, 0, 0, 0} for sample belonging to class 2 and so on until class 5. Typically, a threshold value based on the Bayesian method37, is defined between 0 and 1 and calculated on the basis of the values predicted during the classification process, so an object is assigned to a particular class if its prediction is larger than the threshold value for this class. As in the PLS regression model, the optimal number of latent variables (LVs) retained are chosen before the modeling process and this is determined by using the fractional misclassification error rate in cross validation38.

Sensitivity Test

The sensitivity investigations were performed for the two approaches(region selection based on a priori compositional information and GA) by randomly splitting the 472 sample datasets into 282 for training, 95 for tuning and 95 for test, respectively. Wavelength selection was performed on the training set whereas the tuning set was used to determine the unclassification thresholds for the developed classification model. The test set was kept aside for final classification purposes. The size of the test set represents a compromise between opting for a test set size large enough to minimize the standard error in computing the misclassification rate, while maintaining the training and tuning set sizes large enough to construct and optimize a robust classifier. To obtain more representative rates of correct classification, misclassification and unclassification, 100 independent iterations were performed by re-splitting the entire data into training, tuning and test sets. In the following, we describe the procedure for application of each classifier for one iteration where the dataset has already been randomly split into training, tuning and test sets.

Additionally, to determine the number of the LVs we employed a leave-one-out cross-validation procedure in the training set. Also, for predicting the class of a spectrum, we used equally weighted scaled orthogonal and score distances. Here, using the tuning set, we defined an unclassification criterion to prevent the misclassification of potential samples that are distant from the center of the PLS-DA model. Based on the assumption of normal distribution of the distances of the tuning dataset to the center of the PLS-DA model, distance thresholds were established (mean ± 3.5*standard deviation) to unclassify correspondingly distant test samples. In other words, if the membership probability for every class is observed to be <0.3%, the test sample was categorized as “unclassified”.

Results and Discussion

Spectral observations

Figure 1 shows a representative spectrum of each of the five HEM classes measured on our LIBS system. As observed from Fig. 1, the acquired spectra for the 5 classes of HEM samples, namely HMX, NTO, PETN, RDX and TNT, show the presence of emission peaks associated with C (247.8 nm), Mg (279.5 & 280.3 nm), Ca (393.3, 396.8, 422.7 nm), H (656.3 nm), N (742.4, 744.3, 746.9, 818.4, 818.8, 821.6, 824.2 nm), O (777.2, 777.4, 794.8, 822.2, 822.7, 844.6, 868.1 nm) and Na (589, 589.6 nm). Evidently, the oxygen peak at 777.3 nm has the highest intensity followed by CN at 388.3 nm, which along with the C2 peaks are representative of organic molecules. In addition, the CN violet bands are observed corresponding to B2Σ+ → X2Σ+ transitions at three regions with Δυ = −1, 0 and +1 at 357–360 nm, 384–389 nm and 414–423 nm ranges, respectively. C2 swan bands corresponding to D3Πg → a3Πu transitions are also seen in two regions with Δυ = −1 and +1 at 460–475 nm and 550–565 nm ranges, respectively. While the HEM specimen show expectedly near-identical peaks, careful inspection reveals the subtle, but reproducible, peak intensity variations across the samples. The underlying basis of feature selection is to exploit these intensity variations not only at the notable peaks (which often do not provide the highest diagnostic power) but across the spectrum to obtain an accurate classifier and, ideally, identify the smallest spectral subset of discriminatory features.

Sample compositional knowledge based spectral region selection

The recorded spectra consist of intensity values at 25,699 pixels of the detector, which spans a wide wavelength region from the ultraviolet to near-IR. While the LIBS spectrum consists of several sharp peaks attributed to the elemental (and some molecular) emission lines, a substantial portion of the feature space is composed of minimal signal beyond the noise floor (Fig. 1). However, very rarely is there a concerted effort in choosing the features that is rooted in the context of the application (for example, chemical composition of the samples) – rather for reasons of expediency the full LIBS spectrum is routinely employed in classification. Though inclusion of noise39 and continuum40 may not significantly hamper the prediction performance in specific cases, feature selection in general is critical to eliminating spurious correlations especially in applications where the inter-class differences are subtle. Here based on the prominent spectral features of CN, C2, C, H, O and N, we selected four different regions R1, R2, R3 and R4 as detailed in Table 1. For R1, carbon and hydrogen-specific peaks are included since carbon and hydrogen are the dominant peaks of HEMs and are less influenced by the surrounding ambient atmosphere in comparison with oxygen and nitrogen features. Region R2 is based on the peaks representative of C and CN. The formation of CN depends on the content of carbon and nitrogen in the sample, besides the nitrogen contribution from the atmosphere. Region R3 provides the cumulative information of R1, R2 and C2 which is expected to indicate the presence of aromatic double bonds (C = C)41. Finally, region R4 consists of the higher wavelength spectral region following the hydrogen peaks and has contributions from both the HEM specimen and the atmosphere.

Using PLS-DA derived classifiers classification performance was determined for each of the spectral regions. Table 2 enumerates the results of correct classification, misclassification and unclassification rates. Here the maximum correct classification rate (92.61%) was obtained when the full spectral profile was considered as the input to the classifier. The misclassification of 4.65% was mainly due to the incorrect recognition of RDX samples that were misclassified as HMX. This misclassification can be attributed to the similar chemical formulation of HMX and RDX and, notably, to the significantly larger number of HMX samples in the dataset that likely ‘skew’ the decision algorithm. The unclassification rate was almost the same for all classes and, as noted from Table 2, for the four spectral regions and the full spectrum. On the other hand, the misclassification rates varied for the selected spectral regions from ca. 6.8% for R3 to 28.3% for R1. It is noteworthy that despite the presence of the prominent carbon and hydrogen peaks in R1, its diagnostic power is significantly lower due to the lack of significant variations in this region between the 5 HEM sample classes. Perhaps, more importantly, we observe that there is barely a 2% reduction in the classification performance between R3 and the full spectrum, even though R3 constitutes only ~63% of the full spectrum intensity values. This is not unexpected as the main emission lines that are not incorporated in the R3 come from oxygen and nitrogen – contributions that are marred by atmospheric interference. Evidently, decreasing the size of sampled points not only reduces the redundant and uninformative regions but also raises the possibility of using a spectrometer that is focused only on the lower wavelength region for HEM classification with equal efficacy.

Unsupervised feature selection using genetic algorithm

While a priori knowledge of the sample aids in selection of broad wavelength bands, a more rigorous wavelength selection strategy could potentially yield an even smaller subset of discriminatory features. To this end, we employed genetic algorithm for interrogating the large search space in which extremely diverse combinations of wavelengths must be considered. Figure 2 shows the results of classification for HMX, NTO, PETN, RDX and TNT obtained with the GA/PLS-DA model, where the lengths of the bars are proportional to the average error rate (root mean square error of cross validation, RMSECV) in the training set and the associated error bars represent the standard deviation over 100 iterations. Here RMSEV is computed by comparing the PLS-DA estimates with the assigned numeric values for the reference classes in a quadratic loss function. This provides a direct assessment of the performance of the classification models corresponding to the selection of wavelength subsets, ranging from 1% to 90% of the total population of spectral points (~250 to 23,129 spectral points, where the total population is 25,699). A decreasing monotonic trend is observed for the RMSEV values as the number of points included in the classification analysis is reduced from 90% (0.212) to 10% (0.182) (highlighted in the second panel of Fig. 2). Clearly, this affirms the potential to surpass the prediction performance of full spectral analysis due to the ability to decompose the spectral data matrix in a manner biased toward the isolation of analyte-dependent information – a result that exceeds the findings of the previous approach. At the resolution of our current analysis, 10% of the points sampled appear to provide the optimal balance of selected variables and discriminatory power of the classification model. As the number of wavelengths sampled is further reduced from 10%–1% (0.216) of the total population, we observe a non-monotonic increase in the RMSEV values. Nevertheless, it is notable that selection of only 1% of the wavelengths provides nearly equivalent levels of prediction accuracy (RMSEV = 0.216) as when 90% of the full spectrum is employed. In fact, the corresponding p-value (0.54) indicates the absence of statistically significant differences between the classification performance of the models employing 1% and 90% of the points.

Based on the RMSEV results, one can reasonably infer how many spectral points provide relevant information specific to the analyte of interest and the spectral interferents in the samples – and therefore the minimum size of spectral dataset required for optimal classification performance. Specific to GA implementation, an effective approach for reducing the number of final wavelengths is to reduce the number of wavelengths selected in the initial chromosome42. Moreover, with the use of lower spectral resolution, the GA optimization efficiency can be further improved and the number of wavelengths in the optimal chromosome can be further decreased.

Table 3 gives the results of sensitivity analysis for the GA/PLS-DA based classification with 90%, 10% and 1% of the selected wavelengths. The model is judged to be sensitive if the correct classification rate is high and the unclassification rate is low. For the 90% case, we observe a high rate of correct classification for all classes with an average rate of correct classification of ca.93%. In particular, no misclassifications are observed for TNT and PETN. However, relatively high rate of misclassification is obtained for the RDX sample that mirrors the findings of the previous region-specific PLS-DA models. Also, from the unclassification perspective, the PETN samples perform relatively poorly in relation to the other HEM classes, which imply the existence of spectral outliers for this class of samples. From the corresponding GA/PLS-DA-based classification with 10% (~2,569 spectral points) of the wavelengths, we find that the average rate of correct classification is even higher (ca. 94.2%), with the TNT, NTO and HMX samples being classified with very high accuracy (≥96%). The improvement in the average rate of correct classification from 90% to 10% is statistically significant (p < 10−4). Additionally, the average rate of unclassification for the 10% case (2.3%) is slightly smaller than in the case of 90% of the selected variables (2.6%) (p < 10−4). Finally, when only 1% of the wavelengths are included in the analysis framework, we see that the correct rate of classification is slightly lower (90.7%) in relation to the 90% case and, similarly, the average rate of unclassification is also slightly higher (3.2%). The import of this result cannot be understated since nearly same levels of performance are observed while reducing the number of wavelengths used by 2 orders of magnitude (from ~25 k to barely 250). Table S1 (in Supporting Information) also shows the robustness of these wavelength-selected PLS-DA models in correctly identifying and unallocating spectra from samples of an unknown class.

Given the significant reduction in the number of sampled points for the 1% case, a critical question that needs to be addressed is the relative consistency of the selected wavelength subset across multiple iterations. Ideally, the selected wavelength subsets should exhibit a high degree of consistency (reproducibility) since these ~250 points are determined based on their discriminatory power - particularly if the selection is not affected by spurious or chance correlations in the models. To quantify the reproducibility, we performed multiple iterations of the 1% case and computed the consistency as the ratio of overlapping spectral points (between selected wavelength subsets across 10 iterations) to the total number of selected spectral points (259). This was found to be equal to 0.72 ± 0.03 that implies over 70% of the wavelengths were common to all the subsets. Figure 3 provides a cumulative frequency plot illustrating the number of times a specific wavelength is retained in the selected subset. The consistency of wavelength selection is represented by the presence of greater structure, i.e. higher retention frequency of specific wavelengths and concomitant reduction of other uninformative regions. Figure S1 in Supporting Information shows an alternate representation of the frequency plot highlighting the sampling differences between the 1% and 10% wavelength selected data sets.

Collectively, we have demonstrated the efficacy and consistency of the wavelength selection approach in building classification models for HEM recognition. Despite the limited number of HEM classes used in this research, compared to what may be typically encountered in a defense facility, the variable selection approach by using GA is shown to provide very good compliance and high prediction accuracy in classification of the blind cases.

Conclusions

Given the large dimensionality of LIBS spectral data, it is critical to develop classifiers that endorse and rely on prior feature selection. Here we have exhibited the efficacy of a supervised as well as genetic algorithm based wavelength selection for discerning the maximally discriminatory features in the spectral data. While our previous work has shown the usefulness of employing multivariate chemometric approaches in segmentation problems using LIBS, this study underscores that the most prominent peaks in the acquired spectroscopic data are often not the parameters with the most diagnostic utility. In particular, our GA-based PLS-DA models demonstrate superior classification performance to corresponding algorithms that employ the full spectral data – highlighting the need of removing uninformative or spurious information that may occupy as much as 99% of the entire information collected. Importantly, we show the robustness (consistency) of the selected wavelength subset regardless of the exact identity of the HEM samples selected in the training set. Through our computational work on feature selection in the sparse LIBS dataset, we also envision architecting a compact, inexpensive diagnostic platform that operates without a conventional echelle spectrograph-ICCD combination while also optimizing the laser excitation wavelength to enable higher pulse energies and improved discrimination. A critical consideration in this regard is the possibility of acquiring lower resolution data and its influence on the quality of the classification model. In combination with our previous report of the better-than-expected performance of non-gated LIBS detection, this report greatly expands the feasibility of field applications while highlighting the untapped opportunities in exploiting computational approaches to make instrumentation innovations and advances. With minimal alterations, the approach described here can be equally exploited in quantitative compositional analysis of pharmaceutical samples, forensic specimen as well as in other recalcitrant process monitoring applications.

Additional Information

How to cite this article: Kumar Myakalwar, A. et al. Less is more: Avoiding the LIBS dimensionality curse through judicious feature selection for explosive detection. Sci. Rep. 5, 13169; doi: 10.1038/srep13169 (2015).