High-level Fusion Coupled with Mahalanobis Distance Weighted (MDW) Method for Multivariate Calibration

Near infrared spectra (NIR) technology is a widespread detection method with high signal to noise ratio (SNR) while has poor modeling interpretation due to the overlapped features. Alternatively, mid-infrared spectra (MIR) technology demonstrates more chemical features and gives a better explanation of the model. Yet, it has the defects of low SNR. With the purpose of developing a model with plenty of characteristics as well as with higher SNR, NIR and MIR technologies are combined to perform high-level fusion strategy for quantitative analysis. A novel chemometrical method named as Mahalanobis distance weighted (MDW) is proposed to integrate NIR and MIR techniques comprehensively. Mahalanobis distance (MD) based on the principle of spectral similarity is obtained to calculate the weight of each sample. Specifically, the weight is assigned to the inverse ratio of the corresponding MD. Besides, the proposed MDW method is applied to NIR and MIR spectra of active ingredients in deltamethrin and emamectin benzoate formulations for quantitative analysis. As a consequence, the overall results show that the MDW method is promising with noticeable improvement of predictive performance than individual methods when executing high-level fusion for quantitative analysis.

methods are mainly focused on classification issues [22][23][24][25][26][27] and have been less explored to quantitative analysis. In this study, we propose the Mahalanobis distance weighted (MDW) method to employ high-level fusion to quantitative analysis.
Mahalanobis distance (MD) was first proposed by P. C. Mahalanobis in 1938 28 . Employing MD on mathematical algorithm for chemical identity classification was described by Mark and Tunnell 29,30 . Based on the principle of spectral similarity, MD manifests its distance from the initial calibration set. Theoretically speaking, MD is considered as a measure of standard deviation, and the samples are expected to lie within three times the MD of their respective group means 29 . Therefore, a sample with large MD value might decrease the robustness and predictability of the model. Similarly, incorporating a sample with small MD value for modeling might increase the predictive ability. To be specific, a sample with large MD value indicates that it is farther from the calibration set in multidimensional space, and it is identified as the dissimilar one with the initial calibration set, and correspondingly it will give a relative poor prediction result when modeling. Comparably, a sample with a small MD value manifests that it is close to the calibration set in multidimensional space, and therefore it could be concluded that this sample is similar to the calibration set and will give a promising predictability.
The MDW method is proposed according to the MD value. It is worth noting that three times of the group means of MD is treated as a criterion. Moreover, if a sample (for a specific method) lies within the limits of criterion, the weight is given corresponding to the reciprocal of the MD value; otherwise, when the MD is over criterion value, the weight is assigned to zero. Since one sample owns a particular MD value from one technique, several MD values are acquired for one sample when executing fusion strategy. A sample is assigned to several weights by means of different methods. What is worth mentioning, the total weight is set as 1.
Integrating the individual results to get a final output by assigning each sensor a rational weight is the principle of high-level fusion. In views of the above, a weighted method upon MD is put forward to fuse each technique by taking all the individual results into account. In this study, the MDW method is applied to individual NIR and MIR spectroscopic analysis for active ingredient determination in deltamethrin and emamectin benzoate formulations. The performance of the three methods (individual NIR, individual MIR, MDW fusion) from deltamethrin and emamectin benzoate formulations are evaluated by the predict ability of the models.

theory and algorithms
High-level fusion. In high-level fusion, the NIR and MIR matrices are performed to develop two individual models according to five-fold cross-validation of partial least square (PLS), respectively. In addition, the fusion results are calculated by combing the outputs of the two individual results. The main advantage of high-level fusion is to obtain a comprehensive utilization of all individual methods rather than take advantage of only one method. As a result, more accurate and credible results are gained with reasonable weights assigned to different samples on the basis of different methods.
In the high-level fusion, separate models are calculated by the corresponding data source, and the results are combined to acquire the final declaration 16 . In the fusion approach, the final result is obtained by associating the results of each method via their weights. The outputs of each sample can be represented by Eq. 1 where y p is the fusion result, L is the number of sensors, w i and y i are the weight and the predicted result for the i th sensor, respectively. The output y p is acquired by integrating all the individual results with their weights.

Mahalanobis distance (MD).
The MD value is calculated on the basis of the distribution of the original samples 29 . The largest MD value sample is the most different one from the initial set. The MD of an observation x from a set of observations X i (the calibration set corresponding to sensor i) is defined as Eq. 2 in squared units.
where X i and M i are the mean and covariance matrix of X i , respectively.
In order to reduce the dimensionality and eliminate the overlapping information in coexistence, the score matrix (S) is used to calculate MD to avoid collinear problems in M matrix. It is worth noting that, the corresponding score matrix S i (related to the calibration set of X i not for all the samples) is utilized to compute the MD values. Prior to obtain MD, the optimal dimension of PLS scores matrix is performed to acquire S i .

Mahalanobis distance weighted (MDW).
The weights are assigned to each sample according to MDW method. For each individual method, the score matrices are employed to calculate the MD. It is noteworthy that a sample is assigned to several weights by means of different detection methods. The flowchart of high-level fusion combined with MDW method is illustrated in Fig. 1. In the first place, fit the PLS models for each data set and obtain the score matrices. The next step, calculate the distance for the samples in calibration set. And finally, compute the weight and the threshold. As shown in the schematization framework, each sample is assembled by different weights via MDW method. It can be summarized in the following steps: (1) A data matrix contains n samples in rows and p variables in columns.
(2) Carry out PLS, the data are mean-centered before establishing regression model. In the following sections, the characteristics and behaviors of the MDW method are discussed in detail.

Model evaluation.
Results from two individual methods are optimized at the stage of calibration set and then evaluated by RMSEP. In calibration set, the optimal LVs is determined by five-fold cross-validation method. As a matter of fact, an optimal model is formed with low root mean square error of prediction (RMSEP) and low bias. For the two parameters of RMSEP and bias, a better model is obtained following the principle: small RMSEP with small bias> small RMSEP with large bias > large RMSEP with small bias > large RMSEP with large bias.
Software. The algorithms involved in this study are programmed by Matlab (Version 2016a, the MathWorks, Inc.). The coding scripts used in this study are available upon request.

Data description
Deltamethrin samples. Seventy-eight deltamethrin samples were prepared by technical deltamethrin (98.1%, obtained from Jiangsu Huangma Agrochemicals, China), dimethylbenzene (99.0%, Beijing Chemical Works, China) and commercial deltamethrin formulation emulsion (25 g/L, Bayer Crop Science, China). The exact concentration of deltamethrin in the commercial formulation was determined by high performance liquid chromatography (HPLC). The concentration of the samples were ranged from 0.1% to 4.98% (w/w). Details of the samples were shown in Table 1.  near infrared (niR) spectroscopy. The NIR spectra were acquired by the FT-NIR spectrometer (Spectrum One NTS, Perkin Elmer, USA) from 4000 cm −1 to 12500 cm −1 at a resolution of 4 cm −1 , and the spectra were the mean value of 64 accumulations.
Mid-infrared (MiR) spectroscopy. The MIR spectra were collected by the FT-IR spectrometer (Cary 630, Agilent, USA) with ATR accessory. The spectral range was between 650 cm −1 and 4000 cm −1 at a resolution of 4 cm −1 with 64 accumulations co-added.

Result and discussion
Influence of mahalanobis distance (MD) for high-level fusion. The physical meaning of MD is the normalized Euclidean distance (ED) in principal components space, and MD denotes the distance between a sample and the calibration set in multidimensional space. Three times of the training distribution centroid was treated as the threshold. As acknowledged, a sample with small MD value was thought as being consistent with the training distribution, and would give an acceptable result in return. Similarly, it was firmly convinced that a sample with large MD value would result in a worse predictive result theoretically. To investigate the weight of each sample, the following two cases were taken into consideration: within or out of criterion. On the one hand, if the MD value was out of the threshold, the weight was assigned to 0%, since an abnormal sample would reduce the overall predictability dramatically. On the other hand, if the corresponding MD value was in the limits scope, the weight was given from 100% to 0% according to the reciprocal of MD. Specifically, to some extent, the MD value was negatively correlated with its weight, that is, it was related to the reciprocal of the weight when the MD value was confined in the criteria range.
Spectra of deltamethrin and emamectin benzoate formulations. The NIR and MIR spectra of deltamethrin and emamectin benzoate formulations were presented in Fig. 2. According to the NIR spectra of deltamethrin formulation (Fig. 2a), the spectral features of deltamethrin 4300 cm −1 and 4600 cm −1 were the www.nature.com/scientificreports www.nature.com/scientificreports/ combination of the stretching and bending vibrations of C-H and N-H, respectively. Moreover, 5900 cm −1 was ascribed to the first overtone of the stretching vibrations of C-H. In the MIR spectra of deltamethrin (Fig. 2b), the characteristic peaks of 700 cm −1 was associated with the bending vibration of C-H in the three-membered ring. The peaks at 750 cm −1 was corresponding to the deformation vibrations of C-H in benzene ring. As is acknowledged, the peak at 1123 cm −1 was ascribed to the stretching vibration of C-O. What is more, the characteristic peak of 1500 cm −1 was associated with the stretching vibration of C-H in the three-membered ring. In addition, the peak located at 1610 cm −1 in the region of 1600-1650 cm −1 was assigned to the stretching vibration of C=C. As far as is known, the peaks at 1720 cm −1 and 1740 cm −1 were ascribed to the bending vibration of C=O. The peaks at 3000 cm −1 was corresponding to the stretching vibrations of C-H in benzene ring.
On Fig. 2c (the NIR spectra of emamectin benzoate), the spectra located around 4400 cm −1 and 5200 cm −1 were the combination of the stretching and bending vibrations of C-H and O-H, respectively. And 6800 cm −1 was ascribed to the first overtone of the stretching vibration of N-H. In the MIR spectra of emamectin benzoate (Fig. 2d), the peak around 1020 cm −1 was ascribed to the stretching vibration of C-O. Moreover, the peak at 1640 cm −1 was ascribed to the bending vibration of C=C. Besides, the wide peak near 3300 cm −1 was associated with the stretching vibration of O-H.
Deltamethrin formulation data. Monte-Carlo (MC) outlier approach was carried out upon running 1000 times for outlier detection. It turned out that no sample was kicked out after MC outlier detection approach. Afterwards, the samples were separated into calibration set (63 samples) and validation set (15 samples) according to the concentration from high to low. In this study, the spectra were corrected by autoscaling pre-treatment method to make sure each column had a mean of 0 and a variance of 1. Autoscaling was a mathematical transformation method to calculate the ratio of the mean centering spectra and the standard deviation spectra. As one of the data standard processing in factor analysis, it gave all variables a chance to be treated fairly regardless of the absolute concentration. As a result, all the wavelength variables were assigned to the same weight after autoscaling preprocessing.
Five-fold cross validation technique was used to explore the predictive performance of each approach. It is generally known that the number of LVs was a critical parameter in calibration stage. As a result, ten and seven LVs were respectively chosen for NIR and MIR data set when executing PLS algorithm. After carrying out PLS algorithm, the plot of estimated and specified values in validation set were shown in Fig. 3a,b. The predicted results for each technique were summarized in Table 2. Since RMSEP and the bias were acknowledged as the model evaluation indicator, a better predictive ability was accompanied with small RMSEP and small bias. On the whole, the NIR method performed superior results than MIR method with smaller RMSEP and smaller bias.
The deltamethrin data matrix was implemented to investigate the effectiveness of MDW method for high-level fusion in the following steps. In the first place, the NIR and MIR approaches were performed PLS algorithm. Then the score matrix from each method was employed to calculate the MD and the threshold. Finally, the weight was obtained via the principals confined in Section 4.1. Based on the above procedure, MDW method was carried www.nature.com/scientificreports www.nature.com/scientificreports/ out to employ high-level fusion for quantitative analysis. The thresholds for the two methods were displayed in Table 3, which was the standard for assigning weight. As displayed in Table 3, there were no sample beyond their own threshold. Since all the samples were in the scope of threshold, the weights were assigned to the inverse relationship of the corresponding MD values. Figure 4a evidently displayed the MD values of 15 deltamethrin samples according to NIR and MIR methods. The left y axis indicated the estimated value of validation set, and the right y axis represented the color bar of MD. As demonstrated in Fig. 4a and Table 3, the MD values were represented by different colors, the color bar was varied from dark blue to yellow indicating MD values ranged from 1.25 to 3.76 (the minimum and maximum MD values from NIR and MIR methods). Given an example, if a sample was symbolized in yellow, which indicated it owned a large MD value and meanwhile manifested that it had a farther distance from the calibration set. Thus, the corresponding method took up a smaller proportion. Figure 4b showed the weights of deltamethrin samples, it represented the proportions occupied by NIR and MIR methods. The sum proportion of the weight equaled to 1 for one sample according to different methods, which was normalized by MD values. As shown in Fig. 4b, the sampling weights for different methods were filled with different colors, concretely speaking, the blue and yellow portions revealed the weights of NIR and MIR, respectively.
The fusion plot of estimated and specified values in validation set was displayed in Fig. 3c with the weights shown in Fig. 4b. As seen in Fig. 4b, one sample was alloted to two weights according to the performance of NIR and MIR methods. In some cases, one method might give a promising predictive performance, while the other method might not predicted well. In the circumstances, the predictability would be improved by means of the fusion strategy.
It was concluded from Table 2 that the MDW method provided an outperformed results with RMSEP of 0.1016%, bias of 0.0195% compared with the individual methods of NIR (RMSEP of 0.1164%, bias of −0.0120%) and MIR (RMSEP of 0.1294%, bias of 0.0403%). The MDW method yielded more satisfactory results than the individual technologies, which mainly declared that the MDW method was capable to generate a more accurate result than individual technologies for deltamethrin pesticide. Therefore, it was essential for deltamethrin data set to carry out MDW fusion strategy to proceed quantitative analysis. As a case to case approach to evaluate the weight according to different methods, the MDW method was feasible to employ high-level fusion for quantitative analysis.  Table 3. The MD of the validation set of deltamethrin and emamectin benzoate data sets from NIR and MIR methods. Threshold represents three times of the group mean value.  www.nature.com/scientificreports www.nature.com/scientificreports/ emamectin benzoate formulation data set. MC outlier approach was carried out on the emamectin benzoate data set, and no sample was identified as the outlier. Subsequently, sixty samples were divided into calibration set (45 samples) and validation set (15 samples) according to the concentration. The emamectin benzoate data set was pre-processed by autoscaling before developing the model. After carrying out five-fold PLS algorithm, ten and four LVs were respectively chosen for individual NIR and MIR spectra method. The plot of estimated and specified values in validation set were shown in Fig. 5a,b. The predicted results for individual techniques were summarized in Table 2. As outlined in Table 2, the predictive performance of MIR (RMSEP = 0.0765%, bias = 0.0114%) gave a favorable result than that of NIR (RMSEP = 0.1047%, bias = 0.0601%).
Based on the fusion procedure, MDW method was carried out to apply high-level fusion for quantitative analysis for emamectin benzoate data set. The criteria for each method were displayed in Table 3. It was obviously obtained from Table 3 that all the samples were within the range of threshold, so the weights were assigned to the inverse ratio of the corresponding MD values. Figure 4c was the MD values of 15 samples according to different methods. As could be seen in Fig. 4c and Table 3, the MD values (0.883 to 8.283) were represented by different colors with the color bar varied from dark blue to yellow. Take a dark blue sample for instance, this sample was assigned with a comparative small MD, which in return manifesting the associated method played an important role for fusion. Figure 4d was the sampling weights, wherein the blue and yellow portions were the weights of NIR and MIR, respectively. Evidently, one sample was allocated to two weights according to MD values.
The plot of estimated and specified values in validation set were shown in Fig. 5c. The results of MDW method (RMSEP of 0.0596%, bias of 0.0205%) gained an advantage over the individual methods of NIR (RMSEP of 0.1047%, bias of 0.0601%) and MIR (RMSEP of 0.0765%, bias of 0.0114%). In summary, MDW method indeed improved the predictive ability of emamectin benzoate data set, which revealed that the MDW method was effectively to employ the individual method for fusion.

conclusion
In this study, the proposed MDW method was successfully applied to NIR and MIR spectroscopic analysis for rapid determination of pesticide active ingredient in deltamethrin and emamectin benzoate formulations. As a matter of fact, the MDW method performed superior results than individual NIR and MIR method mainly attributed to the fusion method took advantage of the merits of each method comprehensively. Overall, the method is promising with increased predictive ability compared with individual methods. Admittedly, the results generated by the MDW method were better than the two individual methods, indicating that the MDW method could improve the predictive ability of the model and could be successfully used for fusion.