Near-infrared spectroscopy and machine learning-based technique to predict quality-related parameters in instant tea

The traditional method for analyzing the content of instant tea has disadvantages such as complicated operation and being time-consuming. In this study, a method for the rapid determination of instant tea components by near-infrared (NIR) spectroscopy was established and optimized. The NIR spectra of 118 instant tea samples were used to evaluate the modeling and prediction performance of a combination of binary particle swarm optimization (BPSO) with support vector regression (SVR), BPSO with partial least squares (PLS), and SVR and PLS without BPSO. Under optimal conditions, Rp for moisture, caffeine, tea polyphenols, and tea polysaccharides were 0.9678, 0.9757, 0.7569, and 0.8185, respectively. The values of SEP were less than 0.9302, and absolute values of Bias were less than 0.3667. These findings indicate that machine learning can be used to optimize the detection model of instant tea components based on NIR methods to improve prediction accuracy.


Scientific Reports
| (2022) 12:3833 | https://doi.org/10.1038/s41598-022-07652-z www.nature.com/scientificreports/ operators 20 . Therefore, optical spectroscopic techniques are increasingly used for the rapid, nondestructive assessment of food products 21 . Near-infrared (NIR) spectroscopy is particularly attractive for this purpose. The NIR spectral region is mainly the frequency-doubled and combined-frequency absorption regions of the hydrogencontaining group X-H (X being an element such as O, N, S, or C). Because various organic substances contain different groups, and various groups have different absorption wavelengths for NIR light in different chemical environments, the NIR spectrum can be utilized to perform qualitative and quantitative analyses in food component analysis 22 . Since recent years, NIR spectroscopy is being applied in the prediction of tea composition 23 . It has been used to discriminate the roast green tea from different origins, estimate the fermentation degree of Pu' er tea in processing, quantitatively determine the contents of total polyphenols, caffeine, and catechins in tea leaves, and classify special-grade green tea. However, studies performing nondestructive quantitative analysis of biochemical components in instant tea are scant 24 .
In the multivariate data analysis step, the partial least squares (PLS) model is the most widely used model for the quantitative analysis of NIR spectroscopy. Support vector regression (SVR) is also a crucial quantitative analysis algorithm. In recent years, metaheuristic algorithms have been widely adopted as global optimizer methods 25 . The particle swarm optimization (PSO) algorithm is one of these methods; it selects the feature subset, optimizes the model parameters, represents less overhead in operation, and has easier implementation and faster convergence during optimization than other metaheuristic algorithms 26 . Kennedy proposed PSO in 1995 and binary particle swarm optimization (BPSO) of discrete space in 1997 27 . Combined with other classification algorithms, BPSO can obtain improved results.
In this study, we used BPSO respectively with SVR (BPSO-SVR) and PLS (BPSO-PLS) to enhance the randomness of the mutation after the reset mechanism and to keep the particle active in continuous optimization. In addition, a fast experiment for determining moisture, caffeine, tea polyphenols, and tea polysaccharides in instant tea was carried out using different models. This study provides a reference for NIR spectroscopy combined with multivariate statistical analysis to determine food components.

Materials and instruments.
A total of 118 varieties of instant tea were provided by Yunnan Tasly Deepure Tea Group Co., Ltd (Yunnan, China). All methods were performed in accordance with the relevant guidelines and regulations. These instant tea is a kind of fine powder solid tea product, which is processed by extracting and drying the tea as raw material. A caffeine standard was purchased from China Institute for Food and Drug Control; acetonitrile and ethanol were purchased from Merck Co., Ltd; phenol and concentrated sulfuric acid were purchased from Chinese Medicines Holdings Co., Ltd; and glucose was purchased from Sigma-Aldrich Chemical Co., Ltd. Unless otherwise specified, all chemicals used were of analytical grade.
A U-3010 UV-Vis Hitachi spectrophotometer (Tokyo) was used to determine absorbance. An Agilent 1260 Infinity HPLC system was used to determine caffeine content. NIR spectrometry was carried out using a Thermo Fisher Antaris II (USA).
Determination of main components. The moisture content of the instant tea was determined according to ISO 7513:1990. The caffeine content was determined according to ISO 10727:2002 and the tea polyphenol content was determined according to ISO 14502-1:2005. The tea polysaccharide content was determined using a modified phenol-sulfuric acid method 19 .
Spectral data acquisition. NIR spectra were collected in reflectance mode. Each spectrum consisted of an average of 78 scans, in the range of 10,000-4000 cm −1 . Before scanning, the instrument was fully preheated for more than half an hour. Three spectra were collected from each sample, and the average spectrum of the three spectra was taken as the original analytical spectrum of that sample. In this study, the spectral pretreatment method used was the standard normal variate transformation (SNV) method. This method removes physical spectral information resulting from particle size.
Correction set sample division. The acquired spectral data and the reference chemical data were separated into two sets: a calibration set and a prediction set. It has been reported that the tenfold cross-validation method, also called Rotation Estimation, is a practical method to statistically cut the data sample into smaller subsets. The advantage of this method is making full use of small sample data sets 28 . In this study, tenfold crossvalidation was used to randomly select the prediction set, and the remaining samples were selected for the calibration set. In each execution, the model was trained using 90% of the data points and tested using the remaining 10%. Therefore, every data point was taken nine times for training and once for testing the model.
Chemometrics method. SVR. The SVR model is mainly used to realize linear regression by mapping spectral data to high-dimensional space and constructing a linear decision function in high-dimensional space 29 . A linear SVR classifier was trained based on the fitcsvm function in the Statistics and Machine Learning Toolbox™. Usually, a model selection procedure is required to determine the adjusting parameter C to improve the classification accuracy. Because the purpose of this study is to evaluate the search algorithm for spectral data selection, rather than the parameter selection for SVR classifiers, we adopted the default parameter value, i.e., C = 1.
PLS. PLS is an extensively used class of statistical methods that includes regression, classification, and dimension reduction techniques 30 . It uses latent variables, which are also called score vectors, to model the relationship BPSO. The BPSO algorithm transforms the trajectories from a continuous space into discrete space and maintains a swarm of particles and a global best solution simultaneously. In BPSO, each bit only takes a value of "0" or "1, " and the velocities that affect particle positions are transformed into [0, 1] and a stochastic construction process is added to confirm the locations 31 .
Model evaluation. The performance of the final models was evaluated according to the root mean square error of calibration (RMSEC) and the root mean square error of the verification set (RMSEP). The optimal model method was chosen based on RMSEC and RMSEP as the index which were lower and close to each other. At the same time, the correlation coefficient of validation set (Rp), bias-corrected standard error of prediction (SEP) and Bias were used as auxiliary reference indexes for model evaluation 32,33 . Software. Data processing and modeling analysis was carried out using MATLAB 2014a.

Results and discussion
Spectra investigation. The original NIR spectra of the 118 instant tea samples are shown in Fig. 1. These spectra can reflect the intrinsic quality of the instant tea samples. The instant tea samples were similar in their type and place of production, and their NIR spectra were understandably similar as well. The first frequencydoubling peak of the N-H bond stretching vibration was at 6860 cm −1 . The second frequency-doubling peak of the C=O stretching vibration was at 5180 cm −1 , and the combined-frequency peak of the primary amine and tertiary amine stretching vibration was at 4660 cm −1 (Fig. 1). The spectral characteristics depend on the sample composition and provide a theoretical basis for the rapid prediction of moisture, caffeine, tea polyphenols, and tea polysaccharide contents.

Classification of sample sets and distribution of measured values.
A quantitative analysis of moisture, caffeine, tea polyphenols, and tea polysaccharides was carried out on the 118 samples. The results in Table 1 show that the ranges of moisture, caffeine, tea polyphenols, and tea polysaccharide content in the samples are 3.76-6.29%, 0.48-2.80%, 18.70-22.40%, and 18.00-24.50%, respectively. The tenfold cross-validation method was used to randomly select the prediction set, and the remaining samples were selected for the calibration set.
Modeling Results. It has been reported that BPSO algorithm based on the traditional machine learning algorithm have a positive impact on the results of the model prediction 34 .The BPSO method was used to opti-  www.nature.com/scientificreports/ mize the parameter combination, obtain the best tenfold cross-validation accuracy, and establish the model with the strongest prediction ability. In the BPSO process, the relevant parameters were set as follows: the swarm size was 20, the learning factors C 1 and C 2 were 2, the maximum evolutionary algebra was 100, and the weight parameter |V max | = 6. The results in Table 2 show that the SVR, BPSO-SVR, PLS, and BPSO-PLS models could predict the moisture, caffeine, tea polyphenols, and tea polysaccharides of instant tea. The results showed that RMSEC and RMSEP presented a lower value by BPSO algorithm than those by SVR and PLS alone, which indicate that the addition of BPSO algorithm can improve the accuracy of model prediction. In addition, based on RMSEC and RMSEP, most of algorithm values between the calibration set and the prediction set in BPSO-PLS model were lower than those in BPSO-SVR model, and the range of SEP and Bias values were reasonable, which showed that BPSO-PLS model was stable. Figure 2 shows the convergence curve of the BPSO algorithm with the best results during the 100 runs. The model shows a large fluctuation at the beginning of the iteration, after which it decays with a small trend.
The BPSO-PLS model showed the most stable comprehensive performance and the most accurate prediction results for moisture. The values obtained for R c , RMSEC, R p , and RMSEP were 0.9983, 0.4128, 0.9678, and 0.6293, respectively. Comparing the four models and convergence curves for caffeine, the BPSO-PLS model had the most stable comprehensive performance and the most accurate prediction results; the R c , RMSEC, R p , and RMSEP were 0.9981, 0.4145, 0.9757, and 0.5114, respectively. For tea polyphenols, using the BPSO feature selection algorithm, R showed a significant improvement.The R c , RMSEC, R p , and RMSEP were 0.9960, 0.8191, 0.7569, and 2.1082, respectively. For tea polysaccharides models, using the BPSO feature selection algorithm, R also showed a significant improvement. The BPSO-PLS model had the most stable comprehensive performance and the most accurate prediction results for tea polysaccharides, and the R c , RMSEC, R p , and RMSEP were 0.9954, 2.0187, 0.8185, and 4.3109, respectively.
A spectral range was set, such that if it was selected more than 50 times, then this range was the final selected result. The process of selecting the wavenumber of the four components, resulting from 100 iterations, is shown in Fig. 3. We divided the spectral range into 20 segments with the same width of 311, and the last segment with a width of 319. From the wavenumber, we found that the spectral ranges selected for moisture and caffeine were relatively concentrated, while those for tea polyphenols and tea polysaccharides were relatively scattered. Table 3 shows the results of selecting the wavenumber of the four components: moisture, caffeine, tea polyphenols, and tea polysaccharides. The characteristic bands of water in instant tea were mainly concentrated in the two wavebands of 6694-7293 and 7892-8193 cm −1 . The first-order frequency doubling of O-H stretching vibration in pure water is about 7143 cm −1 , and the combined frequency absorption was 8197 cm −1,35 . The characteristic band of the moisture in instant tea associated with compounds containing O-H group through hydrogen bond in various forms, and further make a shift of the absorption peak in the direction of long and short wavelengths. The characteristic bands of caffeine were concentrated in 4000-4894 and 6994-7293 cm −1 . Near 4610 cm −1 was the combined frequency peak of stretching vibration from primary amine and tertiary amine 36    The scatter plots of 4 components between actual and NIR predicted values were shown in Fig. 4. It is well known that scatter plots present the relationship between two variables in two-dimensional coordinates, which can be used to evaluate the predictive ability of the model. In this study, the scatter points of moisture and caffeine between actual and predicted NIR values were concentrated and close to the diagonal. few scatter points of tea polyphenols and tea polysaccharides are relatively departed from the diagonal due to the complex structure. Taken together, the results indicate that the model exhibits a high prediction accuracy.

Conclusions
In this study, a rapid NIR method to estimate the moisture, caffeine, tea polyphenols, and tea polysaccharide contents of instant tea was developed using different model calibrations. The tenfold cross-validation method was used to randomly select the prediction set, and BPSO was employed as the optimization algorithm for SVR and PLS. The results show that the R p is above 0.9 for moisture and caffeine, and the R p is approximately 0.8 for tea polyphenols and tea polysaccharides. Therefore, these models exhibited high precision and accuracy. This approach provides, for the first time, a fast, specific, and easily automatable method for the quantitative detection of moisture, caffeine, tea polyphenols, and tea polysaccharides in instant tea samples. This will enable the   Orange dots present caffeine, green dots present moisture, blue dots present tea polyphenols, and purple dots present tea polysaccharides.