Raman Spectroscopy Enables Non-Invasive Identification of Peanut Genotypes and Value-Added Traits

Identification of specific genotypes can be accomplished by visual recognition of their distinct phenotypical appearance, as well as DNA analysis. Visual identification (ID) of species is subjective and usually requires substantial taxonomic expertise. Genotyping and sequencing are destructive, time- and labor-consuming. In this study, we investigate the potential use of Raman spectroscopy (RS) as a label-free, non-invasive and non-destructive analytical technique for the fast and accurate identification of peanut genotypes. We show that chemometric analysis of peanut leaflet spectra provides accurate identification of different varieties. This same analysis can be used for prediction of nematode resistance and oleic-linoleic oil (O/L) ratio. Raman-based analysis of seeds provides accurate genotype identification in 95% of samples. Additionally, we present data on the identification of carbohydrates, proteins, fiber and other nutrients obtained from spectroscopic signatures of peanut seeds. These results demonstrate that RS allows for fast, accurate and non-invasive screening and selection of plants which can be used for precision breeding.

We employed orthogonalized partial least squares discriminant analysis (OPLS-DA) to determine whether RS can be used for the quantitative identification of peanut genotypes based on spectroscopic signatures collected from their leaflets. The loading plot and misclassification table were then generated using this final model, which contained 9 predictive components (PCs), one orthogonal component and 1021 (690-1710 cm −1 ) original wavenumbers for standard normal variate (SNV) pre-processed first derivative spectra. The nine predictive components (PCs) explained a total of 43% variation between the classes (Fig. S1).
The model identified the carotenoids band at 1525 cm −1 (PC1), the phenylpropanoid bands at 1606 and 1632 cm −1 (PC1), cellulose bands at 1155 cm −1 (PC1), the xylan band at 1218 cm −1 (PC2), the hydrocarbons bond vibration at 1443 cm −1 (PC2), as well as the cellulose and phenylpropanoid band at 1185 cm −1 (PC3) as the strongest predictors of species. The model also explained 47% of the variation (R2X) in the spectra. Our results indicated that RS can be used for, on average, 80% accurate identification in of peanut plants based on their spectroscopic signatures from leaflets (Table 2). However, some varieties had markedly lower accuracy of classification than others. This may be associated with growth stage of the plants.
Resistance was previously identified in wild species peanut that imparted almost total immunity to root-knot nematode (Meloidogyne arenaria) 9 . Several highly resistant cultivars have been released from the resulting gene introgression [10][11][12] . Root-knot nematode resistance is associated with a failure of juveniles to establish a feeding site 13 . We looked at whether RS can be used to probe for resistance in peanut cultivars. Although the average spectroscopic signatures of nematode resistant and susceptible plants are very similar (Fig. 3A), partial least squares discriminant analysis (PLS-DA) allowed for on average 75% accurate identification of these two groups, Table 3. From the loadings plot (Fig. S2), we observed that changes associated with the carotenoid (1526 cm −1 ), phenylpropanoid (1606 cm −1 ) and the 1155 cm −1 bands as the most important for this prediction. Further biochemical analysis is required to determine the relationship between these compounds in the leaves and nematode resistance may be.
Next, we employed RS to distinguish between plants that were known to have a high and normal oleic (O/L) ratio. High-oleic peanuts are preferred by manufactures because they have a longer shelf life which leads to reduced rancidity, and have been shown to be associated with reduced serum cholesterol levels and reductions in cardiovascular disease 14 18 . We collected more than 300 spectra from peanut varieties and breeding lines with either high (23 breeding lines and 1 released cultivars) and low O/L ratios (4 released cultivars). RS revealed that high O/L plants have lower phenylpropanoid content (1606 and 1632 cm −1 ), whereas structure of all other structural components of these two groups of plants appeared to be nearly identical. (Fig. 3B). PLS-DA of these spectra allowed for identification of high vs normal O/L plants with an average 82% accuracy, Table 4. Inspecting the loadings plot ( Fig. S3), we found that the important bands for O/L prediction were quite like those for nematode prediction, as the plot patterns are very similar. However, in O/L, it can be observed that the phenylpropanoid bands are associated with the first latent variable (LV) in O/L, whereas they are observed in the third LV in nematode resistance. As each LV explains successively less variation in a dataset, this may suggest that phenylpropanoid content is more important for prediction of O/L content than for nematode resistance. These results demonstrate that RS can be used to assist plant breeders by allowing for fast screening of genetic traits.

Seed-based identification of peanuts and assessment of their nutrient values.
We also investigated whether RS can be used to positively identify peanut seeds. We collected 342 spectra from 5 wild (assigned K through O) and 5 cultivated (assigned P through T) genotypes of peanuts (Fig. 4). The Raman spectra of peanut seeds exhibited vibrational bands that could be assigned to pectin (849 cm −1 ), carbohydrates (1080, 1119, 1301, 1339 cm −1 ), and phenylpropanoids (1611 cm −1 ) (Fig. 5). We also observed vibrational bands at 969 and 1446 cm −1 , which can be assigned to CH 2 and CH 2 /CH 3 vibrations respectively. The vibrational band at 1005 cm −1 could potentially be assigned to both carotenoids and proteins. In addition to the vibrational band at ~1000 cm −1 , carotenoids also have a distinct vibrational band around 1530 cm −1 which was not observed in the Raman spectra of peanut seeds. Therefore, we can unambiguously assign the vibrational band at 1005 cm −1 to proteins. The vibrational band at 1656 cm −1 can be also assigned to proteins (amide I). However, this band may also originate from -C=C-vibrations of unsaturated fatty acids. The -C=C-group also exhibits the vibrational mode at 1265 cm −1 , which was observed in the Raman spectra of peanut seeds 19 . Since intensities of 1265 and 1656 cm −1 change synchronously from one genotype to another, it is highly likely that both of them should be assigned to the same chemical moiety. Therefore, the vibrational band at 1265 cm −1 has been assigned to the alkene group of unsaturated fatty acids. Finally, the vibrational band at 1748 cm −1 can be assigned to esters.
Next, we investigated whether these Raman spectra of peanut seeds can be used to probe nutrient composition. The intensity of vibrational bands in Raman spectra directly correlate with a concentration of the corresponding chemical in the sample. Therefore, we can use intensities of 1005, 1301, 1443, and 1607 cm −1 vibrational bands to probe relative content of proteins, carbohydrates, oils and fiber, as well as vibrational bands at 1656 and 1748 cm −1 to predict the amount of unsaturated fatty acids and esters in peanut seeds. Previously, we showed that overall spectral intensity directly depends on the color of the sample. For instance, in our study on the nutrient content of corn kernels, spectra collected from dark colored corn kernels had much lower intensity comparing to the Raman spectra collected from lighter color kernels 20 . Therefore, to exclude the influence of seed color on the intensities of vibrational bands, spectra were normalized to the total spectral area, as this method is the least biased. 1286  www.nature.com/scientificreports www.nature.com/scientificreports/ We used ANOVA to determine whether the previously analyzed differences in nutrient-associated bands were statistically significant (Fig. 6). In general, we found that the wild varieties tended to have wider confidence intervals for the true mean intensity of a given band of interest compared to the cultivated varieties. This is logical as the cultivated varieties have been bred for specific traits of interest. At the protein band (1005 cm −1 ; Fig. 6, top left), ANOVA revealed that varieties L, O and Q have significantly higher protein band intensity than R, S and T. Additionally, variety N has significantly higher intensity than variety T. Finally, variety Q is significantly more intense than all varieties except L, N and O. Variety Q may have been selectively bred for higher protein content compared to the other varieties, while within the wild varieties, none of the groups significantly differ from each other.
The carbohydrate band varied as well (Fig. 6, top right). Varieties M, N and P are significantly more intense than all other cultivated varieties. The remaining cultivated varieties (Q -T) do not have significantly different intensities from each other. Variety O is significantly less intense than varieties K, M, N and P. The general trends observed for protein remain true for carbohydrates: the cultivated varieties have more narrow confidence intervals, which is probably due to being selected for traits associated with carbohydrate buildup.
In terms of oil content (Fig. 6, middle left), varieties L, M, P and Q have significantly higher intensity than variety T but are not different from each other. In fact, Additionally, varieties P and Q have significantly higher intensity than S and T, which suggests that these varieties may be bred for higher oil content. All other varieties    www.nature.com/scientificreports www.nature.com/scientificreports/ are not significantly different from each other. In fiber content (Fig. 6, middle right), the centers of the wild variety confidence intervals were all lower than the centers of the cultivated varieties. However, only varieties P, R and T have significantly higher fiber-associated intensity than the wild cultivars. Variety P is significantly more intense than all other varieties analyzed, a grouping which only appears once in all bands analyzed.
Additionally, we discovered that for unsaturated fatty acids (Fig. 6, bottom left), varieties P and Q are significantly less intense than all other varieties. Varieties K, N, S and T are significantly more intense than variety R, and variety K is significantly more intense than all other varieties but N and S. Finally, in esters (Fig. 6, bottom right), we found that while only M and N are significantly different in terms of within category (wild and cultivated) differences, variety M is significantly more intense than all cultivated varieties. In fact, the centers of the  We then used OPLS-DA to demonstrate that RS can be used for identification of peanuts genotypes based on the spectra collected from their seeds. The loading plot (Fig. S4) and misclassification table (Table 5) were then generated using this final model, which contained 9 predictive components, one orthogonal component and 1100 (701-1800 cm −1 ) original wavenumbers for SNV pre-processed first derivative spectra. The nine predictive components (PCs) explained a total of 55% variation between the classes.
Our results demonstrated that RS can be used for highly accurate (95%) identification of peanut seeds. This accuracy is much higher than the positive identification of leaves. This can be explained by the very low if any metabolism occurring in the seeds as compared to actively growing plants. Figure 6. Means (circles) and 95% confidence intervals for the intensities of the peanut seed spectra, normalized to total spectral area, at the indicated wavenumbers. Generated following the ANOVA tests. Blue and solid: wild variety; Red and dashed: cultivated variety.

Conclusions
Our results demonstrate that RS can be used for accurate identification of peanuts based on spectroscopic signatures of their leaves and seeds. We showed that accuracy of plant identification is higher upon spectroscopic analysis of seeds comparing to leaves. OPLS-DA results show that peanut seeds can be identified with on average 95% accuracy whereas accuracy of Raman-based identification of leaves is, on average, 80%. These results demonstrate that RS can be used in field and greenhouses settings for rapid phenotyping of plants. We also showed that utilization of RS allows for non-invasive and non-destructive assessment of nutrient content of seeds providing information about their carbohydrate, protein, fiber, as well as oils and unsaturated fatty acids for peanut seeds. As this method is fast (1 s), portable, non-invasive and non-destructive, the reported experimental evidence suggests that RS can be used directly on combines and elevators for on-line monitoring of seed quality.

Methods
Leaves. Approximately one hundred leaflets were collected from 10 different genotypes of peanut plants grown in the greenhouse at the Texas A&M AgriLife Research and Extension Center at Stephenville 125 days after planting (DAP). Greenhouses are controlled by a Wadsworth Step-50 temperature control system in an IBG greenhouse. The system operated where the heaters cycle on if the temperature drops below 21 °C and the cooling system cycles on if the temperature exceeds 32 °C. To minimize the contribution from possible differences in plant vegetation stages, leaves were collected twice with approximately five weeks gap between sample collection. Spectra from both sampling rounds were used together in the statistical analysis. Up to four spectra were acquired per leaflet, based on their size.
Peanut leaflets were collected at 109 DAP on 10/14/19 from field plots at the Texas A&M AgriLife Research and Extension Center in Stephenville. Susceptible materials included 3 released cultivars and 23 breeding lines from the Texas A&M AgriLife peanut breeding program that do not contain the gene introgression from wild peanut. Resistant material included 1 released cultivar and 6 breeding lines from the program. We collected more than 700 Raman spectra from leaves of both nematode resistant and susceptible peanut varieties and breeding lines.
Seeds. 10 different genotypes of peanut seeds were provided from the Texas A&M AgriLife Research peanut germplasm collection. Seeds were removed from −20 °C short term cold storage and allowed to equilibrate to room temperature for approximately 24 hrs before scanning. 10-35 seeds per genotype were scanned two times each for use in variety differentiation by RS.
Raman spectroscopy. Acquisition. Spectra were collected using a portable, hand-held Agilent Resolve spectrometer equipped with an 830 nm laser. The following experimental parameters were used for all collected spectra: 1 s acquisition time, 495 mW power, and surface scanning mode. Leaves and seeds were each gently pressed against the nose cone during scanning to ensure that they are at the focus of the laser.
Processing. Spectra were automatically background subtracted and baseline corrected by the instrument's onboard software. These data were then exported from the portable instrument as comma separated value (CSV) files using software provided by the company. Raw spectra of leaves and seeds can be found in the SI as Figs S5 and S6, respectively.
These csv's were then imported into MATLAB and SIMCA for preprocessing. Spectra were first normalized to unit variance using the standard normal variate (SNV) method in order to reduce the contribution of random changes in total spectral intensity. Then, all spectra were mean centered, which involves subtracting the mean spectrum from each individual spectrum. This process allows our analyses to be relative to the total mean of the sample. Statistical analyses were then conducted in SIMCA14, MATLAB, or PLS_Toolbox, an addon of MATLAB.  Table 5. OPLS-DA confusion matrix of Raman spectra collected from seeds of 10 different genotypes (K-T) of peanut.
Scientific RepoRtS | (2020) 10:7730 | https://doi.org/10.1038/s41598-020-64730-w www.nature.com/scientificreports www.nature.com/scientificreports/ to indicate discrete classes or categories of data which the model then proceeds to predict 21 . PLS-DA is a type of supervised learning model, meaning that the user must provide the categories for each datapoint during training. Once the model completes training, it is then cross validated: part of the dataset is excluded while the rest is used to train the model. The model then attempts to predict the class membership of the excluded datapoints. This process is repeated until every datapoint has been excluded. In this study, we chose to report cross-validation results, which are suggestive of the model's ability to classify unseen data. Each PLS-DA model will contain predictive components (PCs), also known as latent variables (LVs), each of which explain a percentage of the variation in the dataset. In orthogonal PLS-DA, variation in the data is separated into a predictive portion which accounts for separation between classes, and an orthogonal portion which does not.
Differentiation of peanut varieties using leaf or seed spectra was conducted in SIMCA 14 using OPLS-DA and preprocessing described in the main text. Differentiation of nematode and O/L content was conducted in the MATLAB addon PLS_Toolbox (Eigenvector Research Inc.) using regular PLS-DA.
Anova. We used analysis of variance (ANOVA) to screen our peanut samples for their nutrient content. ANOVA is a statistical procedure which tests whether any means in a set of samples are significantly different from each other. The null hypothesis of this test is that there are no significant differences amongst the categories being tested. A significant (α = 0.05) ANOVA indicates that at least one pair of groups have significantly different means. To determine which groups were significantly different from each other, we then conducted Tukey HSD tests that evaluate which groups are significantly different. We then report 95% confidence intervals for the true value of each mean. Overlapping confidence intervals indicate that those groups are not significantly different from each other.
To conduct ANOVAs on our peanut nutrient dataset, spectra were first imported into MATLAB. They were then normalized to total spectral area and intensities at individual wavenumbers ( Fig. 6) were extracted. ANOVA was then conducted using the anova1 function on the intensities at each of the selected wavenumbers. 95% confidence intervals were constructed using the multcompare function, which by default uses Tukey HSD to evaluate group-to-group differences.