Introduction

Continuous growth of the global population requires perpetual increase in the production of food. It is expected that by 2050 we will need to produce 70% more food1. Such expectations can be met only though major transformations in currently used agricultural approaches. For instance, utilization of sensor-based field irrigation in 50% of ornamental operations can save up to 223 billion liters of water per year in the U.S. alone, or the water use of approximately 400,000 U.S. households2. Satellite or unmanned aerial vehicle (UAV) guided imaging can be used to monitor agricultural crops to minimize harvest losses. This new agricultural paradigm, also known as digital farming, aims to automate agricultural processes via application of precision location (GPS) methods and artificial intelligence.

Although many processes in modern agriculture have reached a substantial degree of automation, plant breeding and taxonomic identification are far from that point. Currently, to positively identify a plant, visual inspection or genotyping are the only options. The first approach is subjective and typically requires substantial knowledge and practical experience. Years of training may be required to train a plant breeder or a botanist to be the expert in one area of the plant kingdom. To exclude the human factor in plant identification, Baena et al. used RGB imaging from UAVs to identify plants3. However, the reported results demonstrated that such approach may work only for large plants with major morphological differences. Genotyping, whether by sequencing or other methods, is broadly employed by scientists and breeders to identify the genetic material associated with traits of interest. However, these methods are destructive, time-consuming and labor-intensive.

Raman spectroscopy (RS) is a label-free, non-invasive and non-destructive analytical technique that can be used to probe chemical composition of analyzed samples. Our group recently demonstrated that RS can be utilized for confirmatory diagnostics of biotic and abiotic stresses on plants. Specifically, we showed that using a hand-held Raman spectrometer, one can diagnose several devastating fungal diseases on corn, wheat and sorghum with high accuracy4,5. We also demonstrated that RS can be used to pre-symptomatic diagnostics of citrus greening disease of orange and grapefruit trees and pests inside cowpea seeds6,7. Additionally, we showed that RS could be used to diagnose rose rosette diseases on roses8. These studies demonstrated that RS can detect chemical changes in plants and potentially revolutionize agriculture.

RS could potentially enable simultaneous digital phenotyping and nutrient assessment of plant materials. This would pave the way for autonomous seed selection, or automatic breeding. Additionally, should RS be able to detect genotypic, disease-associated, and nutrient-related changes in plants, it could be deployed to provide real-time health information for plants in greenhouses or out in the field.

In this proof of principle study, we demonstrate the potential of RS for the accurate identification of peanut genotypes based on chemometric analysis of their leaves and seeds. Additionally, we demonstrate that RS can be used to screen peanut leaves for nematode resistance as well as oleic-to-linoleic acid content ratios. Finally, we show that RS can probe the relative contents of carbohydrates, fiber oils, proteins, fatty acids and esters in these same peanut seeds.

Results and Discussion

Leaf-based spectrotyping of peanuts

Raman spectra collected from the 10 different genotypes of peanuts (Fig. 1) exhibited similar profiles with vibrational bands at 480, 917 cm−1, which can be assigned to carbohydrates, 520 and 1048 and 1115 cm−1 to cellulose, 747 and 853 cm−1 to pectin, 1000, 1155 and 1526 cm−1 to carotenoids, 1185, 1606 and 1632 cm−1 to phenylpropanoids (including lignin), 1660 cm−1 to proteins and 1682 cm−1 to carboxylic acids. We also observed vibrational bands at 964, 1286, 1327, 1387, 1443, which can be assigned to aliphatic groups (CH2/CH3 vibrations) (Fig. 2, Table 1).

Figure 1
figure 1

Phenotypical appearance of ten different peanut genotypes (A–J) used for the study, identities are as follows: (A) Arachis archeri (PI475987), (B) Arachis sp. (36009), (C) Arachis cardenasii (36035Y), (D) Arachis helodes (6331-3), (E) Arachis kulmannii (7631-1), (F) Arachis matiensis (36007), (G) Arachis nitida (S-3942), (H) Arachis subcoriacae (13706-1), (I) Arachis hypogaea (TP 623-1-2), (J) Arachis pintoi (12787).

Figure 2
figure 2

Averages of non-normalized (A) and area normalized (B) Raman spectra collected from leaves of 10 different peanut genotypes.

Table 1 Vibrational bands and their assignments for spectra collected from leaves and seeds of peanuts.

We employed orthogonalized partial least squares discriminant analysis (OPLS-DA) to determine whether RS can be used for the quantitative identification of peanut genotypes based on spectroscopic signatures collected from their leaflets. The loading plot and misclassification table were then generated using this final model, which contained 9 predictive components (PCs), one orthogonal component and 1021 (690–1710 cm−1) original wavenumbers for standard normal variate (SNV) pre-processed first derivative spectra. The nine predictive components (PCs) explained a total of 43% variation between the classes (Fig. S1).

The model identified the carotenoids band at 1525 cm−1 (PC1), the phenylpropanoid bands at 1606 and 1632 cm−1 (PC1), cellulose bands at 1155 cm−1 (PC1), the xylan band at 1218 cm−1 (PC2), the hydrocarbons bond vibration at 1443 cm−1 (PC2), as well as the cellulose and phenylpropanoid band at 1185 cm−1 (PC3) as the strongest predictors of species. The model also explained 47% of the variation (R2X) in the spectra. Our results indicated that RS can be used for, on average, 80% accurate identification in of peanut plants based on their spectroscopic signatures from leaflets (Table 2). However, some varieties had markedly lower accuracy of classification than others. This may be associated with growth stage of the plants.

Table 2 OPLS-DA confusion matrix of Raman spectra collected from leaves of 10 different genotypes (A–J) of peanuts.

Resistance was previously identified in wild species peanut that imparted almost total immunity to root-knot nematode (Meloidogyne arenaria)9. Several highly resistant cultivars have been released from the resulting gene introgression10,11,12. Root-knot nematode resistance is associated with a failure of juveniles to establish a feeding site13. We looked at whether RS can be used to probe for resistance in peanut cultivars. Although the average spectroscopic signatures of nematode resistant and susceptible plants are very similar (Fig. 3A), partial least squares discriminant analysis (PLS-DA) allowed for on average 75% accurate identification of these two groups, Table 3. From the loadings plot (Fig. S2), we observed that changes associated with the carotenoid (1526 cm−1), phenylpropanoid (1606 cm−1) and the 1155 cm−1 bands as the most important for this prediction. Further biochemical analysis is required to determine the relationship between these compounds in the leaves and nematode resistance may be.

Figure 3
figure 3

Averaged Raman spectra collected from the leaves of: (A) nematode(N) resistant and susceptible plants or (B) plants with differing O/L ratios in their seeds.

Table 3 PLS-DA cross-validation confusion matrix of Raman spectra collected from leaves of nematode resistant and susceptible peanut varieties.

Next, we employed RS to distinguish between plants that were known to have a high and normal oleic (O/L) ratio. High-oleic peanuts are preferred by manufactures because they have a longer shelf life which leads to reduced rancidity, and have been shown to be associated with reduced serum cholesterol levels and reductions in cardiovascular disease14. The following high-oleic varieties have been officially released by Texas A&M AgriLife as plant variety protected cultivars: Tamrun OL0112, Tamrun OL0215, Tamrun OL0615, Tamrun OL0715, Webb11, Tamrun OL1116, Schubert17, and TamVal OL1418. We collected more than 300 spectra from peanut varieties and breeding lines with either high (23 breeding lines and 1 released cultivars) and low O/L ratios (4 released cultivars). RS revealed that high O/L plants have lower phenylpropanoid content (1606 and 1632 cm−1), whereas structure of all other structural components of these two groups of plants appeared to be nearly identical. (Fig. 3B). PLS-DA of these spectra allowed for identification of high vs normal O/L plants with an average 82% accuracy, Table 4. Inspecting the loadings plot (Fig. S3), we found that the important bands for O/L prediction were quite like those for nematode prediction, as the plot patterns are very similar. However, in O/L, it can be observed that the phenylpropanoid bands are associated with the first latent variable (LV) in O/L, whereas they are observed in the third LV in nematode resistance. As each LV explains successively less variation in a dataset, this may suggest that phenylpropanoid content is more important for prediction of O/L content than for nematode resistance. These results demonstrate that RS can be used to assist plant breeders by allowing for fast screening of genetic traits.

Table 4 PLS-DA cross-validation confusion matrix of Raman spectra collected from leaves of peanut varieties with high and low O/L ratios.

Seed-based identification of peanuts and assessment of their nutrient values

We also investigated whether RS can be used to positively identify peanut seeds. We collected 342 spectra from 5 wild (assigned K through O) and 5 cultivated (assigned P through T) genotypes of peanuts (Fig. 4). The Raman spectra of peanut seeds exhibited vibrational bands that could be assigned to pectin (849 cm−1), carbohydrates (1080, 1119, 1301, 1339 cm−1), and phenylpropanoids (1611 cm−1) (Fig. 5). We also observed vibrational bands at 969 and 1446 cm−1, which can be assigned to CH2 and CH2/CH3 vibrations respectively. The vibrational band at 1005 cm−1 could potentially be assigned to both carotenoids and proteins. In addition to the vibrational band at ~1000 cm−1, carotenoids also have a distinct vibrational band around 1530 cm−1 which was not observed in the Raman spectra of peanut seeds. Therefore, we can unambiguously assign the vibrational band at 1005 cm−1 to proteins. The vibrational band at 1656 cm−1 can be also assigned to proteins (amide I). However, this band may also originate from -C=C- vibrations of unsaturated fatty acids. The -C=C- group also exhibits the vibrational mode at 1265 cm−1, which was observed in the Raman spectra of peanut seeds19. Since intensities of 1265 and 1656 cm−1 change synchronously from one genotype to another, it is highly likely that both of them should be assigned to the same chemical moiety. Therefore, the vibrational band at 1265 cm−1 has been assigned to the alkene group of unsaturated fatty acids. Finally, the vibrational band at 1748 cm−1 can be assigned to esters.

Figure 4
figure 4

Photographs of seeds of ten different peanut genotypes (K–O). Identities are as follows: (K) Arachis hypogaea (TxL090106-05), (L) Arachis valida (30147), (M) Arachis hypogaea (US# 1551 Tan), (N) Arachis praecox (6416), (O) Arachis cardenasii (10017), (P) Arachis hypogaea (US# 1519 Red), (Q) Arachis hypogaea (TxAG-8), (R) Arachis hypogaea (Tx144932), (S) Arachis gladulifera (30098), (T) Arachis paraguariensis (10585).

Figure 5
figure 5

Averages of non-normalized (A) and area normalized (B) Raman spectra collected from seeds of 10 different peanut genotypes.

Next, we investigated whether these Raman spectra of peanut seeds can be used to probe nutrient composition. The intensity of vibrational bands in Raman spectra directly correlate with a concentration of the corresponding chemical in the sample. Therefore, we can use intensities of 1005, 1301, 1443, and 1607 cm−1 vibrational bands to probe relative content of proteins, carbohydrates, oils and fiber, as well as vibrational bands at 1656 and 1748 cm−1 to predict the amount of unsaturated fatty acids and esters in peanut seeds. Previously, we showed that overall spectral intensity directly depends on the color of the sample. For instance, in our study on the nutrient content of corn kernels, spectra collected from dark colored corn kernels had much lower intensity comparing to the Raman spectra collected from lighter color kernels20. Therefore, to exclude the influence of seed color on the intensities of vibrational bands, spectra were normalized to the total spectral area, as this method is the least biased.

We used ANOVA to determine whether the previously analyzed differences in nutrient-associated bands were statistically significant (Fig. 6). In general, we found that the wild varieties tended to have wider confidence intervals for the true mean intensity of a given band of interest compared to the cultivated varieties. This is logical as the cultivated varieties have been bred for specific traits of interest. At the protein band (1005 cm−1; Fig. 6, top left), ANOVA revealed that varieties L, O and Q have significantly higher protein band intensity than R, S and T. Additionally, variety N has significantly higher intensity than variety T. Finally, variety Q is significantly more intense than all varieties except L, N and O. Variety Q may have been selectively bred for higher protein content compared to the other varieties, while within the wild varieties, none of the groups significantly differ from each other.

Figure 6
figure 6

Means (circles) and 95% confidence intervals for the intensities of the peanut seed spectra, normalized to total spectral area, at the indicated wavenumbers. Generated following the ANOVA tests. Blue and solid: wild variety; Red and dashed: cultivated variety.

The carbohydrate band varied as well (Fig. 6, top right). Varieties M, N and P are significantly more intense than all other cultivated varieties. The remaining cultivated varieties (Q – T) do not have significantly different intensities from each other. Variety O is significantly less intense than varieties K, M, N and P. The general trends observed for protein remain true for carbohydrates: the cultivated varieties have more narrow confidence intervals, which is probably due to being selected for traits associated with carbohydrate buildup.

In terms of oil content (Fig. 6, middle left), varieties L, M, P and Q have significantly higher intensity than variety T but are not different from each other. In fact, Additionally, varieties P and Q have significantly higher intensity than S and T, which suggests that these varieties may be bred for higher oil content. All other varieties are not significantly different from each other. In fiber content (Fig. 6, middle right), the centers of the wild variety confidence intervals were all lower than the centers of the cultivated varieties. However, only varieties P, R and T have significantly higher fiber-associated intensity than the wild cultivars. Variety P is significantly more intense than all other varieties analyzed, a grouping which only appears once in all bands analyzed.

Additionally, we discovered that for unsaturated fatty acids (Fig. 6, bottom left), varieties P and Q are significantly less intense than all other varieties. Varieties K, N, S and T are significantly more intense than variety R, and variety K is significantly more intense than all other varieties but N and S. Finally, in esters (Fig. 6, bottom right), we found that while only M and N are significantly different in terms of within category (wild and cultivated) differences, variety M is significantly more intense than all cultivated varieties. In fact, the centers of the confidence intervals for the cultivated varieties are all lower than those of the wild varieties. Nevertheless, only K, L, M, and O are significantly more intense than the least intense cultivated variety T.

We then used OPLS-DA to demonstrate that RS can be used for identification of peanuts genotypes based on the spectra collected from their seeds. The loading plot (Fig. S4) and misclassification table (Table 5) were then generated using this final model, which contained 9 predictive components, one orthogonal component and 1100 (701–1800 cm−1) original wavenumbers for SNV pre-processed first derivative spectra. The nine predictive components (PCs) explained a total of 55% variation between the classes.

Table 5 OPLS-DA confusion matrix of Raman spectra collected from seeds of 10 different genotypes (K-T) of peanut.

The model identified the ester band at 1748 cm−1 (PC1), a phenylpropanoid band at 1611 cm−1 (PC1), a pectin band at 849 cm−1 (PC1), carbohydrate bands at 1302 and 1339 cm−1 (PC2), the hydrocarbons bond vibration at 1446 cm−1 (PC2), as well as unsaturated fatty acid bands at 1265 and 1656 cm−1 (PC3) as the strongest predictors of genotype. The model also explained 33% of the variation (R2X) in the spectra.

Our results demonstrated that RS can be used for highly accurate (95%) identification of peanut seeds. This accuracy is much higher than the positive identification of leaves. This can be explained by the very low if any metabolism occurring in the seeds as compared to actively growing plants.

Conclusions

Our results demonstrate that RS can be used for accurate identification of peanuts based on spectroscopic signatures of their leaves and seeds. We showed that accuracy of plant identification is higher upon spectroscopic analysis of seeds comparing to leaves. OPLS-DA results show that peanut seeds can be identified with on average 95% accuracy whereas accuracy of Raman-based identification of leaves is, on average, 80%. These results demonstrate that RS can be used in field and greenhouses settings for rapid phenotyping of plants. We also showed that utilization of RS allows for non-invasive and non-destructive assessment of nutrient content of seeds providing information about their carbohydrate, protein, fiber, as well as oils and unsaturated fatty acids for peanut seeds. As this method is fast (1 s), portable, non-invasive and non-destructive, the reported experimental evidence suggests that RS can be used directly on combines and elevators for on-line monitoring of seed quality.

Methods

Leaves

Approximately one hundred leaflets were collected from 10 different genotypes of peanut plants grown in the greenhouse at the Texas A&M AgriLife Research and Extension Center at Stephenville 125 days after planting (DAP). Greenhouses are controlled by a Wadsworth Step-50 temperature control system in an IBG greenhouse. The system operated where the heaters cycle on if the temperature drops below 21 °C and the cooling system cycles on if the temperature exceeds 32 °C. To minimize the contribution from possible differences in plant vegetation stages, leaves were collected twice with approximately five weeks gap between sample collection. Spectra from both sampling rounds were used together in the statistical analysis. Up to four spectra were acquired per leaflet, based on their size.

Peanut leaflets were collected at 109 DAP on 10/14/19 from field plots at the Texas A&M AgriLife Research and Extension Center in Stephenville. Susceptible materials included 3 released cultivars and 23 breeding lines from the Texas A&M AgriLife peanut breeding program that do not contain the gene introgression from wild peanut. Resistant material included 1 released cultivar and 6 breeding lines from the program. We collected more than 700 Raman spectra from leaves of both nematode resistant and susceptible peanut varieties and breeding lines.

Seeds

10 different genotypes of peanut seeds were provided from the Texas A&M AgriLife Research peanut germplasm collection. Seeds were removed from −20 °C short term cold storage and allowed to equilibrate to room temperature for approximately 24 hrs before scanning. 10–35 seeds per genotype were scanned two times each for use in variety differentiation by RS.

Raman spectroscopy

Acquisition

Spectra were collected using a portable, hand-held Agilent Resolve spectrometer equipped with an 830 nm laser. The following experimental parameters were used for all collected spectra: 1 s acquisition time, 495 mW power, and surface scanning mode. Leaves and seeds were each gently pressed against the nose cone during scanning to ensure that they are at the focus of the laser.

Processing

Spectra were automatically background subtracted and baseline corrected by the instrument’s onboard software. These data were then exported from the portable instrument as comma separated value (CSV) files using software provided by the company. Raw spectra of leaves and seeds can be found in the SI as Figs S5 and S6, respectively.

These csv’s were then imported into MATLAB and SIMCA for preprocessing. Spectra were first normalized to unit variance using the standard normal variate (SNV) method in order to reduce the contribution of random changes in total spectral intensity. Then, all spectra were mean centered, which involves subtracting the mean spectrum from each individual spectrum. This process allows our analyses to be relative to the total mean of the sample. Statistical analyses were then conducted in SIMCA14, MATLAB, or PLS_Toolbox, an addon of MATLAB.

Statistical analysis

PLS-DA

Spectra were imported into either SIMCA 14 (Umetrics, Umea, Sweden) or MATLAB for multivariate statistical analysis. To build classification models, partial least squares discriminant analysis (PLS-DA) was selected. PLS-DA is an extension of ordinary PLS which uses dummy Y-variables to indicate discrete classes or categories of data which the model then proceeds to predict21. PLS-DA is a type of supervised learning model, meaning that the user must provide the categories for each datapoint during training. Once the model completes training, it is then cross validated: part of the dataset is excluded while the rest is used to train the model. The model then attempts to predict the class membership of the excluded datapoints. This process is repeated until every datapoint has been excluded. In this study, we chose to report cross-validation results, which are suggestive of the model’s ability to classify unseen data. Each PLS-DA model will contain predictive components (PCs), also known as latent variables (LVs), each of which explain a percentage of the variation in the dataset. In orthogonal PLS-DA, variation in the data is separated into a predictive portion which accounts for separation between classes, and an orthogonal portion which does not.

Differentiation of peanut varieties using leaf or seed spectra was conducted in SIMCA 14 using OPLS-DA and preprocessing described in the main text. Differentiation of nematode and O/L content was conducted in the MATLAB addon PLS_Toolbox (Eigenvector Research Inc.) using regular PLS-DA.

Anova

We used analysis of variance (ANOVA) to screen our peanut samples for their nutrient content. ANOVA is a statistical procedure which tests whether any means in a set of samples are significantly different from each other. The null hypothesis of this test is that there are no significant differences amongst the categories being tested. A significant (α = 0.05) ANOVA indicates that at least one pair of groups have significantly different means. To determine which groups were significantly different from each other, we then conducted Tukey HSD tests that evaluate which groups are significantly different. We then report 95% confidence intervals for the true value of each mean. Overlapping confidence intervals indicate that those groups are not significantly different from each other.

To conduct ANOVAs on our peanut nutrient dataset, spectra were first imported into MATLAB. They were then normalized to total spectral area and intensities at individual wavenumbers (Fig. 6) were extracted. ANOVA was then conducted using the anova1 function on the intensities at each of the selected wavenumbers. 95% confidence intervals were constructed using the multcompare function, which by default uses Tukey HSD to evaluate group-to-group differences.