Introduction

Mass spectrometry in combination with hyphenated techniques (e.g., chromatography) is the gold standard for chemical identification where legal enforcement is required1. Recently ambient ionization (AI) MS has been used to provide rapid assessment of authenticity in a variety of scenarios such as the adulteration of Whisky2, pesticide residue analysis on strawberry surfaces3, and the quantification of bisphenol A and its analogs contained within food packaging4, to mention only a few. AI approaches seek to reduce or entirely negate sample preparation, allowing analysis to be performed on samples in their native environment5. Popular AI sources include the likes of paper spray6,7,8,9,10, direct analysis in real time11,12,13 and desorption electrospray14,15. AI sources that utilize corona discharge (CD) have been extensively reported, such as ASAP16,17 and DAPCI18,19. These approaches typically rely on a heated nebulizing/desolvating gas flow in one form or another to vaporize a liquid/solid sample prior to ionization and analysis. CD without any accompanying gas flow is routinely used with ion mobility instruments20,21 for trace vapor detection and has been incorporated with various AI sources to improve ionization efficiency. For instance, Song et al. applied CD in combination with an inexpensive ultrasonic nebulizer to detect antibiotic drugs in milk22, Mullen et al. combined secondary electrospray with CD to improve detection of explosive compounds23, and Sekimoto et al. increased DART ionization efficiency by addition of a CD pin between the source and inlet24. A number of studies have utilized direct ambient CD ionization for MS studies25,26,27,28.

A key aim of this study was to develop a methodology for direct peppercorn seed analysis that is sufficiently robust and does not require any sample preparation nor requires the addition of any consumables, such as solvents or gas. We surmised that CD ionization can be applied to solid phase samples that naturally contain volatile compounds for direct analysis without any prior vaporization. This significantly reduces the complexity of the setup to be simply the application of a CD needle in proximity to the sample. As is shown in this study, with appropriate sample types, the information obtained from this analytical setup can have great impact to their respective fields. In this regard, aromatic spices are a perfect sample for this exploration because of the possible impact to the food and agricultural industry. For example, black pepper (Piper nigrum), colloquially called “King of Spices”, has evolved to be one of the most widely used spices in the world. Originating in India, black pepper has become a mainstay in both eastern and western cuisines, with its global export value reaching 1.6 billion USD in 201929. Thus, various aspects of quality control of this product are vital to the sustainability of its market30,31,32,33. Prominent examples are the development of analytical methods34,35,36 to detect adulteration by cheaper materials, e.g., pepper husk or papaya seed, and to uncover geographical origins of peppers37. The latter study is directly linked to geographical indication (GI), which has become an increasingly important leveraging tool in trade negotiations38,39 and sales valuation of products. GI protection is worth, on average, double the value compared to non-certified products40. In the EU alone, the value of GI products is estimated to be worth over €75 billion40. Therefore, the incentive for mis-characterization of product origin is very appealing for illegitimate producers. The drive to enforce GI necessitates the development of chemical analysis methods that are rapid, non-biased and versatile41,42,43.

Herein, we demonstrate, for the first time, the application of a CD for the direct ionization of peppercorn seeds to obtain a set of mass spectrometry (MS) data that are indicative of their geographical origins. Combined with robust pre-processing and chemometric analyses (principal component analysis (PCA) and linear discriminant analysis (LDA)), this analytical method was able to distinguish the geographical origin of different sources of black peppers, along with some simpler studies including the type of peppers (black vs white), and the origin of white peppers. An iterative reformulation feature selection algorithm was used to ascertain the key m/z’s necessary to construct accurate classifier models. Our findings indicate that this rapid, versatile and non-targeted MS protocol has the potential to be used for further GI studies relating to other herbs, spices and foodstuffs in general.

Material and methods

Sample preparation

Black pepper seeds originating from Thailand (6 sources), China (1 source), India (1 source), and the United Kingdom (3 sources) were used in this study; all samples were purchased locally from supermarkets or ordered online from reputable suppliers. White peppers from Thailand (6 sources) and China (1 source) were also purchased in a similar manner. All samples were removed from their respective packaging for analysis. Otherwise, no sample pretreatment or preparation was carried out. All experiments were performed in accordance with relevant guidelines and regulations.

Acquisition of MS data from peppers

A Waters Xevo TQ MS (Waters Corporation, Milford, MA, USA) was used in this study with the following instrument parameters: cone voltage at 20 V, source temperature at 100 °C, and acquisition mass range of 50–500 m/z. The experimental method is depicted in Fig. 1. Briefly, a peppercorn seed was affixed with a distance of approximately 3 mm from the MS inlet. A stainless steel needle was placed 2–3 mm above the sample with an applied potential of 3 kV, as determined by prior optimization experiments to yield a stable signal. A corona discharge (CD) develops at the apex of the needle due to the large potential and small radius of curvature which subsequently ionizes vapors from the pepper sample. Data was acquired for a duration of 2 min with 1 s per scan. At least 9 seeds from each source were tested, with a blank (no seed) being run in between different sample sources. The complete set of data can be found in the supplementary information.

Figure 1
figure 1

Experimental setup for direct ionization MS on peppercorn seeds.

Data pre-processing for subsequent chemometric analyses

A series of automated data pre-processing steps were performed in MATLAB (Mathworks). (1) Firstly, any m/z values over 300 were discarded due to the lack of any significant ion peaks. (2) Mass spectra were smoothed with the weighted linear regression function within the peak width (10 data points) in order to prevent peak fluctuations in the same m/z channel. Baseline correction, with zeroing of any point with negative values, was performed. (3) The intensity of each m/z data point was scaled logarithmically in order to give each data point equal significance. (4) The data was aligned using 4 reference peaks (m/z = 109.1, 137.15, 177.15 and 235.3), which are consistently present in all samples and background scans, to ensure any instrument drift during analysis is corrected. (5) The MS was down sampled to 0.1 amu grid reducing the m/z datapoints from 3172 to 2000. (6) Local maxima within 0.8-amu windows were confirmed reducing the variables from 2000 to 267. (7) Peak intensity data from real samples were subtracted by the intensity data of the blank, i.e., the runs without any peppercorn seed but with the same setup. (8) Finally, white and black pepper datasets are combined into a single data table consisting of 267 (m/z’s) by 156 (sample intensities) datapoints. Further filtering of m/z’s closer than 0.5 amu reduces the dataset to a final 199 × 156 datapoints. An overview scheme of the data preparation with an example mass spectrum after preprocessing is illustrated in Fig. S1.

Visualizing method and classification method

The data sets showed a very high number of variables (267 variables). To visualize and determine the number of significant factors from such high numbers, further data reduction was performed. The data was centered over all samples prior to the data reduction process. In this study, Principal Component Analysis (PCA) was used to reduce the dimensionality in a multivariate data set to those data characteristics that contribute the most to the overall variance, with the first component representing the greatest variance of the data (principal component 1—PC1), the second greatest variance being projected to the second coordinate (PC2), and so on. PCA scores can be used to visualize the clusters of samples, which share common influences. More detailed information on PCA is discussed elsewhere44,45. In this study, PCA was employed to visualize the underlying relationships of the data. The clusters of samples were either visualized using PCs with the maximum variances (PC1-PC3), or using the best discriminating PCs.

To obtain the classification performance, Linear Discriminant Analysis (LDA) was used for class prediction. This was done by creating a model boundary (classifier) between classes using linear discriminant function in order to define the directions in which the classes are best separated. However, in the LDA algorithm, it is necessary to calculate the inverse of the variance–covariance matrix. Therefore, if the number of variables is larger than the number of samples, the variance–covariance matrix will be a singular matrix that cannot be inverted. Hence, in this study, the discriminant features of the PCA-LDA were used as inputs to the classification model. The analysis was done by merging the modeling runs of two algorithms based on PCA to reduce the dimensionality of data matrix into 3PCs and sequentially LDA to create the classifier for class predictions of samples. The predictive ability of the developed PCA-LDA model is evaluated by “leave-one-out” cross validation (LOOCV). The mean centering is performed on the corresponding training set samples as appropriate, while the test sets are centered according to parameters obtained from the training set. The combination of validation procedures of PCA-LDA and the LOOCV approach were performed by the following steps:

  1. 1.

    For the data matrix (X), a mass spectrum of a sample (xtest) was removed to be used as a test set, while the remaining mass spectra are formed as a training set (Xtrain).

  2. 2.

    PCA was performed on the training set to obtain score (Ttrain) and loading (Ptrain). In this study, only the first 3 PCs were used to represent the overall variance of the data.

  3. 3.

    The score (ttest) of a test sample (xtest) was calculated by using the pseudo-inverse of loading matrix from the training set (Ptrain) as ttest = xtest PTtrain (Ptrain PTtrain)-1, where T is a transpose operator and -1 is an inverse operator

  4. 4.

    The LDA classifier boundary of each class was built from Ttrain to predict the class of a test sample using ttest.

The procedure is repeated until all samples have been assigned as a test sample. The contingency table can be then constructed to express the performance and stability of the developed classifier.

Methods for identifying sets of significant variables

Fisher weight

Fisher weight46 is defined by \(f_{i} = {{\left[ {\sum\nolimits_{g = 1}^{G} {I_{g} \left( {\overline{x}_{jg} - \overline{x}_{j} } \right)^{2} } } \right]} \mathord{\left/ {\vphantom {{\left[ {\sum\nolimits_{g = 1}^{G} {I_{g} \left( {\overline{x}_{jg} - \overline{x}_{j} } \right)^{2} } } \right]} {\left[ {S_{pool, j}^{2} \sum\nolimits_{g = 1}^{G} {\left( {I_{g} - 1} \right)} } \right]}}} \right. \kern-\nulldelimiterspace} {\left[ {S_{pool, j}^{2} \sum\nolimits_{g = 1}^{G} {\left( {I_{g} - 1} \right)} } \right]}}\) when there are G classes, where \(\overline{x}_{j}\) is the mean of variable (m/z) j of all classes, Ig is the number of samples in class g, \(S^{2}_{{{\text{pool}}\;j}}\) the pooled standard deviation. The calculation is based on the ratio of within-class variance to between-class variance. It shows a main advantage over t-statistic as it can be further use for the data with > 2 classes. However, when it is simplified for a two-class model (G = 2), the rank of the variables is the same as the t-statistic although there is no sign. The m/z values with the highest values of fisher weight (rank number 1) are considered as potential candidates to be used as geographical markers.

Iterative reformulation of training sets

This technique47 was performed by randomly selecting 70% of the entire dataset as a training set for several times (100 in this case). Each training may thus result in different significant m/z values being selected. If the same m/z value is chosen many times as a geographical marker for each split in the data, it indicates that such a data point is likely a robust and authentic marker. In each iteration, the optimized number of ranked data points on Fisher weight scores are recorded as the most significant variables. The procedure was repeated 100 times, thus obtaining 100 lists of significant m/z data points. Any m/z values which presented in all iterations were identified as impactful geographical markers.

Results and discussion

Features of MS data obtained from direct ionization of black peppers

An example spectrum (Fig. 2) highlighted some interesting features from the MS data obtained. First, it is surprising, at first glance, that a peak corresponding to piperine (m/z 286 for [M + H]+) or its derivatives cannot be seen at all. Piperine is responsible for the pungency property of black pepper, and makes up to 7% of its dried weight48. This disappearance is likely due to the higher melting point and boiling point, as well as the higher polarity, compared to volatile terpenes, which is another group of compounds commonly found in black pepper. Moreover, in a study examining black pepper and white pepper utilizing a similar ionization mechanism, proton transfer reaction (PTR)-MS, piperine was not detected either49. In any case, an aqueous extract of black pepper was shown to provide a clear signal of piperine via PS-MS (Fig. S2). Since the main goal of this work is to discover a key set of data that can differentiate the origins of black peppers, the absence of a single, albeit major, compound is deemed to be inconsequential. Next, volatile terpenes were indeed clearly found in this experimental setup. For example, a peak corresponding to monoterpene (C10H16, m/z 137 for [M + H]+) can be seen. Apart from piperine, these are a group of terpenes that directly contribute to the flavor profiles of black peppers50,51,52,53. Examples include myrcene, sabinene, and terpinene. Also, another peak at m/z 205 is visible, which is attributed to sesquiterpene (C15H24). To improve the confidence of some of the suggested peak assignments, we performed some exemplary MS/MS experiments focusing on the collision-induced dissociation of m/z 151 and m/z 205. In Fig. S3, it can be seen that there is no significant difference between the CD-MS/MS spectrum from black peppercorn seed and the spectrum from a direct-infusion MS/MS experiment of (R)-carvone, for the parent ion m/z 151. Hence suggesting that the majority of the signal at m/z 151 is likely to be (R)-carvone. Likewise, agreement was found from a comparison between a CD-MS/MS spectrum of a black pepper seed and a library spectrum54 for m/z 205 (Fig. S4), which we attribute to δ-elemene. Furthermore, these suggested assignments have previously been identified as major components of pepper seed extract50.

Figure 2
figure 2

A representative MS spectrum from the sample BP(Thailand) – 06, along with some examples of putative chemical species that were deduced from the literature50,51,52,53.

Interestingly, the amounts of these compounds in each source of black pepper, as reflected in the ion intensities, can vary. For instance, while the representative sample in Fig. 2 had m/z 151, likely carvone, as the most intense peak, some other samples had other peaks as the major component (Figs. S5 and S6 for the complete set of MS spectra, which were plotted from the complete set of numerical data in the supplementary information). These differences, when treated with systematic statistical analysis, may be sufficient for the differentiation of black peppers from different origins.

Data analysis of obtained MS data

In this study, multivariate data analysis was performed on the mass spectra in order to discriminate peppers based on three categorizations: I) the types of peppers (black and white peppers including all origins), II) the origins of black peppers (Thailand, China, India and UK), and III) the origins of white peppers (Thailand and China). First, one-sample t-tests were used to analyze the raw data for their variabilities in the analysis. It was found that all m/z data points from seeds of the same source do not show any statistical difference at the 5% significance level, i.e., accepting the null hypothesis. Thus, this indicates that variability of analysis was generally low. Then, all m/z data points were used for PCA analyses on the types of peppers, the origins of black peppers, and the origins of white peppers using the first three PCs (Fig. 3) with the total variance > 70% in all cases. Overall, the PCA score projections in this manner exhibit poor separation between classes in most studies. In addition, the contingency table of the classification using PCA-LDA is shown in Table S1. Classification rates of 76.92%, 74.46% and 69.35% were observed in the discrimination of the types of pepper, the origins of black peppers and the origins of white peppers, respectively. This result indicates that PCA based on the largest principal components may not always be the most effective method. In some cases, especially in biological systems47,55, the PCs with high variance may instead correlate with the background noise of biological samples and the best discriminator might present at the latter PC with small variance. Hence, to explore the possibility of obtaining useful information in small-intensity data points, some analysis on other PCs was conducted.

Figure 3
figure 3

PC score plots of the first 3 principal components of the pre-processed data using all m/z data points to visualize the cluster relationship of (A) types of peppers, (B) origins of black peppers and (C) origins of white peppers.

As shown in Fig. 4 (A,C,E), it can be seen that the classification accuracy using the first three PCs is poor for all three studies. With the inclusion of latter PCs, it was revealed that 18, 14, and 11 PCs were optimal for the classification of the types of peppers, the origins of black peppers, and the origins of white peppers, respectively. In addition, the prediction strength of each PC was calculated to visualize the discrimination of the PCs in each study (shown in Fig. S7). The higher the prediction strength, the better the discrimination power. It can be seen that some latter PCs showed higher prediction strength. That is, PC 3-4-10, PC 3-4-7, and PC 1-7-8, are the sets of PCs with the highest prediction strength for the studies of the types of peppers, the origins of black peppers, and the origins of white peppers, respectively. The score plots with the PCs with the highest prediction strength are shown in Fig. 4 (B,D,F).

Figure 4
figure 4

The percentage of correctly classified samples with the number of PCs used to build a model (PC1-20), along with their respective PC score plots using the best discriminate PCs for the study of (A) the types of peppers, (B) is the corresponding PC score plot, (C) the origins of black peppers, (D) is the corresponding PC score plot, and (E) the origins of white peppers with (F) showing the corresponding PC score plot.

To provide some numerical data for comparison, the classification rates for each PCA-LDA model built from both the first three PCs and the optimal number of PCs are illustrated in Fig. 5 (Table S1 for contingency tables). The overall classification accuracies using the first three PCs are merely 76.92%, 44.68% and 72.58% for the studies of the type of peppers, the origins of black peppers, and the origins of white peppers, respectively. On the other hand, the classification rates for the cases that were built from the optimal number of PCs appear to be significantly higher, with the scores of 98.72%, 98.94% and 100%. Although the most optimal numbers of PCs for the LDA classifications were > 10 PCs in all cases, the classification rates increased dramatically only in the range of PC1-5. These results suggested that the first five PCs are the most relevant, but the latter PCs are still meaningful in differentiating between the classes. This is in good agreement with the PC score plots in Fig. 4 using the latter PCs with high prediction strength. In addition, we also evaluated the differentiation power of PS-MS data from a sample set of black peppers using their aqueous extracts. The result (Fig. S8) showed a similar trend in that multiple PCs were required to achieve 100% prediction accuracy, and that the best PCs were not from only the first three PCs. This similarity suggested that the nature of these samples, i.e., peppers, is amenable to discrimination by various MS-derived data. Thus, CD-MS, which requires no solvent and no extra sample preparation step, is an attractive and convenient method for rapid analysis to uncover geographical indication of peppers.

Figure 5
figure 5

Classification accuracies of (A) the discrimination between black and white peppers, (B) the geographical discrimination of black peppers, and (C) the geographical discrimination of white peppers.

Significant m/z data points as markers

A conventional approach for variable selection is to perform a selection method (Fisher weight in this case) on the entire dataset, and determine the significance of an m/z value as a marker from its magnitude of test value. However, selecting a set of markers from an entire dataset may encounter an overfitting issue. An alternative approach is to identify potential m/z data points from several spitted training sets. This is called as “iterative reformulation of training set models”, which was first introduced elsewhere47. In this study, we also employed this approach to confirm the relevance of the obtained m/z data points. That is, a randomized partial set (70% of the entire dataset) was used for the discovery of relevant m/z data points, whose appearances were then counted. After 100 iterations, all m/z values were then evaluated for the number of times they appeared in the list of relevant m/z data points of each split training set. As different m/z values may be differentially selected in each iteration, those data points that are most frequently selected are likely the most significant markers.

Figure 6 illustrates the number of times (out of 100 iterations) that each m/z value was selected in the subset of the top 10% m/z data points with the highest magnitude of Fisher weight value. Any m/z values that are present for 100 times were identified as impactful markers. The study on the types of peppers provides a set of 3 m/z data points (m/z = 205.2, 206.2 and 295.2) that appeared in all 100 iterations (Fig. 6A), while the study about the origins of white peppers gave only 6 m/z data points (m/z = 163.1, 205.2, 206.2, 222.2, 224.2, and 238.3) that appeared 100 times (Fig. 6C). In the case of the study about the origins of black peppers, there were 7 m/z values (121.1, 205.2, 206.2, 245.2, 273.3, 291.2, 292.1, 292.2) that appeared in all 100 iterations (Fig. 6B), which indicated a more challenging case due to higher numbers of variables. It can be seen that in all cases, m/z 205.2 appeared as a significant marker, which is attributed to sesquiterpenes, a major class of compounds found in peppercorn seeds. It should be noted, however, that focusing on individual data points may be misleading due to the possibility for strategic adulteration/modification. This is actually reflected in the results, where different studies required different numbers of variables for effective cluster separations. Therefore, the non-targeted approach56, as practiced in this study, is deemed to be more flexible and versatile for broader applications. This is because it is not subject to strategic adulteration from a selected small set of known markers.

Figure 6
figure 6

Bar charts showing the number of times each variable (m/z data points) was found in 100 iterations in the studies about (A) the type of peppers, (B) the origins of black peppers, and (C) the origins of white peppers.

Conclusion

In this study, a simple and rapid MS-based method to reveal chemical profiles was demonstrated that is capable of distinguishing geographical origins. In particular, corona discharge was used to ionize volatile compounds from peppercorns for subsequent MS analysis. With its relatively simple setup, the method allowed for rapid data collection, which can then be further processed by chemometrics. Such an approach has potential to be used as part of a portable setup for remote and onsite analysis. After pre-processing, linear discriminant analysis indicated that this experimental setup had high discrimination efficiency (> 98% accuracies) in all studies involving the type of peppers, the origins of black peppers, and the origins of white peppers. This positive outcome suggests that this method can be applied to many more cases where whole agricultural products are directly analyzed for rapid discrimination of origins and without any sample pretreatment or workup.

Data availability

A set of supplementary information is available online including an overview scheme of data preparation process, Preliminary and comparative PS-MS data, MS/MS data, Representative MS spectra for all samples, Prediction strength values for all studies, Classification rates, and Raw numerical data in a spreadsheet.