Rapid geographical indication of peppercorn seeds using corona discharge mass spectrometry

With increasing demands for more rapid and practical analyses, various techniques of ambient ionization mass spectrometry have gained significant interest due to the speed of analysis and abundance of information provided. Herein, an ambient ionization technique that utilizes corona discharge was applied, for the first time, to analyze and categorize whole seeds of black and white peppers from different origins. This setup requires no solvent application nor gas flow, thus resulting in a very simple and rapid analysis that can be applied directly to the sample without any prior workup or preparation. Combined with robust data pre-processing and subsequent chemometric analyses, this analytical method was capable of indicating the geographical origin of each pepper source with up to 98% accuracies in all sub-studies. The simplicity and speed of this approach open up the exciting opportunity for onsite analysis without the need for a highly trained operator. Furthermore, this methodology can be applied to a variety of spices and herbs, whose geographical indication or similar intellectual properties are economically important, hence it is capable of creating tremendous impact in the food and agricultural industries.

Mass spectrometry in combination with hyphenated techniques (e.g., chromatography) is the gold standard for chemical identification where legal enforcement is required 1 . Recently ambient ionization (AI) MS has been used to provide rapid assessment of authenticity in a variety of scenarios such as the adulteration of Whisky 2 , pesticide residue analysis on strawberry surfaces 3 , and the quantification of bisphenol A and its analogs contained within food packaging 4 , to mention only a few. AI approaches seek to reduce or entirely negate sample preparation, allowing analysis to be performed on samples in their native environment 5 . Popular AI sources include the likes of paper spray [6][7][8][9][10] , direct analysis in real time [11][12][13] and desorption electrospray 14,15 . AI sources that utilize corona discharge (CD) have been extensively reported, such as ASAP 16,17 and DAPCI 18,19 . These approaches typically rely on a heated nebulizing/desolvating gas flow in one form or another to vaporize a liquid/solid sample prior to ionization and analysis. CD without any accompanying gas flow is routinely used with ion mobility instruments 20,21 for trace vapor detection and has been incorporated with various AI sources to improve ionization efficiency. For instance, Song et al. applied CD in combination with an inexpensive ultrasonic nebulizer to detect antibiotic drugs in milk 22 , Mullen et al. combined secondary electrospray with CD to improve detection of explosive compounds 23 , and Sekimoto et al. increased DART ionization efficiency by addition of a CD pin between the source and inlet 24 . A number of studies have utilized direct ambient CD ionization for MS studies [25][26][27][28] .
A key aim of this study was to develop a methodology for direct peppercorn seed analysis that is sufficiently robust and does not require any sample preparation nor requires the addition of any consumables, such as solvents or gas. We surmised that CD ionization can be applied to solid phase samples that naturally contain volatile compounds for direct analysis without any prior vaporization. This significantly reduces the complexity of the setup to be simply the application of a CD needle in proximity to the sample. As is shown in this study, with appropriate sample types, the information obtained from this analytical setup can have great impact to their respective fields. In this regard, aromatic spices are a perfect sample for this exploration because of the  29 . Thus, various aspects of quality control of this product are vital to the sustainability of its market [30][31][32][33] . Prominent examples are the development of analytical methods [34][35][36] to detect adulteration by cheaper materials, e.g., pepper husk or papaya seed, and to uncover geographical origins of peppers 37 . The latter study is directly linked to geographical indication (GI), which has become an increasingly important leveraging tool in trade negotiations 38,39 and sales valuation of products. GI protection is worth, on average, double the value compared to non-certified products 40 . In the EU alone, the value of GI products is estimated to be worth over €75 billion 40 . Therefore, the incentive for mis-characterization of product origin is very appealing for illegitimate producers. The drive to enforce GI necessitates the development of chemical analysis methods that are rapid, non-biased and versatile [41][42][43] . Herein, we demonstrate, for the first time, the application of a CD for the direct ionization of peppercorn seeds to obtain a set of mass spectrometry (MS) data that are indicative of their geographical origins. Combined with robust pre-processing and chemometric analyses (principal component analysis (PCA) and linear discriminant analysis (LDA)), this analytical method was able to distinguish the geographical origin of different sources of black peppers, along with some simpler studies including the type of peppers (black vs white), and the origin of white peppers. An iterative reformulation feature selection algorithm was used to ascertain the key m/z's necessary to construct accurate classifier models. Our findings indicate that this rapid, versatile and non-targeted MS protocol has the potential to be used for further GI studies relating to other herbs, spices and foodstuffs in general.

Material and methods
Sample preparation. Black pepper seeds originating from Thailand (6 sources), China (1 source), India (1 source), and the United Kingdom (3 sources) were used in this study; all samples were purchased locally from supermarkets or ordered online from reputable suppliers. White peppers from Thailand (6 sources) and China (1 source) were also purchased in a similar manner. All samples were removed from their respective packaging for analysis. Otherwise, no sample pretreatment or preparation was carried out. All experiments were performed in accordance with relevant guidelines and regulations.

Acquisition of MS data from peppers. A Waters Xevo TQ MS (Waters Corporation, Milford, MA, USA)
was used in this study with the following instrument parameters: cone voltage at 20 V, source temperature at 100 °C, and acquisition mass range of 50-500 m/z. The experimental method is depicted in Fig. 1. Briefly, a peppercorn seed was affixed with a distance of approximately 3 mm from the MS inlet. A stainless steel needle was placed 2-3 mm above the sample with an applied potential of 3 kV, as determined by prior optimization experiments to yield a stable signal. A corona discharge (CD) develops at the apex of the needle due to the large potential and small radius of curvature which subsequently ionizes vapors from the pepper sample. Data was acquired for a duration of 2 min with 1 s per scan. At least 9 seeds from each source were tested, with a blank (no seed) being run in between different sample sources. The complete set of data can be found in the supplementary information.
Data pre-processing for subsequent chemometric analyses. A series of automated data pre-processing steps were performed in MATLAB (Mathworks). (1) Firstly, any m/z values over 300 were discarded due to the lack of any significant ion peaks. (2) Mass spectra were smoothed with the weighted linear regression function within the peak width (10 data points) in order to prevent peak fluctuations in the same m/z channel. Baseline correction, with zeroing of any point with negative values, was performed. (3) The intensity of each m/z  datapoints. An overview scheme of the data preparation with an example mass spectrum after preprocessing is illustrated in Fig. S1.
Visualizing method and classification method. The data sets showed a very high number of variables (267 variables). To visualize and determine the number of significant factors from such high numbers, further data reduction was performed. The data was centered over all samples prior to the data reduction process. In this study, Principal Component Analysis (PCA) was used to reduce the dimensionality in a multivariate data set to those data characteristics that contribute the most to the overall variance, with the first component representing the greatest variance of the data (principal component 1-PC1), the second greatest variance being projected to the second coordinate (PC2), and so on. PCA scores can be used to visualize the clusters of samples, which share common influences. More detailed information on PCA is discussed elsewhere 44,45 . In this study, PCA was employed to visualize the underlying relationships of the data. The clusters of samples were either visualized using PCs with the maximum variances (PC1-PC3), or using the best discriminating PCs.
To obtain the classification performance, Linear Discriminant Analysis (LDA) was used for class prediction. This was done by creating a model boundary (classifier) between classes using linear discriminant function in order to define the directions in which the classes are best separated. However, in the LDA algorithm, it is necessary to calculate the inverse of the variance-covariance matrix. Therefore, if the number of variables is larger than the number of samples, the variance-covariance matrix will be a singular matrix that cannot be inverted. Hence, in this study, the discriminant features of the PCA-LDA were used as inputs to the classification model. The analysis was done by merging the modeling runs of two algorithms based on PCA to reduce the dimensionality of data matrix into 3PCs and sequentially LDA to create the classifier for class predictions of samples. The predictive ability of the developed PCA-LDA model is evaluated by "leave-one-out" cross validation (LOOCV). The mean centering is performed on the corresponding training set samples as appropriate, while the test sets are centered according to parameters obtained from the training set. The combination of validation procedures of PCA-LDA and the LOOCV approach were performed by the following steps: 1. For the data matrix (X), a mass spectrum of a sample (x test ) was removed to be used as a test set, while the remaining mass spectra are formed as a training set (X train ). 2. PCA was performed on the training set to obtain score (T train ) and loading (P train ). In this study, only the first 3 PCs were used to represent the overall variance of the data. 3. The score (t test ) of a test sample (x test ) was calculated by using the pseudo-inverse of loading matrix from the training set (P train ) as t test = x test P T train (P train P T train ) -1 , where T is a transpose operator and -1 is an inverse operator 4. The LDA classifier boundary of each class was built from T train to predict the class of a test sample using t test .
The procedure is repeated until all samples have been assigned as a test sample. The contingency table can be then constructed to express the performance and stability of the developed classifier.
Methods for identifying sets of significant variables. Fisher weight. Fisher weight 46 is defined by pool,j G g=1 I g − 1 when there are G classes, where x j is the mean of variable (m/z) j of all classes, I g is the number of samples in class g, S 2 pool j the pooled standard deviation. The calculation is based on the ratio of within-class variance to between-class variance. It shows a main advantage over t-statistic as it can be further use for the data with > 2 classes. However, when it is simplified for a two-class model (G = 2), the rank of the variables is the same as the t-statistic although there is no sign. The m/z values with the highest values of fisher weight (rank number 1) are considered as potential candidates to be used as geographical markers.
Iterative reformulation of training sets. This technique 47 was performed by randomly selecting 70% of the entire dataset as a training set for several times (100 in this case). Each training may thus result in different significant m/z values being selected. If the same m/z value is chosen many times as a geographical marker for each split in the data, it indicates that such a data point is likely a robust and authentic marker. In each iteration, the optimized number of ranked data points on Fisher weight scores are recorded as the most significant variables. The procedure was repeated 100 times, thus obtaining 100 lists of significant m/z data points. Any m/z values which presented in all iterations were identified as impactful geographical markers.

Results and discussion
Features of MS data obtained from direct ionization of black peppers. An example spectrum ( Fig. 2)  www.nature.com/scientificreports/ is responsible for the pungency property of black pepper, and makes up to 7% of its dried weight 48 . This disappearance is likely due to the higher melting point and boiling point, as well as the higher polarity, compared to volatile terpenes, which is another group of compounds commonly found in black pepper. Moreover, in a study examining black pepper and white pepper utilizing a similar ionization mechanism, proton transfer reaction (PTR)-MS, piperine was not detected either 49 . In any case, an aqueous extract of black pepper was shown to provide a clear signal of piperine via PS-MS (Fig. S2). Since the main goal of this work is to discover a key set of data that can differentiate the origins of black peppers, the absence of a single, albeit major, compound is deemed to be inconsequential. Next, volatile terpenes were indeed clearly found in this experimental setup. For example, a peak corresponding to monoterpene (C 10 H 16 , m/z 137 for [M + H] + ) can be seen. Apart from piperine, these are a group of terpenes that directly contribute to the flavor profiles of black peppers [50][51][52][53] . Examples include myrcene, sabinene, and terpinene. Also, another peak at m/z 205 is visible, which is attributed to sesquiterpene (C 15 H 24 ). To improve the confidence of some of the suggested peak assignments, we performed some exemplary MS/MS experiments focusing on the collision-induced dissociation of m/z 151 and m/z 205. In Fig. S3, it can be seen that there is no significant difference between the CD-MS/MS spectrum from black peppercorn seed and the spectrum from a direct-infusion MS/MS experiment of (R)-carvone, for the parent ion m/z 151. Hence suggesting that the majority of the signal at m/z 151 is likely to be (R)-carvone. Likewise, agreement was found from a comparison between a CD-MS/MS spectrum of a black pepper seed and a library spectrum 54 for m/z 205 (Fig. S4), which we attribute to δ-elemene. Furthermore, these suggested assignments have previously been identified as major components of pepper seed extract 50 . Interestingly, the amounts of these compounds in each source of black pepper, as reflected in the ion intensities, can vary. For instance, while the representative sample in Fig. 2 had m/z 151, likely carvone, as the most intense peak, some other samples had other peaks as the major component (Figs. S5 and S6 for the complete set of MS spectra, which were plotted from the complete set of numerical data in the supplementary information). These differences, when treated with systematic statistical analysis, may be sufficient for the differentiation of black peppers from different origins.

Data analysis of obtained MS data.
In this study, multivariate data analysis was performed on the mass spectra in order to discriminate peppers based on three categorizations: I) the types of peppers (black and white peppers including all origins), II) the origins of black peppers (Thailand, China, India and UK), and III) the origins of white peppers (Thailand and China). First, one-sample t-tests were used to analyze the raw data for their variabilities in the analysis. It was found that all m/z data points from seeds of the same source do not show any statistical difference at the 5% significance level, i.e., accepting the null hypothesis. Thus, this indicates that variability of analysis was generally low. Then, all m/z data points were used for PCA analyses on the types of peppers, the origins of black peppers, and the origins of white peppers using the first three PCs (Fig. 3) with the total variance > 70% in all cases. Overall, the PCA score projections in this manner exhibit poor separation between classes in most studies. In addition, the contingency table of the classification using PCA-LDA is shown in Table S1. Classification rates of 76.92%, 74.46% and 69.35% were observed in the discrimination of the types of pepper, the origins of black peppers and the origins of white peppers, respectively. This result indicates that PCA based on the largest principal components may not always be the most effective method. In some cases, especially in biological systems 47,55 , the PCs with high variance may instead correlate with the background noise of biological samples and the best discriminator might present at the latter PC with small variance. Hence, to explore the possibility of obtaining useful information in small-intensity data points, some analysis on other PCs was conducted.
As shown in Fig. 4 (A,C,E), it can be seen that the classification accuracy using the first three PCs is poor for all three studies. With the inclusion of latter PCs, it was revealed that 18, 14, and 11 PCs were optimal for the www.nature.com/scientificreports/ classification of the types of peppers, the origins of black peppers, and the origins of white peppers, respectively. In addition, the prediction strength of each PC was calculated to visualize the discrimination of the PCs in each study (shown in Fig. S7). The higher the prediction strength, the better the discrimination power. It can be seen that some latter PCs showed higher prediction strength. That is, PC 3-4-10, PC 3-4-7, and PC 1-7-8, are the sets of PCs with the highest prediction strength for the studies of the types of peppers, the origins of black peppers, and the origins of white peppers, respectively. The score plots with the PCs with the highest prediction strength are shown in Fig. 4 (B,D,F). To provide some numerical data for comparison, the classification rates for each PCA-LDA model built from both the first three PCs and the optimal number of PCs are illustrated in Fig. 5 (Table S1 for contingency tables). The overall classification accuracies using the first three PCs are merely 76.92%, 44.68% and 72.58% for the studies of the type of peppers, the origins of black peppers, and the origins of white peppers, respectively. On the other hand, the classification rates for the cases that were built from the optimal number of PCs appear to be significantly higher, with the scores of 98.72%, 98.94% and 100%. Although the most optimal numbers of PCs for the LDA classifications were > 10 PCs in all cases, the classification rates increased dramatically only in the range of PC1-5. These results suggested that the first five PCs are the most relevant, but the latter PCs are still meaningful in differentiating between the classes. This is in good agreement with the PC score plots in Fig. 4 using the latter PCs with high prediction strength. In addition, we also evaluated the differentiation power of PS-MS data from a sample set of black peppers using their aqueous extracts. The result (Fig. S8) showed a similar trend in that multiple PCs were required to achieve 100% prediction accuracy, and that the best PCs were not from only the first three PCs. This similarity suggested that the nature of these samples, i.e., peppers, is amenable to discrimination by various MS-derived data. Thus, CD-MS, which requires no solvent and no extra sample preparation step, is an attractive and convenient method for rapid analysis to uncover geographical indication of peppers. www.nature.com/scientificreports/ Significant m/z data points as markers. A conventional approach for variable selection is to perform a selection method (Fisher weight in this case) on the entire dataset, and determine the significance of an m/z value as a marker from its magnitude of test value. However, selecting a set of markers from an entire dataset may encounter an overfitting issue. An alternative approach is to identify potential m/z data points from several spitted training sets. This is called as "iterative reformulation of training set models", which was first introduced elsewhere 47 . In this study, we also employed this approach to confirm the relevance of the obtained m/z data points. That is, a randomized partial set (70% of the entire dataset) was used for the discovery of relevant m/z data points, whose appearances were then counted. After 100 iterations, all m/z values were then evaluated for the number of times they appeared in the list of relevant m/z data points of each split training set. As different m/z values may be differentially selected in each iteration, those data points that are most frequently selected are likely the most significant markers. Figure 6 illustrates the number of times (out of 100 iterations) that each m/z value was selected in the subset of the top 10% m/z data points with the highest magnitude of Fisher weight value. Any m/z values that are present for 100 times were identified as impactful markers. The study on the types of peppers provides a set of 3 m/z data points (m/z = 205.2, 206.2 and 295.2) that appeared in all 100 iterations (Fig. 6A) (Fig. 6B), which indicated a more challenging case due to higher numbers of variables. It can be seen that in all cases, m/z 205.2 appeared as a significant marker, which is attributed to sesquiterpenes, a major class of compounds found in peppercorn seeds. It should be noted, however, that focusing on individual data points may be misleading due to the possibility for strategic adulteration/modification. This is actually reflected in the results, where different studies required different numbers of variables for effective cluster separations. Therefore, the non-targeted approach 56 , as practiced in this study, is deemed to be more flexible and versatile for broader applications. This is because it is not subject to strategic adulteration from a selected small set of known markers.

Conclusion
In this study, a simple and rapid MS-based method to reveal chemical profiles was demonstrated that is capable of distinguishing geographical origins. In particular, corona discharge was used to ionize volatile compounds from peppercorns for subsequent MS analysis. With its relatively simple setup, the method allowed for rapid data collection, which can then be further processed by chemometrics. Such an approach has potential to be used as part of a portable setup for remote and onsite analysis. After pre-processing, linear discriminant analysis indicated that this experimental setup had high discrimination efficiency (> 98% accuracies) in all studies involving the type of peppers, the origins of black peppers, and the origins of white peppers. This positive outcome suggests that this method can be applied to many more cases where whole agricultural products are directly analyzed for rapid discrimination of origins and without any sample pretreatment or workup.

Data availability
A set of supplementary information is available online including an overview scheme of data preparation process, Preliminary and comparative PS-MS data, MS/MS data, Representative MS spectra for all samples, Prediction strength values for all studies, Classification rates, and Raw numerical data in a spreadsheet.