Chemometric approach to characterization of the selected grape seed oils based on their fatty acids composition and FTIR spectroscopy

Addressing the issues arising from the production and trade of low-quality foods necessitates developing new quality control methods. Cooking oils, especially those produced from the grape seeds, are an example of food products that often suffer from questionable quality due to various adulterations and low-quality fruits used for their production. Among many methods allowing for fast and efficient food quality control, the combination of experimental and advanced mathematical approaches seems most reliable. In this work a method for grape seed oils compositional characterization based on the infrared (FTIR) spectroscopy and fatty acids profile is reported. Also, the relevant parameters of oils are characterized using a combination of standard techniques such as the Principal Component Analysis, k-Means, and Gaussian Mixture Model (GMM) fitting parameters. Two different approaches to perform unsupervised clustering using GMM were investigated. The first approach relies on the profile of fatty acids, while the second is FT-IR spectroscopy-based. The GMM fitting parameters in both approaches were compared. The results obtained from both approaches are consistent and complementary and provide the tools to address the characterization and clustering issues in grape seed oils.


Results and discussion
Fatty acid compounds of the selected grape seed oils. The fatty acid composition of the oils extracted from eight grape cultivars and 2 years of harvesting is shown in Table 1 linoleic (70.10-71.55%), oleic (15.33-17.28%), and palmitic (6.84-8.18%) acids were the predominant fatty acids in oils, consistent with previously reported data 8,20 . The differences between the selected acids compared to varieties and vintages are given in %-age units.
Chemometric analysis of fatty acid compounds and physical parameters of selected grape seed oils. Correlation analysis. Analysis of correlations between unsaturation and the physical parameters was the first step in characterizing selected grape seed oils. Therefore, the fatty acids were grouped into saturated fatty acid (SFA), monosaturated fatty acid (MUFA), and polyunsaturated fatty acid (PUFA). The relationship between the number of unsaturation and the physical parameters was obtained by analyzing correlations between concentration of SFA, MUFA, PUFA, and the values of physical parameters: mass density and apparent viscosity. Pearson's correlation coefficients were presented in Table 2. A high value of modulus of coefficients between two considered variables explains the direction of their relation.
The SFA concentration in analyzed oils has significant correlations with MUFA (|R| = 0.77) and PC2 (|R| = 0.55), but small values of correlations coefficients with other variables (|R| < 0.5). A high correlation relations were observed between MUFA and PUFA (|R| = 0.71), MUFA and µ (|R| = 0.5), MUFA and PC2 (|R| = 0.61). The difference between concentrations of SFA and MUFA in analyzed oils is lower than the difference between SFA and PUFA. These relations are shown in correlation analysis. PC1 and PC3 present lower values of correlation coefficients when compared to PC2. Table 1. The relative concentration of fatty acids in grape oils. The concentration of saturated, monosaturated, and polyunsaturated are presented along with the physical properties (measured at 20 °C), which are the apparent viscosity (µ) and the mass density (ρ). www.nature.com/scientificreports/ PCA. In the first approach for characterization of grape seed oils the PCA was applied to analyze seven common fatty acids, three groups of fatty acids and two physical parameters to obtain a linear estimate of dimensionality. Based on the Kaiser criterion in PCA, three components having eigenvalue higher than 1 were determined. The first three main components explained 88.66% of the total variance, and two components explained above 70% of it. Therefore, to simplify the description, we consider only the first two components in the following. k-Means for two PCs. The second step involved the selection of initial values for the means in the mixture model. This was done by applying the k-Means method for normalized principal components, i.e., for reduced data set. As initial values the centers (or the means) of the clusters were taken. The sum of squared errors (SSE) suggests that the five clusters are an optimal choice. The clustering result is presented in Fig. 1.
GMM for clustering. Next, the values for the parameters in GMM based on the number of clusters obtained with k-Means clustering were calculated. Five Gaussian components using Bayesian Information Criterion (BIC) were chosen in order to estimate the optimal model for Gaussians (Fig. 2), resulting in 'diag' (covariance matrix is diagonal) optimal Gaussian model for five components. The values of parameters from the fit are presented in Table 3, and the split into clusters is presented in Fig. 3.
FT-IR spectroscopy analysis. The ATR-FTIR spectra for selected oil samples obtained from grape seeds of the selected cultivars harvested in the respective experimental years depending on the cultivar are presented in www.nature.com/scientificreports/  Table 4 presents all the characteristic bands present in the oil samples selected for the study (from the first and second measurement years) from the aforementioned selected cultivars, and a correlation of the functional group vibrations with the corresponding bands (with a detailed literature review). It is worth noticing that all the infrared spectra (ATR-FTIR) of the selected oil samples, both in the first and the second year of the experiment, revealed highly intensive and distinct bands that could be correlated with specific functional groups vibrations originating from ingredients typically present in food products. A vast majority of edible plant fats, potential oily materials, are composed primarily of various fractions of triglycerides, differentiated mainly by the degree of unsaturation and the length of their respective hydrocarbon chains 21,22 . In many publications, the authors were able to match the particular bands present in the spectra of both animal and plant oils [21][22][23][24][25][26][27][28][29][30][31] to specific vibrations of molecules or groups thereof. However, the majority of the literature available pertains to FTIR analyses of specific plants (e.g., rape) and animal oils, while only a few such studies have been carried out on the types of samples discussed in this work. Furthermore, a precise assignment of bands to a specific functional group is often problematic. Table 4 presents a detailed analysis of characteristic band frequencies with the most important widening observed in the oil spectra, and the correlations with their respective functional groups (including a review of relevant literature data 21,22,28-31 . Also, a subscript was used  www.nature.com/scientificreports/ to account for the intensity of bands of the typical spectra in the infrared region. It is noteworthy that identifying stretching vibrations is significantly easier in this type of biological sample, especially when compared to deformation vibrations, which are often overlapped. In the general characteristics of the selected oil samples spectra, vibrations of the methylene group located within the spectral range from 1350 to 1165/cm were observed 21,22 . In the case of our samples, these bands represented the stretching vibrations originating from the -C-H group bound to the -CH3 group (usually approx. 1350-60/cm, in our samples approx. 1348/cm) as well as deformation vibrations of the same group (present at approx. 1160/cm, in our case-1157/cm). It is noteworthy that the stretching vibrations of the (C-O) ester bond composed of two combined asymmetric vibrations are, in this case, vibrations of the C-C(=O)-O and O-C-C groups 31 . In the former case, the intensity of vibration is significantly higher 30 . The bands are present in the region from 1300 (as C-C(=O)-O, in our case approx. 1271/cm, as enhancement of the band with the maximum at approx. 1238/cm) and at approx. 1000/cm (in our case approx. 1027/cm for this group). In turn, the bands associated with saturated esters such as: C-C(=O)-O are found between 1240 and 1160/cm (in the case of the grapeseed oils samples selected for the study at approx. 1238/cm), while in the case of unsaturated, the vibrations usually emerge at lower frequencies 21 . At the same time, however, the O-C-O band often associated with primary alcohols is observed in the region from 1090 to 1020/cm (for the functional groups analyzed in our study, www.nature.com/scientificreports/ it was at approx. 1027/cm). In the case of secondary alcohols, the band usually emerges with the maximum at approx. 1100/cm (in our study approx. 1099/cm). Both types of esters described above are present in triglyceride molecules. However, authors often associate the band mentioned above (at approx. 1238/cm) exclusively with the out-of-plane bending vibrations of the methylene group 32 . The subsequent two bands presented in Table 4 (and in Fig. 4) have the maxima at approx. 1421 and 1315/cm, respectively (band widening, see Fig. 4, both for samples from the first and second measurement year). The first of said groups of vibrations (with the maximum at approx. 1421/cm) may originate from the vibrations of methyl groups in the aliphatic chains of the selected oil samples 21,32 . The second group of bands (i.e., the band widening) with the maximum at approx. 1315/cm (in all analyzed samples) was observed simultaneously with weak bands with maxima at approx. 965 and 905/cm. The 905/cm band present in all oil samples is associated with the stretching vibrations of cis-substituted olefinic groups 21 and may also be associated with vibrations of the vinyl group. The selected samples of grapeseed oil obtained in the two experimental years produced largely similar infrared spectra, but it should be noted that depending on the cultivar, certain differences were nonetheless observed that seem to be relatively characteristic and easily identifiable. Firstly, the studies revealed noticeably significant differences in terms of the respective bands' intensity (not represented as the band levels were equalized at the peak related to the vibrations of the carbonyl group C=O to facilitate easier interpretation of the results), which seems to be related to the differences between the respective cultivars.
Another very characteristic region of vibrations contained bands with the maximum at approx. 1745/cm characteristic of stretching vibrations of the C=O carbonyl group 21 in esters. Apart from the band characteristic for vibrations of the carbonyl group, on the lower wavenumber side there was also an enhancement with the maximum at approx. 1709/cm (distinctly less intensive in samples from, e.g., the Pinot Gris 2015 cultivar), which also corresponded to vibrations of the carbonyl group but occurred in the acid groups of the oil samples selected for the study 21,23,30 . The next band, with the maximum at 1652/cm corresponded to the stretching vibrations of the -C=C-group (from the cis-transformation) 21,28 . A characteristic region also contains vibrations with the maximum at 1462/cm originating from the deformation vibrations of the -C-H groups in -CH2 and -CH3 (bending vibrations). One should also mention vibrations in the region from 900 to 650/cm which represent characteristic deformation vibrations associated with the -HC=CH-groups (cis-conformation, out of plane) as well as the rocking vibrations of said groups ((-(CH2)n-and -HC=CH-(cis-)) 21,28 .
As we proceed to vibrations in higher wavenumber regions, one should also mention the very significant stretching vibrations =C-H (trans-transformation) with the maximum at approx. 3066/cm (Table 4-very low intensity) originating from vibrations of the triglyceride fraction 21,33 (in Fig. 4 with very low intensity-primarily in the Zweigeltrebe 2015 cultivar). In turn, the stretching vibrations of =C-H in the cis-configuration were observed as very characteristic and intensive vibrations with the maximum at approx. 3011/cm (Fig. 4 and www.nature.com/scientificreports/ Table 4). The vibrations with the maximum at approx. 2934, 2863/cm originate from the stretching -C-H vibrations in the -CH3, CH2 groups belonging to triglyceride aliphatic groups [21][22][23][24][25][26][27][28][29] . It should also be noted that the spectra of the analyzed oil samples produced from the seeds of various grape cultivars (and from different years of the experiment) (Fig. 4) reveal noticeable differences in the shape of bands in the region from 1770 to 1660/cm. For most of the analyzed samples, one can clearly observe a slight band enhancement at 1745/cm (corresponding to the vibrations of the C=O, as already discussed above) on the lower wavenumber side, with a clear maximum at approx. 1709/cm 34 , which can also be correlated with forming a hydrogen bond between the C=O⋯H-O-groups (more intensive in the first year for the Pinot Gris 2015 group). Simultaneously with the emergence of the band at 1709/cm, we can observe a distinct change in the intensity of bands at approx. 1150-1070, 721/cm 28 , which can also be correlated to the stretching vibrations of C-O and C-C groups (described above). The bands, given the possibly decreasing affinity of the associated molecules with the formation of the C=O⋯H-O-H hydrogen bond, may suggest a slight increase in intensity thereof.
The spectral changes seem to correlate very well with the changes in the fatty acid profile presented in Table 1 and discussed in the first part of this section. Apart from the visible differences in the bands with the maxima at approx. 1710-1715, one should also emphasize the possibly most important observation, i.e., the emergence of a very clearly visible band with the maximum at approx. 840/cm (Fig. 4, Table 4) that may originate from the stretching vibrations on bonds existing between various acid fractions in the analyzed samples.
Chemometric analysis of FTIR spectra of selected grape seed oils. PCA. According to the previously adopted procedure, firstly, the PCA method was applied to approximate the dimensionality of spectra data in a linear manner. Based on the Kaiser criterion in PCA three components having eigenvalue higher than 1 were determined. The first three main components explained 98.46% of the total variance, and two components explained above 95.18% of it. Therefore, we proceed further with our analysis using the first two components. According to the loadings, the highest contribution of FTIR spectra of PC1 take the vibration of w(-HC=CH-, trans-) out-of-plane deformation from the range 700-1500/cm, while for PC2 the vibration of (-C=O vst ) in esters located in the region from 1600 to 2000/cm.  GMM for clustering. The parameters were extracted in GMM based on the number of clusters obtained with k-Means clustering. Estimation of the optimal model for Gaussians by five Gaussian components using BIC is presented in Fig. 6. Following the BIC criterion, the optimal Gaussian model for five components is 'full' (full covariance matrix). The values of the parameters from the fit are presented in Table 5 and the split into clusters is presented in Fig. 7.

Conclusions
This study evidenced the efforts to characterize the selected grape oils in an unsupervised classification, based on their fatty acid composition and physical parameters, and FTIR spectroscopy. To this end, Gaussian Mixture Model based on Principal Component Analysis was applied. Two different approaches were compared. The first approach was based on fatty acids profile linked with physical parameters such as the apparent viscosity and mass density, while the second approach was based on the FT-IR spectroscopic data. The results obtained from  The results of correlation analysis demonstrate that the concentration of MUFA is related to apparent viscosity.
In conclusion, the application of techniques associated with GMM-based clustering to classify features and characterize the grape oils may undoubtedly be considered as new tools to solve the characterization and Table 5. GMM parameters for standardized 2 PCs for data of FTIR spectra.  www.nature.com/scientificreports/ clustering problems. Therefore, there is a promising prospect that methods used in this work will provide a basis suitable for addressing issues arising from the differentiation and unsupervised clustering in grape seed oils.

Materials and methods
Samples preparation. For the purpose of this work, the grape seed oils from 10 various grape types and 2 years (2015 and 2017) were used. In 2015 were included the varieties Dornfelder, Pálava, Pinot Gris, Riesling, Tramin, Zweigeltrebe and in 2017 the varieties Hibernal, Neuburger, Sauvignon, Zweigeltrebe. The relevant permission was obtained by the authors prior the samples harvesting from plants cultivated in South Morava, Czech Republic A prototype of a vibratory separator was used to separate the seeds from marc. For successful pressing of seeds and their storage, their initial moisture content was lowered to about 10% in a chamber dryer. The temperature in the chamber dryer did not exceed 40 °C. The material was kept in a closed bag at room temperature until screw pressing. All methods were performed in accordance with the relevant UE guidelines/regulations/ legislation.
Oil extraction from grape seeds. The oil was pressed on the screw press UNO FM 3F produced by the Farmet Company (Česká Skalice, CZ). This press model is designed for cold pressing of all oily seeds at 80 rpm. The pressing device components are: a matrix, 220 mm screw, head, heating mantle, nozzle holder, and nozzle in diameter 10 mm. After pressing, the oils were settled by gravity, then filtered, and poured into glass jars (volume 500 ml). Oils were not technologically treated or stabilized in any way.
Physical properties. The density of oils was determined pycnometrically according to ISO 6883:2017 35 .
The rheological evaluation of grape seed oils was prepared according to previously article 33 . The Rheometer Anton Paar MCR 102 (Graz, Austria) with the measuring geometry cone-plate was used. The gap between the cone and the plate was set at the stable value of 0.103 mm. The diameter of the cone equaled to 50 mm with the angle of 1°. Rheological tests were performed at the temperature 20 °C. The apparent viscosity was measured at the shear rate 5/s. Each physical properties analysis was performed in triplicate.
Fatty acid profile. For our research we used the second part of ISO 12966 norm 36 , which specifies methods of preparing the methyl esters of fatty acids. Specifically, the boron trifluoride (BF3) transmethylation procedure was used. The isooctane solution thus obtained was prepared for analysis, by using the GC according to ISO 12966 norm, part four 37 . The profile of fatty acids was determined by using GC Hewlett Packard 4890D (Palo Alto, CA) with a flame ionization detector (FID). The separation was performed on column DB-23 (60 m × 0.25 mm with a 0.25 μm film thickness) from Agilent Technologies (Santa Clara, CA). The temperature program was as follows: the initial temperature was 100 °C held for 3 min, then was increased at 10 °C/min to 170 °C, then again increased at 4 °C/min to 230 °C held for 8 min, and then again at 5 °C/min to 250 °C held for 15 min. The injector temperature was 270 °C, while the detector temperature was set to 280°C. The injection volume was 2 µl at a split ratio of 40/1. The helium was used as a carrier gas with a flow rate of 1.0 ml/min. Retention times of FAME standards were used to identified individual fatty acid methyl esters. The resulting chromatograms were processed using the station CSW (version 1.7, Data Apex, Praha, CZ). Results are reported as % fatty acid (area under the peak of particular fatty acid) of total fatty acids (total area under the peak of all fatty acids). Each GC analysis was performed in triplicate. Chemicals used in the analysis were from VWR International (Radnor, Pennsylvania, USA) and FAME standards were from Supelco (Sigma-Aldrich, Saint-Louis, Missouri, USA).

FT-IR measurements.
Measurements of ATR-FTIR background-corrected spectra (25 scans for each sample) were carried out with the use of a HATR Ge trough (45° cut, yielding 10 internal reflections) crystal plate at 20 °C, and were recorded with a 670-IR spectrometer (Agilent, USA). The Ge crystal was cleaned with ultra-pure organic solvents (Sigma-Aldrich). The instrument was continuously purged with argon for 40 min. before and during measurements. Absorption spectra at a resolution of one data point per 1/cm (to the highest measurement accuracy) were obtained in the region between 4000 and 400/cm. Scans were Fourier-transformed and averaged with Grams/AI 8.0 software (Thermo Fisher Scientific, USA).
Chemometric methods. The data were analyzed by correlations among variables were evaluated using principal component analysis (PCA), cluster analysis on normalized PCs (k-means and Gaussian Mixture Models <GMM>) to oils samples according to their acid and spectroscopy profile. The multivariate data analysis methods have found increased use during the last decades in all fields of spectroscopy-related research. Such methods are the state of the art of mathematical analysis. They perform a reduction of the dimensionality of data set and allows the visualization of underlying structure in experimental data and relationships between data and samples by identifying the directions in which most of the information is retained. The FTIR spectroscopy characterization of oils from grape seeds was combined with statistical analysis, PCA and GMM being considered as classification method of unsupervised learning. Characterization of the samples was performed using the relative intensity of absorption band corresponding to the main classes of chemical compounds identified in the IR spectrum was measured 21,22 . The spectral range was divide into fourth areas. The first spectral area, between 3050 and 4000/cm was not taken into account. It is known that this spectral range contains information that is not significant for oils discrimination (water absorbance) and it also can be derive of noise. The second spectral range 2605-3050/cm provided eight values of the absorption band intensity (every 50/cm) for analysis. The next spectral ranges, between 1600 and 2000/cm (every 50/cm) and between 700 and 1500/cm (every 50/cm) gave 9 www.nature.com/scientificreports/ and 17 values of the absorption band intensity for analysis, respectively 38 . Finally, we represented each IR spectrum as a vector with 34 values. In order to discover underlying classes into which data of ten different oils set splits, the standard clustering methods from unsupervised learning were used in the following sequence.
1. Determine optimal Principal Components in PCA that explain above 70%. 2. Normalize PCs using standard scaler. 3. Using k-Means algorithm and the elbow rule to determine the optimal number of clusters into which normalized PC split. 4. Use the optimal number of clusters from the previous step to fix the number of Gaussian distributions in Gaussian Mixture Model (GMM) and determine optimal Gaussian parameters from the following 39,40 : (a) 'full'-each component has its own covariance matrix; (b) 'tied'-one general covariance matrix for each component; (c) 'diag'-diagonal covariance matrices for each component; (d) 'spherical'-each component has its own diagonal covariance matrix with equal eigenvalues; The GMM has the following parametrization where π k is the weight of the k-th Gaussian with normalization n k=1 π k = 1 , μ k is the mean of the Gaussian (its center), Σ k is the covariance matrix.
The optimal is made by minimizing Bayesian Information Criteria (BIC) 38,41 for a given number of components and models. The k-Means clustering as a way of choosing the optimal number of components prevents BIC selection to exclude one-cluster-per-point model. The trained model and the whole analysis pipeline can also be used for classifying new data. However, since the sample is small we did not do unsupervised machine learning here. In the analysis the version 0.20.0 of Scikit-Learn library 39,40 was used.

Data availability
The samples of each material used in this study, namely the variety of grape seed oils are available on request from M.V and P.B laboratories. π k p k (x|µ k , � k ), www.nature.com/scientificreports/ Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.