Introduction

Achieving untargeted structural elucidation, isomeric differentiation, and quantification is a paramount goal in molecular characterization and crucial to resolving many scientific and technological problems1,2,3. Surface-enhanced Raman scattering (SERS) spectroscopy emerges as a compelling qualitative and quantitative molecular sensing approach because it rapidly provides rich vibrational information for univocal chemical identification and multiplexing capabilities at ppm to ppb levels4,5,6. Currently, identifying known chemical molecules through manual referencing of existing SERS databases and literature is relatively straightforward, However, manually matching SERS peaks to vibrational modes is tedious and error-prone, especially when handling complicated spectra and large datasets. Moreover, it is a passive approach and lacks the forward-inferring capabilities to predict “unidentified molecules” beyond the boundaries of existing databases4,5,6,7,8,9. The challenge lies in the untargeted identification of “unidentified molecules” outside existing SERS databases.

Inspired by taxonomy, based on the use of anatomical and behavioral characteristics to classify new species, we postulate the establishment of a SERS-based chemical taxonomy that can achieve untargeted identification of “unidentified molecules” by combining SERS fingerprints with a hierarchical machine learning (ML) framework10,11,12. To begin, we establish hierarchical levels within the SERS-based chemical taxonomy. Each level is linked to a molecular structural characteristic, such as the types and numbers of functional groups10,13. Leveraging the taxonomic ML model, we can predict individual structural attributes in a stepwise manner. This progressive process can be done by analyzing and pairwise profiling similarities and differences in structure and SERS spectra. Crucially, this approach facilitates unprecedented forward prediction, allowing for the deduction of “unidentified molecules” situated beyond the boundaries of the ML model. Specifically, our proposed process systematically excludes alternative structural possibilities when the SERS spectra traverse the hierarchical levels of the chemical taxonomy, culminating in the precise identification of the exact molecular structure. In contrast, such forward prediction remains elusive through a single classification ML model, which inaccurately classifies the “unidentified molecules” as one of the pre-existing labeled classes in that model.

One biomolecule class that directly benefits from such a SERS chemical taxonomy model is the cerebrosides. Particularly, the epimeric glucocerebrosides (GlcCerX:Y) and galactocerebrosides (GalCerX:Y) differ in the spatial orientation of their C4 OH-groups (C4 site of isomerism) in their glycosyl/ galactosyl moiety and consist of ceramides moieties with varied carbon chain length (X) and saturation degrees (Y)2,14. Due to their structural diversity, they possess different bioactivities and play distinct functional and constitutional roles in cellular signaling and metabolism. For instance, an increase in GlcCer24:1 alludes to endometriosis and Gaucher disease, whereas GalCer24:1 is implicated in Fabry and Krabbe diseases2,14,15,16,17,18,19. At present, a rapid, point-of-care platform for their untargeted identification is pertinent as isomeric differentiation and quantification using gold standard gas/liquid chromatography-mass spectrometry remain arduous due to fragmentation pattern issues and inefficient prior derivatization2.

Herein, we establish a SERS-based chemical taxonomy using hierarchical ML with forward-inferring capabilities for untargeted structural elucidation of eleven (11) GlcCers and GalCers, attaining classification accuracy >90% through their SERS spectra that are untrained within the model. Our SERS-ML framework is proficient in single and multiplex quantification, achieving precision with <10% errors at their physiological relevant concentrations (Fig. 1). To achieve this SERS-based chemical taxonomy, we first develop an Ag SERS platform functionalized with 4-mercaptophenylboronic acid (4-MPBA) to specifically capture the epimers at their C4 site of isomerism, yielding unique epimer-MPBA adducts, each with distinct SERS fingerprints. (Fig. 1a, b) Corroborating the SERS fingerprints with DFT simulations allows us to identify five key spectral features, each corresponding to a key structural characteristic, i.e., (1) the presence or absence of epimers, (2) monosaccharide vs. cerebroside, (3) saturated vs. unsaturated ceramide, (4) glucosyl vs. galactosyl moieties, and (5) GlcCer or GalCer’s carbon chain lengths. We then perform feature engineering of the SERS spectra to extract individual peak spectral features, such as position, intensity, full width at half maximum, skew, and ratio, as ML inputs for accurate and efficient modeling (Fig. 1c)20. Using spectral features as ML inputs, we build a hierarchical ML framework consisting of four sequential random forest classifiers (RF-C1–4) and two support vector machine regressors (SVM-R 5.1 and 5.2) (Fig. 1d). This framework elucidates the five identified structural characteristics from each model and then aggregates the information gained to reconstruct the complete molecular structure. Collectively, the four RF-Cs yield classification accuracies surpassing 90%, whereas the two SVM-Rs accurately predict the carbon chain lengths up to 1 carbon difference, all when using untrained spectra for blind tests, highlighting the generalizability of our framework for structural elucidation of all 11 cerebrosides. Importantly, we prove that although the model is established using spectra of cerebrosides at 10−4 M, it is still effective in predicting “unidentified cerebrosides” at concentrations 1–6 orders of magnitude lower than those in the trained model (i.e., at 10−5–10−10 M) with accuracies ranging from 87 to 100% with <1 carbon chain length discrepancy. This demonstrates the robustness and applicability of our chemical taxonomy framework for practical SERS sensing applications, where the concentration of analytes is frequently unknown. Furthermore, our integrated ML framework allows seamless identification and quantification with <10% errors for all 11 cerebrosides from 10−4 to 10−10 M (Fig. 1e). We further achieve multiplex quantification of biomarkers GlcCer24:1 and GalCer24:1 in binary mixtures at μM range with an absolute error of <4% between predicted and actual concentrations in 30 blind samples. Overall, our forward-predictive SERS-based chemical taxonomy marks a pivotal advance from existing molecular identification methods, realizing the longstanding goal of rapid (<30 min), untargeted structural elucidation and quantification. In this work, we create a localized SERS molecular space, within which our ML framework can both interpolatively and extrapolatively predict 11 gluco- and galactocerebrosides. We envision high-throughput testing of more probes and analyte combinations to further extend the framework and create a global SERS molecular space capable of elucidating other classes of isomeric compounds to meet escalating demands for rapid, point-of-need analytical tools.

Fig. 1: Schematics of a forward-predictive integrated SERS-based chemical taxonomy machine learning framework.
figure 1

a The SERS probe 4-MPBA first covalently captures the epimeric cerebrosides in unique configurations to form specific cerebroside-MPBA complexes. b The unique cerebroside-MPBA complexes will have distinctive SERS spectra. c We perform feature engineering to extract SERS spectral features from each SERS peak to form the machine learning input. d To perform untargeted structural elucidation, the spectra features of an “unidentified” epimeric cerebrosides at unknown trace concentrations is fed into our chemical taxonomy which will elucidate the chemical structure and nomenclature. e Next, the spectra features are parsed into our pretrained quantification models for single and multiplex quantification. Glc-MPBA glucosyl-MPBA, Glal-MPBA galactosyl-MPBA, GlcCer glucocerebroside, GalCer galactocerebroside, RF random forest, SVM support vector machine.

Results and Discussion

SERS platform and biomolecule characterization

To establish a reliable SERS-based chemical taxonomy ML framework for untargeted structural elucidation of unknown epimeric cerebrosides, it is pertinent to generate a SERS database with distinctive and strong SERS signals to facilitate their unambiguous identification and differentiation. We utilize a “capture and confine” strategy to first covalently capture the epimers onto 4-MPBA grafted on Ag nanocubes before physically concentrating the mixture using a hydrophobic perfluorothiol-Ag substrate to amplify SERS signals (analytical enhancement factor = 3.2 × 105) (Suppl. Notes 14). We employ this indirect SERS detection due to the small Raman cross sections of the epimers. The MPBA probe effectively forms covalent boronate ester bonds with epimers’ 1,2-diol group directly at their C4 site of isomerism, which generates distinct SERS fingerprints from each unique epimer-MPBA adduct, facilitating their differentiation (Fig. 2a, b)9,21,22. In our investigation, we study 11 cerebrosides, including five glucocerebrosides (GlcCerX:Y), and six galactocerebrosides (GalCer X:Y) at 10−4 M, where X represents ceramide chain length and Y denotes saturation degrees. We further compare them to glucose and galactose which are identified as primary interferences since they have the exact same hexose moiety and are expected to bind in similar orientations/configurations to MPBA as gluco- and galactocerebrosides. We categorize cerebrosides according to five specific structural characteristics. In category 1 (blank vs. epimers), we first determine the presence or absence of epimers. (Fig. 2c). In category 2 (monosaccharide vs. cerebroside), we differentiate monosaccharides from cerebrosides, which have ceramide chains glycosidically linked to the glycosyl moiety. In category 3 (saturated vs. unsaturated), the cerebrosides are distinguished by the ceramide’s saturation degrees (Y = 0 or 1). In category 4 (glucosyl vs. galactosyl), the epimeric cerebrosides are differentiated based on the presence of either a glycosyl (C4 equatorial OH) or galactosyl (C4 axial OH) moiety. Finally, in category 5 (GlcCer or GalCer carbon chain length), the precise ceramide carbon chain length is identified. In the experimental epimer-MPBA adduct SERS spectra, we note overall varied degrees of red/blueshifts and intensity changes at five regions due to differential covalent interactions between MPBA and epimers (Fig. 2d).

Fig. 2: Key structural characteristics of 13 glucosyl(glc) and galactosyl(gal)-analytes enabling their SERS differentiation.
figure 2

a Molecular structures of 4-mercaptophenylboronic acid (4-MPBA) before and after differential covalently bonding to C4 of glucosyl(glc) and galactosyl(gal)-analytes. b The general structure of cerebrosides with various categorized structural characteristics. c Molecular structures of the 13 epimers, d Their key structural characteristics, and e their corresponding differential normalized SERS spectra.

Elucidating epimer-specific SERS fingerprints

To confirm the chemical relevance of spectral features driving the differentiation among the 11 cerebrosides, we further scrutinize the high-veracity SERS fingerprints and corroborate them with density functional theory (DFT) simulations to elucidate the molecular origins of various spectral variations. We identify five vital spectral regions from the spectra relating to the epimers’ structural characteristics (Fig. 3a). They include (1) the presence or absence of epimers at 1330 cm−1, (2) monosaccharide vs. cerebroside at 1300 cm−1, (3) saturated vs. unsaturated ceramide at 1595–1603 cm−1, (4) glucosyl vs. galactosyl at 414–419 cm−1, and (5) ceramide carbon chain lengths of GlcCer at 1023 cm−1 or GalCer at 687 cm−1. First, to confirm that the SERS platform has successfully captured the epimers (category 1), we note a sharp decrease of the BOH bending (νCH + βCH + βBOH) at 1330 cm−1 from 0.6 to <0.2 arbitrary units (arb. u.) for all epimers (Fig. 3b), indicating the formation of boronate ester bonds with their 1,2-diol moiety via a condensation reaction4,21,22. Importantly, we can utilize this mode to differentiate between monosaccharides-MPBA (I1330/I1567 < 0.07 arb. u.) and cerebrosides-MPBA (0.07 to 2.2 arb. u.), whereby their difference in peak intensity ratio is statistically significant (p < 0.05, category 2, Fig. 3b). For monosaccharides, the 1330 cm−1 peak intensity is significantly lower because they have (1) more OH groups available per molecule to interact with MPBA and (2) more compact molecular structures with no steric hindrance from bulky side chains, which increases accessibility to MPBA for preferential binding. Next, we note that the differentiation between the saturation vs. unsaturation in the ceramide (category 3) is correlated to the totally symmetric νCC stretching mode at 1584–1601 cm−1 (Fig. 3c). Compared to MPBA blanks’ νCC mode at 1603 cm−1, saturated GalCer24 slightly redshifts to 1601 cm−1, whereas the unsaturated GalCer24:1 and GlcCer24:1 experience an increase in intensity and more pronounced redshifts to 1595 and 1591 cm−1, with respect to the non-totally symmetric νCC peak at 1575 cm−1. This trend agrees well with DFT, where we notice an increase in intensity and redshift of the same SERS band from blank at 1621–1616 cm−1 for saturated GalCer24 and 1607 cm−1 for unsaturated GalCer24:1 and GlcCer24:1 with respect to the non-totally symmetric νCC peak at 1525 cm−1. This is attributed to the presence of a distal C = C in unsaturated GalCer24:1 and GlcCer24:1, which is capable of forming π-π interactions with MPBA’s benzene ring and disrupting MPBA’s molecular symmetry, resulting in redshifting and intensity increase in the totally symmetric νCC mode due to Herzberg–Teller contribution23,24. In category 4, GlcCer and GalCer can be differentiated using the peak position of the βCCC + νCS band in the range of 414–419 cm−1. Compared to MPBA-blanks at 414 cm−1, GlcCer-MPBA adducts blueshift slightly to between 414 and 417 cm−1, whereas GalCer -MPBA undergoes significant blueshifts to 417–419 cm−1 (Fig. 3d). This is in good accordance with our simulated spectra, where the same βCCC + νCS vibrational mode blueshift more for GalCer-MPBA than GlcCer-MPBA, from 488 cm−1 in blank-MPBA to 593–595 cm−1 for the former and to 507–509 cm−1 for the latter. This is because GalCer-MPBAs’ 5-membered ring formed is less strained due to the axial OH group, with bond angles of 102.8°−106.2° that are overall closer to the ideal 107° for five-membered rings compared to GlcCer-MPBAs’ 102.7°−105.4°, which collectively experience more ring strain (Suppl. Note 5). A decrease in ring strain leads to a more substantial induction effect, which in turn leads to an increase in both benzene electron cloud delocalization and polarizability of the C−S bond, resulting in a larger blueshift for Galcer-MPBA24,25. Finally, for category 5, C–H bending (βCH) mode at 1023 cm−1 shows strong positive correlations and hence is used to elucidate the ceramide carbon chain length for Glc-MPBA adducts (Fig. 3e). Comparing the I1023/I1567 peak ascribed to βCH of MPBA’s benzene against increasing carbon chain length, we observe positive correlations in the experimental and our DFT-simulated SERS spectra. The increase in βCH intensity is likely due to the more significant extent of symmetry breaking of the 4-MPBA from nearly C2v to Cs after binding to the various Glc-MPBA with increasing carbon chain lengths9,21,24. The non-ideal bond angles of Gal-MPBA adducts likely aggravate this symmetry-breaking effect. Similarly, for Gal-MPBA, the I691/I1567 peak indexed to the βCCC + νCS mode positively correlates to increasing carbon chain length, which agrees with DFT spectral changes (Fig. 3f). Notably, this increase in intensity is prominent in Gal-MPBA adducts due to the aforementioned ideal bond angles effectuating strong induction effect, causing a concomitant increase in benzene electron cloud delocalization and C−S bond polarization for Gal-MPBA with longer carbon chain lengths9,21,24,25. The differential peak intensities for various epimers thus reflect their carbon chain length specific differences and underscore the high propensity for quantitative chemical structure-spectra correlations. These five unique structural characteristics captured in the SERS fingerprints are evidence of 13 epimers’ differential covalent interactions with MPBA, which paves the way for their unambiguous identification and differentiation using ML.

Fig. 3: SERS analysis of 11 cerebrosides and 2 monosaccharides.
figure 3

a Representative SERS spectrum of 4-MPBA, GlcCer24:1, and GalCer24:1. Key SERS regions corresponding to the structural characteristics 1−5, where υ = stretching, β = bending, γ = wagging, a1 = totally symmetric, b2 = non-totally symmetric. b Categories 1 and 2 compare the experimental and calculated spectra of MPBA, monosaccharides-MPBA, and cerebroside-MPBA adducts at 1330 cm1 ascribed to βBO + βCH + βOH. The peak intensity ratio difference for each of the 13 analytes is extracted from 60 individual SERS spectra and plotted in standard boxplots with interquartile ranges shaded, mean values indicated, and the whiskers indicating min and max, respectively. c Category 3 is a comparison of the experimental and calculated spectra as well as the intensity ratio difference of the a1,υCC mode between saturated vs. unsaturated cerebroside-MPBA, which shows differential degrees of redshifting compared to MPBA. d Category 4, comparison of the experimental and calculated βccc + υCS peak position difference between epimeric GlcCer-MPBA and GalCer-MPBA adducts. GlcCer-MPBA adducts with bond angles of the five-membered ring between 102.7° and 105.4° experience higher collective ring strain compared to GalCer-MPBA adducts with bond angles between 102.8° and 106.2° which are closer to the ideal 107°. e Category 5.1 elucidates Glc-MPBA carbon chain length effects by comparing the experimental and calculated peak intensity ratio of βCH of the various Glc-MPBA with different chain lengths. The peak intensity ratio difference for each of the five analytes is extracted from 60 individual SERS spectra and plotted in standard boxplots with interquartile ranges shaded, mean values indicated, and the whiskers indicating min and max, respectively. f Category 5.2 elucidates Gal-MPBA carbon chain length effects by comparing the experimental and calculated peak intensity ratio of βccc + υCS of the various Gal-MPBA with different chain lengths. The peak intensity ratio difference for each of the six analytes is extracted from 60 individual SERS spectra and plotted in standard boxplots with interquartile ranges shaded, mean values indicated, and the whiskers indicating min and max, respectively.

Supervised and unsupervised ML for structural elucidation

Leveraging the strong correlations between the cerebrosides’ SERS fingerprints and their structural characteristics, we create a universal SERS-based chemical taxonomy, achieving over >90% classification accuracy and <1 carbon chain length difference. This employs ML’s advanced capabilities to discern underlying data patterns, enabling instantaneous, untargeted structural elucidation across any concentration20,26,27,28,29. To enhance ML accuracy and efficiency by reducing modeling time, we parameterize spectra and isolate 19 peaks from individual cerebroside spectra and derive five peak attributes, including position, intensity, full width at half maximum, skew (symmetry or degree of asymmetry), and ratio (degree of Gaussian/Lorenztian characteristics). This effectively reduces the input features from 1200 variables in a single SERS spectrum to just 19 × 5 = 95 features for ML (Suppl. Notes 6). To begin, we use unsupervised t-distributed stochastic neighbor embeddings (t-SNE) clustering to confirm that SERS fingerprints can differentiate all epimers (Fig. 4a). t-SNE primarily serves to visualize high-dimensional data by projecting it into two-dimensional space. From this t-SNE analysis, we observe distinct clusters of cerebrosides, indicating that the cumulative differences encoded in the spectra are significant (Fig. 4a)29. Cerebrosides are distinguished from the interfering monosaccharides (glucose and galactose). Individual cerebrosides also cluster according to their structural characteristics in our SERS molecular space. Moreover, we note clear segregations in the subsequent four t-SNE plots, which were segmented based on their structural characteristics into (1) the presence vs. absence of epimers (2) monosaccharide vs. cerebroside, (3) saturated vs. unsaturated ceramide and (4) GlcCer vs. GalCer (Fig. 4b–e). Overall, our unsupervised t-SNE results provide unambiguous differentiation among the various epimers without human input. This sets the stage for untargeted structural elucidation using supervised ML models.

Fig. 4: Unsupervised and supervised machine learning results to forward-predict “unidentified” cerebrosides.
figure 4

a Visualization of the cerebroside SERS molecular space using unsupervised t-distributed stochastic neighbor embedding (t-SNE) showing distinct clustering in b. blank-MPBA vs. epimer-MPBA, c monosaccharides-MPBA vs. cerebrosides-MPBA, d unsaturated vs. saturated cerebrosides-MPBA, and e GlcCer-MPBA vs. GalCer-MPBA. f Forward prediction of “unidentified” cerebroside structures using the five-level SERS-based chemical taxonomy framework consisting of a hierarchical ensemble of six ML models. Contrasting incomplete structural elucidation when g RF-C4 and h SVM-R 5 are removed respectively.

A five-level chemical taxonomy model is at the core of our forward-inferring ML framework for untargeted elucidation. It comprises four sequential random forest classifiers (RF-C1–4) to determine a specific structural characteristic and two support vector machine regressors (SVM-R 5.1 and 5.2) to estimate GlcCer and GalCer’s carbon chain lengths. (Fig. 4f)26. We chose the RF-C because tree-based algorithms are robust for deciphering both linear and non-linear relationships amongst variables and data patterns that are challenging by manual analysis30. The SVM-R excels in handling complex boundary-specific problems. In our case, the goal is to find the hyperplane that best segregates GlcCer and GalCer chain lengths in the dataset28. The hierarchical ML-driven framework is designed for progressive structural elucidation of cerebrosides based on their five-tiered structural characteristics, akin to biological taxonomic analysis. After confirming the presence of epimers, the framework sequentially predicts the following structural characteristics: (category 2) monosaccharide vs. cerebroside, (3) saturated vs. unsaturated ceramide, (4) glucosyl vs. galactosyl moieties, and (5.1 or 5.2) GlcCer or GalCer’s carbon chain lengths. We emphasize that a sequential approach is necessary for untargeted identification. It systematically reconstructs individual structural characteristics while eliminating other structural possibilities to elucidate exact molecular structure and identity. This is impossible with the commonly employed single ML classification model, which would erroneously force unknown samples into existing labeled classes.

Using GlcCer8 samples for blind testing, their SERS spectra undergo evaluation within the chemical taxonomy framework. Our RF-C1 correctly identifies each spectrum as an epimer-MPBA, not as unreacted MPBA with 100% certainty (probability = 1) (Fig. 4f). Next, the spectra are directed to RF-C 2, where they are compared with either monosaccharide-MPBA or cerebroside-MPBA groups. They are then classified as belonging to the biomolecule class of cerebrosides. The spectra then proceed to RF-C 3, which predicts whether they are saturated or unsaturated cerebrosides (Y = 0 or 1). Our RF-C 3 precisely predicts they are saturated cerebrosides. In the final classifier, RF-4 accurately predicts that the spectra belong to GlcCer-MPBA. After determining the structural characteristics using these four different classifiers, the spectra are channeled to the regressor (SVM-R 5.1 for GlcCer), which determines that the carbon chain length of the bonded ceramide moiety is 8. The overall probability scores for all blind test spectra from each classifier/regressor are input into a custom Python program tasked to make an executive decision regarding the chemical identity and nomenclature (i.e., cerebroside, saturated, glucosyl, carbon chain length 8 = GlcCer8). Overall, our SERS-based chemical taxonomy can distinguish multiple key structural characteristics of unknown molecules, even down to the functional group levels. It can also provide accurate predictions of the molecules’ complete molecular structure and identity. If we omit any of the RF-Cs or SVM-Rs, we lose the ability to determine a specific structural characteristic, precluding holistic structural elucidation (Fig. 4g, h). For instance, removing RF-C4 would leave the identity of the glycosyl moiety unidentified. Removing the SVM-Rs would mean we could not predict the cerebroside’s ceramide carbon chain length. Notably, the predictions are instantaneous due to ML’s ability to analyze chemically relevant spectral information quickly and accurately.

We demonstrate full generalizability of our SERS-based chemical taxonomy in forward predicting the identity of all 11 gluco- and galactocerebrosides using their SERS spectra. First, we systematically remove individual cerebroside sets of 60 spectra from the total 840 spectra (60 each for MPBA blanks, two monosaccharides, and 11 cerebroside samples at 10−4 M) for blind predictions (Table 1). To mitigate the risk of chance error, we randomly stratify the remaining 780 spectra into training and test using 5-fold cross-validation over 100 iterations (Suppl. Note 7, Suppl. Tables 212). Collectively, the four RF-Cs achieve >90% cumulative classification accuracy across all models and test instances, successfully determining various structural characteristics. The two SVM-Rs for GlcCer and GalCer also accurately predict the carbon chain length of epimeric cerebrosides, differing by at most one carbon, thus allowing for complete structural elucidation.

Table 1 Fully generalizable forward prediction of all 11 cerebrosides when the respective dataset of the individual cerebrosides is excluded in the training set and used as blind tests

Next, we demonstrate that even when our chemical taxonomy is built using cerebrosides at 104 M concentrations, it can effectively predict the identity of “unidentified molecules” ranging from 105 to 1010 M, which is 1–6 orders of magnitude lower than the trained model (Table 2). Such concentration-independence prediction capability is critical for practical SERS sensing applications where the concentration of analytes is often unknown. In this case, we input untrained GalCer16 SERS measured at 105–1010 M into the chemical taxonomy built using the spectra of other epimers and blanks at 104 M. The chemical taxonomy returns with 87–100% classification accuracies, <1 carbon length difference for the entire concentration range (Table 2, Suppl. Note 7, Suppl. Tables 1318). Critically, we achieve accurate prediction at such a dynamic range of concentrations because of the robust SERS spectra differences of each epimer-MPBA SERS fingerprint. These results are direct evidence of the generalizability of our model built upon sound chemical knowledge using chemically relevant features as inputs. Overall, our SERS-based chemical taxonomy ML framework enables an end-to-end structural elucidation and identification of 11 cerebrosides at any concentration not trained in the model through stepwise elucidation of their multifarious structural characteristics from their SERS fingerprints. This finding signifies a step forward for ML-driven SERS as a toolkit for instantaneous, untargeted chemical identification and differentiation of isomeric (bio)molecules that were previously an elusive class of epimer for SERS due to their small Raman cross-sections.

Table 2 Forward prediction results of GalCer16 at 105–1010 M used as blind tests, which are 1 − 6 orders of magnitude lower than the concentration used in the trained model (104 M)

Quantification and multiplex quantification

After developing a chemical taxonomy for the molecular structure elucidation and identification of cerebrosides, we proceed to construct separate SVM-R models for quantification. These models are designed to accurately quantify the concentrations of all 11 pure cerebrosides from 104 to 1010 M. Our models show near-ideal linearity spanning seven orders of magnitude with R2 of 0.95–1.00 and low RMSEprediction of 0.09-0.44 for each epimer, confirming the ultratrace sensitivity of our SERS platform with a detection limit of 1010 M (Fig. 5a–l, Suppl. Note 5). In contrast, the complex derivatization procedure required for gold standard LC-MS analysis has typically hindered the accurate quantification of these biomolecules due to poor epimer recovery2,3.

Fig. 5: Single and multiplex SERS quantification results.
figure 5

a Schematics of the pure cerebrosides of different concentrations. SERS quantification of different pure bf GlcCers and gl GalCers from 10−4 to 10−10 M using SVM-R models. m Scheme of the three multiplex mixtures of epimeric GlcCer24:1 and GalCer24:1. n Multiplex quantification of the three mixtures with different percentage compositions of the two cerebrosides constituting a total concentration of 100 μM using an SVM-R model.

In addition, we further achieve multiplex quantification within the physiologically relevant \(\mu\)M range of the epimeric GalCer24:1 and GlcCer24:1, which coexist in the human body and are vital biomarkers for endometriosis and Gaucher as well as Fabry’s and Krabbe diseases, respectively14,15,16,17,18. To construct a multiplex quantification model for binary mixtures of GlcCer24:1 and GalCer24:1, we first build a calibration curve by varying the mol% of GalCer24:1 from 0 to 100% (and vice versa for GlcCer24:1) using 60 SERS spectra for each calibration set. The total concentration is maintained at 100 μM (Fig. 5m, n, Suppl. Note 7, Suppl. Tables 1921). Our calibration curve exhibits a near-ideal linearity with a cross-validation R2 value of 0.99 and a low RMSEcalibration = 2.18, indicating good predictive accuracy (Fig. 5n). Composition predictions of three binary mixtures as blind tests comprising 90%, 60%, and 40% GalCer24:1, respectively, also exhibit good linear coefficient R2 > 0.93, RMSEprediction = 5.5 and an absolute difference of <4 μM or 4% between the predicted and actual concentrations (Fig. 5m, n, Table 3). Importantly, we demonstrate excellent detection sensitivity even with minute changes in the concentrations of two cerebrosides, underpinning the potential of our SERS strategy to quantify them in biofluid mixtures concurrently.

Table 3 Multiplex quantification results for 3 sets of binary mixtures with total cerebroside concentration of 100 μM as blind tests consisting 10 individual samples for each set of mixture

Finally, we synergize the chemical taxonomy framework with these 11 cerebroside SVM-R quantification models to demonstrate simultaneous structural elucidation, molecular identification, and quantification of “unidentified cerebrosides” across concentrations ranging from 104 to 1010 M (Tables 4, 5, Suppl. Note 7, Suppl. Tables 2224). As a proof-of-concept, we test 30 blind samples of three different cerebrosides with various concentrations near the detection limits. We achieve predictive performance with >80% cumulative classification accuracies and <10% quantification errors. Our results demonstrate that the majority of spectra in each blind test can be correctly classified and identified. We note a slight decrease in classification accuracy when the samples are at or near the detection limit (LOD) of 1010 M. For instance, when we test 10 blind samples of GalCer12 at 108 M, one sample is wrongly classified as unsaturated, resulting in a drop in cumulative accuracy to 90%. Nevertheless, it is essential to highlight that by implementing a majority voting scheme, where the predictions are based on the results of most samples, we can confidently identify all cerebrosides, even at their LODs of 1010 M. Overall, we perform quantitative and multiplex quantitative cerebroside detection at the physiologically relevant micromolar level with high predictive accuracies. In our quantitative detection system, we only require five μL of sample, and the whole procedure requires <1 h, including sample mixing and drying, SERS measurements, and ML predictions, which is significantly faster than conventional LC-MS analyses (h to days) for cerebroside profiling. This finding establishes our ML-driven SERS approach as a promising tool for ultra-trace quantitatively detecting large biomolecules with small Raman scattering cross-sections for biomedical applications.

Table 4 Results of our proof-of-concept simultaneous structural elucidation, molecular identification, and quantification of 3 cerebrosides (10 samples each) at various concentrations using our integrated SERS-based chemical taxonomy framework
Table 5 A detailed breakdown of the ML results obtained by our integrated SERS-based chemical taxonomy framework using 10 blind test samples of GalCer12 with the actual concentration of 1 × 108 M. The probability scores for each sample is recorded below

In conclusion, the establishment of an integrated SERS-based chemical taxonomy ML framework demonstrates the capacity for predictive modeling, enabling untargeted structural elucidation and identification with >90% classification accuracies quantification of 11 epimeric cerebrosides at trace concentrations <10% error using their SERS spectra. We also provide an in-depth understanding of the spectral regions contributing to the differentiation of all 13 epimers by corroborating with DFT simulations through systematic investigation and profiling of their molecular structures and SERS spectral characteristics. Using the five-level hierarchical chemical taxonomy ML framework, (bio)molecules are sequentially classified according to confirmable differences and similarities in molecular structural characteristics and finally identified by piecing the collective information gained from each model. These structural characteristics include (1) blank vs. epimer (2) monosaccharides vs. cerebrosides, (3) saturated vs. unsaturated, (4) glucosyl vs. galactosyl, and (5) the exact carbon chain length of the ceramide. Our research underscores the high cumulative classification accuracy of >90% for the four RF-C and up to 1 carbon discrepancy when predicting the carbon chain length using the SVM-Rs. Moreover, the integrated ML pipeline enables quantitative detection of all 11 pure cerebrosides after identifying them from 104 to 1010 M, showing good predictive accuracy and near-ideal linearity spanning seven orders of magnitude. We further demonstrate multiplex SERS quantification of 30 blind test epimer binary mixtures (total concentration of 100 μM) with <4% difference between the actual and predicted concentrations. Overall, our concentration-independent ML-driven SERS chemical taxonomy can forward-predict epimeric cerebrosides over a wide range concentration range of 104–1010 M and allows rapid, one-step untargeted structural elucidation and quantification of “unidentified” isomeric biomolecules. We envision the exploration of biomolecules characterized by higher degrees of saturation (>2) and the ability to elucidate further the exact location of the C = C bonds along the carbon chain and their cis-trans isomerism. The presence of multi-isomeric sites in complex diastereomers may also be probed. Lastly, to extend the framework untargeted elucidation of other classes of isomeric compounds beyond the 13 epimers used in this study, we posit the creation of a global SERS molecular space using high-throughput platforms to test various probe-analyte combinations. This innovation can synergize effectively with miniaturized SERS spectrometers and microfluidic chips to realize the point-of-need lab-on-a-chip concept by streamlining sample separation and pretreatment to improve SERS detection in complex and heterogeneous mediums.

Methods

Chemicals

Silver nitrate (AgNO3, ≥99%), anhydrous 1,5-pentanediol (PD, ≥97%), poly(vinylpyrrolidone) (PVP, average Mw = 55,000 g mol1), 1H,1H,2H,2H-perfluorodecanethiol (PFDT, ≥97%), 4-mercaptophenylboronic acid (4-MPBA, ≥90%) and, dodecane (C12H26, anhydrous, ≥99%), D-(+)-galactose (C6H12O6, ≥99%), D-(+)-glucose (dextrose C6H12O6, ≥99.5%, GC) were purchased from Sigma Aldrich. Copper (II) chloride was purchased from Alfa Aesar. Glucosylceramides (GlcCer C8), GlcCer(β) ceramide (d18:1/8:0 ≥ 99%); GlcCer C12, GlcCer(β) ceramide (d18:1/12:0 ≥ 99%); GlcCer C16, GlcCer(β) ceramide (d18:1/16:0 ≥ 99%); GlcCer C18, GlcCer(β) ceramide (d18:1/18:0 ≥ 99%); GlcCer C24:1 GlcCer(ß) ceramide (d18:1/24:1(15Z) ≥ 99%) and galactosylceramide (GalCer C8), GalCer(β) ceramide (d18:1/8:0 ≥ 99%); GalCer C12 GalCer(β) ceramide (d18:1/12:0 ≥ 99%); GalCer C16 GalCer(β) ceramide (d18:1/16:0 ≥ 99%); GalCer C18 GalCer(β) ceramide (d18:1/18:0 ≥ 99%); GalCer C24 GalCer(β) ceramide (d18:1/24:0 ≥ 99%); GalCer C24:1 GalCer(ß) ceramide (d18:1/24:1(15Z) ≥ 99%) were purchased from Avanti® Polar Lipids, Inc. Ethanol (C2H6O, ACS, ISO, Reag. Ph Eur), acetone (C3H6O, HPLC, ≥99.9%) and potassium hydroxide (KOH, ACS reagent, ≥85%, pellets) were obtained from Merck. Milli-Q water (>18.0 MΩ cm) was purified with a Sartorius Arium® 611 UV ultrapure water system. All reagents were used without further purification.

Synthesis and purification of silver nanocubes

Ag nanocubes were synthesized via the polyol method31. Two precursor solutions were first prepared. Precursor solution 1 consisted of silver nitrate (0.50 g) and copper (II) chloride (0.86 μg) dissolved in PD in a scintillation vial. Precursor solution B consisted of PVP (0.25 g) dissolved in PD. 20 mL of PD was added to a 100 mL round-bottom flask and heated at 190 °C for 10 min in a temperature-controlled silicon oil bath. Subsequently, aliquots of PVP (250 μL) and silver nitrate (500 μL) precursor solutions were injected in alternation to the reaction flask at different rates, namely 500 μL every min for silver nitrate and 250 μL every 30 s for the PVP solution, until the reaction mixture turned reddish-brown. The as-synthesized Ag nanocubes were purified via several rounds of centrifugation at 12,000 × g and sonication in acetone and ethanol, then subsequently stored in ethanol. Ag nanocubes were further subjected to vacuum filtration using polyvinylidene fluoride filter membranes (Durapore®) with pore sizes 5 μm, 0.65 μm, 0.45 μm, and 0.22 μm to remove impurities before use.

SEM and UV characterization of Ag nanocubes

The synthesized Ag NCs were subjected to scanning electron microscopy (SEM) using the JEOL JSM-7600F Schottky field emission electron microscope at an accelerating voltage of 5 kV. Measurements were randomly taken at 5 different spots on the SEM substrate to get a representative group of images for each Ag nanocube sample. For each sample of Ag NCs, the size (edge length) of 100 randomly selected nanocubes was measured using the ImageJ freeware. The UV-vis spectra were taken on the Agilent Technologies Cary 60 UV/visible spectrophotometer.

Surface functionalization of Ag nanocubes

The purified and filtered Ag nanocubes underwent 2 rounds of ligand exchange with MPBA. Briefly, Ag nanocubes (ethanolic, 4 mg mL1, 400 μL) were added to ethanol (1600 μL) while stirring at 750 rpm for 1 min. Next, 4-MPBA (ethanolic, 1 mM, 200 μL) was injected and the mixture was stirred for 3 h at 750 rpm in the dark. After the first ligand exchange cycle of 3 h, the 4-MPBA functionalized Ag nanocubes were purified via two rounds of centrifugation at 12,000 × g and sonication in ethanol to remove any excess 4-MPBA and redispersed in 400 μL of ethanol. The ligand exchange process was repeated but stirred for 2 h instead of 3 h in the second cycle. The 4-MPBA functionalized Ag nanocubes (ethanolic, 4 mg mL1) were stored at 4 °C.

Analyte preparation and reaction

Glucose (aq, 5 mM), Galactose (aq, 5 mM), and 11 other cerebrosides of different chain lengths (98:2 ethanol: dodecane, 5 mM) stock solutions were prepared. Serial dilution was carried out to yield 7 concentrations per cerebroside (1 mM to 1 nM). 4-MPBA functionalized Ag nanocubes (ethanolic, 4 mg mL1) were sonicated and redispersed in pH 11 KOH solution, before reaction with the analytes, and immediate SERS measurements. 4-MPBA-Ag nanocubes (aqueous (aq.) pH 11, 4 mg mL1, 10 μL) were added to pH 11 KOH solution (aq., 35 μL) before adding the respective analytes (5 μL). The mixtures were sonicated immediately after each addition to that it is well-mixed and further shaken (Eppendorf ThermoMixer C, 1000 rpm at 25 °C) for 30 min to ensure thorough mixing during the reaction, including three sonication and vortex steps at the tenth, twentieth and thirtieth-min mark. After the reaction, 5 droplets of 2 μL of the mixture sample solution were drop-casted on different locations on the perfluorothiol Ag hydrophobic substrate and dried under ambient conditions. SERS spectra were then collected on the 5 dried spot areas using a laser beam with the conditions listed below.

SERS measurement of 13 analytes

SERS measurements were performed using x-y hyperspectral imaging modes of the Ramantouch microspectrometer (Nanophoton Inc., Osaka, Japan) with a 532 nm excitation laser (power = 0.10 mW). A 20× objective lens was used with 30 s acquisition time. The spectral window of 400–1800 cm1 was used for data analyses. The spectra were preprocessed using baseline correction via the adaptive iteratively reweighted penalized least squares (airPLS) algorithm and min-max normalization (max = 1). For each of the 13 analytes, 3 different spectra were measured from different locations within the 5 drop-casted droplets (total 15 spectra per analyte) and the experiment was repeated 4 times (total 60 spectra per analyte per concentration) to ensure reproducibility. Representative SERS spectra were obtained by averaging 60 individual SERS spectra per analyte per concentration and data analysis was completed using Origin 9.0 software (OriginLab Corporation, Northampton, MA, USA).

Fabrication of the hydrophobic substrate

Oxygen plasma (FEMTO SCIENCE, CUTE-MP/R, 100 W) was used to clean and prepare the silica substrates for 5 min. Next, chromium (Cr) and silver (Ag) films were deposited in sequence using a home-built Syskey splutter system via thermal evaporation deposition. An adhesion layer of Cr (12.5 nm) was first deposited, followed by an Ag 100 nm film Si substrate. The deposition rates of Cr and Ag were 0.1 and 0.5 Å s1, respectively and the rate was monitored in situ using a quartz crystal microbalance. Cr and Ag targets (99.99% purity) were purchased from Advent Research Materials, UK. The resulting coated Ag nanocube array was then functionalized by immersing a 5 mM 1H,1H,2H,2H-perfluorodecanethiol ethanolic solution for at least 15 h before rinsing three times with ethanol to remove any unbounded PFDT and stored in nitrogen before use.

Contact angle measurements

Static contact angles were measured on the Theta Lite fully automated optical tensiometer equipped with a Firewire digital camera and Attention from Biolin Scientific by drop-casting a sessile 4 μL water droplet onto the hydrophobic and Si substrates respectively. All contact angle measurements reported were repeated at least five times across each substrate and averaged.

Determining the analytical enhancement factor (AEF)

The analytical enhancement factor is calculated by using the equation below:

$${{{{{\rm{AEF}}}}}}={{I}}_{{{{{{\rm{SERS}}}}}}}/{I}_{{{{{{\rm{Raman}}}}}}}\times {{C}}_{{{{{{\rm{Raman}}}}}}}/{C}_{{{{{{\rm{SERS}}}}}}}$$
(1)

Where ISERS and IRaman were the intensities from the signals recorded on SERS and normal Raman, and CSERS and CRaman were the corresponding analyte concentrations measured using a hydrophobic platform and normal Si-wafer Raman substrate respectively. We choose the CC stretching (a1, ν(CC)) peak at 1598 cm1 of the MPBA normal Raman spectrum and the corresponding peak at 1575 cm1 of the 4-MPBA SERS spectrum. We conduct this experiment by using 2 μL of MPBA solutions at different concentrations. For the normal Raman measurements, we drop-casted a 4-MPBA solution (1 M, 2 μL) on Si wafer substrates. For the SERS measurements, we dropcast a 4-MPBA-Ag nanocube solution (102 M, 2 μL) on the Si wafer substrate. For the SERS measurements on our hydrophobic substrate, we drop cast a 4-MPBA-Ag nanocube solution (104 M, 2 μL) on the Si wafer substrate. All the SERS spectra in this work are collected under dry conditions after the solvent in the droplet has fully evaporated, and the AEF was calculated based on the average intensities of the corresponding vibrational bands at 1575 cm1 and 1598 cm1 of MPBA in the 25 spectra for each substrate. SERS measurements were taken with the SERS conditions listed above.

DFT simulations

The calculations on the interaction of the MPBA-functionalized Ag surface with all 13 analytes (2 monosaccharides and 11 cerebrosides) were carried out using the unrestricted B3LYP exchange–correlation function in the Gaussian 09 computational chemistry package. The 6-31G (d,p) basis set was used for C, H, O, B, and N. The LANL2DZ basis set was employed for Ag. The Ag surface was modeled using a reported triangle consisting of six Ag atoms. Structure optimization was carried out in 3 steps. Firstly, we optimized the geometry of all 13 analytes. Secondly, we optimized the geometry of the triangular Ag cluster; then 4-MPBA was placed on its vertex, and the whole system was reoptimized. Lastly, we introduced the optimized analyte molecules to the MPBA-Ag system and formed 2 boric ester bonds with MPBA via the C3 and C4 hydroxy groups, and the whole system was reoptimized with Ag atoms fixed again.

Chemometrics analysis

Unsupervised principal component analysis (PCA) was performed using SOLO v8.8 (Stand Alone Chemometrics Software, Eigenvector Research, Inc.). The PCA model was cross-validated using Venetian blinds, with 10 splits and a blind thickness of 1. Supervised support vector machine regression (SVM-R) with prior PCA compression applied to prevent overfitting, and extreme gradient boosting classification where eta/learning rate = 0.1, max depth = 6, num_round = 500 was performed using SOLO v8.8 (Stand Alone Chemometrics Software, Eigenvector Research, Inc.). (A) For the forward prediction model (regression model 5), the training dataset containing all chain lengths except the testing dataset (e.g., train = GlcCer8,16,18 and 24 and test = GlcCer16) was first randomly stratified into an 80% train and 20% cross-validation for model training/building. Once trained, the test dataset was used to assess the model accuracy by determining the R2 value, root mean square error of prediction (RMSEP), predicted chain length, and% difference between the predicted value and actual value. (B) For the pure analyte regression, the dataset was first randomly stratified into a 75% train and 25% test in one iteration. The test set was used to assess the model accuracy by determining the R2 value, root mean square error of prediction (RMSEP). (C) For the multiplex regression model, the training dataset containing all training concentrations (e.g., GlcCer24:1: GalCer24:1 ratios are 0:100, 25:75, 50:50, 75:25, 100:0) was first randomly stratified into 80% train and 20% cross-validation for model training/building. Once trained, the 3 testing datasets (GlcCer24:1: GalCer24:1 ratios are 10:90, 40:60, 60:40) were used to assess the model accuracy by determining the R2 value, root mean square error of prediction (RMSEP), predicted chain length, and difference in percent between the predicted and actual value.

All other chemometrics analyses including unsupervised t-distributed stochastic neighbor embedding (t-SNE) and PCA as well as the supervised random forest, decision tree, support vector machine, Naïve Bayesian network, and neural network classification models were conducted using Orange Data Mining32. The following parameters were applied for each model: t-SNE using perplexity = 30, Exaggeration = 1, PCA components = 15; random forest where the number of trees = 1000, number of attributes arbitrarily considered at each split = \(\surd ({{{{{\rm{number}}}}}}\;{{{{{\rm{of}}}}}}\; {{{{{\rm{attributes}}}}}})\), Max depth = 10, No splitting of subsets <5; decision tree where the number of trees = 50, nodes = 2; support vector machine, cost = 1, loss = 0, 10 RBF kernel; naïve Bayesian network using, cost = 1, loss = 0, 10 RBF kernel, neural network using 50 nodes per layer, 2 layers. For all classification models, where all classes are represented, the dataset was first randomly stratified into a 75% train and 25% test in one iteration. The test set was used to assess the model accuracy by determining its prediction accuracy and F1 score. This process was repeated for another 99 iterations to derive the average prediction accuracy for the class. For the RF classification models used in forward prediction (classification models 1–4), the dataset containing all classes, except the blind test class, was first randomly stratified into an 80% train and 20% cross-validation for 100 iterations to derive the average prediction accuracy for the model. Once trained, Once trained, the test datasets were used to assess the 4 different models’ accuracy by assessing the classification accuracy, Precision, Recall, and F1 score.

Machine learning metrics were calculated as follows:

$$ {{{{{\rm{Classification}}}}}}\;{{{{{\rm{accuracy}}}}}} \\ =\frac{{{{{{\rm{True}}}}}}\;{{{{{\rm{positive}}}}}}+{{{{{\rm{True}}}}}}\; {{{{{\rm{negative}}}}}}}{{{{{{\rm{True}}}}}}\; {{{{{\rm{positive}}}}}}+{{{{{\rm{False}}}}}}\; {{{{{\rm{positive}}}}}}+{{{{{\rm{True}}}}}}\; {{{{{\rm{negative}}}}}}+{{{{{\rm{False}}}}}}\; {{{{{\rm{negative}}}}}}}$$
(2)
$${{{{{\rm{Precision}}}}}}=\frac{{{{{{\rm{True}}}}}}\; {{{{{\rm{positive}}}}}}}{{{{{{\rm{True}}}}}}\; {{{{{\rm{positive}}}}}}+{{{{{\rm{False}}}}}}\; {{{{{\rm{positive}}}}}}}$$
(3)
$${{{{{\rm{Recall}}}}}}=\frac{{{{{{\rm{True}}}}}}\; {{{{{\rm{positive}}}}}}}{{{{{{\rm{True}}}}}}\; {{{{{\rm{positive}}}}}}+{{{{{\rm{False}}}}}}\; {{{{{\rm{negative}}}}}}}$$
(4)
$${{{{{\rm{F}}}}}}1{{{{{\rm{score}}}}}}=\frac{2\times {{{{{\rm{Precision}}}}}}\times {{{{{\rm{Recall}}}}}}}{{{{{{\rm{Precision}}}}}}+{{{{{\rm{Recall}}}}}}}$$
(5)
$${{{{{\rm{Difference}}}}}}=\frac{{{{{{\rm{|Experimental}}}}}}\; {{{{{\rm{value}}}}}}-{{{{{\rm{Predicted}}}}}}\; {{{{{\rm{value|}}}}}}}{{{{{{\rm{Experimental}}}}}}\; {{{{{\rm{value}}}}}}}$$
(6)