Introduction

Sphingolipids (SLs) are a diverse family of structural and signalling lipids that comprise a broad range of biological functions crucial to normal physiology, cell signalling and trophic support1,2,3. SLs are defined by their sphingoid backbone, which in mammals consists predominantly of an 1, 3-dihydroxy, 18-carbon, mono-unsaturated sphingosine (d18:1), with variations to this long chain base also studied, including saturated dihydrosphingosine or sphinganine (d18:0), the di-unsaturated sphingadiene (d18:2), as well as mono- (m18:X) and trihydroxy (t18:X) configurations2,4.

Ceramide (Cer) is the central intermediate of the sphingolipid pathway, that consists of a sphingoid base amide-linked to a fatty acid of variable length, hydroxylation and degree of unsaturation5,6. Modifications to the C1 hydroxyl group of Cer allow for the formation of the more complex glycosylated SLs (e.g., cerebrosides, gangliosides), and the phosphorylated sphingomyelins (SM), which are the most abundant SL class in the plasma membrane of eukaryotes7. SM is frequently studied alongside Cer, owing to the functional importance of Cer-SM balance in cellular and inflammatory processes, as well as in ordered domain function (e.g., lipid rafts)8,9,10,11. Liquid chromatography-tandem mass spectrometry (LC–MS/MS) is the conventional approach for the analysis of Cer and SM lipids, as their fragmentation following positive ionisation ([M + H]+) generates characteristic ions of their sphingoid backbone (e.g., m/z 264 for d18:1) and side groups (e.g., m/z 184 for the choline headgroup of SM) to facilitate their identification (Fig. 1). Their fragmentation can also elucidate structural features of the fatty acyl and sphingoid base chain, including length and degree(s) of unsaturation (Fig. 1), which can help determine the precise position of carbon–carbon double bonds with specialised instruments4,12.

Figure 1
figure 1

Basic chemical structure of sphingolipids, displaying the primary headgroups (‘R’) for ceramide and sphingomyelin, as well as the sites of characteristics fragmentation ions (‘F’) on the sphingoid base (F1-3) and choline headgroup (F4) specific to sphingomyelin. Common sites of unsaturation (C=C) on the mammalian sphingoid backbone are also illustrated, and their impact on the reported m/z fragment ions.

Specific precursor–product ion transitions can be employed in selective (SRM) or multiple (MRM) reaction monitoring experiments to improve the accuracy and reproducibility of SL analysis13,14,15, and are readily adapted into commercial (e.g., LipidSearch, LipidBlast, LipidAnnotator) and freeware toolboxes for the development of automated workflows (including setup, data acquisition and validation), thereby making their analysis amendable to non-experts16,17,18. However, resolving SL identities remains complex, as variations in headgroups, carbon backbone length and sites of unsaturation increase the probable number of variants with similar transitions, including isobaric and isomeric species19, that cannot be distinguished even with high-resolution instruments20. In addition, new structural species are continually being identified21,22, adding to this growing complexity.

Notwithstanding, the resolution of closely related SLs, including those with similar structures or nominal mass, can be achieved through LC separation, augmenting the accuracy of MS/MS assignments23,24,25. SL identities can be further validated by matching their chromatographic separation or retention time (RT) in complex samples against pure external standards or internal (deuterated) compounds (Fig. 2)26. Though this unequivocally resolves their identities, purchasing standards for all SLs of interest is not feasible, given many are not commercially available and the financial burden of purchasing copious numbers of SL variants.

Figure 2
figure 2

Representative total ion chromatograms for ceramide and sphingomyelin species from (A,B) pure compounds and (C,D) biological extracts. Loaded calibrators for (A) Cer(d18:1/24:0) and (B) SM(d18:1/24:1) allow us to resolve their position in the samples (C,D), which have multiple peaks assigned to their MRM scan, caused by interfering ions with similar precursor-product transitions.

Advancements in computational predictions have vastly improved RT estimation across various lipids (including SLs) and small molecules, for both targeted (SRM/MRM) and global lipid (untargeted) analyses27,28,29,30,31,32,33,34. However, numerous selective modifications to lipid separation (e.g., sample preparation, appropriate solvents, flow rate, chromatography matrix) are made to improve their detection and avoid batch effects. This can impose limitations on consistent RT estimations as modelling is often confined to comparable experimental conditions, including tissue type, potential class-specific biases and/or requires frequent retraining and validation30,33,34. This has precluded the capacity for robust or straightforward integration of RT estimates in lipid identification tools and experimental analyses.

To overcome these limitations in the assessment of SLs, Mass versus Relative Elution Time (MRET) profiling can be employed35, plotting two-dimensional data of the SLs nominated precursor mass (m/z), against their elution time (RT) (Fig. 3). Standards and controls (quality or internal) are utilised as “points of reference”, with the generated 2D plots used to visually extrapolate the position of other SLs within that class, using the knowledge of their known mass and characteristic fragmentation ions (i.e,. sphingoid backbone, C1 head group) as molecular descriptors to determine the RT of unknowns (Fig. 3). Recognising MRET patterns specific to a given SL family (Fig. 3B), as well as structural characteristics (e.g., degree of unsaturation, Fig. 3C,D), allows for the elimination of major interferences, resulting from ions of indistinguishable precursor and product fragmentation.

Figure 3
figure 3

2D plots of MRET profiles. (A) Each bubble represents a distinct ion and its relative TIC peak area (bubble size) identified during an SRM/MRM experiment, with the identity of ions for Cer and SM standards/internal controls highlighted. These values were assigned as points of reference to (B) extrapolate the elution profiles of Cer and SM, which subsequently allows for the removal of interfering ions. Enhanced views of elution profiles for (C) Cer and (D) SM demonstrate how relative RT can assist with determining structural characteristics, in this instance relative RT helps identify effects on SLs from increasing carbon chain length and degree of unsaturation (C=C bonds). 2D plots were derived from data obtained in our previous study14.

Presently, MRET profiling is manually performed. In this study, we aimed to automate the process by developing a bespoke, web-based tool entitled ReTimeML, aptly named to describe the task at hand—a calculated Remeasurement of the Time, on column, to predict the elution of (sphingo)lipids. ReTimeML was constructed through Machine-Learned regression of user RT assignments, collected from our database of LC–MS/MS analyses on Cer and SM lipids, and literature sources. Herein, we verified ReTimeML’s capacity to accurately annotate (> 99% accuracy) Cer and SM RTs, compared to expert-user assignments, across multiple tissues and LC–MS/MS experimental conditions. Notably, ReTimeML was successfully applied to aid the identification of noncanonical Cer and SM structures, resolve ion interferences, and guide the accurate annotations for Cer and SM expressional differences in cerebrospinal fluid (CSF) and paired serum collected from the same healthy volunteers (HVs).

Results

Regression model development and selection

We assessed the capacity of the nominated regression algorithms (Supplementary Data 1) to learn from descriptor information for identifying Cer and SM lipids (e.g., precursor mass, fragmentation) and user RT annotations of previous work (Supplementary Data 2). RT data was sequentially increased to determine the optimal training sample size, broken down by molecular features, for which regression model performance (evaluated on validation data) yielded coefficient of determination (R2) and root mean squared error (RMSE) values at the acceptance thresholds (R2 > 0.9, RMSE < 0.25), with the fulfilment of both criteria imperative. As anticipated, performance increased across all models when augmenting the training set size (Fig. 4). Lasso (alpha = 0.001) and ridge regression (alpha = 0.4) outperformed the other machine-learned models (Fig. 4), and were assigned as the optimal regression algorithms for Cer and SM RT estimations, respectively. Though comparable, lasso achieved a slightly more favourable R2 value for Cer (lasso: 0.930; ridge: 0.929, n = 9), while ridge regression appeared to be more beneficial for SM estimations (lasso: 0.915; ridge: 0.928, n = 6). Lasso yielded marginally lower RMSE values for both Cer (lasso: 0.091; ridge: 0.102, n = 9) and SM (lasso: 0.132; ridge: 0.178, n = 6), when applying the same number of datasets to concurrently meet our R2 > 0.9 prerequisites, though both were well-below the acceptable RMSE < 0.25 threshold (Fig. 4, Supplementary Data 2).

Figure 4
figure 4

The (A,B) coefficient of determination (R2) and (C,D) root-mean-square-error (RMSE) values for every regression model applied to the validation data with increasing training sample size. The selected regression model for (A,C) ceramide (lasso, alpha = 0.001, red line) and (B,D) sphingomyelin (ridge, alpha = 0.4, green line) has been emphasised.

Performance evaluation and model robustness

The performance of ReTimeML’s lasso and ridge regression modelling was verified in four, independently performed, LC–MS/MS analyses on frequently observed d18:0, d18:1 and d18:2 Cer and SM lipid species (Tables 1 and 2). LC–MS/MS for Cer and SM species were performed on various tissues/fluids of rodents and humans, and different chromatography conditions (including an isocratic vs. gradient comparison). Cer and SM species for which experimental RTs were known (i.e., calibrators, internal/quality controls, experimentally determined), were assigned as points of reference (‘Train’), with the remaining unknown RTs extrapolated for each experiment (‘Test’). ReTimeML’s output provides users with an RT list for all Cer/SMs of interest, including those listed as references, which can be downloaded as a .csv file or directly copied into Excel or a similar spreadsheet. An MRET profile plot is also generated that displays the position of each calculated SL, organised into different degrees of SL structural unsaturation. Representative figures of ReTimeML’s output are illustrated in Fig. 5A and B. MRET plots were also manually constructed from user-validated assignments (Supplementary Fig. 1), with their eluting order supporting ReTimeML’s output, as well as previous literature on the separation of these SLs under reverse-phase conditions35,36,37. ReTimeML’s estimations displayed exceptional agreement when compared to user-determined RTs (Fig. 5C,D). ReTimeML predicted the RTs of 192 Cer and SM species, across the four LC–MS/MS experiments, with an average and median prediction error of 7.6 and 3.6 s, respectively, with each validation experiment achieving R2 > 0.99 when comparing ReTimeML estimations to experimentally determined RTs (Fig. 5E–H). Of the ReTimeML predicted RTs, 14 deviated more than 3% from user assignments, with errors of this magnitude occurring only when RTs were estimated under isocratic conditions (Tables 1 and 2). For gradient RT estimations, 23 of 142 deviated more than 1%, with only 2 of these estimates exceeding a 2% variance from user assignments (Tables 1 and 2).

Table 1 Precursor-product ions used for the analysis of ceramides and comparison between ReTimeML estimations to user-defined RTs.
Table 2 Precursor–product ions used for the analysis of sphingomyelins and comparison between ReTimeML estimations to user-defined RTs.
Figure 5
figure 5

Representative ReTimeML outputs estimating RTs for (A) gradient (human serum, ceramide) and (B) isocratic (rat brain, sphingomyelin) LC–MS/MS analyses. (C,D) ReTimeML estimations (blue circle) aligned with user RT annotations (orange circle) for (C) Cer and (D) SM across our four experimental fluid/tissue analyses. Scatter plots representing ReTimeML estimations vs user-determined RT values for (E) human serum (F) CSF, (G) mouse liver and (H) rat brain SL extracts. Squared correlation coefficient (R2) are reported. The complete list of RT values (ReTimeML and user) have been provided in Tables 1 and 2.

Accuracy threshold assessment

ReTimeML’s performance in our verification studies prompted us to evaluate how its modelling accuracy responds to variations in the number and/or type (i.e., structure) of reference material employed. This was performed to ascertain the minimum requirement of reference material (‘Train’) that will achieve, on average, appropriate levels of accuracy (< 3% deviation from experimentally determined values). To investigate this potential constraint, we undertook a random sampling of our user-defined RTs from the verification LC–MS/MS experiments (Tables 1 and 2), which were then utilised as operational ‘Train’ values to evaluate their impact on ReTimeML’s subsequent extrapolation of unknowns. Nominated values were incrementally increased to assess the effects of fatty acyl chain length within a sphingoid base (e.g., d18:1/16:0 → d18:1/24:0), alongside structural variations from degree(s) of unsaturation (d18:0/XX:0, d18:1/XX:0, d18:1/XX:1, d18:2/XX:0 and d18:2/XX:1). In all random samplings, the internal control was maintained, with ReTimeML requiring a minimum of two points of reference to begin the estimation of unknowns.

Regardless of the selected RT material used, ReTimeML’s estimations on gradient LC–MS/MS experiments consistently outperformed isocratic measures, requiring fewer ‘Train’ points to achieve appropriate levels of accuracy (Fig. 6). ReTimeML estimations for gradient assessments were deemed suitable upon the employment of three structural variants (e.g., d18:0/XX:0, d18:1/XX:0 and d18:1/XX:1) as reference material, with average deviations from user-RTs ranging from 0.39 to 0.58% for Cer and 0.3–2.71% for SM (Fig. 6A–C,E–G). Increasing the number of reference points beyond this value does not greatly improve RT accuracy, with the exception of SMs assessed in mouse liver tissue where additional fatty acyl chains of the three structural variants improved accuracy to 1% and below (Fig. 6G). For isocratic LC–MS/MS, a larger number of references (n = 5–7) were required to achieve appropriate levels of accuracy (< 3%), with ReTimeML unable to attain a variance from user-RTs below ~ 1.5% (Fig. 6D,H). These results mirrored the similar numbers of ‘Train’ values and model accuracy achieved in our verification experiments (Tables 1 and 2). In all assessments, increasing the number of fatty acyl chain ‘Train’ points does not as drastically improve RT accuracy when compared to increasing the number of structural variants.

Figure 6
figure 6

Heat maps of comparative assessment in the accuracy of ReTimeML estimations against the selection and number of reference RTs employed. A random sampling of predetermined RTs was performed within each of our validation cohorts, which incrementally increased the number of references based on fatty-acyl chain length (y-axis) and sphingoid base unsaturation (x-axis). Variance (% error) in ReTimeML estimations for (AD) Cer and (EH) SM from user-defined RTs are represented for (A,E) human, serum; (B,F) human, CSF; (C,G) mouse, liver and (D,H), rat, brain. Accuracy was arbitrarily deemed appropriate if ReTimeML variance from user-RTs was less than 3%.

ReTimeML SL resolution in the absence of fragmentation distinguishment.

ReTimeML estimations help clarify the RTs of SLs where the occurrence of both isobaric and isotope distributions interferes with the correct peak identification in the total ion chromatogram (TIC), and cannot be resolved by supporting fragmentation data. This was particularly notable for di-unsaturated SMs that can correspond to either d18:1/XX:1 or d18:2/XX:0 in our MRM analyses, as the m/z 184 transition is insufficient to differentiate these isobars (Fig. 7A–C). To aid their resolution, secondary scanning of the m/z 262 (d18:2) and 264 (d18:1) sphingoid base was performed (Fig. 7D–I, Table 2). Sphingoid fragment selection proved useful in resolving SM(d18:2/XX:0) species (Fig. 7D–F) but presented complications in deriving the identities of SM (d18:1/XX:1) species, as the prominent peaks displayed with m/z 264 scans either did not align with MRET principles (Fig. 7G–H) or displayed multiple peaks, leading to further disambiguation of the correct identity (Fig. 7I). ReTimeML-guided RTs helped to either exclude (Fig. 7G,H) or correctly annotate (Fig. 7I) the SM (d18:1/XX:1) peaks in TICs. Peaks that did not align to ReTimeML estimates for m/z 264 transitions were attributed to isotope interference from mono-unsaturated hexosylceramides (HexCer (d18:1/XX:0), as illustrated (Fig. 7J–L). HexCer (d18:1/XX:0) species give rise to an [M + 1] isotopic ion that can interfere with the [M + H]+ ion of SM (d18:1/XX:1) (Fig. 7M–R). A similar comparison of the isotopic distribution and ReTimeML RT estimation helped to resolve SM (d18:2/XX:1) and SM (d18:1/XX:2), the latter of which were not assessed as their signal-to-noise ratios (S/N) were below the limit of detection (S/N < 3, data not shown).

Figure 7
figure 7

ReTimeML assigned RTs help to resolve disambiguation from isobaric and isotopic interferences in MRM analysis of SMs. The TICs for (AC) MRM transitions m/z 729.6, 785.7 and 813.7 → 184.1 correspond to SM (d36:2), SM (d40:2), SM (d42:2), respectively. ReTimeML estimated RTs for the SM (d18:1/XX:1) (green arrow/peak) and SM (d18:2/XX:0) (red arrow/peak) isobars are illustrated. We confirmed ReTimeML assignments through secondary scans of the (DF) d18:2 (m/z 262.3) and (GI) d18:1 (m/z 264.3) sphingoid backbone. Differentiation of isotopic SLs was also supported by ReTimeML assignments for SM (d18:1/XX:1) isobars, as their secondary scans displayed interfering peaks of similar RT, caused by the presence of [M + 1] isotopic ions from HexCer (d18:1/XX:0) SLs. These were evidentially verified by the inclusion and alignment of the RTs for their (JL) [M + H]+ MRM transitions. The isotopic distribution patterns for (MO) SM(d18:1/18, 20 and 24:1) and (PR) HexCer (d18:1/18, 20 and 24:1) have been presented to demonstrate how their profiles overlap, allowing the latter to interfere with [M + H]+  → m/z 264 scans for SM (d18:1/XX:1).

Correlation of Cer and SM profiles between CSF and serum

ReTimeML-guided Cer/SM lipids were assessed in our paired CSF and serum samples, with both fluids collected from the same HVs. This resulted in the structural characterisation and quantification of 48 Cer and SM lipid species. Demographics for the 51 HVs included in this analysis are described in Supplementary Table 1 and their concentrations (pmol/mL) for each Cer and SM lipid quantified in both the CSF and serum are listed in Supplementary Data 3. Participants were recruited in Cologne, Germany and considered healthy at the time of body fluid withdrawal, with no close relatives having a psychiatric disorder. Participants were comprised of 30 women (58.8%) and 21 men (41.2%), average age of 27.3 (SD = 6.6) years, BMI of 23.0 (SD = 3.4) and were of Caucasian ethnicity, with the exception of two participants of African and Asian origin. Cer and SM profiles were comparable to previously reported CSF and serum concentrations of these SLs in healthy/control cohorts38,39,40,41,42. SMs displayed predominately higher concentrations compared to Cer, with d18:1 being the most prominent sphingoid backbone for both SL classes, regardless of body fluid type (Fig. 8A,B, Supplementary Data 3).

Figure 8
figure 8

Human CSF-serum SL expression profiles for (A) Cer and (B) SM. (C,D) Scatter-plot of the individual pairwise CSF-serum SL comparisons for each (C) Cer (n = 606) and (D) SM (n = 945) identified in our subject cohort (n = 51). Correlations were determined by Pearson analysis, using natural log-transformed (Ln) Cer and SM levels. The coefficient of correlations (r) and p values are shown. (E) Volcano plot summarising the adjusted p values (Q-value) against correlation coefficients for Cer and SM lipids analysed. The dotted line shows the threshold for statistical significance at Q < 0.05, adjusted for multiple comparisons using the FDR approach of Benjamini, Krieger, and Yekutieli.

The comparative similarities in their CSF-serum profiles (Fig. 8A,B) prompted us to evaluate the respective associations between these two body fluids. First, pairings were treated independently by matching every SL concentration identified per individual in the CSF to that of its corresponding value in the serum. This resulted in a total of 606 individualised Cer, and 945 SM, CSF-serum pairings (Fig. 8C,D), with Pearson analysis revealing significant (p < 1 × 10−15) positive associations between the biofluids for Cer (r = 0.466) and SM (r = 0.844).

Mean CSF-serum correlations for each characterised Cer and SM structure were evaluated next. From the 48 Cer and SM structures identified, 41 recorded a sufficient number of CSF-serum parings (N > 10) to be subjected to a correlated coefficient analysis (r), with statistical values (p values) adjusted for multiple comparisons using the FDR approach of Benjamini, Krieger, and Yekutieli (Q < 0.05). As a result, 5 SLs were defined as significantly correlating between the CSF and serum (Fig. 8E). Notably, all of these significant associations were comprised of noncanonical, sphingadiene backbone, with 4 SMs displaying positive associations (SM (d18:2/16:1), SM (d18:2/18:1), SM (d18:2/20:1) and SM (d18:2/22:1), alongside Cer (d18:2/16:0) which showed a significant inverse correlation between the two body fluids. Correlation coefficients and statistical values for all Cer and SM associations are summarised in Supplementary Table 2.

Discussion

The objective of our study was to develop a user-friendly, analytically robust, tool for estimating RTs of the major d18:X SLs classes for Cer and SM, utilising our database of MRET behaviour in prior LC–MS/MS experiments, supplemented with literature sources, as the framework for building our model. Presently, no commercially available software or freeware incorporates RT data into their descriptor information to aid the identification of SLs in LC–MS/MS. However, it is strongly recommended that SL elution order be considered to reduce the likelihood of incorrect annotation in their identities16. The lack of RT incorporation is partly due to modelling overreliance on the specifics of the separation system used, with previous studies emphasising the need for future models to (1) be readily adaptable to differing experimental conditions and; (2) require as few reference values as possible28,32,43.

Applying MRET principles35, we demonstrate how ReTimeML is able to correctly assign various d18:0, d18:1 and d18:2 Cer/SM lipids based upon an understanding of their carbon chain length, degree of unsaturation and the C1 headgroup. We chose to focus on d18:X sphingoid bases as these comprise the most abundant SLs in mammalian organisms44. A similar approach has been developed for modelling the RTs of glycerophospholipids, assessing their equivalent carbon number (i.e., acyl chain composition), and expanding this across the different classes (e.g., phosphatidylcholine, phosphatidylethanolamine, phosphatidylserine)45. Both approaches have the distinct advantage of displaying no bias towards an MS-system (e.g., triple quadrupole (QqQ) or high-resolution orbitrap). This could have an added benefit for resolving Cer and SM lipids from global/untargeted LC–MS/MS analyses that employ the appropriate conditions for their separation46, particularly since high-resolution MS systems are without the added precision from MRM specifications to reduce TIC complexity. Though comparable, ReTimeML’s strength lies in its combination of learned material from previous experimental data, alongside the assignment of ‘real’ values (i.e., user-defined references, standards, internal controls) to enhance the decision-making process. This enabled ReTimeML to delineate common patterns for a given SL class, irrespective of LC–MS/MS methodology or experimental conditions, overcoming a major obstacle of RT adaptability for use in lipid identification software. Furthermore, ReTimeML is completely automated, making it accessible to those with limited knowledge of lipid biochemistry, and, thus, translatable to a base of researchers who may have shied away from the complex analyses.

Encouragingly, ReTimeML did not require an excessive number of references to estimate unknowns with a high degree of accuracy, meaning that users are not faced with the considerable costs of purchasing a copious number of standards. Moreover, for most experiments (excluding isocratic analyses) n ≥ 4 reference RTs did not drastically improve the accuracy of unknowns, and in certain instances increased deviations from user RTs, presumably due to data overfitting (Fig. 6). We recommend using a minimum of n = 3 reference points per SL class, be employed, comprised of different structural variants, as this provided the most accurate RT estimations (less than 1% variance for gradient and between 2.5 to 5.3% for isocratic, from user-defined measures, Fig. 6). ReTimeML’s lower accuracy of RT estimations in isocratic systems was attributed to the lesser data available for model training, as well as increases in peak-broadening from using single solvent systems that make analyte detection more difficult, particularly when lipids of interest span a wide polarity range or require the separation of closely related species47,48. Hence, isocratic measures are more suited to low numbers of SL analytes (n < 10) to reduce complexity but have the advantage of reduced run time with no requirement of column equilibration prior to subsequent measurements49.

For most SL analyses, n = 3 reference RTs would be achievable by pooling together a given LC–MS/MS experiments routine internal controls per lipid class (e.g., Cer, SM), or the closest structural/chemical equivalence, that control for lipid extraction inconsistencies and LC–MS/MS normalisation across sample cohorts50,51, alongside calibrators to externally quantify samples and/or QC mixture to control for instrument signal variability (e.g., ion suppression) and mitigate influences from matrix and batch effects during high throughput screening52,53.

Importantly, the accuracy of ReTimeML estimations was also adaptable to the reference material employed (Fig. 6), indicating that if a particular reference for a structure of interest (e.g., d18:2/XX:0) is not available, an alternate standard sourced from the same lipid class would suffice, provided our guidelines (n = 3 references, including structures) are retained. If standards cannot be procured, and we do acknowledge that certain classes of SLs (e.g., SM) are limited in their commercial availability, users can still choose to enter experimentally determined values from samples, though we strongly advise caution using this approach and recommend limiting this to RTs of SLs that are well-established or when signal interference is negligible to unequivocally resolve the correct RT on the TIC.

Herein we would also like to highlight ReTimeML’s proficiency at annotating RTs in complex TICs, particularly where ions of similar transitions interfere with correct peak annotation (Figs. 2 and 7), with their mass differentiation (~ 20 ppm) not achievable at the resolution of a QqQ system (~ 0.1–0.2 Da). ReTimeML correctly assigned RTs to the peaks of SM isobars (d18:1/XX:1 vs. d18:2/XX:0) using only the m/z 184 transition (Fig. 7A–C). This is particularly notable given this diagnostic ion is assigned only to detect the presence of a choline headgroup, with SM analyses requiring a secondary m/z 262, 264 or 266 scan(s) to determine the correct sphingoid backbone. This can be problematic in LC–MS/MS experiments for SMs as the choline headgroup is highly sensitive and its signal (up to 100× stronger) can suppress the detection of the sphingoid ion54, thereby limiting the structural information on SMs to the sum of its components (e.g., SM (36:2)) rather than at the fatty acyl/sphingoid base structure level (e.g., SM (d18:1/18:1))55. SM peak determination was further complicated by the presence of HexCer isotopic distribution, whose sphingoid transitions overlapped with di-unsaturated SMs (Figs. 7G–l). In the absence of RT annotation using ReTimeML, such signals could easily be misinterpreted, particularly if the m/z 184 was not in the MRM experiment or had already been assigned to a particular isobar (Fig. 7A,B).

Although a significant improvement in the resolution of SL variants, ReTimeML remains bound by the user’s LC setup. Should LC conditions not facilitate appropriate separation, interfering SL ions may be indistinguishable on the TIC. This has been proven to potentially cause artificial inflation of SL levels36, and represents a current constraint for ReTimeML to handle isomers of SLs (e.g., galactosyl vs glucosylceramide having been referred to as HexCer), given the chromatography conditions ReTimeML was trained on were incapable of their separation. Additionally, misidentifications from indistinguishable transitions of hydroxylated variants, even at trace levels, could artificiality inflate or cause miss-annotation (e.g., [M + H]+for Cer (d18:1/24:1) and [M + H–H2O]+ for Cer (t18:1/24:0) share the same 648.6 → 264.3 transition). Though not a component of this study, resolution of these hydroxylated variants can be achieved under suitable reverse-phase conditions56, nonetheless remain an important consideration when establishing LC conditions. Appropriate separation is also pivotal when considering the application of ReTimeML for processing SLs using high-resolution MS. As previously aforementioned, these untargeted measures provide no selective bias (i.e., MRM) towards SLs of interest, which could potentially increase the risk of overlapping or interfering ions from other lipid species. Updated versions of ReTimeML shall require training on additional setups that may circumvent these potential sources of interference, including normal-phase LC–MS/MS25 and next-generation ion mobility MS employing RT with collision cross-section57, capable of achieving additional SL class and structure interpretation.

ReTimeML was employed to aid the peak selection and structural annotation of Cer and SM lipids identified in HVs, providing both CSF and serum, allowing us to compare these body fluid profiles in the same subject. CSF is the closest anatomical fluid to the brain, likely to yield more applicable biomarkers for studying neuronal effects and conditions given its composition closely resembles that of the brain's extracellular space58,59,60. However, preconceived notions towards the invasiveness of the procedure (lumbar puncture) and the resulting distress to subjects have severely influenced its inclusion in clinical trials and broader use as a diagnostic fluid61,62. As SLs are enriched in the CNS and have exhibited a capacity to cross the blood–brain barrier63, their peripheral concentrations have the potential to act as surrogate markers in neurological and neuropsychiatric disorders64. To the best of our knowledge, this is the first reported CSF-serum SL comparison, and only the second paired blood-CSF lipid profiling investigation in HVs65.

Although Cer and SM concentrations in the CSF were found to be considerably lower than in serum, their relative structure-distribution patterns remained conserved and positively associated between the two body fluids (Fig. 8A–D). This is consistent with recently published CSF-plasma data65. Although Saito et al., reported contrasting results on overall lipid compositions, a closer inspection of their SM data revealed a similar conserved profile, with a positive correlation for the SM structural variants identified in their study (r = 0.811, average per lipid structure, n = 99; r = 0.760, individuals SL pairwise CSF-plasma pairings, n = 2,079; both p values < 0.0001, Supplementary Fig. 2). Their number of Cer identities was insufficient for an effective comparison (data not shown).

A handful of our SL identities exhibited highly stringent correlations, conserved between the CSF and serum (Q < 0.05, Fig. 8E). Interestingly, all the positively associated comparisons were categorised as belonging to the same SM structural variant, which included the presence of a sphingadiene backbone (d18:2, m/z 262), together with a mono-unsaturated fatty acyl chain. Though the presence of d18:2 on SLs was first identified in the late 60 s66,67,68, it has taken major advancements in LC–MS/MS sensitivity to enable routine assessments of these noncanonical structures, reviewed in69, meaning our understanding of their functional importance is largely undetermined. Sphingadiene backbones are exhibited in mouse kidney, brain, lung, and colon tissue SLs70, are reported to be the second most abundant sphingoid base in human plasma71, and a common constituent in plants and fungi72,73. Natural (soy) sphingadienes have been reported to inhibit intestinal tumoregensis in vivo, through disrupted Akt translocation74, and reduced Wnt transcriptional activity in colon cancer cells75. Clinical investigations have reported that d18:2 SLs may provide protection against the development of obesity and the risk of diabetes and cardiovascular disease76,77. Our own research observed a marked accumulation of d18:2 SLs (unpublished), following the ablation of sphingosine kinase 2 (SphK2) in a mouse model for Alzheimer’s disease (J20), which led to severe defects in myelin integrity78. However, we never resolved whether this shift towards sphingadiene-based SLs was a primary cause of myelin disruption. SphK2 is the major isoform for catalysing the phosphorylation of sphingosine into sphingosine 1-phosphate (S1P), the penultimate step in SL lysosomal catabolism via irreversible degradation by S1P lyase1,2. This pathway has been reported to be less efficient for the clearance of d18:2 SLs, over their d18:1 counterparts in vitro4,70, presumably a consequence of the angled nature of the cis-double bond79. Furthermore, it has been recently shown that d18:2 predominately converts to SMs over glycosphingolipids (i.e., HexCer)79. Hence, we speculate that stronger associations observed for SM (d18:2/XX:1) profiles between these peripheral systems may be rationalised by their preferential formation from sphingadiene precursors and the accompanied stability from cis-double bonds on both the sphingoid base and fatty acyl chain.

In this study, we set ourselves the objective of developing a freeware that could perform data transformation and feature engineering to estimate the RTs of Cer and SM identities from complex LC–MS/MS spectra, adaptable to the experimental conditions applied, and amenable to any level of LC–MS/MS experience and/or knowledge on the biochemistry of SLs. We believe that ReTimeML excelled at this objective, assisting with the identification process across multiple LC–MS/MS conditions, including structural annotation and the removal of interfering RTs from isobaric and isotopic measures. While we recognise the advantages from employing ReTimeML, we acknowledge the existence of unforeseen circumstances and have highlighted probable instances where ReTimeML may not be applicable to a user’s LC–MS/MS design. Hence, it is always advisable for users to conduct secondary analyses/measures to confirm their SL identities, particularly if the existence of isobaric compounds or ion interferences within the TIC are likely, albeit not directly observed.

Moving forward, our objective is the continued optimisation of ReTimeML, ensuring it grows with mass spectrometry development and refine its ability to guide RT annotations for further classes of SLs, including non-canonical variants. We also plan to expand ReTimeML assessments into other lipid classes, searching relevant data repositories (e.g., MetaboLights, Metabolomics Work Bench), and welcome support from users prepared to share their lipidomic data via the options provided (see ‘Data and Code availability’). In the long term, we envisage that with the incorporation of more machine-learned lipid class RT measures, ReTimeML could become an openly accessed and/or integrated function in current automated lipid identification software engines. In achieving routine RT annotations for SLs, we also drew attention to the physiological and pathophysiological importance of non-canonical SLs that are achievable with current mass spectrometric systems. It is hoped that our findings will further scrutinise their significance as molecular mediators in health and disease.

Materials and methods

Data integration

Regression models were fitted to Cer and SM data collated from our prior published material and that of the literature (Supplementary Data 2). Data incorporated met a minimum quantity of molecular and chromatography information, ensuring sufficient variance between descriptors to accurately fit with models. The information incorporated includes the sphingoid base/fatty-acyl naming for Cer and SM55,80 and/or chemical formula of the sphingolipid, m/z of the [M + H]+ precursor ion together, with a minimum of one fragmentation ion to aid the structural characterisation (e.g., m/z 264 for the sphingosine backbone), the chromatography system applied, stationary-phase column used, solvent conditions (gradient vs isocratic) and nominated flow rate. For each dataset, Cer and SM species were broken down into RTs that were either “user-defined” or “known” (e.g., standard), per study. The selected nominations were defined by the experimental datasets chosen controls (internal and QC), compounds used for calibration or to optimise the LC–MS/MS parameters during their method development.

Retention time learning algorithms

Linear and non-linear regression algorithms were assessed and evaluated based on their ability to predict RTs, alongside the required number of training samples to learn and achieve appropriate levels of accuracy. Variables associated with precursor mass, precursor mass squared, square root and the log of the precursor mass for each molecular sample were calculated for every included data point. Adopting the sphingoid base and fatty acyl notation55, a Python (version 3.10, Python Software Foundation) function was used to extract the molecular features, based on the nominated Cer/SM precursor mass and relevant fragmentation to deduce the carbon chain length, both on the sphingoid backbone and fatty acyl chain, the degree(s) of unsaturation, as well as specific modifications to the C1 head group (i.e., m/z 184 for choline phosphate of SM). Incorporated datasets were randomly divided into ‘training’ (70%) and ‘validation’ (30%), using Python’s inbuilt sklearn package, which was also used to train linear, lasso and ridge regression models. Python’s xgboost package was used to train an XGBoost regression algorithm. RMSE and R2 were used to evaluate the performance of each regression model’s RT estimations. The number of training, and hence corresponding validation data points were sequentially varied (± 3 datasets) to assess the minimum training size required for a given regression model to forecast RTs accurately (R2 > 0.95; RMSE < 0.25), with regression models ranked according to the number of training samples required to achieve these measurables. The complete list of regression models, along with R2 and RMSE scores per training size, are summarised in Supplementary Data 1.

Ceramide and sphingomyelin data acquisition

Top-ranked regression models for Cer and SM were selected for secondary verification across four, independently performed, LC–MS/MS assessments spanning human fluids (CSF and serum), mouse liver and rat brain homogenates (all unpublished analyses). Human CSF and serum were provided by our biobank at the Central Institute of Mental Health, Mannheim, with both donated from the same healthy participants (n = 51), originally recruited at the Clinic and Outpatient Clinic of Psychiatry and Psychotherapy, University of Cologne. The Ethics Committee of the Medical Faculty Cologne, University of Cologne, Germany (00-053) approved the use of these samples for this research. Rat brain tissue was procured from our prior animal investigation exploring behavioural changes following different tetrahydrocannabinol preparations81, approved by the regional authority State Agency for Nature, Environment and Consumer Protection of the State North Rhine-Westphalia (LANIUV-NRW). Only tissue from placebo-administered rats was assessed. SLs extracted from C57BL/6 mouse liver were in accordance with protocols (#2019-033), approved by the Research Ethics and Governance Office, Royal Prince Alfred Hospital, Sydney, Australia.

SLs from human fluids were extracted using the conventional Bligh and Dyer method82, while rodent liver and brain tissue SLs were extracted using single-phase methanol/butanol (1:1 v/v83) and two-phase methyl-tert-butyl ether (MTBE)/methanol/water (10:3:2.5, v/v/v4), respectively. Prior to extractions, all samples were loaded with Cer (d18:1/17:0) and SM (d18:1/12:0) as internal controls. All samples underwent MRM analysis performed on a TSQ Altis QqQ mass spectrometer (ThermoFisher), coupled to a Vanquish UHPLC system, as previous4. For SMs, secondary product ion scans for m/z 264 and 262 were included to help distinguish isobars (e.g., SM(d18:1/XX:1) and SM (d18:2/XX:0)) and the potentially conflicting phosphatidylcholine ions, which induce the same m/z 184 choline headgroup fragment. A complete list of Cer and SM lipids, together with their MRM transitions, are provided in Tables 1 and 2.

Chromatographic and stationary phase conditions were also varied between experiments. SLs extracted from human fluids were resolved on a 3 × 150 mm Agilent XDB-C8 column (5 μM pore size), using a modified Hejazi et al.35 binary gradient as follows: 0 min, 20:80 A/B; 2 min, 20:80 A/B; 7 min, 13:87 A/B; 14 min, 0:100 A/B; 20.5 min, 0:100 A/B; 21 min, 20:80 A/B; 24 min, 20:80 A/B. Mobile phase ‘A’ consisted of 0.2% formic acid, 2 mM ammonium formate in water; Mobile phase ‘B’: 0.2% formic acid, 1 mM ammonium formate in methanol. Total run time was 24 min, at a flow rate of 0.2 mL/min. Extracted SLs from mouse liver were separated on the same 3 × 150 mm Agilent XDB-C8 stationary phase, with modified gradient solvents and conditions as follows: 0 min, 20:80 A/B; 1 min, 20:80 A/B; 9 min, 5:95 A/B; 11 min, 0:100 A/B; 17.5 min, 0:100 A/B; 17.6 min 20:80 A/B, 20.5 min 20:80 A/B. Solvent ‘A’ comprised of 0.1% formic acid, 2 mM Ammonium acetate in water; Solvent ‘B’ composition of 0.1% formic acid, 2 mM Ammonium acetate in methanol. Total run time was 20.5 min at a flow rate of 0.3 mL/min. SL analysis of brains (rat) were resolved on a 2.1 × 100 mm Waters Acquity C18 UPLC column (1.7 µm pore size) under isocratic conditions, as previous84, using methanol with 0.2% formic acid as the mobile phase, at a flow rate of 0.25 mL/min for 15 min.

ReTimeML interface

ReTimeML’s pilot version is available as a free, open-source web interface powered by streamlit (https://mikeallwright23-retime-app-lipid3-021zpv.streamlit.app/). Users upload datasets in .csv format, consisting of the Cer/SM lipids of interest, alongside their precursor mass and whether the SL species included are a “Train” (reference with known RT included) or “Test” (unknown) value. Template .csv files are provided for users to test the interface (Supplementary Data 4 and 5), and can be adapted for their own SL analysis. Uploaded data (drag and drop option) triggers the automatic calculation of RTs, utilising the nominated RTs as points-of-reference (Train) to guide the extrapolated unknowns (Test), with users free to amend the number of train/test values. It is of note to mention a minimum of two ‘Train’ values is required for ReTimeML to extrapolate an output. Functions are applied within the web interface to automatically pre-process each data field, using regex functionality in Python, to feature engineer the number of carbon atoms and degree(s) of unsaturation for structural components on the sphingoid and fatty acyl chain, programming these as one hot encoded (ohe) variables, as well as the log of the mass, mass squared and square root of the mass. In addition, ReTimeML also provides an MRET profile output of estimations, annotated using second-order polynomial trendlines. The web interface also provides users the added option to voluntarily upload their own RT estimations, so that our team can evaluate and incorporate them into our working models.

Statistical analysis

Verified Human Cer and SM lipid datasets underwent peak integration using Xcalibur 4.4.16.14 software (ThermoFisher Scientific, San Jose, CA, USA), with Cer and SM species normalised as ratios to their class-specific internal control. A separate Cer/SM mixture, comprising various compounds (Supplementary Table 3), was run every 20 samples. This external mixture acted as a QC and provided additional RT references for Cer and SM. All Cer/SM mixtures were prepared in aqueous/organic proportions reflecting the starting conditions of the respective LC–MS/MS experiments. All Cer and SM values were first log-transformed (natural log) to obtain a normal distribution for Pearson correlations (r). For the assessment of individual lipids between CSF and serum, the resultant p values were adjusted for multiple comparisons using the Benjamini, Krieger, and Yekutieli false discovery rate (FDR) approach, with Q < 0.05 considered significant (GraphPad Prism software, Version 10.0.3, Dotmatics, Boston, MA, USA).