High sensitivity analysis of nanogram quantities of glycosaminoglycans using ToF-SIMS

Glycosaminoglycans (GAGs) are important biopolymers that differ in the sequence of saccharide units and in post polymerisation alterations at various positions, making these complex molecules challenging to analyse. Here we describe an approach that enables small quantities (<200 ng) of over 400 different GAGs to be analysed within a short time frame (3–4 h). Time of flight secondary ion mass spectrometry (ToF-SIMS) together with multivariate analysis is used to analyse the entire set of GAG samples. Resultant spectra are derived from the whole molecules and do not require pre-digestion. All 6 possible GAG types are successfully discriminated, both alone and in the presence of fibronectin. We also distinguish between pharmaceutical grade heparin, derived from different animal species and from different suppliers, to a sensitivity as low as 0.001 wt%. This approach is likely to be highly beneficial in the quality control of GAGs produced for therapeutic applications and for characterising GAGs within biomaterials or from in vitro cell culture.

1 μl of 2 mg/ml solutions of either HS or HA were manually pipetted onto each substrate and allowed to air dry. These two GAGs were selected in order to assess sulphated and non-sulphated biomolecules. The dried spots were then incubated overnight in 10 mls of water, before drying and analysis by ToF-SIMS. The intensity of both positively and negatively charged characteristic ions for the GAGs was compared on as-received materials, after deposition of GAGs and after washing (Supplementary Figure 1-2). For all substrates and for both GAGs the characteristic ions were low for as-received materials, and increased after the addition of either HS or HA. On almost all samples the characteristic ions reduced to similar intensities to the as-received materials after washing, suggesting that the washing procedure has completely removed the GAG. The exception to this was the cationic samples poly-L-lysine and allylamine plasma polymer, both of which maintained a higher intensity of characteristic ions for GAGs above baseline levels after washing. The PLL coated glass slide maintained a higher intensity of characteristic peaks for GAGs after washing compared with the allylamine plasma polymer coating and was thus selected for formation of subsequent arrays. cycle was replaced in each print run by the use of two nozzles, one that delivered a GAG solution and the other that delivered water.
The resulting array is shown in Supplementary Figure 3. All spots were visible after drying. The array was analysed by SIMS and regions of interest were used to extract spectra for each spot. The resulting spectra were analysed by PCA to reduce the dimensionality of the dataset. A sparse feature dataset was created using recursive feature elimination using the total separation of the pure GAGs as a weighting (selection of feature to be eliminated was determined by the feature that would cause the largest separation of points in PC1 and PC2). The dataset was successfully reduced to 4 variables without a significant reduction in sample separation. The final scores plot for PC1 and PC2 are shown in Supplementary Figure 4. For each of the single GAGs, 7 repeats were used for the PCA analysis, whist 3 samples were used as a test set. The PCA showed that the 4 single GAGs were readily chemically discernible, with the clusters associated with replicates of each GAGs being clearly distinct. In all cases the test sets fell within the 95 % confidence ellipses determined for the training samples, suggesting that the variance captured by PC1 and 2 was real and the data had not been over-fitted. GAG mixed samples, when plotted onto the scores spot, fell outside of ellipses determined for the single GAG samples, however, samples were not observed to quantitatively fall between the ellipses for the single samples. This is expected due to matrix effects compromising the quantitative nature of SIMS data 16 .
The 4 ions used for the PCA, possible assignments and loadings for PC1 and PC2 are shown in Supplementary Table 1. Apart from the ion SNO-, the assignment of the ions is ambiguous due to the mass resolution of ToF-SIMS. However, a number of possible assignments including sulphur, suggesting that the variance captured by the PCA is associated with the sulphation pattern of the GAGs.
Supplementary Figure 3. Brightfield image of a GAG microarray, pre-printed with15 nL of water and then a total volume of 17 nL of 5 mg/ml GAG solutions, either hyaluronic acid (HA), heparan sulphate (HS), chondroitin sulphate (CS) or dermatan sulphate (DS). The bar below each sample indicates the amount of each GAG type added, respective of the fraction coloured. Array printed at 65 % relative humidity. The 5×5 array of mixed GAGs was repeated 4 times.

Supplementary Note 3. PCA sparse dataset generation optimisation
In order to optimise the generation of a sparse dataset for PCA a concentration series of heparin samples either derived from porcine mucosa (PM) or bovine lung (BL) were mixed, printed onto an array and analysed by ToF-SIMS. There were 11 technical repeats of each sample, and these samples were split into training and test sets at a training:test ratio of 8:3. Principal component analysis was then done on the full dataset and sparse datasets generated by either recursive feature elimination (RFE) or addition (RFA) using the minimisation of the overlap of 95% confidence ellipses or separation of the distance between the means of datasets for either Euclidean or Mahalanobis geometry as a selection criterion. The total number of features selected was determined as the minimum number of features required to maximise the associated cost function, or to reach an equilibrium state whereupon addition of further features produced no further separation of sample sets. The resulting scores plots are shown in Supplementary Figure 5.
Only the 100 % BL heparin was separated from all other samples using the entire dataset (Supplementary Figure 5a-b), with or without variance scaling, although with variance scaling the separation of 100% BL heparin from all other samples was aligned with PC1 (Supplementary Figure 5a) as opposed to without variance scaling, where the separation was a combination of PC1 and PC2 (Supplementary Figure 5b). PCA of a sparse dataset generated by RFA using the minimisation of overlapping 95% confidence ellipses as a cost function enabled all samples to be separated after consideration of PC1 and PC2 to 95% confidence (Supplementary Figure 5c). However, this PCA model was over-fitted as the test sets did not fall within the 95% confidence ellipses of their respective samples. Random selection of a different training and test set and reapplying the PCA for the same sparse dataset described the samples without overfitting (all test dataset was within the 95% confidence limit associated with the training set), however, only the 100% and 50% BL heparin samples were successfully separated. This suggested that minimising overlap as a cost function was highly effective at identifying features that could differentiate between samples to 95% confidence but was susceptible to over-fitting due to the selection of features that describe random variance within sample sets rather than features that differentiate between sample sets.
Sparse datasets generated by RFA using the maximisation of the distance between the means of samples sets as a cost function successfully identified features that described variance between the different samples sets, as indicated by reduced overlap of sample sets for the scores plots of PC1 and PC2 ( Supplementary Figure 5e-f). This was observed for both Euclidean and Mahalanobis geometry. However, PCA of either dataset was unable to successfully differentiate between the different sample sets to 95% confidence, whereupon no samples sets could be differentiated using PCA of the dataset generated using Mahalanobis geometry (Supplementary Figure 5f), however, the 100% BL heparin sample could be differentiated using the dataset for the Euclidean geometry from all other samples (Supplementary Figure 5e). Although no other sample sets could be differentiated from all other samples, this dataset was able to differentiate between some samples. For example, both the 50% and 0% BL heparin was successfully differentiated from 5 of the other samples concentrations. Moreover, no over-fitting was observed. Therefore, use of the maximisation of the distance between sample set means within Euclidean geometry was able to select features that described the variance between samples, however, separation to 95% confidence was not achieved in all cases.
RFE was also explored as a route to generating a sparse dataset. Use of the minimisation of overlap of 95% confidence limits produced a sparse dataset that did not separate features better than the original dataset. In this case the feature selection was not over fitted. However, this approach selected for features that described different variance within sample sets rather than variance between sample sets, as indicated by the elongation and varied orientation of the ellipses in the scores plot of PC1 and PC2 without achieving separation between the samples (Supplementary Figure  5g).
RFE using the maximisation of the distance between sample sets as a cost function identified a sparse dataset that was not able to differentiate between samples as well as the respective RFA sparse dataset for both Euclidean and Mahalanobis geometries (Supplementary Figure 5h and i).
Of the 140 different ions selected for the 6 different sparse dataset, 28 ions were common to at least 2 different datasets. These ions are listed in Supplementary Figure 5i. A comparison of common ions between sparse datasets is shown in Supplementary Table 1. A maximum of 9 common features were identified when using the distance between sample means in Euclidean geometry as a cost function for RFE and RFA. A similar level of commonality was not observed for sparse datasets generated for the other two costs functions when using RFE or RFA, where only 2 common features were identified for minimising overlap of confidence ellipses and 0 common features were identified for maximising the distance between sample means using Mahalanobis geometry.
As minimising ellipse overlap was the most successful at producing a dataset that could separate samples to 95% confidence using PCA but was prone to overfitting or selection of features that described variance within a dataset, maximising the distance between sample means using Euclidean geometry was initially used as a cost function for RFA to generate a sparse dataset containing only features that describe the separation between sample sets. RFE using the minimisation of ellipse overlap was then applied to generate a secondary sparse model. Using this approach a sparse dataset was produced with optimal separation of samples that was not over-fitted. To further ensure this approach did not overfit data it was applied to generate a sparse dataset for randomly produced data. A total of 1,200 initial features was included, which exceeded the typical numbers of positive and negative ions identified from a ToF-SIMS spectrum. No separation of samples was observed after generating a sparse dataset (Supplementary Figure 6), further confirming the method was not over-fitting the data. This approach was applied for the further generation of sparse datasets.