Multi-block data integration analysis for identifying and validating targeted N-glycans as biomarkers for type II diabetes mellitus

Plasma N-glycan profiles have been shown to be defective in type II diabetes Mellitus (T2DM) and holds a promise to discovering biomarkers. The study comprised 232 T2DM patients and 219 healthy individuals. N-glycans were analysed by high-performance liquid chromatography. The multivariate integrative framework, DIABLO was employed for the statistical analysis. N-glycan groups (GPs 34, 32, 26, 31, 36 and 30) were significantly expressed in T2DM in component 1 and GPs 38 and 20 were related to T2DM in component 2. Four clusters were observed based on the correlation of the expressive signatures of the 39 N-glycans across T2DM and controls. Cluster A, B, C and D had 16, 16, 4 and 3 N-glycans respectively, of which 11, 8, 1 and 1 were found to express differently between controls and T2DM in a univariate analysis \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(p < 0.05)$$\end{document}(p<0.05). Multi-block analysis revealed that trigalactosylated (G3), triantennary (TRIA), high branching (HB) and trisialylated (S3) expressed significantly highly in T2DM than healthy controls. A bipartite relevance network revealed that HB, monogalactosylated (G1) and G3 were central in the network and observed more connections, highlighting their importance in discriminating between T2DM and healthy controls. Investigation of these N-glycans can enhance the understanding of T2DM.

Inclusion criteria. T2DM was established based on the international classification of disease 10 (ICD- 10) criteria and known history of anti-diabetes medication use. The controls, however, were individuals who were not suffering from T2DM and/or hypertension and had no history of anti-diabetes or antihypertensive medication use. In both groups, we excluded participants who were suffering from other chronic diseases related to the genitourinary, digestive, respiratory and haematological systems. The age range for all participants was 30-80 years.
Anthropometric examination. Participants supplied their demographic information by completing a brief questionnaire after which anthropometric measurements including weight, height, Body mass index (BMI), Waist-to-hip ratio (WHR), Waist-to-height ratio (WHtR), systolic blood pressure (SBP) and diastolic blood pressure (DBP) were measured by standard methods (Fig. 1).
Clinical data. Briefly N-glycan release and labelling. Plasma samples were first randomised on multiple plates to avoid bias and experimental errors. Plasma samples aliquoted in 96-well plates were denatured, following which, glycans were released, fluorescently labelled, purified/washed/cleaned up as described in our previous studies 5, 31,32 . Hydrophilic interaction liquid chromatography on a Waters Acquity ultra-performance liquid chromatography (UPLC) instrument (Waters Corporation, Milford, MA, USA) was employed for the separation and analysis of eluted glycans. This high throughput instrument generated a total plasma N-glycome chromatogram of 39 N-glycan peaks. Each glycan peak's relative abundance was expressed as a percentage of the total integrated area (Fig. 2). Twenty-one (21) derived traits were calculated then calculated from the 39 N-glycan peaks (Supplementary Table 1).

Statistical analysis.
Batch correction and normalisation on the UPLC data was performed to control for non-biological variability. To explore batch effects, data tables were created for each plate that compared T2DM, age and gender. Glycome experiments were designed to control for important factors between the different plates. These factors (type II diabetes vs healthy control, male vs female, age separated at the median, timepoint of data collection) were evenly distributed among the experimental batches. During data processing, experimental artefacts were removed by using the ComBat method for batch correction. Thereafter, data was normal-  33 . Kolmogorov Smirnoff test and QQ plots was viewed to ascertain where the data was normally distributed or not. Continuous data was represented as mean ± standard deviation (Mean ± SD) while categorical variables were expressed as frequencies (percentages). Groups comparisons for continuous variables performed either by Student-t tests or Mann-Whitney U-tests whereas categorical variables were compared using Chi-square tests. The Benjamini-Hochberg (BH) method was used to control the false discovery rate (q). Spearman correlational analysis was carried out to establish associations among the biochemically N-glycan measurements. Agglomerative hierarchical clustering was derived using the Euclidean distance as the similarity measure and Ward methodology. The dendrogram for the columns indicated four possible clusters for the biochemical measurements.
Multivariate integrative framework. DIABLO extends the ideas of sparse generalized canonical correlation analysis (sGCCA). Let X (1) , . . . , X (J) denote J normalized, cantered, and scaled datasets of dimensions (N × P 1 ), . . . , N × P J , measuring the expression levels of P 1 , . . . , P J multi-block variables on the same sample N . sGCCA identifies relevant dimensions, d = 1, . . . , D, of the multi-block dataset by maximizing the variancecovariance function  . C = c i,h i,h is a J × J design matrix that indicates the connections among multi-block dataset. Elements in C can be interpreted as correlations where zero indicates that the blocks of data are not connected, and one indicates that they are fully connected. (j) is a non-negative parameter that controls the amount of shrinkage, indicating the number of non-zero coefficients in a (j) d , for each component score s The data was partitioned to create two separate sets of data, one for training the models and one for testing their predictive performance. This division occurred at an 80/20 proportion of the data. K-fold cross-validation (CV) was used to evaluate and compare the different models to each other. Analysis was performed in R statistical software. DIABLO was implemented in the 'mixOmics' R Bioconductor package which has functions for parameters' choice and visualization to assist in the interpretation of the integrative analyses. Table 1. Characteristics of participants with and without T2DM. Data presented as Mean ± SD and n (%). ^χ 2 test of independence, t Student's t-test, u Mann Whitney U tests. Tests of significance were two tailed (*p < 0.05); **q < 0.05F significant after correction for FDR and are bold.

Results
The demographic and anthropometric information detailed in Table 1 shows that there were more female participants (61.4%), along with a mean age of controls and cases been 56.54 ± 9.89 and 55.10 ± 9.27, respectively. Majority of the participants in both groups had some form of education (χ 2 = 9.83, q = 0.0812) and employment (χ Discriminating signature for N-glycans on T2DM and healthy controls. The nature of associations and patterns of clustering for the 39 N-glycans were explored for cases and controls. Hierarchical cluster analysis identified four clusters in the correlation of the expressive signatures of the 39 N-glycans across healthy controls and T2DM cases (Fig. 3A). For example, cluster A had 16 N-glycans (Table 2), of which 11 were found to express significantly different (Fig. 3B) between healthy controls and T2DM cases (K = 16; k = 11, p < 0.05) . Similarly, 16 glycans were clustered as B, of which 8 were found to express significantly different between healthy controls and T2DM cases (K = 16; k = 8, p < 0.05) . Cluster C and D had 4 and 3 glycans respectively and in each of them only 1 was found to express significantly different between healthy controls and T2DM cases ( Feature selection is important in the refinement of biological and biochemical hypotheses. We identified a combination of discriminative features from a disparate block of glycans. N-glycans loaded differently along the two principal components (PC), with estimates of positive and negative weights (Fig. 4). A large absolute value indicates the importance of the variable to the PC and the colour codes indicate how prominent the biomarker expressed in the cases of Type II diabetes mellitus and healthy controls. To discriminate between T2DM and healthy controls, the optimal model identified the N-glycans signatures and expressed their contributions in classifying between T2DM and healthy controls for 10 out of 39 N-glycans in each component. The top 10

Multivariate integrative analysis.
We investigated the clustered image map (CIM) to highlight the strength and direction of pair-wise association structures between the two groups and the canonically derived traits. We then selected the important features between multi-block derived traits of the N-glycan measurements. CIM based on a hierarchical clustering simultaneously operated on the rows and columns using a similarity matrix to produce a 2-dimensional coloured image (Fig. 5). The dendrogram for the columns indicated six possible clusters for the canonically derived traits of biochemical N-glycans measurements, reflecting the six Multi-block analysis of the canonically N-glycan derived traits is presented in a circos plot (Fig. 6A), with links between or within blocks indicating positive and negative correlations at a cut-off correlation of |0.5|. This threshold was arbitrarily chosen to obtain interpretable networks that were neither too sparse nor too dense. We observed significant difference of expression (p < 0.05) of S1, BAMS, A2G and G2 being expressed highly in the healthy control group compared to T2DM cases, whereas G3, TRIA, HB and S3 expressed significantly highly in T2DM cases compared to healthy controls (Fig. 6A).
Using a pairwise similarity matrix directly obtained from outputs of sGCCA and PLS, bipartite network was inferred (Fig. 6B). One relevant component was obtained when the threshold was set to 0.5, linking the corresponding correlated subsets in the independent and dependent data. The network model representing the bivariate partial correlation matrix between the 21 canonically derived traits, comprises both positive (red lines) and negative (green lines) correlations (Fig. 6B). HB, G1 and G3 were central in the network and observed more connections than the others, highlighting their importance in discriminating between T2DM and healthy controls.

Discussion
Comprehensive understanding of spectral N-glycan data from UPLC analysis is anchored on advanced statistical methods. Integrative methods offer comprehensive means to dissect data, with the goal of transforming the data into a clinically useful information. For the first time, we have applied a powerful and advanced integrative method DIABLO, to explore N-glycan profiles interaction in real time.
Prior to applying the DIABLO method, univariate, and multivariate statistical methods (e.g., student t tests and Mann Whitney U tests and chi-square test) have been used to reveal the association between T2DM and biochemical measures such as plasma glucose and lipid profiles. Surprisingly, the control group had a higher blood pressure than the cases, and this can be attributed to the medication use (glucose, lipid and blood pressure lowering drugs) among the cases (Table 1). Moreover, this highlights the proportion of the population who have www.nature.com/scientificreports/ raised blood pressure, but they are unaware of it. WHO reports that an estimated 45% of hypertensive adults are not aware of it, although the control group in the current study cannot be said to be hypertensive. This is because hypertension is established after repeated measures of blood pressure above normal threshold (140 mmHg). In the present study, blood pressure was only measured once. It is not clear why the control group had a lower HDL-c but it may be attributed to genetic factors or defects in cholesterol efflux. Medication use in T2DM can potentially affect their N-glycome. Singh et al. 30 found that statin use was linked to a decrease in all fucosylated traits including diantennary and triantennary structures (A2EF, A2LF, A3EF, A3L0F). In addition, statin use was associated with an increased galactosylation in diantennary non-fucosylated (A2F0G) and in sialylated diantennary (A2SG) glycans. The study further stated that statin use negatively correlated with Alpha2,6-sialylation of triantennary (A3E) and fucosylated tetra-antennary glycans (A4FGE). Similarly, metformin correlated with a decreased fucosylation in diantennary, triantennary and tetra-antennary traits and an increase of galactosylation in diantennary glycans 34 .It is widely known that T2DM develops several years before clinical diagnosis. Mild symptoms such as weight loss or weight gain, fatigue, increased hunger would progressively result in persistent high plasma glucose and complications. However, because of limited sensitive, and robust biomarkers, T2DM diagnosis is often delayed. This problem appears to be solved with the advent of N-glycans. First, the GST2D score was used to predict T2DM development 6-8 years before clinical manifestation 35 . In another study, Cvetko et al. 36 reported that individuals who were healthy at baseline but developed insulin resistance and T2DM over time, were characterised by complex and highly branched N-glycan structures. Specifically, the study identified alterations in eight N-glycans: GP10, GP16, GP18, GP19, GP20, GP26, GP32 and GP34 36 ; with GP 32 and GP34 being the most significant in the continuum of insulin resistance and T2DM. Increasing evidence shows that T2DM patients can be distinguished from healthy individuals depending on the composition of their respective total N-glycome 5,18 . Thus, we explored the N-glycan traits whose expression were different in cases and controls. The present study validates that of Cvetko et al. 36 and Clemens et al. 35 , we identified GPs 34, 32, 26, 31, 36 and 30 to be highly expressed in T2DM in the first principal axis and on the second principal axis, GPs 38, 1, 2, 25 and 20 were dominant in T2DM. Sialylated glycans (GP26, GP32, GP35 and GP36) are expressed on a1-acid glycoprotein, whereas GP18 and GP20 originates from glycoproteins a-antitrypsin. A-antitrypsin is a protease inhibitor with at least three glycosylation sites for biantennary glycans without fucosylation (site asparagine 70), bi-, tri-and tetra-antennary glycans with core and antennary fucosylation (at site asparagine 107) and site asparagine 271 is occupied by bi-and tri-antennary glycans with core-and antennary-fucosylation 37 . A-antitrypsin protects β-cells from apoptosis and triggers insulin secretion, hence important for preventing type I diabetes 38 .
Clerc et al. 39 , further states that triantennary (GP 30, GP 31 and GP 32) and tetraantennary (e.g., GP 26,GP34,36) glycans are expressed on kininogen-1 and histidine-rich glycoproteins. Kininogens are proteins with multiple functions including antidiuretic, antiangiogenic, antithrombotic, profibrinolytic and proinflammatory proteins. Abnormal expression of this glycoprotein is linked to diabetes 40 . Histidine-rich glycoproteins bind to ligands including heparan sulfate, plasminogen, heme amongst others and regulates multiple processes such as cell adhesion, fibrinolysis, cell chemotaxis. A deficiency of this protein has been associated with thrombosis, but its role in www.nature.com/scientificreports/ diabetes has also been reported 41 . Similarly, abnormal activities of a-antitrypsin, transferrin and hemopexin are all implicated in diabetes. Of particular interest is three glycan groups (GP30, GP36 and GP38) that have been shown to have clinical relevance in maturity onset diabetes of the young (MODY) 42 . In fact, Juszczak et al. 42 , documented that GP30, GP36 and GP38 had the best discriminative power between HNF1A-MODY and earlyonset type 2 diabetes. The authors explained that HNF1A is a transcription factor for the inflammatory marker C-reactive protein (CRP) and a master regulator of fucosylation; with variations in HNF1A triggering MODY. With a sensitivity of 88% and specificity of 80%, was the best amongst the three glycan groups in discriminating between individuals with damaging HNF1A alleles from those with early-onset nonautoimmune diabetes but lacked HNF1A variants. The study showed that subjects with deleterious HNF1A allele had reduced levels of these glycans than those who lacked the rare HNF1A allele 42 .
The findings of the current study build upon that of Keser et al. 17 who also suggested that the increased branched N-glycans in T2DM can be due to dysregulation of the hexosamine biosynthesis pathway (HBP). HBP has been found to be involved in the metabolism of glucose. This pathway under normal conditions, metabolises up to 3% glucose of the total glucose in the body. However, when homeostatic mechanism is disturbed, such as in T2DM, the metabolism of glucose is heightened, producing uridine diphosphate N-acetylglucosamine (UDP-G1cNAc). UDP-G1cNAc is a substrate for glycosyltransferases that catalyses the elongation and branching of glycan chains in glycosylation. GNTs are encoded by MGAT3 [mannosyl (β-1,4-)-glycoprotein β-1,4-Nacetylglucosaminyltransferase] but specifically, GNT-I, -II, -IV and -V catalyses the biosynthesis of mono, bi, tri and tetra-antennary glycans whereas GNT extends the 1-6 arm of the glycan core with GlcNAc residue. A defective GNT glycosyltransferase in the pancreatic islets results in impaired insulin action, impaired glucose tolerance and eventually, hyperglycaemia.
Aberration of fucosylation, be it core or antennary has been implicated in our results just as stated in multiple chronic diseases [43][44][45] . For example, Herrera et al. 46 identified core-fucosylated tetra-antennary glycan to be associated with poor breast cancer prognosis. Then Testa et al. 44 , showed that core-α-1,6-fucosylated diantennary glycans was associated with T2DM. Sialic acids (N-acetylneuraminic acids) are pinned to the non-reducing ends of N-glycans by way of 2,3-, ,2,6-linkages. When bound, they play crucial roles in the pathological conditions including cancers and viral infections, while sialic acid complex glycans have been suggested to have antiinflammatory properties 47 . Removal of UDP-N-GlcNAc 2-epimerase/ N-acetylmannosamine (ManNAc) kinase, an enzyme required for the biosynthesis sialic acids, led to glomerula proteinuria in mice 48 . In addition, other studies have found that upregulation of β-galactoside α-2,6-sialyltransferase 1, an enzyme that catalyses terminal α2,6-sialylation, was associated with worse patient outcomes in cancer 49 . Other studies have also indicated that an increase in α-(2 → 3)-sialic acid correlates with tumor metastasis. For example, intravenous administration of a sialidase (enzyme that cleaves sialic acids) blocking agent caused an increase release of insulin in pancreatic islets 50 . It is known that hyposialylated IgG glycans stimulates endothelial FcγRIIb, which has been previously associated with insulin resistance in obese mice. In the present study, the T2DM was associated with terminal sialylation. Recently, increased sialic acids on N-glycans has been implicated in T2DM development 17 . The absence of sialic acids on plasma LDL-c has been shown to induce cholesterol ester accumulation in cells and hence implicated in cardiometabolic diseases. This could be a possible reason why plasma LDL-c was highly loaded in cases compared to controls 51 .
The main limitation of the study relates to the small sample, and which means, the results cannot be generalised. Also, there is a possibility of biological variations related to gene expression in the samples, but that was not investigated. Already a genome wide association study has identified HNFA1 α as the master regulator of fucosylation 52 . Moreover, Cohain et al. 27 analysis on cardiometabolic tissues revealed multiple genes that code for clinical markers including total cholesterol (DHCR7, FADS1, FADS2, MMAB, and MVK), (FLVCR1, LSS, MMAB, MVK, DHCR7, FADS1, FADS2 and VPS37D), LDL-c (FADS1, FADS2, and LSS), HDL-c (FLVCR1, MMAB, MVK, FADS1, FADS2), and TG (VPS37D, FADS1, FADS2). Zaytseva 53 also reported that most of the highly heritable N-glycan peaks such as GP1, GP2, GP4-6, GP10-11, GP16, and GP17 were core-fucosylated biantennary with reduced sialylation whereas GP 20 and GP 14 had a low heritability. We intend to use path analysis and confirmatory factor analysis to determine gene-glycan relationships.
The present study has only provided information about glycans in biological samples (glycome), without highlighting downstream changes in the transcriptome, metabolome, lipidome and proteome. Thus, combining and analysing multiomics simultaneously will provide a clearer understanding of the mechanism that underly T2DM pathogenesis.

Conclusion
DIABLO is a robust method that captures the N-glycan-glycan interactions in T2DM and healthy controls. T2DM is associated with highly branched N-glycan structures including trigalactosylated, triantennary, high branching (HB) and trisialylated that are derived from glycoproteins. Glycan groups identified to discriminate T2DM from healthy controls can be exploited further to unearth their potential for T2DM diagnosis and prognosis.