Glycosyltransferase Gene Expression Profiles Classify Cancer Types and Propose Prognostic Subtypes

Aberrant glycosylation in tumours stem from altered glycosyltransferase (GT) gene expression but can the expression profiles of these signature genes be used to classify cancer types and lead to cancer subtype discovery? The differential structural changes to cellular glycan structures are predominantly regulated by the expression patterns of GT genes and are a hallmark of neoplastic cell metamorphoses. We found that the expression of 210 GT genes taken from 1893 cancer patient samples in The Cancer Genome Atlas (TCGA) microarray data are able to classify six cancers; breast, ovarian, glioblastoma, kidney, colon and lung. The GT gene expression profiles are used to develop cancer classifiers and propose subtypes. The subclassification of breast cancer solid tumour samples illustrates the discovery of subgroups from GT genes that match well against basal-like and HER2-enriched subtypes and correlates to clinical, mutation and survival data. This cancer type glycosyltransferase gene signature finding provides foundational evidence for the centrality of glycosylation in cancer.


Supplementary Information: Methods Details and Data
Though next generation sequencing can produce more accurate data with higher sensitivity 1 , we used microarray data because of the availability of a larger number of samples with adequate follow up information, which were retrieved from The Cancer Genome Atlas (TCGA) data portal (https://tcgadata.nci.nih.gov/tcga/) and also availability of an independent dataset, GSE20624 2 , as external test for the developed classifier. The Cancer Genome Atlas (TCGA) Research Network has improved our understanding in cancer biology through profiling and analyzing large numbers of human tumors. The resulting rich data offer a great opportunity to improve a coherent picture of variation across tumors 3,4 . For the purpose of this study, Agilent Microarray (Agilent 244K custom gene expression) data of 1893 samples from published TCGA (http://tcga-data.nci.nih.gov/tcga/) representing six cancer types including breast: breast invasive carcinoma [BRCA,n=531], ovary: ovarian serous cystadenocarcinoma [OV, n=578], brain: glioblastoma multiforme [GBM, n=403], kidney: kidney renal clear cell carcinoma [KIRC,n=72], colon: colon adenocarcinoma [COAD, n=154] and lung: lung squamous cell carcinoma [LUSC,n=155]) were combined, while normal and control samples were excluded.
A list of human glycosyltransferase (GT) genes was retrieved through filtering several publicly available databases such as Kyoto Encyclopedia of Genes and Genomes database (KEGG/GENES) (http://www.genome.jp/kegg/genes.html), Carbohydrate-Active Enzymes database (CAZy) databases (http://www.cazy.org/), and literature search. KEGG/GENES is a pool of manually curated genes retrieved mainly from NCBI RefSeq [5][6][7] . Furthermore, CAZy provides an online and regularly updated access to family classification of CAZymes corresponding to proteins released in the daily releases of GenBank (ftp://ftp.ncbi.nih.gov/genbank/daily-nc) 8 . Supplementary Table 1 contains a list of human glycosyltransferase gene symbols with their Entrez numbers.
The expression dataset of glycosyltransferases was built through combining the TCGA expression datasets of six investigated cancer types (i.e. breast, brain, colon, kidney, lung and ovary) and further retrieving the expression of 210 glycosyltransferase genes from the combined dataset.
Batch effects are commonly observed systematic non-biological variation between groups of samples due to experimental artifacts, such as processing date, lab, or technician. Combining samples from multiple batches can cause the true biological variation in a high-throughput experiment to be obscured by variation due to batch 9 . However, the correlations of batch effects (technical and biological artifacts) with the outcome are common and critical to address 10 , correcting for batch effects when there is no significant effects may result in removing biological variation instead of the systematic non-biological variation due to batch 9 . Therefore, a simple test was performed to evaluate existing of batch effects in combined dataset with comparing box plots, QQ-plots and applying a t-test analysis before and after using an Empirical Bayes batch effect correction method, i.e. ComBat, implemented in 'sva' package 11 in R 12 . The result of this analysis clearly illustrated that no significant batch effects in the dataset (t-test p-value=0.5, Supplementary Figure 1a) and also there is no specific color grouping observed based on the batch effects in the principal component loading plot for PC1 to PC3 in the combined dataset (Supplementary Figure 1b). However a significant grouping is observed while samples are colored based on the cancer type ( Fig. 1a) confirming the batch effects do not stimulate cancer type grouping. In addition, separate principal component analyses were performed to investigate the batch effects in each TCGA cancer type while samples are colored based on the batch numbers and they have not shown any grouping based on the batch numbers in each cancer type (Supplementary Figure 1c).
To separate cancer types based on the expression of glycosyltransferase genes, a principal component analysis was performed using 'psych' package 13 in R. Furthermore, a hierarchical average linkage clustering performed on GT genes and cancer types across the complete 1893 sample set using 'cluster' package 13 in R. The result of this analysis reveals that the expression profile of GT genes not only separates six cancer types but also represents a unique molecular entity with similarity to lung cancer for basal-like samples (TNBC, n=83, colored in black in the TNBC sidebar in Supplementary  Figure 2), which is in line with the result of Prat and colleagues (2013) investigating the expression of 3486 most variable genes across six different cancer types from TCGA data 3 .
To better understand how the expression of glycosyltransferase genes contribute to separation of cancer types from each other and to investigate dominant glycan-specific changes occur in carcinogeneic process of each cancer type, the expression of glycosyltransferase genes was compared among the cancer types and the association of glycosyltransferase genes to patient survival was studied. For this purpose, differential expression analyses were carried out using 'limma' package 14 in R. Genes with q-value > 0.005 and 2 fold change were considered as a differentially expressed gene in pairwise comparisons, while a 'decideTests' function in 'limma' package was used to assigning binary values (i.e. 1: up-regulated, -1: down-regulated and 0: not detected) to these genes. Finally, a gene in a specific cancer types was considered to be up-regulated if the median of all pairwise comparisons was 1 and it is down-regulated (-1) in none of the comparisons and wise versa (Supplementary Table 2). Correlation between patient survival and glycosyltransferase gene expression was performed using log rank test implemented in 'survival' package 15 in R, while samples in all cancer types were divided into two groups for each gene (0: samples that showed gene expression value above median and 1: below median), and then compared to each other in terms of overall outcome (Supplementary Table 2). In addition, Supplementary Table 3 shows the expression of glycosyltransferase genes between normal and malignant in various cancers in several studies.
Having established that the expression profile of glycosyltransferase genes are able to separate six cancers we explore the development of a GT gene classifier using shrunken centroid approach 16 in 'pamr' package (http://cran.r-project.org/web/packages/pamr) in R, which is able to identify cancer type from a random sample. Furthermore, 'caret' package 17 in R was used to rank the gene importance in a supervised learning model (pam model).
For the purpose of error estimation of training model (pam classifier) in the assignment of samples to the right cancer types, a 10-fold cross validation technique was carried out using 'pamr' package in R. In addition, internal and independent/external tests were carried out to evaluate the performance of the pam classifier using the expression of glycosyltransferase genes. For this purpose, the glycosyltranseferases' expression dataset was randomly split hundred times into training (70%) and test (30%) sets. Training sets were used to build a model, which were then applied to the testing sets. Finally, the median values were used to assign each sample to a specific cancer type. The result of this analysis was used for accuracy measurement calculation summarized in Supplementary Table 4. Given a classifier and a sample, there are four possible outcomes: true positive, true negative, false positive and false negative. It is true positive if the sample is positive and it is classified as positive and it is false negative if it is classified as negative. It is true negative if the sample is negative and it is classified as negative and it is a false positive if it is classified as positive. Given a classifier and a set of samples (the test set), a two-by-two confusion matrix (also called a contingency table) can be constructed representing the dispositions of the set of samples, see Fawcett (2006) for more information and equations 16 . Furthermore, the ROC (Receiver Operating Characteristics) curve has been extensively studied and applied in medical diagnosis since the 1970s 18,19 and the area under the ROC (AUC) 20 has become an important performance measure in this regard, since it is invariant to operating conditions 21 . The accuracy measures derived from a confusion matrix, the area under the receiver operating characteristic (ROC) curve and its confidence interval (CI) for internal test (Supplementary Table 4), clearly shows the potential of gene expression profiling of glycosyltransferase in tumor type identification/separation with high accuracy, sensitivity and specificity for all investigated cancer types.
Since training algorithms look for patterns in the training dataset, a classifier that relies on these spurious patterns will have higher accuracy on the training examples than it will on the whole population. Therefore, it is extremely critical to evaluate the performance of a classifier on an independent test set. For this purpose, training sets of previous test (internal test) have been used for an external (independent) test that examines 293 breast cancer samples existing in GPL1390 platform of GSE20624 2 . GSE20624 (GPL1390) data is not included in TCGA while it uses the same microarray platform with TCGA datasets, however, only 177 glycosyltransferase are common between training (TCGA based) and this dataset.
In terms of breast cancer subtyping, to provide a quantitative evidence for the prediction of a number of possible clusters within the TCGA breast cancer dataset, consensus clustering plus class discovery technique 24 was conducted using 'ConsensusClusterPlus' package 22 in R. Consensus clustering is a clustering framework that has been widely used for cancer subtyping. In this technique, the same clustering algorithm is applied multiple times to different subsets of the data and a consensus result is collected to better describe the similarities between samples 23 . The consensus Cumulative Distribution Function (CDF) and Delta area plots are the graphical representations to illustrate at what number of clusters, the CDF reaches an approximate maximum and at which k (number of groups) there is no significant increase in CDF curve, respectively. The result of consensus clustering analysis was graphically represented as heatmaps for the consensus matrices of k=2 to k=10. Accordingly, microarrays are placed in both rows and columns of the consensus matrices and consensus value ranges are colored by white to dark blue, indicating that samples never cluster together and always cluster together, respectively ( Supplementary Figures 3a and b). Furthermore, to group samples into subtypes based on the expression of glycosyltransferase genes, a k-means clustering was performed using 'cluster' package in R. Cluster significance was evaluated using 'SigClust' package 24 in R, and all class boundaries were statistically significant (Supplementary Figure 3c). To investigate whether the identified groups (using k-means clustering), specific to breast cancer may represent clinically distinct subgroups of patients, univariate survival analyses (comparing subtypes, k=2 to k=10, with respect to the overall survival) was performed (Supplementary Figure 3d) using 'survival' package in R, while previously identified normal-like 25

N-glycans
Precursor synthesis

Lewis antigens
Sialyl Le(a)