Finite Dimension: A Mathematical Tool to Analise Glycans

There is a need to develop widely applicable tools to understand glycan organization, diversity and structure. We present a graph-theoretical study of a large sample of glycans in terms of finite dimension, a new metric which is an adaptation to finite sets of the classical Hausdorff “fractal” dimension. Every glycan in the sample is encoded, via finite dimension, as a point of Glycan Space, a new notion introduced in this paper. Two major outcomes were found: (a) the existence of universal bounds that restrict the universe of possible glycans and show, for instance, that the graphs of glycans are a very special type of chemical graph, and (b) how Glycan Space is related to biological domains associated to the analysed glycans. In addition, we discuss briefly how this encoding may help to improve search in glycan databases.

where Γ(glycan) denotes the underlying graph (Fig. 1b), and Γ dim ( (glycan)) f is the finite dimension of this graph. The finite dimension of glycan is defined by the last equality in (1). In line with this definition, we abuse language and apply directly to glycans notions that are graph-theoretical, for example we say that two glycans are isomorphic when their associated graphs are isomorphic.  The Hausdorff dimension of finite sets is zero. In contrast, the finite dimension of finite sets is highly non-trivial, making it suitable to classify finite sets. We call it finite dimension because it is defined only on finite sets; its values, however, can be any real number ≥0, or infinity 6 .
Finite dimension is actually defined on finite metric spaces (for the general definition and properties see 6 ). Graphs can be used-in several ways 7 -to give a metric structure to the set of vertices. Here we use the simplest way, standard in Graph Theory, which is to count the smallest number of hops between two vertices, using only adjacent vertices to go from one to the other. The finite dimension of this metric space is called the finite dimension of the graph and, in this paper, of the glycan in question. An important fact is that, with this metric, isomorphic graphs have equal finite dimension 7 .
In the graphs obtained from GTC most edges have the same length, but ramified glycans contain chemical ligatures whose length is approximately 1.5 times the common, "usual", length (e.g. 1-6 linkages). We disregard this difference and assume that all edges have length one. This has an important technical implication that simplifies the calculation of dim f . In fact, if Γ denotes a graph with vertices V, we have: where N is the smallest number of cliques (i.e. sets of vertices of diameter 1) that are needed to cover V, and D is the diameter of Γ. It turns out that N = ϑ(Γ), a classical graph parameter called clique covering number. For perfect graphs, Lovász has shown that ϑ(Γ) = α(Γ), the independence number of Γ 10,11 . The vast majority of the glycans we study are trees, a class of perfect graphs. Hence: This is the formula used in the paper to compute finite dimension. In doing so, we use the open-source mathematical software system SAGEMATH 12 .
Notation and examples. In this paper, graph means finite, undirected, simple and connected graph. The path P n has n vertices and n − 1 edges; it has diameter n − 1. The star St n has n vertices, with a central one which is adjacent to all other vertices; it has diameter 2. Let C n denote the cycle with n vertices and n edges; it has diameter ⌊ ⌋ n/2 , where ⌊ ⌋ n/2 is the largest integer ≤n/2. The complete graph on n ≥ 2 vertices is denoted K n ; its diameter is 1, for all n. The finite dimension of a graph is zero if and only if (iff) the graph is a single point, and For all other graphs, dim f is a positive real number. If Γ is triangle-free the cliques used to cover V(Γ) are pairs of adjacent vertices and, hence, n ≤ 2N, or , where ⌈ ⌉ n/2 denotes the smallest integer ≥n/2.

Results
We study a sample  that consists of two different sets: a portion of the glycan database GTC, and a set of "synthetic" glycogen containing 500 simulated glycogen molecules. The complete sample  is available online (see Supplementary information below).
The contents of GlyTouCan. We study the contents of GTC from the point of view of the graphs associated to each entry. We read all entries of GTC with WURCS codes 13 , and considered alternative linkage information (alternative, statistically or range), but we did not consider alternative units or alternative repetitions. We obtained 52,374 entries, 25 of which were discarded as they were disconnected. The remaining set of 52,349 connected graphs, denoted gtc, is the disjoint union of a set T that contains only trees and a set Cy of graphs that are not trees, i.e. graphs that are, or contain, cycles. In its turn, T is the disjoint union (denoted ) of B, the set of branched or ramified trees, and L, the set of linear trees, i.e. paths of different lengths. We have: The size of these sets, and their percentage in gtc, is: |B| = 28, 424 (54.4%), |L| = 23, 715 (45.3%), and |Cy| = 210 (0.4%). We can further subdivide Cy as the disjoint union of C and Cb, where C denotes the set of graphs that are pure cycles, and Cb the set of ramified ones, i.e. the graphs that contain, but are not themselves, cycles. We have |C| = 130 (61.9%) and |Cb| = 80 (38.1%).
Some of the entries in GTC contain information about the biological species where their associated glycan was found. We follow 14 and group each of these species into three domains: Eukaryota, denoted EU, Bacteria, BA, and Archaea, AR. We could read 15,230 glycan structures included in this taxonomy. Of these, 876 entries were not uniquely categorized, and were removed from tax, the set of entries uniquely classified. We have: For bacteria, we have |BAB| = 1, 631 (51.1%), |BAL| = 1, 561 (48.9%) and |BAC| = 2(0.06%). Finally, for archaea, |ARB| = 3 (8.3%), |ARL| = 33 (91.7%) and |ARC| = 0 (0.0%). We summarize these facts in Table 1 below. Table 1 consists of three rectangles. One is called gtc and consists of the three columns B, L, Cy. The second rectangle is called tax and consists of three rows labelled EU, BA, AR, and six columns. The third rectangle is labelled GTC and consists of four columns (B, L, Cy, GTC) and four rows. The main portion of the table consists of the 9 entries defined by EU, BA, AR and B, L, Cy. For instance, there are 4,589 branched eukaryotic glycans and 3,138 linear ones.
Outside of the table proper there is a column and a row: 8,411 for instance, is the sum of all elements of EU, including 684 eukaryotic glycans that belong to GTC but not to gtc, etc. Similarly, 6,223 is the amount of branched glycans that are categorized, i.e. that belong to tax. The number 22,201 in the cell labelled B gives the number of branched elements of gtc that are not categorized, i.e. that do not belong to tax, and similarly for the cells labelled L,Cy.
Branched glycans in gtc. The set of dimensions of elements of B is denoted dimB, and its statistical structure is summarised in Tables 2 and 3. It follows that . ≤ ≤ . 0 7737 dim 20 f , for all ramified glycans in gtc. We can also note that dimB has only 115 different values.
Linear glycans in gtc. There are 23,715 linear glycans, i.e. paths P n , in gtc. Of these, 4,015 are segments P 2 . In general, many non-isomorphic graphs can have the same finite dimension, but linear graphs have a very simple structure: two paths P n , P m , are isomorphic iff n = m, iff P n and P m have the same diameter, iff In particular, their finite dimension depends only on the diameter of the path. Indeed, These dimensions are always <1, except for the case n = 3, where it equals 1. On the other hand, their limit (as n → ∞) is 1. We note here that there are 24 different values of the finite dimension of linear glycans in gtc, 23 finite ones and infinity. Thus, the total of 23,715 linear glycans falls into only 24 different isomorphism classes.
Since the finite dimension of linear glycans gives no more information than their diameter, it may seem that considering dim f only complicates matters unnecessarily. However, there are advantages to treat all glycans uniformly, the most important of which is to discover the unexpected way in which the linear glycans fit in γ γ gtc ( ), ( )  (cf. Section 3.5 and Figs 2 and 3). We let Lfin denote the 19,700 glycans that are paths of length ≥2, and let dimL denote the set of their finite dimensions. From Table 3 we can read, for example, that more than 30% of the linear glycans of length ≥2 consists of paths of length 2 (because P 3 is the only path with dim f = 1).  , when k → ∞. In contrast to the case of paths and pure cycles, the values in dimCb lie on both sides of 1. There are 80 elements in Cb, and 11 different values in dimCb.
Glycogen. Using NumPy 15 we generated 500 synthetic graphs that satisfy the following specification: the graphs are ramified trees with up to 120,000 nodes, and branches of length 20-23 nodes that sprout every 12-19 nodes 16,17 . We let S500 denote the set of their finite dimensions (see Table 2).  Universal bounds. From Table 2 one can read universal bounds ≤ ≤ a b dim f for the different types of glycans of sample . For instance, the choice a = 0.6309, b = 2.0, works for  and a = 0.9341, b = 1.2680, for S500. By Equation (2), the inequalities ≤ Γ ≤ a b dim ( ) f , are equivalent to: Since both N and D can be regarded as different measures of size for Γ, Equation (3) establishes relations between these measures that restrict, both qualitatively and quantitatively, Γ's form and, ultimately, the kind of graphs that can pertain to glycans. More on this in Section 3.4.
Glycan Space. We enrich the information provided by the finite dimension of glycans by adding information on size, in the form of the glycan's diameter. We feel this is an appropriate way to compactly codify glycan information, or rather, the glycan information that is the focus of this paper. To this effect, we introduce Glycan Space (GS), a subset of the plane  2 where all glycans of  can be represented or coded. In fact, our methods are so general that all glycans in any future sample will have a coding in GS, as long as we can obtain their underlying graphs. Consider the lattice   ⊆ 2 of all points (n, m) with n, m integers ≥2, and let  ϕ → : The coding of  is defined to be  γ( ), where γ → : S GS, , we can restrict γ to gtc or to S500, to obtain codings of these sub-samples. Figure 2 shows γ gtc ( ), the coding of gtc, and a neat structure in it. More precisely, Fig. 2 includes B, Cb and Lfin. This set of 52,219 glycans is coded in GS using 145 points.
Horizontal lines in GS. Let D ≥ 2, and consider the horizontal line in GS defined by D, or D-line for short, i.e. the set of points of GS whose second coordinate equals D. We are interested in obtaining information about the endpoints L D , R D of D-lines when we restrict them to specific classes of graphs. Since the second coordinate of these points is fixed to D, we abuse notation and let L D , R D denote both the points of the plane, and their corresponding first coordinates. Note also that the actual values of L D , R D depend crucially on the class of graphs under consideration. We have: 1. For D-lines of triangle-free graphs, the leftmost point L D coincides with the code of the path P D+1 . If the graphs contain triangles (and D ≥ 3), then L D = ln2/ln3, as shown in Theorem 5.2 of 9 . The rightmost point R D (for arbitrary graphs) does not exist: there are graphs of diameter D whose finite dimension is as large as desired. For a proof, see (A2) of the Appendix. 2. For D-lines of chemical graphs (i.e. graphs whose vertices have degree ≤4) R D is finite, but tends to infinity with D. See (A3) of the Appendix.
Summarising, for a graph g of diameter D, we have: , R D < ∞, but R D → ∞ as D → ∞, if g is triangle-free and chemical. The last case applies notably to glycans, since the vast majority of them are triangle-free, chemical graphs. But for glycans, we already know, from Table 3, that R D ≤ 2. In the next section we show, moreover, that for  ∪ ∈ g gtc , R D → 1 as D → ∞. In other words, glycans are indeed a very special subset of the chemical graphs.
The shape of γ ( ) gtc and γ ( ) S . Figure 2 shows that γ gtc ( ) has a shape that resembles that of a Christmas tree. In stark contrast to the general results of the last section, the rightmost bound of D-lines is always ≤2 and, moreover, tends to decrease as D grows. Since all elements of the sample  are triangle-free, we already knew that the left boundary of γ gtc ( ) is given by the codes of paths. Figure 2 shows that glycans with D ≤ 20 come quite close to filling up the space to this theoretical boundary. Another interesting feature of γ gtc ( ) is that, if you disregard the special case D = 2, and join the triangles coding the remaining linear glycans, you obtain two "lines" that get closer as D increases. Thus, the structural simplicity of linear glycans noted earlier, is reflected in GS by the fact that they form a "1-dimensional" subset of the plane. In contrast, the far more complex ramified glycans form a "2-dimensional" pattern. Figure 3 shows the coding of the complete sample . The large disparity in diameter between gtc and S500 accounts for the fact that gtc appears completely flattened near D = 0. The Christmas tree pattern, however, remains unchanged. The reader may contrast the rich information contained in Figs 2 and 3 to the more classical statistical summaries of Table 2.
Next, we use Equation (3) to explain the apparent invariance of the Christmas tree shape. We study the left and right "curves" in GS that delimit the region inside which γ( )  lies. We start with another derivation and formulation of the leftmost boundary. Suppose that Γ is a ramified tree with n vertices and diameter D, and The bound N ≤ n − 1, while not very sharp, is true for all graphs. Thus, D a ≤ n − 1 or, roughly speaking, on the D-line, for dim f to be "far" from 1 it is necessary to have a glycan with "many" nodes in a "small" space (i.e. the diameter must still equal D). Or, conversely, a glycan with diameter D and "few" nodes must have dim f close to 1. For example, for D = 10 and a = 1.5, the above condition gives that n ≤ 32 implies Γ < . dim ( ) 1 5 f . In actual fact, the rightmost point of γ gtc ( ) on the 10-line corresponds to 6 graphs Γ with n = 28 (the largest n on the 10-line) and Γ = ∼ . dim ( ) ln(15)/ln(10) 1 17609 f . It turns out that these 6 graphs are all isomorphic. The glycan with Accession Number G06222QR in GTC, shown in Fig. 1, is one such example: it has code GS .
∈ (1 17609, 10) . The implications of this condition for the rightmost boundary of  γ( ) is that existing glycans satisfy the following condition which summarises the qualitative and quantitative aspects discussed here and in subsections 3.3 and 3.4: . glycans do not have a large number of nodes in relation to the modecule s diameter ' As long as this condition remains true for glycans discovered in future, the Christmas tree shape will persist. The simulated molecules of glycogen are archetypicsal in relation to this property: they consist of a long path with lengths in the range 1,900-70,900, from which short paths ramify every so often.
In order to get a feeling for the meaning of dim f , we invite the reader to take a look at the glycans with Accesion Numbers G60741HS and G94498MI in GTC. Both have the same code GS .
∈ (0 77815, 10) ,   Finite dimension and taxonomy. We discuss the way in which the finite dimension of ramified tree glycans is related to glycan taxonomy. We use the notation BAB for branched BA, and EUB for branched EU, |BAB| = 1,631, and |EUB| = 4,589. We consider a sort of "symmetric difference" of these sets with respect to finite dimension. By definition, BAB_EUB consists of elements of BAB whose dim f is exclusive for BA; similarly, EUB_BAB consists of glycans in EUB whose dim f is not the finite dimension of any BAB. The sets contain, respectively, 32 and 317 glycans (bold figures in Fig. 5). The "intersection" consists of elements in the union BAB∪EUB (as sets) whose finite dimension is shared by BA and EU. The total number is, of course, 6,220. The set of different values of the finite dimension of these 6,220 glycans has a total of 86 elements, shown in parenthesis in Fig. 5, of which 20 are exclusive to BA, 31 exclusive to EU, and 35 are shared by BA and EU. Figure 6 shows the position in GS of the differences BAB_EUB and EUB_BAB of Fig. 5. We see a rather clear-cut separation between Bacteria and Eukaryota, as well as a shift to the right and down as we move from Bacteria to Eukaryota. In other words, glycans from Bacteria have "large" diameter and "small" finite dimension, and those from Eukaryota have "smaller" diameters and "large" dimension. This means, roughly speaking, "long and sparse" glycans for Bacteria, and "short and packed" (i.e. with many edges in the given diameter) glycans for Eukaryota. Figure 6 suggests several questions; for instance, about exceptions and outliers. There are four exceptional EUB-points that have dim f < 1. What can be said about the properties of glycans coded by these four points? And what about the BAB-points of diameters 14 and 39 that have dim f > 1? Or glycans coded by the 3 points with diameter ≥35?  Prospective uses of the glycan coding. An analysis of the advantages and disadvantages of existing methods to classify and retrieve glycan structures in databases is certainly out of the scope of this work. However, we foresee possible applications of the new methodology for this purpose, i.e., to search (by coding/decoding) a glycan database (DB) and, hence, we will refer briefly to this point.
Let us assume we wish to decide whether or not a glycan g, with underlying graph Γ(g), is an entry of a given DB. We start by encoding it as a point , we conclude that g is not in DB. If it does, we have to decode P and find a unique entry of DB that corresponds to g, provided one exists. To decode, the first step is to find out whether or not DB contains a graph Γ isomorphic to Γ(g), Γ ≅ Γ(g). It suffices to search through the isomorphism classes of graphs in DB. This is trivial if g is linear: we can just compute the length of Γ(g). To decode P when g is branched, recall that, of the 145 points of γ gtc ( ), 18 are exclusive to linear glycans, so we concentrate on the 127 remaining ones. None of these points contain more than 63 isomorphic classes of graphs. Searching through these few classes we can decide very fast whether or not DB contains Γ ≅ Γ(g).
This suggests an algorithm to decide whether or not a given glycan g is in DB: we first compute its code γ g ( ) and, if it is not in γ gtc ( ), we conclude that g is not in DB. If the code is in γ gtc ( ), we search for an isomorphic graph with this code (the largest such search-set contains 63 elements). If we cannot find an isomorphic graph, we conclude that g is not in the database. If we do find one, say Γ, then we search through all labelled graphs with Γ as underlying graph (the largest such search-set contains at most 1,994 elements) and again, if we find a labelled graph with the same labels as g, then g is in DB, and not otherwise.

Conclusions and Questions
Via finite dimension we obtained a compact coding in GS of the sample . The shape of S GS γ ⊂ ( ) resembles that of a Christmas tree, and we gave a mathematical explanation of why this is so. It turns out that having this shape is a consequence of condition (4) of Section 3.5. In fact, we conjecture that all glycans, present and future, do satisfy (4), perhaps because of stereochemical restrictions and/or biochemical reasons. In addition, the coding reveals a rather clear-cut distinction between Bacteria and Eukaryota. The generality of our methods allows for a similar coding of future glycan DBs. Also, the coding might be of help in retrieving glycan structures in databases.
Our work suggests several questions: (a) there are "holes" in Fig. 2, e.g. around finite dimension 1 and diameters in the range 15-20, "should" there exist glycans to fill the hole? Based on their position in GS, what properties would they have (as graphs, biochemical, biological (taxonomy), etc)? (b) there are some exceptional points in Fig. 6. For example, four EUB-points that have dim f < 1. What can be said about the properties of glycans coded by these four points? And what about the BAB-points of diameters 14 and 39 that have dim f > 1? Or glycans coded by the 3 points with diameter ≥35? (c) More generally, is there a connection between the position of glycans in GS and their properties?