Singular manifolds of proteomic drivers to model the evolution of inflammatory bowel disease status

The conditions used to describe the presence of an immune disease are often represented by interaction graphs. These informative, but intricate structures are susceptible to perturbations at different levels. The mode in which that perturbation occurs is still of utmost importance in areas such as cell reprogramming and therapeutics models. In this sense, module identification can be useful to well characterise the global graph architecture. To help us with this identification, we perform topological overlap-related measures. Thanks to these measures, the location of highly disease-specific module regulators is possible. Such regulators can perturb other nodes, potentially causing the entire system to change behaviour or collapse. We provide a geometric framework explaining such situations in the context of inflammatory bowel diseases (IBD). IBD are severe chronic disorders of the gastrointestinal tract whose incidence is dramatically increasing worldwide. Our approach models different IBD status as Riemannian manifolds defined by the graph Laplacian of two high throughput proteome screenings. It also identifies module regulators as singularities within the manifolds (the so-called singular manifolds). Furthermore, it reinterprets the characteristic nonlinear dynamics of IBD as compensatory responses to perturbations on those singularities. Then, particular reconfigurations of the immune system could make the disease status move towards an innocuous target state.


Full Description of Methods
Below, we describe the methods and notation used in this work that were borrowed and adapted from previous results introduced by Horvath et al., 2008 (pg 5), Belkin et al., 2012, Belkin et al., 2015, and Cornelius et al., 2013 (pgs 11-12) among others. Intellectually the work was already done independently, then, we only glued the pieces!
Peptides were analyzed in the Orbitrap cell, in MS, at a resolution of 120,000 (at m/z 200), with a mass range of m/z 350-1550 and an AGC target of 2 x10 5 .
Fragments were obtained by high collision-induced dissociation (HCD) activation with a collisional energy of 30%. MS/MS data were gathered from the ion trap in the top-speed mode, with a total cycle of 3 seconds, with an AGC target of 5x10 4 , a dynamic exclusion of 50 seconds and exclusion duration of 60 seconds. The maximum ion accumulation times were set to 100 ms for MS acquisition and 30 ms for MS/MS acquisition in parallelization mode.

Peptides and proteins identifications and quantification by LC-MS/MS
Identification procedure: The Proteome Discoverer software (Thermo Scientific, version 2.1) and with the Mascot search engine (Matrix Science, version 5.1) were used in the MS and MS/MS data processing. For the precursor ion we set to 7 ppm the mass tolerance maximum threshold and 0.5 Da for fragments. The following variable modifications (2 maximum per peptide) were allowed: carbamidomethylation (Cys), oxidation (Met), phosphorylation (Ser, Thr, Tyr), acetylation (N-term of protein). More than two missed cleavages were not considered for the trypsin protease. The MS/MS identification step performed on the SwissProt database (02/16 version) using the Homo sapiens taxonomy as base. Peptide Identifications were validated using a 1% FDR (False Discovery Rate) threshold calculated with the Percolator algorithm. The relative quantification of protein abundances is measured by means of Progenesis QI -Proteomics software (version 4.0, Waters) -using co-detection in order to eliminate missing values.
The relative quantitation of proteins according to the five groups (CTRL, active Crohn, quiescent Crohn, active UC, quiescent UC) was performed using a between subject analysis and a Hi-3 method (for which the three most abundant peptides were used for protein quantification). Abundance variations of proteins with an ANOVA p-value under 0.05 and with at least two identified peptides were further considered.

How to perform the Weighted Gene co-Expression Network Analysis
The WGCNA R package enables the construction of co-expression network using the normalised data of the 40 LC-MS/MS acquired runs in the two replica samples just described in the previous section. We set the threshold power to 10 since it was the supremum in the scale-free ! fit of 0.8. Briefly, the network was created calculating topologic overlap (TO) with bicor correlation function, then proteins were hierarchically clustered using 1-TO (dissTOM) as the distance measure. Initial module assignments were determined by using dynamic tree-cutting, using default parameters except deepSplit =0, WGCNA network gene-level kME values (gene expression correlation to the module eigengene for the module to which each gene belonged) were used to rule out the measured gene list to genes with kME > 0.65 and membership in any of 4 modules (resp. 2 in CD) of particular interest in UC. R code is provided online via synapse.org (https://www.synapse.org/XXXXXX).

Manifold Assumption of Data Representation in IBD
In many biological systems data explicitly lies on a manifold. It is also normal that those data exhibit structures of high dimensionality and nonlinearity. Still, they can be described by a number of components much lower than their current features. Manifolds, herein noisy Riemannian manifolds with a measure, provide a natural framework to understand the structure of highdimensional data in Biology.
We enquire how our data shape might affect that we state about similarity.
Thus, we consider a function : → ℝ that potentially learns the shape of a data set. Then by the manifold assumption stated in Belkin et al., 2012, we claim the inner product , ∆ ! ! is small. Subsequently, data shape can be closely traced using such function f.

Laplace-Beltrami Operator Basis
The Laplace operator acts, in a certain way, as precursor of the basis required for the function used in the geometry of data. The Laplace -Beltrami operator can be inferred by generalisation of Fourier analysis. If we define : ℳ ! → ℝ and ! : ! ℳ ! → ℳ ! as the application to be estimate defined from the manifold of dimension ℝ (as describe in the previous section) and the corresponding application defined from its tangent space respectively. Then, the operator is defined as follows:

Data Representation: Graph Laplacian Eigenmaps
Consider a Graph G and its set f of sub functions ! defined as indicated in the previous section and this for each ( , ) ℎ ℎ. We smoothly can construct a functional by: Then data derived from graph can be naturally represented on a manifold preserving adjacency by optimisation: The associated best solution is provided by [3 Belkin Nigoyi 01]: = .
Therefore, data derived from the weighted gene co-expression network analysis of IBD can be smoothly represented on a manifold via the graph Laplacian eigenmaps. But how could this be done attending to the disease status?

Singular Manifolds of IBD Status
Similarly The geometric problem may be briefly described as follows: Let be the set of ℎ Riemannian submanifolds, smoothly selected and being compact, of intrinsic dimension 8 embedded in ℝ !" . Among the space of possible singularities, we are basically interested in two general classes: 1. intersection-type (control-active scenario): That amounts to the in pairwise smooth intersection of submanifolds. The simplest context occurs as such an intersection is done between the interior of the submanifolds, namely: 2. edge-type (active-quiescent scenario): Extreme distance points that lie on within the boundaries of two smooth submanifolds. That is Thus, the singular manifold of the IBD status can be defined as the union of 3 smooth manifolds endowed with these types of singularities.

Graph Laplacian of IBD Status
Let = { ! ,· · · , !" } be the set of i.i.d. random samples derived from the two cohorts of IBD patients distribution with density ( ) on . Then, enables to build a weighted graph ( , ) of each cohort by simply matching the ithsample point ! with the ith-vertex ! and making correspond a weight !" to edge !" . We consider in this study the Gaussian weight function with the Euclidean distance as indicates:

Belkin et al., 2012 suggests normalization by
! !! in order to hold the limit analysis. With the best fitted Gaussian bandwidth of the selected protein candidates as ℎ = 0.0384.
Let's define now the edge weight matrix of the graph G as !,! with !,! ( , ) = !,! ( ! , ! ), and !,! as the diagonal matrix defined by We may want to introduce the graph Laplacian without normalization scale as the × matrix !,! = !,! − !,! . Under all these considerations we can apply the graph Laplacian to any couple of fixed smooth function and point ∈ as follows

This is the graph Laplacian in the interior points of IBD status manifolds. But
more importantly, what would its behaviour be nearby our protein candidates where a phenotypic change produces an intersection-type singularity?

Notes on the Limit of Graph Laplacian defined upon Singular Manifolds
The limit behaviour of graph Laplacians should be observed under a double prism composed by the sample size n and the (Gaussian) kernel bandwidth .
To this end, we first address the limit ℎ → 0 for data at very great scale and later gaining insights onto finite sample and rates as → ∞ by means of concentration inequalities.
In particular, herein we study how the infinite graph Laplacian ! ( )acts when x lies on or nearby a singular point with an ℎ small and a fixed function

ODE System for IBD Status Dynamic
The dynamics of a complex network can often be represented by a set of coupled ordinary differential equations. We thus consider a -node network whose -dimensional dynamical state x is governed by = .
In general, the heterogeneity in the physiopathology of IBD progression tends to perturb any of its graphical representation. Let's get this modification underway at a time prior to a given ! , driving it to a status ! = ( ! ) in the attraction basin ℬ( ! ) of an undesirable phenotype ! . We seek then to compensate this effect by wisely identifying a perturbation ! → ! ! to be implemented at time ! , where ′ ! lies on the basin of attraction ℬ( * ) of a desirable status * . For clarity, we assume that ! * are fixed points, although the same holds to other types of attractors.

Optimisation of the Dissipation Parameter
We fit the dissipation parameters of our IBD model to data (i.e. ). In this analysis both the initial condition, s, and parameters, p, are randomly picked out thanks to a normal distribution. Upon extraction, we fit the model to the data (see Fig. 6B).
Then we calculate the estimated parameters, confidence ranges, and correlations between the parameters. Additionally, the initial condition is read from the data (instead of estimating it).
We also fit simultaneously several data sets by provide a list of data sets to the first data option. When fitting several data sets, some of the parameters could be the same across all data sets, whereas others could differ, and have a unique value in each data set (see Fig. 6B). Finally, we show how one can bootstrap the data by sampling (with recruitment) from every individual data set, and refit the samples using the best parameters as an initial guess (Fig. 6A). These calculations were implemented by tailored in-house functions and the R function derived from C-code GRIND [de Boer, R, 2017].

Compensatory Perturbations (Nonlinear Optimisation)
How to iteratively rectify unwanted paths resulted in those spaces by debugging from small compensations, as shown in Fig.7? Given a dynamical system as introduced above and an initial state ! at time ! , a small identifying the compensation ! regarding the system initial condition ! that, in the small compensation sub-space verifying ! ≤ ! , will approximate ( ! ) + ! to * the most (Fig. 7, left hand side red arrows).

Figure S2. Significance and preservation of CR graph modules. (A)
The module significance (average protein significance) of the modules. The underlying protein significance is defined with respect to the patient disease status. (B) The consensus dendrogram for replica 1 and replica 2 CR coexpression graphs. (C) The composite statistic Zsummary (Eq.9.1 in (12)). If Zsummary > 10 the probability the module is preserved is high (13). If Zsummary < 2, we can say nothing about the module preservation. In the light of the Zsummary, it is apparent there is a high correlation with the module size. The green UC module shows high evidence of preservation in its two replica graphs. Immunoglobulin mediated immune response. The protein interaction graphs were constructed using Genemania (Warde-Farley et al., 2010). Edge colouring of the graphs amount to: purple stands for Co-expression, orange for Predicted, light-blue for Pathway, light-red for Physical interactions, green for Shared protein domains and blue for Co-localisation. Boxplots of expression for the most attractive drivers respect to UC status are indicated by floating arrows at the top of each subgraph. Initial P upon the boxplot title amount to p-value associated to the IBD status correlation, whereas PC stands for positive control, i.e., proteins already described as IBD related. For the sake of simplicity, we highlighted some candidates simply by its identifiers. Figure S4. Similar patterns of protein expression in the UC green WGCNA selected module. CTSH and CDV3 barely present the same expression levels between active and quiescent UC status. The units of expression stand for normalised protein abundance.  Figure S7. Abstraction of the IBD progression. We describe the domain of IBD progression as intersection of three manifolds eventually with boundaries namely: Ωc, Ωq and Ωa to control, quiescent and active status respectively. The data geometry of IBD status can be learnt through the IBD potential introduced in the main manuscript by using in an on top space the differential structure defined by their WGCNA selected proteins.