Conserved transcriptomic profile between mouse and human colitis allows temporal dynamic visualization of IBD-risk genes and unsupervised patient stratification

Despite the fact that ulcerative colitis (UC) patients show heterogeneous clinical manifestation and diverse response to biological therapies, all UC patients are classified as one group. Therefore, there is a lack of tailored therapies. In order to design these, an unsupervised molecular re-classification of UC patients is evoked. Classical clustering approaches based on tissue transcriptomic data were not able to classify UC patients into subgroups, likely due to associated covariates. In addition, while genome wide association studies (GWAS) have identified potential new target genes, their temporal dynamic revealing the optimal therapeutic window of time remains to be elucidated. To overcome the limitations, we generated time-series transcriptome data from a mouse model of colitis, which was then cross-compared with human datasets. This allowed us to visualize IBD-risk gene expression kinetics and reveal that the expression of the majority of IBD-risk genes peak during the inflammatory phase, and not the recovery phase. Moreover, by restricting the analysis to the most differentially expressed genes shared between mouse and human, we were able to cluster UC patients into two subgroups, termed UC1 and UC2. We found that UC1 patients expressed higher copy of genes involved in neutrophil recruitment, activation and degranulation compared to UC2. Of note, we found that over 87% of UC1 patients failed to respond to two of the most widely-used biological therapies for UC. This study serves as a proof of concept that cross-species comparison of gene expression profiles enables the temporal annotation of disease-associated gene expression and the stratification of patients as of yet considered molecularly undistinguishable.


Introduction
colonic affected areas 1,2 . 48 Although there is no definitive cure for UC, there are biological therapies available which 49 target the inflammatory response during UC by means of inhibiting pro-inflammatory 50 cytokines or by blocking immune cell migration 3 . Among these, the most frequently used 51 biological therapies in UC patients block tumor necrosis factor (TNF) with anti-TNF antibodies 52 (such as infliximab, IFX) 4 or leukocyte migration (such as vedolizumab, VDZ) 5 6 . However, 53 about 35% 4,6 and 50% 5,6 of patients poorly achieve clinical response to IFX and VDZ, 54 respectively. Patients that do not respond develop adverse effects, most notably increased 55 risk of infections, thus requiring continuous medical monitoring and ultimate surgical 56 intervention 7,8 . 57 In an attempt to identify genes/pathways as a potential novel therapeutic target, genome wide 58 association studies (GWAS) have identified more than 200 polymorphisms associated with 59 higher susceptibility to IBD 9,10 . However, the function and temporal expression of IBD risk 60 genes during experimental colitis are yet to be elucidated 9,10 . 61 Furthermore, while there is an obvious clinical heterogeneity among UC patients as seen for 62 example by the location affected (i.e. distal colitis, left-sided and pancolitis, and responder 63 and non-responder) and the extent of the severity, initial treatment for these patient 64 subgroups is identical and modified only if the patients have not responded 6,8 . Biomarkers 65 that could distinguish the different entities of the UC spectrum are currently lacking and they 66 are required in order to achieve the highly needed stratification of UC patients into 67 molecularly functional subgroups 8,11 . Moreover, an unbiased stratification of UC subtypes 68 has never been accomplished at the molecular and functional levels. Here, using 69 transcriptomic data from a well-characterized experimental model of colitis we were able to 70 identify conserved genes between mouse and UC patients. As a result, we were able to gain 71 insights into IBD-risk gene kinetics and to molecularly stratify UC patients in an unsupervised 72 manner.

Human UC is highly variable at the transcriptome level 75
In order to molecularly stratify UC patients into subgroups, we combined 4 publicly available 76 human UC cohort datasets (n=102 patients), in which transcriptomic microarrays of total 77 colonic biopsies was performed 12-15 (Table1 and Fig S1). We ranked genes using the top 78 100 most variable genes and further tested whether molecular subgroups exist (Fig 1a). 79 Analysis by visual assessment of cluster tendency (VAT) 16 indicated that biopsies presented 80 high inter-sample dissimilarities (Fig1b), suggesting a poor overall tendency to form 81 consistent clusters. Dimensionality reduction analysis by tSNE using the top highly variable 82 genes also indicates the formation of a single group with no apparent subdivisions (Fig 1c). 83 Then, we further statistically tested whether multi-cluster substructures were present in the 84 dataset, since most clustering algorithms define subgroups even on random noise [17][18][19] . 85 However, bootstrapping analysis using the Hartigan's Dip test 19,20 presented a low cluster 86 substructure trend (p > 0.9), regardless of the gene ranking metrics used (Fig 1d). 87 Independently of the clustering tendency results, we forced patient subdivision using 88 hierarchical clustering and tested for cluster stability using bootstrapping 17,18,21 . In line with 89 previous results, formed clusters were highly unstable using the list of highly variable genes 90 (AU ≈ 0%) (Fig 1e). These results indicate that without prior knowledge of patient subdivision, 91 standard gene ranking strategies do not allow clustering of UC patients into molecularly 92 distinct subgroups. 93 94

Time-series reveals processes underlying colon inflammation and repair 95
One cause of such inter-patient variability can be attributed to the sampling procedure, which 96 contributes largely to the total data variance and masks real biological differences 22,23 . To 97 overcome the total data variance, we sought to identify the genes that contribute to 98 inflammation in an independent and unsupervised manner. To this end, we focused the analysis on a list of evolutionarily conserved genes that best discriminate the nuances of 100 inflammation in a well-characterized colitis mouse model 24 . 101 To identify these evolutionarily conserved genes, we first elucidated through an unbiased 102 genes compared to UC2 (Fig 3f). We also observed that neither UC1 nor UC2 subtypes were 241 discriminated by the overall macroscopic disease severity (Fig 3g) In order to characterize UC1 and UC2 beyond conserved genes, we performed differential 247 expression analysis using all genes present in the human dataset. We were able to identify 248 205 highly differentially expressed genes, among which 187 were up-regulated in UC1 and 18 249 were up-regulated in UC2 (Fig 4a). Detailed tables with information on all DEGs comparing 250 UC1 and UC2 are available for exploration (Table S8 and Fig S7a). Among those, cytokines 251 (TNF, IL11), enzymes (NOX1, MMP3, CYP26B1), calcium-binding proteins (S100A8, 252 S100A9), chemokines (TREM1, CXCL8) and other proteins related to the inflammatory 253 response (NR3C2, BCL2A1, PARM1, TNFSF13B) were clearly able to discriminate UC1 from 254 UC2 (Fig 4b and Fig S7). Enrichment analysis for cell types, GO, and KEGG pathways 255 revealed that genes highly expressed in UC1 (187) were associated with terms related to 256 neutrophil, neutrophil degranulation and cytokine-cytokine receptor interaction, respectively 257 ( Fig S7b). Venn diagram of the top enriched terms revealed many overlapping genes are 258 shared among these pathways (Fig 4c), suggesting that UC1 patients present a distinct 259 transcriptional signature enriched in neutrophil activity and cytokine signaling compared to 260 UC2 patients. 261 We trained a logistic regression classifier using each of the DEGs between UC1 and UC2 to 262 identify key genes that could be further used in the clinics for distinction of UC1 and UC2. 263 Genes were tested and scored individually using the area under the curve (AUC) as a 264 combined measure of sensitivity and specificity (Fig 4d). We observed that genes such as 265 used the patient-specific treatment response obtained 4 to 8 weeks after the biopsy was 274 taken and treatment with IFX started (Table 1). Interestingly, we observed that on average, 275 70% of the patients belonging to the subtype UC2 responded to infliximab therapy (Fig 5a) in 276 contrast to less than 10% of the patients classified as UC1, regardless of the dataset 277 analyzed (Fig 5a). 278 To extend the applicability of our findings, we made use of another set of UC patients 279 receiving vedolizumab and repeated the same procedure as before (Table 1). Transcriptomic 280 data from UC patients were analyzed using the most relevant genes identified in our mouse 281 colitis model and then clustered as described above to reveal UC1 and UC2. Between them, 282 UC1 presented a higher expression of the conserved inflammatory genes (Fig 5b). We 283 observed that about 60% of the patients belonging to the UC2 subgroup responded to VDZ, in 284 comparison to about 13% of the patients belonging to the UC1 subgroup (Fig 5c). Taken 285 together, the data indicates that patients belonging to the UC2 subgroup, which present a 286 higher percentage of response, respond to either IFX or VDZ treatment. Importantly, our 287 approach actually allows a more accurate identification of those patients with UC1, in which 288 87% of the patients are refractory to both IFX and VDZ. 289 290

291
A systematic study demonstrated that biopsy sampling was the major source of inter-patient 292 variability 22 . Therefore, such technical variations can mask real biological differences, even 293 though UC is known to present a high level of variability in macroscopic and endoscopic scoring among patients 1,2,8 . To solve this, we limited the analysis to the relevant genes for 295 inflammation including the phases of tissue repair and regeneration. By using the key DEGs 296 obtained by a mouse model of colitis, we were able to "ignore" genes that were highly variable 297 between patients (e.g. as a result of technical variation), and focus only on those that 298 contribute to inflammation. This allowed us to temporally classify IBD risk genes and 299 molecularly sub-classify UC patients into two subgroups; one of these characterized by genes 300 involved in neutrophil recruitment, activation and degranulation, and by low response to 301 biologicals. 302 Different experimental models to study mucosal immune processes associated with the 303 pathogenesis of UC are available 33,34 . Among them, the DSS-induced colitis model is broadly 304 used due to its simplicity and applicability with different therapeutic drugs 35 . Early studies 305 characterized the temporal changes by qPCR for a handful of inflammatory markers 36 , but 306 how non-inflammatory (i.e. repair-related) genes fluctuate over time during tissue repair was 307 unknown. Others had previously performed a kinetic microarray analysis only during the acute 308 inflammation phase of DSS (from days 0 to 6) 37 , but whether those genes continue to be 309 expressed during tissue repair remained unclear. Moreover, although the DSS-induced colitis 310 model has been extensively used for the study of UC, an open reference for gene expression 311 during intestinal inflammation and tissue repair was still missing. Here, we used a time-series 312 transcriptional characterization of colitis, which allowed us to identify which genes contribute 313 to most of the nuances of inflammation over time. In addition, this manuscript provides an 314 open data source that can be further investigated by others with different questions. As an 315 example, we provided a temporal assignment of IBD risk genes that might offer insight into 316 their potential functions. Finally, our data show that the DSS mouse model is a relevant model 317 for studying certain aspects of human UC. 318 Previous studies identified the molecular differences among responder and non-responder 319 IBD patients 13 . These studies were purposely biased by an a priori knowledge of the 320 responder and non-responder IBD samples. In contrast, we successfully classified the 321 patients using a completely unsupervised approach and therefore, we have potentially 322 identified genes that go beyond the responsiveness to the therapy by describing the 323 molecular signature of the identified subgroups. We were able to do this by using the key 324 DEGs found in the mouse model of colitis, by "debiasing" the human analysis, by "ignoring" 325 genes that were highly variable between patients, and by focusing only on those genes that 326 contribute to inflammation. Consequently, we identified two subpopulations of UC patients 327 responses by cytokine secretion such as TNF and IL-6 40 , which are also up-regulated in UC1 346 patients. B cell depletion using anti-CD20 antibody in a small cohort showed a trend in 347 reducing inflammation, although non-significant 41 . However, it remains possible that B cell 348 depletion might affect only UC1 patients, but not UC2. Similarly, we also observed that UC1 349 patients have a higher expression of genes involved in the JAK/STAT signaling pathway 350 (PTP4A3, SOCS3) and cytokine signaling (IL6 and IL1B), suggesting a potential role of other 351 therapies for this subgroup, such as canakinumab (anti IL-6 mAb), siltuximab (anti IL-1β 352 mAb), JMS-053 (PTP4A3 inhibitor) and others might apply. 353 In summary, we have performed an unbiased characterization of the inflammatory and tissue 354 repair processes using a mouse colitis model, providing a useful resource for understanding 355 colonic inflammation. Many of the genes identified in mice were also detected in human UC 356 patients, thus allowing us to explore the temporal expression of IBD risk genes during the 357 course of inflammation and gain useful insights into their potential function. Furthermore, they 358 allowed us to identify for the first time two clinically relevant molecular ulcerative colitis 359 subsets (UC1 and UC2) in an unsupervised manner. Thus, our methodology identified two 360 molecularly distinct UC subgroups and will serve as a proof of concept for the use of 361 transcriptomic data from highly controlled mice experiments to perform unsupervised and 362 biologically-driven analysis of highly variable human datasets. 363 364

365
All methods used in this paper are described in the Online Methods linked to this manuscript. 366 367 Acknowledgements 368 We would like to thank Stefan Bonn, Samuel Huber and Charlotte Hedin for critical reading 369 and suggestions on the manuscript. We thank Elaine Hussey for editorial assistance.    further proceeded for abundance estimation using Kallisto 5 . 46 read count less than 5 in at least 3 samples were considered with low expression and filtered 48 out. Differences in tissue cell composition that could affect transcriptional pools were 49 balanced by means of removing unwanted variation based on negative control genes using 50 the RUVg function implemented in RUVseq package 6 . Analysis revealed that library sizes 51 strongly correlated with several known intestinal housekeeping genes, such as Hprt (r=0.87) 52 and Gapdh (r=0.85), but not Actb (r=0.68). Moreover, genes such as Cd63 (0.94), Trappc 53 (r=0.97), and Cpped1 (0.97) and Slc25a3 (r=0.96) correlated even more strongly to the library 54 sizes, indicating potentially novel housekeeping genes during colonic inflammation. Negative 55 controls genes were thus defined as genes with positive Pearson correlation above 0.9 to 56 their respective sample library sizes. Estimated unwanted variation vectors were then used as 57 covariates for calculation of differentially expressed genes (DEGs) using EdgeR package 7 . 58 EdgeR is specialized in performing time-series differential expression by means of 59

RNA-seq dataset 75
To identify which genes are shared between mouse and human ulcerative colitis, we 76 compared the list of DEGs identified by in the DSS dataset and the list of genes identified by 77 Taman et al. 11 . Mapping of IBD risk genes was done using the list of IBD risk genes identified 78 by fine-mapping at the single loci resolution 12 . Identification of enriched GO processes and 79 KEGG pathways was done using enrichR 10 . 80 81

principal components 83
To investigate whether the nuances of inflammation observed in the mouse model could also 84 be found in humans, we made use of four human microarray datasets from GSE12251 13 , 85 GSE73661 14 , GSE23597 15 and GSE16879 16 . Combined, these datasets contain gene 86 expression and metadata of 447 patients, containing information such as disease type (UC or 87 CD), Mayo macroscopic score, the therapy given, when the sample was collected and the 88 response to infliximab (IFX) or to vedolizumab (VDZ). Across all datasets, patients were 89 considered inflamed if presenting a Mayo score of 2 or 3 (out of 3). Similarly, patient were 90 considered to respond to therapy when it respective Mayo score reduced to 0 or 1, between 91 4-8 weeks of treatment with IFX or between 6-52 weeks of treatment with VDZ. For this study, 92 we included only patients with UC before receiving any therapy (either IFX or VDZ), 93 comprising a total transcriptional profiles of 143 patients, of which 102 received IFX and 41 for 94 VDZ. The list of samples used in this study is supplied as metadata table (Table S9). 95 Probes with log2 fluorescence count lower than 6 in at least 10 samples were excluded from 96 the analysis. Batches between dataset were observed and corrected using the ComBat 97 function in SVA package 17 . Selection of genes for further exploration was done by different 98 approaches: 1) using all genes; 2) using only the top 100 highly variable genes; 3) using the 99 genes with top 100 high dispersion; 3) The gene with high loading in principal component 1 100 and; 4) The gene with high loading in principal component 2.