ColPortal, an integrative multiomic platform for analysing epigenetic interactions in colorectal cancer

Colorectal cancer (CRC) is the third leading cause of cancer mortality worldwide. Different pathological pathways and molecular drivers have been described and some of the associated markers are used to select effective anti-neoplastic therapy. More recent evidence points to a causal role of microbiota and altered microRNA expression in CRC carcinogenesis, but their relationship with pathological drivers or molecular phenotypes is not clearly established. Joint analysis of clinical and omics data can help clarify such relations. We present ColPortal, a platform that integrates transcriptomic, microtranscriptomic, methylomic and microbiota data of patients with colorectal cancer. ColPortal also includes detailed information of histological features and digital histological slides from the study cases, since histology is a morphological manifestation of a complex molecular change. The current cohort consists of Caucasian patients from Europe. For each patient, demographic information, location, histology, tumor staging, tissue prognostic factors, molecular biomarker status and clinical outcomes are integrated with omics data. ColPortal allows one to perform multiomics analyses for groups of patients selected by their clinical data.


Methods
Web platform. Our proposal is based on our previous results and experience working with biomedical data 20,21 . ColPortal (https://colportal.imib.es) aims at providing an open web platform where users can download, visualize and analyse clinical and omic data in an integrated way. One of the main features of ColPortal is the possibility of making integrated analyses in real time. ColPortal has been developed using Java and R technologies and deployed in a HP Proliant DL360 G9 server with two Intel xeon processors with 12 cores (hyperthreading) and 256 GB RAM. Figure 1 shows the technological architecture of ColPortal, whose main modules are described below: • Web Interface. This is a web application developed using Java Server Faces technologies. Web standards (HTML, CSS) have been used to display the data. This means that any user can use ColPortal without requiring any third party software. • Integration and Analysis Engine. This module allows the communication between the web interface and the several data sources. The clinical data are used for generating the cohorts for the personalized analyses. In the current implementation, methylome data is used in every analysis, because this is the omics type for which we have the largest amount of samples. Methylome is actually the omic data that requires more computational time. For example, the analysis of normal/tumoral cases may take up to 20 minutes using the server described above. Consequently, methylome data have been pre-analysed to ensure that real-time multiomics analyses can be executed in ColPortal. Furthermore, this module is able to recover the data filtered by the user, perform statistical analyses using R libraries, and load the results in an in-memory database, so the data can be used by visualization methods or reused in new studies. • Relational Database. A MySQL relational database contains information about clinical cases, molecular features, tissue images and the links to the omics files for the integrative analysis. • Linked Data Database. We use Virtuoso as a Linked Data database. In this case, all the data generated in differential methylation analysis are stored in RDF to facilitate the exploitation of this large dataset (450,000 probes by clinical case). • In-memory Database. All the data results from analysis generated in real time are stored in an Sqlite database. This database engine allows the use of an in-memory temporary database. This feature is very interesting to increase performance. • RAW data repository. The entire raw dataset from the different omics is stored in a server folder. The corresponding paths are stored together with the clinical cases in the relational database. • R scripts repository. The analysis of omics data is supported by R scripts. Parameterized R scripts are stored in a folder of the server. When the user selects the information to be analysed, we use the R scripts as templates to generate the concrete code. • Zoom-in images repository. All the images that are zoomed in the visualization tool are stored in a server folder and indexed with the clinical cases in the relational database.
types of analysis. Currently, ColPortal allows the following types of analysis: • Multiple Correspondence Analysis (MCA). FactoMineR package is used for multiple correspondence analysis 22 . Factoextra package 23 is used to visualize the results of MCA analyses. • Differential gene and miRNA expression analysis. Limma from Bioconductor is used for differential expression for microarray data from genes and miRNAs 24 . In addition, it allows us to standardize the data before performing the analysis. In this context, the Tidyverse package 25 is used for the efficient loading of large data files. • Differential methylation analysis. Minfi from Bioconductor is used to analyse Illumina DNA methylation arrays 26 . We use Minfi to standarize the data before performing analysis. IlluminaHumanMethyla-tion450kanno.ilmn12.hg19 27 and IlluminaHumanMethylation450kmanifest 28 packages are used to annotate each probe with its region (1st exon, 5′ UTR, etc) and type (CpG island, OpenSea,…) in the genome. To perform the differential methylation analysis we also use Limma as a tool for contrasting differences between the two groups. • Correlation analysis between microbiome and methylome. For this analysis, we use the corrplot package 29 .
This method returns a table with the selected bacterial genus and genes and the correlation value between their abundance and methylation values. With these results, we use the FactoMineR and Factoextra packages for principal component analysis (PCA) and their visualization, and we also use the cluster 30 and ape 31 packages for hierarchical clustering. www.nature.com/scientificdata www.nature.com/scientificdata/ External resources. To enrich the results of the analysis, ColPortal integrates the following external resources: • Human Phenotype Ontology (HPO). This is an ontology whose main objective is to offer a common vocabulary for the annotation of genes and proteins 32 . HPO is used to link human phenotypic abnormalities with the genes/proteins that cause them or have some kind of influence on them. • KEGG. This is a set of databases to perform functional analysis of genes 33 . In the context of ColPortal, this database will allow us to know the metabolic pathways in which a gene or protein participates. • MirTarBase. This is a database that stores the results of experiments that have been performed to validate the interaction between miRNA and its targets 34 . This tool classifies evidence as strong and weak depending on the type of validation experiment performed.
Patients and tumour samples. In order to have a wide representation of different histological subtypes of CRC, a previously described series enriched in less frequent CRCs was included in this platform 9,35 . CCs were diagnosed as previously described 36 ; SACs were diagnosed on the basis of criteria proposed by Mäkinen et al.
(epithelial serrations, clear or eosinophilic cytoplasm, abundant cytoplasm, vesicular nuclei, absence of or less than 10% necrosis of the total surface area, mucin production, and cell balls and papillary rods in mucinous areas of a tumour) 8 and hmMSI-Hs according to prior established criteria (mucinous, signet-ring cell and medullary carcinoma, tumour infiltrating and peritumoural lymphocytes, "Crohn-like" inflammatory response, poor differentiation, tumour heterogeneity, and "pushing" tumour border) 5 . None of the hmMSI-H showed serrated morphology. The study cases consisted of a cohort of 48 SACs, 41 CCs 15 hMSI-H and 58 colorectal polyps retrieved from the Santa Lucia University Hospital, Cartagena, and 18 SACs and 9 CCs from Oulu University Hospital, Oulu, Finland. From these subjects, data from 61 adjacent normal mucosa samples were also included. From all the cases, multiomic data was obtained from frozen tissue specimens. Overall survival (OS) and disease free survival (DFS) were calculated in months as previously described 9,11 . The study was approved by the Hospital Ethics Committee and was carried out in accordance with the ethical standards laid down in the 1964 Declaration of Helsinki and its later amendments. Written informed consent was obtained from all patients. In order to diminish the bias of tumor heterogeneity all samples for DNA-and RNA-based high throughput techniques were taken from the center of the tumor after performing a staining from the frozen tissue block to ensure that no necrotic areas were included.
Preparation of digital histological slides. Hematoxylin eosin histological preparations were obtained from selected tumor paraffin-embedded blocks showing tumor invasive front and representative morphological features. These slides were digitized at x20 (1.5 GB mean size) using an Aperio AT2 scanner. Digital images are accessible through Clinical cases/List view. The size of the generated images is between 500 MB and 3 GB. If we upload directly the original images it would be unfeasible for users to see them from the portal. They would have to download them, and install software to see them correctly. We avoid such actions by zooming images using Libvips (https://jcupitt.github.io/ libvips/) and the web viewer OpenSeadragon https://openseadragon.github.io/. These technologies have been applied in similar previous works 37-39 . DNa extraction. A volume of approximately 10 mm 3 was extracted from each frozen tissue using the disposable sterile biopsy punch. DNA was extracted following the manufacturer's instructions (QIAGEN, Hilden, Germany). Briefly, tissue was disrupted and homogenized in ATL buffer using a Tissueruptor (QIAGEN) incubated with proteinase K, and the homogenate was subjected to automatic DNA extraction using the Qiacube equipment and the QiaAmp DNA Mini Kit (Cat No: 51306), both provided by QIAGEN.
Oncogene mutation testing. These methods are expanded versions of descriptions in our related work 35 .
DNA samples were diluted to 5 ng/μl concentration and subjected to allelic discrimination using TaqMan probes for BRAF V600E detection and those cases with no V600E mutation were directly sequenced for BRAF exon 15 as described previously 35 . KRAS mutations at codons 12 and 13 were determined by denaturing high-performance liquid chromatography (dHPLC). A fragment of 92 bp was amplified in a 25 μl volume containing 2 mM MgCl2, 1 mM dNTPs, 10% DMSO and 2U of TaqGold polymerase (Applied Biosystems, Foster City, CA), and 1 μM of primers KRASf and KRASr. To improve the detection of sequence changes in the amplified product, a Guanine-Cysteine clamp was anchored to the 5′ end of the forward primer. The reverse primer contained the M13  Before dHPLC analysis, PCR products were heated to 95 °C for 10 min and then slowly cooled to room temperature to allow heteroduplex formation. Five microliters of the PCR product were then injected into a preheated reverse-phase column (Helix DVB, Varian Analytical Instruments, 2700 Mitchell Drive, CA) equilibrated by triethylammonium acetate (TEAA) 0.1 M in a Helix ProStar dHPLC instrument (Varian Analytical Instruments, 2700 Mitchell Drive, CA) as previously described 5 . All cases with a curve profile different than KRAS native were confirmed by sequencing using M13 universal primer. The mutation status of exons 9 and 20 in the PI3KCA gene was determined by direct sequencing after a nested-PCR. External and internal PCR were performed in a total volume of 20 μl containing 0.2 mM dNTPs, 2 mM MgCl2 0.025 U/μl GoTaq Hot Start Polymerase (ref: M5001, Promega, Madison, WI) and 0.5 μM of each primer. In the first PCR, 2 μl of DNA template were used with the primers PI542-5EF and PI542-5ER for the exon 9 and with PI1047EF and PI1047ER for the exon 20. Nested PCR was performed using 2 μl of the first PCR and the following primers: PI542-5IF and PI542-5IR for exon 9 and PI1047IF and PI1047IR for exon 20. According to the manufacturer's instructions, amplicons were purified using the QiAquick 96 PCR purification kit (ref: 28181, Qiagen, Hilden, Germany) and were subsequently subjected to direct sequencing using the primer PI542-5IF for exon 9 and the PI1047seq. Primer sequences are provided in Table 1.

Microsatellite instability and CpG island methylation phenotype (CiMP)
. MSI was evaluated, as previously described 35  CpG island methylation phenotype (CIMP) assessment. A hundred nanograms of DNA were denatured in a total volume of 5 ml Tris-EDTA buffer, and further performed as recommended by the supplier (MRC-Holland, Amsterdam, the Netherlands). Methylation-specific multiplex ligation-dependent probe amplification (MS-MLPA) is a method for the simultaneous detection of methylation at eight genes firmly associated with CIMP (CACNA1G, IGF2, NEUROG1, RUNX3, SOCS1, CDKN2A, MLH1, and CRABP1) in one reaction 41 . In short, a mixture of probe-mix (ME042-B1 CIMP) and buffer were added to the denatured DNA, and probes  www.nature.com/scientificdata www.nature.com/scientificdata/ were allowed to hybridize to the DNA at 60 °C for 16 hours. Each sample was divided into two tubes, in which one half was ligated, and the other was ligated and digested using the methylation-sensitive restriction enzyme HhaI. Both samples were subsequently subjected to a PCR reaction using a thermal cycler (GeneAmp 2700, Applied Biosystems, Foster City, CA, USA), and fragment analysis performed on a capillary sequencer (ABI 3130xl, Applied Biosystems, Foster City, CA, USA). DNA from normal colonic mucosa was used as normal reference. The output from the analysis, after inter-and intrasample normalization, is a percentage of methylation in the sample. Partially methylated genes were considered as methylated. The Ogino and Weisenberger criteria for CIMP status assessment were applied and compared (CIMP(O) and CIMP(W), respectively) 42,43 considering only 5 genes from the panel (CACNA1G, IGF2, NEUROG1, RUNX3 y SOCS1), the difference between no-CIMP and high-CIMP being ≤3 and >3 methylated genes, respectively.
Bisulfite treatment and DNA methylome assay. These methods are expanded versions of descriptions in our related work 12,13 . HumanMethylation450K BeadChip (Illumina, Inc., San Diego, CA), using Infinium HD Methylation assay for genome-wide DNA methylation screening, was employed. In brief, genomic DNA (1000 ng) from each sample was bisulfite converted with the EZ DNA Methylation Kit (Zymo Research, Orange, CA) according to the manufacturer's recommendations. Bisulfite-treated DNA was isothermally amplified at 37 °C (20-24 h), and the DNA product was fragmented by an endpoint enzymatic process, then precipitated, resuspended, applied to an Infinium Human Methylation450K BeadChip (Illumina, San Diego, CA, USA) and hybridized at 48 °C (16-24 h). The fluorescently stained chip was imaged by the Illumina i-SCAN and Illumina's Genome Studio program (Methylation Module) was used to analyse BeadArray data to assign site-specific DNA methylation β values to each CpG site. Total RNA was quantified by spectrometry (NanoDrop ND1000, NanoDrop Technologies, Wilminton, DE) and fragment size distribution was analysed by RNA 6000 Pico Bioanalyzer assay (Agilent Technologies, Palo Alto, CA). RNA (150 ng) was concentrated in a SpeedVac to a working dilution and used to produce cyanine 3-CTP-labeled cRNA using the Low Input Quick Amp Labeling Kit, One-Color (Agilent p/n 5190-2305) according to the "One-Color Microarray-Based Gene Expression Analysis" protocol Version 6.0 (Agilent p/n G4140-90040). This method uses T7 RNA polymerase which simultaneously amplifies the target material and www.nature.com/scientificdata www.nature.com/scientificdata/ incorporates cyanine 3-labeled-CTP. A 2,000 ng cRNA product was hybridized with a Whole Human Genome Oligo Microarray Kit (Agilent p/n G2519F-014850) containing 41,000 unique human genes and transcripts. Arrays were scanned in an Agilent Microarray Scanner (Agilent G2565BA) according to the manufacturer's protocol and data extracted using Agilent Feature Extraction Software 10.7.1 following the Agilent grid template 014850_D_F_20100430 protocol GE1_107_Sep09 and the QC Metric Set GE1_QCMT_Sep09. miRNa microarray assay. miRNA microarray profiling was conducted using Affymetrix GeneChip miRNA 3.0 arrays (Affymetrix, Inc., Santa Clara, CA, USA) containing 5,607 probe sets for human pre-miRNA, miRNAs, small nuclear RNA, and small Cajal body-specific RNA. These Affymetrix miRNA arrays provide 100% coverage of the miRNAs in the miRBase (version 17; http://www.mirbase.org). Probe sets that were deleted in a more recent version of miRBase (version 21) were excluded from the analysis. All steps of the procedure were performed according to the Affymetrix standardized protocol for miRNA 3.0 arrays.  www.nature.com/scientificdata www.nature.com/scientificdata/ Strengths and limitations. Based on the principle that the histology is a complex manifestation of genetic and epigenetic changes, the main goal of the included series is to provide a representation of different histological subtypes of colorectal cancers with their associated "omic" profiles. Those profiles have been obtained enriching certain subtypes, such as serrated adenocarcinoma and colorectal carcinomas, with histological features of microsatellite instability. Therefore, it does not represent the general frequency of these subtypes found amongst non-selected patients with colorectal carcinoma. The benefits of ColPortal for clinicians is the possibility of correlating not very common CRC subtypes with clinical features such as tumor presentation and prognosis; for pathologists it allows to associate histological characteristic of CRC with particular molecular changes. Finally, researchers can generate proof of concepts and establish relationships amongst genetics and epigenetics features at different levels (e.g. microbiota and microtranscriptomics; microbiota and methylomics, etc). The main advantage compared to the TCGA database is that ColPortal allows the integration of microbiota and methylome data in cancer and precursor lesions. On the other hand, the main limitation of ColPortal is the number of cases included for each platform. Nevertheless, once the frame and data acquisition has been created new cases and datasets will be incorporated. Another limitation of the portal is the absence of proteomic data especially post-translational modifications which are important elements affecting protein functionality and therefore the tumor histology and biology.

Data Records
In Table 2 we detail the types of datasets and their number of cases/samples used in the analyses. Next, we describe this data and their availability:

technical Validation
The technical validation of this cohort includes the use of ColPortal as an anonymous user. A glossary of the terminology used in ColPortal is shown in Table 3.

Clinical cases.
First of all, we will validate whether the data referring to clinical cases are capable of correctly classifying the different types of samples from the portal. To perform this task, from the clinical cases section, the user can perform an MCA analysis selecting the classification variables (Classes) and the variables that will be used for generating that classification (Variables). Figure 2 shows the result of an MCA analysis 22 with the tumoral status classification (variable disease status as class) using age group, sex, general localization and TNM staging (age, sex, general localization, T stage, N stage, M stageas variables). In this case, the two groups are clearly separated using data from the clinical cases.
Gene regulation. ColPortal allows the generation of customizable analyses over RNA and miRNA data from Transcriptome and MicroTranscriptome sections. In this validation case, we have used the preloaded methylome information, where we have differential methylation data between normal and tumoral samples. The cohort selection includes normal and tumoral samples with RNA and Methylome data. In this case we have 2 normal cases and 24 tumoral cases. We use the Limma package 24 to perform the differential expression analysis. The analysis returns 5,309 differentially expressed genes with adjusted p-values of less than 0.05. The data can be filtered by gene name, KEGG pathway 33 or by human phenotype (from Human Phenotype Ontology 32 ) and ordered using different criteria.
For this example, we first filter the genes using keywords such as cancer, colon, rectal and rectum for human phenotypes. Then we ordered the genes by absolute value for log of fold change and visualized (pressing the eye button) the genes with the highest value. As a result we obtained the gene expression heatmaps for the resulting 8 genes and the median methylation for each gene. Figure 3 shows the gene expression.
The results show that the expression of some genes are inversely correlated with methylation values, which validates the quality of data. However, the correlation with methylation for genes such as CDKN1C and EPCAM is not clear. For these cases, the user has two options: • A more detailed exploration of the methylation pattern of this gene. This can be done in the "Methylome section", "Methylome Gene Visualization" tab, filtering by gene regions, CpG islands, transcription start site, etc. Figure 4 shows the result of filtering by gene EPCAM. The user can also see the comparison between normal and tumoral (see Fig. 5). • A differential expression analysis of miRNAs and check if those genes are regulated by this molecular mechanism. This can be done in the "MicroTranscriptome" section, where you can filter by different features. Figure 6 shows the hsa-miR-221 expression. Our platform uses data from MirTarBase 34 to know the targets validated in laboratories and CDKN1C is one of them. EPCAM cannot be validated because methylation and miRNAs do not only regulate gene expression.
Methylome and microbiome interaction. In this case, we use the Microbiome section for selecting microbiome genus and methylated genes using the following filters: (1) genus abundance greater than 20 and (2) genes with an absolute value of log of fold change greater than 1 in the colorectal cancer pathway. With this filter we recovered 8 genus and 34 genes. Figure 7 shows the correlation plot between microbiome and methylome using the previous filtered data. www.nature.com/scientificdata www.nature.com/scientificdata/ Now, the user can filter correlated genus and gene and perform a Principal Component Analysis 22 and a Hierarchical Clustering 30 analysis. Figure 8 shows that the pseudomonas genus (the genus with more correlations) allows us to correctly separate normal and tumour samples using also their methylation values. In 57 we found a study that describes the prevalence of a pseudomonas species in patients with cancer.

Usage Notes
ColPortal allows to analyse in an integrated way all the data of this cohort. However, if users want to reuse the data for other studies in Figshare all the data and R scripts are available to perform the analyses 46,48,49,51,53,56,58 . All the omic raw data can be linked with the metadata from clinical cases using the field labeled Id.

Code availability
These are the scripts 58 used in the different analyses: • mRNA expression script. This script reads preprocessed and normalized expression data and makes a differential expression (DE) analysis between the samples defined in the parameter URL ### ###. As a result, a table of DE genes is generated. A second table with the normalized gene expression values per sample is also generated.
• miRNA expression script. This script reads preprocessed and normalized miRNA expression data and makes a differential expression (DE) analysis between the samples defined in parameter URL ### ###. As a result, a table of DE miRNA is generated. Likewise, a second table with the normalized gene expression values per sample is generated.
• Microbiome correlation script. This script generates a correlation plot between microbial abundance and methylation data f rom selected DM genes obtained f rom samples previously f iltered.
− microbioma Genes Methylated ### ### is a table whose variables are the microbiome data and the methylation levels of the genes for the selected samples. An extra variable called "class" contains the labels of the classes. It is also necessary to know the number of microbiome variables (###numgenus###) and the number of genes (###numGenes###).
• Microbiome PCA script. This script generates a plot resulting from a principal components analysis (PCA) using the methylation values of the selected genes in the filtered samples, and the selected genus(s) (or families, or species). The color distinguishes the different groups of samples.
• Methylation script. This script performs a differential methylation (DM) analysis using raw data from Illumina microarrays. The results are a DM gene table and the methylation values (beta values) of each sample.
• MCA plot script. This performs a multiple correspondence analysis with the selected variables. The result is a graph with the relations between categories.
All scripts are developed in R 59 and they are used as templates, which are instantiated using the values defined by the user. The following R packages have been used: limma 24