DNA methylation profiling reveals common signatures of tumorigenesis and defines epigenetic prognostic subtypes of canine Diffuse Large B-cell Lymphoma

Epigenetic deregulation is a hallmark of cancer characterized by frequent acquisition of new DNA methylation in CpG islands. To gain insight into the methylation changes of canine DLBCL, we investigated the DNA methylome in primary DLBCLs in comparison with control lymph nodes by genome-wide CpG microarray. We identified 1,194 target loci showing different methylation levels in tumors compared with controls. The hypermethylated CpG loci included promoter, 5′-UTRs, upstream and exonic regions. Interestingly, targets of polycomb repressive complex in stem cells were mostly affected suggesting that DLBCL shares a stem cell-like epigenetic pattern. Functional analysis highlighted biological processes strongly related to embryonic development, tissue morphogenesis and cellular differentiation, including HOX, BMP and WNT. In addition, the analysis of epigenetic patterns and genome-wide methylation variability identified cDLBCL subgroups. Some of these epigenetic subtypes showed a concordance with the clinical outcome supporting the hypothesis that the accumulation of aberrant epigenetic changes results in a more aggressive behavior of the tumor. Collectively, our results suggest an important role of DNA methylation in DLBCL where aberrancies in transcription factors were frequently observed, suggesting an involvement during tumorigenesis. These findings warrant further investigation to improve cDLBCL prognostic classification and provide new insights on tumor aggressiveness.


SUPPLEMENTAL MATERIAL AND METHODS cDLBCL cohort
Clinical features of dogs affected by DLBCL are reported in Table S1.

Sample labeling and hybridization
Enriched-fraction and total gDNA (reference) obtained from 48 lymph node samples (40 cDLBCLs and 8 control dogs) were labelled independently with cyanine 5-deoxyuridine triphosphate (dUTP) and cyanine 3-dUTP, respectively. Sample labeling was performed by using SureTag Complete DNA Labeling Kit (Agilent Technologies) following the manufacturer's recommendations. For each sample, equal amount of enriched (Cy5-labeled) and reference (Cy3-labeled) DNAs were cohybridized to the microarray platform. Arrays were scanned at 3µm resolution using an Agilent G2565CA scanner, and image data were processed using Feature Extraction version 10.7 with CGH-1200-Jun14 protocol (Agilent Technologies).

Data Quality Control and Preprocessing
A total of 58 probes exhibiting signal saturation and one cDLBCL sample not valid according to Methylation Microarray QC metrics (Agilent Technologies) were filtered out. The MedianSignal of the probes was considered for further preprocessing. The ProcessedSignal provided by Agilent Feature Extraction algorithm was not employed since it was characterized by a higher overall variability. Differences on signal variability were observed between Cy3 and Cy5 signals, probably due to the capture/enrichment step performed on DNA of Cy5-labeled samples. To adjust the Cy3/Cy5 dye bias, Loess normalization was applied to each dye using the information between-array to remove intensity-dependent trends, but without scaling the overall median signal towards zero. Specifically, the Loess curve was estimated keeping the within-array median value calculated across the probes. After dye bias correction, quality assessment of the resulting log2-signal ratios was performed using the arrayQualityMetrics package in Bioconductor (http://www.bioconductor. org). Principal Component Analysis (PCA) was also considered to evaluate anomalies among the samples. Three samples (i.e. Dog#1, Dog#10 and one control dog) failing quality controls on MA plots, box-plots and between-array distances were excluded from the analysis. On the remaining arrays, between-samples Quantile normalization was applied to the corresponding log2-signal ratios. The distribution of the median log2-ratios calculated across samples was characterized by two clear peaks ( Figure S1): the first one (log2-ratios > 2) represented hyper-methylated probes, while the second one (0 <log2-ratios <1) represented probes showing a methylation level similar to the reference. The latter was not centered to zero since the MedianSignal included the background noise that may be different between the dyes. Therefore, in order to subtract the background noise and having this peak centered to zero, the overall signal was finally scaled by a factor equal to 0.85.
To estimate the peaks, we used the function findPeaks of R package quantmod. Figure S1. Density plot of the median array calculated from the normalized data.

Details on statistical analyses
F-test: Since cDLBCL is a heterogeneous disease, driven by perturbations of different molecular pathways, and varying from individual to individual, epigenetic instability or the loss of epigenetic control of important genomic domains can lead to increased methylation variability, not always associated to a difference of methylation levels. Recently, it has been found that differential variability between normal and cancer tissues can be very useful for identifying methylation markers of cancer (Hansen et al., 2011). Therefore, for the clinical factors differential variability was tested using the F-test, one of the most popular approaches for testing the equality of variances.
Multivariate linear regression: Linear combinations of clinical/pathological factors significantly associated to methylation level were investigated through a multivariate linear regression model.
Specifically, the factors of the final model were selected with a step-down procedure: all the factors were initially included in the full model considering main effects only, then they were sequentially removed if their removal did not result in a significant change in fitting the data, using F-test.
Analysis of methylation disruption: Starting from methylation data in cDLBCL samples and control lymph nodes a matrix X was defined describing the methylation changes for each sequence j and each cDLBCL sample i as xij = yijzj, which is the methylation difference between the sample i and the median methylation zj calculated across the 7 control lymph nodes at sequence j. PCA analysis was performed on xij values as preliminary analysis of the variability in methylation changes. The methylation variability profile for each cDLBCL sample i (MVP) was then defined as the density function fi(x) across all the regions represented on the array. The function was estimated using the density() function in R with bandwidth parameter 0.1 33 .
To define a distance matrix for the clustering, the squared L2-distance between the MVP density functions were calculated for all pairs of patient samples. This distance represents the squared difference in the area under the curve between two samples and is approximated using the Trapezoidal rule (Supplementary Material in Chambwe et al. 33 ).
Consensus clustering was then performed on this matrix applying Ward's linkage, using R package ConsensusClusterPlus (Wilkerson et al., 2010). Specifically, HCL was performed 1,000 times on resampled subsets of the cDLBCL samples (using 80% of samples as subset) and evaluated the number of clusters k=2,3……15. We note that the relative change in area under the cumulative distribution functions of the consensus matrix (described in Methods) for each k is maximum at 3, indicating the best separation of the clusters.
Finally, to provide also a quantitative measure of the magnitude of methylation disruption observed in each sample, Methylation Variability Score of cDLBCL sample i was defined as the deviation of each cDLBCL MVP describe by fi(x) to that of the expected MVP of a control lymph node, described by the mean density function ( ) 33 :

Annotation and functional analysis
In order to improve the biological interpretation of the significant sequences, CanFam3 annotations from both RefSeq and Ensembl retrieved from UCSC table browser were associated to each sequence. In particular, we first checked whether each sequence overlaps at least one of the following genomic locations: 5'-UTR, 3'-UTR, exonic, intronic, promoter/upstream (i.e. 2k/10k bases upstream from transcription start site, respectively), downstream (i.e. 2k bases from transcription start site). If the sequence did not overlap any of these locations (i.e. it is an intergenic region), the nearest gene was associated, assuming possible distal regulatory effects on the associated gene.

Figure S2. Distribution of target sequences (CpG and CDS) across the dataset. Percentages
with respect to the total number of sequences in the chip are reported. The sum of these percentages is not equal to 100% since each sequence can overlap more than one genomic region.
In order to identify enriched genomic locations with respect to the selection of the differentially methylated sequences, Fisher's Exact test was performed on the number of the selected sequences with respect to the total number of sequences available in the microarray platform.
Finally, the biological terms from Gene Ontology (GO) and KEGG pathways able to significantly characterize the selected sequences were identified by performing an enrichment analysis.  Shaffer et al. 2006). Pathway Enrichment analysis was performed on each collection independently, T-test metric was employed for gene ranking, and 1,000 permutations were applied for p-value assignment.

Methylation specific PCR (MSP)
A technical validation of microarray platform by methylation-specific PCR (Hernández et al. 2013) was performed on 5 differentially methylated genes (FGFR2, HOXD10, RASAL3, CYP1B1 and ITIH5). On the CpG islands of these genes (Table S2) Statistical analyses were performed using a commercially available statistical software program (SPSS v20.0). Data were analysed using a non-parametric statistical method because of the limited number of cases. Sample methylation levels, were evaluated for significant differences between controls and cDLBCLs using the Mann-Whitney test.

Gene expression analysis of CLBL1 cells treated with hypomethylating agents
A functional validation of microarray data was performed evaluating the mRNA expression For each target transcript, gene-specific primers encompassing one intron were designed (see Table   S3). Two internal control genes (ICGs: GOLGA1 and CCZ1) previously published in Giantin et al. (2013) and Giantin et al. (2016) were selected.
The qPCR reaction was performed in a final volume of 10 μL, using 12.5 ng of cDNA, the Power SYBR Green PCR Master Mix (Life Technologies, Carlsbad, California, United States) and a Stratagene Mx3000P thermal cycler (Agilent Technologies, Santa Clara, California, United States).
Standard qPCR conditions were used, except for the analysis of CADM1 and CDH11, for which Ct values were acquired at 78°C to eliminate primer dimers contribution to the amplification plot.
Different concentrations of forward (F) and reverse (R) primers were tested. The presence of specific amplification products was confirmed by dissociation curve analysis. For each qPCR assay, negative controls (with total RNA or water as template) and positive controls (the cDNA of 6 canine control lymph nodes) were run. Standard curves were obtained using the best performing primer concentration and serial dilutions of control lymph node cDNA. Each dilution was amplified in duplicate. The ∆∆Ct method (Livak and Schmittgen, 2001) was used for the analysis of gene expression results.
Statistical analysis was performed using GraphPad Prism version 5.00 for Windows (GraphPad Software, San Diego, USA). Data were analysed using unpaired t-test. A P value < of 0.05 or less was considered as statistically significant.

Data preprocessing
The combining Loess and Quantile normalization pre-process used to normalize the methylation data, was able to overcome some pitfalls of the data distribution. In Fig. S3, the position of Dog#19 is shown both in the PCA and MA plots with respect to the median across the samples, highlighting the difference obtained by the two normalization approaches. In Fig S3 A. and C., only Quantile normalization is applied, whereas in Fig S4 B. and D. Loess approach is combined with Quantile. In the latter, Dog#19 clustered with the cDLBCLs group. Indeed, this sample showed the most evident dye-bias ( Figure S4) generated by the fact that Cy5 signal was altered by the experimental enrichment step generating an additional bias compared to the Cy3-labeled reference. This trend was observed in more than half of the samples of the dataset. The Loess normalization step was able to determine the adjustment of this bias using the information from each dye. Furthermore, since Quantile normalization assumes a common distribution of data, the Loess-normalized data were characterized by a more similar between-array distribution compared to the raw data ( Figure S5), thus allowing Quantile-normalization to have the best fit.

Microarray data technical validation
In order to quantify the ratio of methylated to unmethylated alleles, the ΔCT (=CT_Meth -CT_NoMeth) value was determined (Table S3) as described by Zeschnigk et al. (2004). The Mann-Whitney test comparing Meth/No-Meth primer pairs showed a significant hypermethylation between the two groups for HOXD10, RASAL3 (p<0.001), CYP1B1 and ITIH5 (p<0.01), while FGFR2 was only marginally significant (p=0.07).

Microarray data functional validation
All primer pairs for gene expression analysis had an acceptable efficiency (range 90 % ÷ 110 %), and a slope in the range of -3.6/-3.1. The main features of the validated qPCR assays are reported in Table S4. The effect of AZA and DEC treatment on CADM1, CDH11 and ABCB1 mRNA expression are summarized in Figure S6-S8, respectively.
Selected genes were all constitutively and highly expressed in the control lymph nodes, while in the B-cell lymphoma cell line (CLBL1) they were almost completely silenced. Following the treatment with AZA, the mRNA expression was significantly restored (P<0.05 or less). Conversely, DEC affected the re-expression of ABCB1 (P<0.01), while did not exert any effect on CADM1 and CDH11 (data not shown).    Table S2.xls Differentially hyper-and hypo-methylated sequences on cDLBCL samples with respect to normal lymph nodes. Columns H-I report the median methylation level across cDLBCL and normal samples, respectively. Columns K-R report the overlapping Refseq (K-N) and Ensembl (O-R) transcripts, considering for each transcript the following genomic locations: exon, intron, 5'-UTR, 3'-UTR, "proxUP" (i.e. until 2kb upstream from transcription start site), "upstr" (i.e. from 2kb to 10kb upstream from transcription start site), "proxDOWN" (i.e. until 2kb downstream from transcription start site). Columns S-V: nearest Refseq or Ensembl transcripts calculating the distance from the transcription start site (TSS). Table S3   Table S3.xls List of significant probes/genes obtained after categorical division of the Beta values in two classes. Column A reports the exact genomic location of the probe and Column B reports the gene symbol. Table S4   Table S4.xls Significantly enriched GO terms and KEGG pathways for the differentially hyper-and hypo-methylated sequences on cDLBCL samples with respect to normal lymph nodes. Table S5   Table S5.xls Significantly enriched gene signatures highlighted by Gene Set Enrichment Analysis (GSEA) on the entire dataset of probes. Table S6   Table S6.xls Sequences showing differential methylation variability on one or more clinical factors across the cDLBCL samples. Columns G-N report the overlapping Refseq (G-J) and

Excel file Supplementary
Ensembl (K-N) transcripts, considering for each transcript the following genomic locations: exon, intron, 5'-UTR, 3'-UTR, "proxUP" (i.e. until 2kb upstream from transcription start site), "upstr" (i.e. from 2kb to 10kb upstream from transcription start site), "proxDOWN" (i.e. until 2kb downstream from transcription start site). Columns O-R: nearest Refseq or Ensembl transcripts calculating the distance from the transcription start site (TSS). Table S7   Table S7.xls Results from multivariate linear regression analysis, investigating different combination of clinical factors across the cDLBCL samples. Columns H-R: clinical factors considered for the analysis; "1" indicates the presence of that factor in the linear regression model. Columns S-Z report the overlapping Refseq (S-V) and Ensembl (W-Z) transcripts, considering for each transcript the following genomic locations: exon, intron, 5'-UTR, 3'-UTR, "proxUP" (i.e. until 2kb upstream from transcription start site), "upstr" (i.e. from 2kb to 10kb upstream from transcription start site), "proxDOWN" (i.e. until 2kb downstream from transcription start site).

Excel file Supplementary
Columns AA-AD: nearest Refseq or Ensembl transcripts calculating the distance from the transcription start site (TSS).