Molecular characterization of colorectal adenomas with and without malignancy reveals distinguishing genome, transcriptome and methylome alterations

The majority of colorectal cancer (CRC) arises from precursor lesions known as polyps. The molecular determinants that distinguish benign from malignant polyps remain unclear. To molecularly characterize polyps, we utilized Cancer Adjacent Polyp (CAP) and Cancer Free Polyp (CFP) patients. CAPs had tissues from the residual polyp of origin and contiguous cancer; CFPs had polyp tissues matched to CAPs based on polyp size, histology and dysplasia. To determine whether molecular features distinguish CAPs and CFPs, we conducted Whole Genome Sequencing, RNA-seq, and RRBS on over 90 tissues from 31 patients. CAPs had significantly more mutations, altered expression and hypermethylation compared to CFPs. APC was significantly mutated in both polyp groups, but mutations in TP53, FBXW7, PIK3CA, KIAA1804 and SMAD2 were exclusive to CAPs. We found significant expression changes between CAPs and CFPs in GREM1, IGF2, CTGF, and PLAU, and both expression and methylation alterations in FES and HES1. Integrative analyses revealed 124 genes with alterations in at least two platforms, and ERBB3 and E2F8 showed aberrations specific to CAPs across all platforms. These findings provide a resource of molecular distinctions between polyps with and without cancer, which have the potential to enhance the diagnosis, risk assessment and management of polyps.


Whole Genome Sequencing
For library construction, total DNA was quantified in triplicate using the Quant-iT™ PicoGreen® DNA Assay Kit and normalized to 2ng/uL minimum concentration. An aliquot of 100ng for each sample was transferred into library preparation utilizing the Broad Institute developed one-well protocol. All biochemistry occurs in a single well without the need for sample transfer (the sample is reversibly immobilized to and released from magnetic beads, allowing washes and reagent addition). The one-well protocol streamlines the process and greatly reduces sample input requirements. The product provides one library (typical median insert size of library is 330bp) 1 . Details on the library preparation workflow including general information on the adapters can be found at, provided by Illumina: https://www.illumina.com/content/dam/illuminamarketing/documents/products/datasheets/datasheet_truseq_sample_prep_kits.pdf.
Samples were sequenced on the Illumina HiSeq X instruments producing 150 base pair, paired-end reads to meet a goal of 30x mean coverage. Using the Picard Informatics Pipeline, all data from a particular sample was aggregated into a single BAM file which included all reads, all bases from all reads, and original/vendor-assigned quality scores. A pooled Variant Call Format (VCF) file using the latest version of Picard GATK software was generated and provided for each sample batch. All whole genome sequencing data analyzed in this manuscript are available in the dbGaP database with Study Accession number: phs001384.v1.p1. Accession numbers for each WGS BAM file are located in Table S15.

Genomic Alteration Detection
Before calling germline and somatic mutations, we followed GATK's best practice to preprocess the data. Reads were first quality-controlled and then mapped to the reference. Duplicates were marked by Picard, and then GATK was used for later analyses, including base recalibration and variant calling. CNVs were called by CNVnator 2 . In order to detect somatic single nucleotide variants (SNVs) between the polyp or tumor and matched normal tissue or PBL, 4 different somatic variant callers were used: MuTect2,SomaticSniper,Strelka,and VarScan 3,4,5,6 . Those callers were run with default options for normal and polyp or tumor samples from each patient. We included common SNVs detected by at least 2 different callers. Variant allele frequencies for those SNVs were calculated from sample BAM files for each patient using an in-house script. To annotate mutations, we used Variant Effect Predictor (http://www.ensembl.org/Tools/VEP) and ANNOVAR 7 .

RNA-Seq and processing
Total RNA was quantified using the Quant-iT™ RiboGreen® RNA Assay Kit and normalized to 5ng/ul. 200 ng of RNA was used to prepare libraries, using an automated version of the. mRNA was selected from the total RNA samples using oligo dT beads.
The cDNA that resulted was indexed using Broad Institute designed indexed adapters substituted in for multiplexing. After enrichment the libraries were quantified with qPCR using the KAPA Library Quantification Kit for Illumina Sequencing Platforms and then pooled equimolarly. Each sequencing run was101bp paired-end with barcoding. Pooled libraries were normalized and denatured prior to sequencing. Flow cell cluster amplification and sequencing were performed according to the manufacturer's protocols using either the HiSeq 2000 or HiSeq 2500. Data was analyzed using the Broad Picard Pipeline, which includes de-multiplexing and data aggregation. and analyze paired-end RNA-Seq data. Read alignment was performed with Tophat 8 , using Bowtie 9 . Reads were aligned to the transcriptome (Ensembl GTF) and genome (hg19), and expression was quantified using featureCounts 10 . RPKM values were calculated from the raw gene counts to assess the relative abundance of each gene.
Within each sample, RSeQC software was used to detect unsymmetrical gene body coverage, high levels of read duplication, and low saturation levels of known exon junctions 11 . Reads were additionally normalized using conditional quantile normalization, which adjusts for gene length, GC content and library size 12 . All RNAseq data analyzed in this manuscript are available in the dbGaP database with Study Accession number: phs001384.v1.p1. Accession numbers for each RNA-seq BAM file are located in Table S15.

Reduced Representation Bisulfite Sequencing (RRBS) and processing
RRBS was performed at the Mayo Clinic Genotyping Shared Resource facility. Briefly, DNA (250ng) was digested with Msp1 (New England Biolabs, Catalog Number: R0106M) and purified using Qiaquick Nucleotide Removal Kit (Qiagen, Catalog Number: 28004). End-repair A tailing was performed (New England Biolabs, The RRBS data was processed using a streamlined analysis and annotation pipeline for reduced representation bisulfite sequencing, SAAP-RRBS 13 . Briefly, FASTQ are trimmed to remove adaptor sequences, and any reads with less than 15bp are discarded. Trimmed Fastqs are then aligned against the reference genome using BSMAP 14 ; this tool converts the reference genome to align the bisulfite treated reads. Samtools is used to get mpileup and custom PERL scripts to determine CpG methylation and bisulfite conversion ratios 15 . Methylation is reported along with custom CpG annotation for the one with minimum of five read support. All RRBS data analyzed in this manuscript are available in the dbGaP database with Study Accession number: phs001384.v1.p1. Accession numbers for each RRBS BAM file are located in Table S15.

Determining Differentially Methylated Regions
Tiled units of CpGs were created based on distance between adjacent CpG site locations (within 100 base pairs of the last observed CpG) and the level of background methylation in the control group (not to exceed 5%; control group were the CFPs).
Regions of chromosomes satisfying these criteria with more than 5 CpGs were considered regions of interest. Each CpG must also be observed in at least 50% of the samples of each disease group to be considered. Statistical significance of these regions were determined by logistic regression using the ratio of methylated and total read counts within the region as a response and disease group as a covariate. To account for varying read depths across individual subjects, an over-dispersed logistic regression model was used, where dispersion parameter was estimated using the Pearson Chi-square statistic of the residuals from the fitted model.  Figure S1. The somatic mutation frequency of 10 genes found to be commonly mutated in CRC by the TCGA. We compared the mutation frequencies of these genes for the CAPs, Cancer tissues (of CAPs) and CFPs.

VILLOUS LOW V30L
NON-AGG CFP revealed fragments of tubular adenoma with low grade dysplasia. In the transverse colon, a pedunculated, 1.2 cm polyp was found. Biopsy also revealed fragments of tubular adenoma with low grade dysplasia. Only the fragments from the ascending polyp are used in this study. A27 X X CFP NORMAL EPITH

NON-AGG CFP
Three polyps were found. In the cecum, there was an 8 mm, sessile polyp. A similar-appearing, sessile polyp, measuring 7 mm in size, was found in the ascending colon. The biopsies for these two polyps both showed fragments of tubular adenoma with low grade dysplasia. Within the sigmoid colon, there was a 5 mm, sessile polyp that was shown to be a tubular adenoma with low grade dysplasia. Only the fragments from the cecal polyp are used in this study.