Genomic landscape of diploid and aneuploid microsatellite stable early onset colorectal cancer

Although colorectal cancer (CRC) remains the second leading cause of cancer-related death in the United States, the overall incidence and mortality from the disease have declined in recent decades. In contrast, there has been a steady increase in the incidence of CRC in individuals under 50 years of age. Hereditary syndromes contribute disproportionately to early onset CRC (EOCRC). These include microsatellite instability high (MSI+) tumors arising in patients with Lynch Syndrome. However, most EOCRCs are not associated with familial syndromes or MSI+ genotypes. Comprehensive genomic profiling has provided the basis of improved more personalized treatments for older CRC patients. However, less is known about the basis of sporadic EOCRC. To define the genomic landscape of EOCRC we used DNA content flow sorting to isolate diploid and aneuploid tumor fractions from 21 non-hereditary cases. We then generated whole exome mutational profiles for each case and whole genome copy number, telomere length, and EGFR immunohistochemistry (IHC) analyses on subsets of samples. These results discriminate the molecular features of diploid and aneuploid EOCRC and provide a basis for larger population-based studies and the development of effective strategies to monitor and treat this emerging disease.


Flow cytometry
Excess paraffin was removed from each FFPE sample with a scalpel from either side of scrolls then processed according to our published methods 21 .We used one to three 50 µm scroll(s) from each FFPE tissue block to obtain sufficient numbers of intact nuclei for sorting and molecular assays.Frozen tissue samples were minced

Quality control (Q.C.) measures of single nuclei and of genomic DNA
Nuclei from pre and post sorted samples were inspected with a Countess 3 FL Automated Cell Counter to confirm the quality and yields of each tissue and sorted fraction.DNAs from sorted samples were extracted using Qiagen micro kits (Qiagen Valencia, CA) then assayed with an Agilent Tape Station, to measure the yield and molecular weight of extracted DNA.

NGS exome
Sorted tumor populations and patient matched control samples were sequenced within the Mayo Clinic Medical Genome Facility (MGF) using established protocols for whole exome analysis 22 .Pair-ended Illumina FASTQ reads were processed with GENOMEGPS-the internal Mayo Clinic secondary data processing pipeline.Briefly, reads QC and adapter trimming were performed by CUTADAPT (https:// cutad apt.readt hedocs.io/ en/ stable/), with alignment to reference HG38 by BWA-MEM, followed by reads de-duplication and base quality recalibration by GATK 3.6.For somatic mutation calling from tumor-normal matching samples, MuTect2 from GATK 4.3 were used.The Genome Aggregation Database (gnomAD) was used to build a mutation calling statistics model.For mutation filtering, a panel of normal sequences (somatic-hg38_1000g_pon.hg38.vcf.gz) was used to filter out commonly seen sequencing noise.Orientation biases (i.e., FFPE artefacts) were annotated by a mixture model (LearnReadOrientationModel) from GATK.We used GATK recommended tool (FilterMutectCalls) to filter raw somatic mutations and keep mutations annotated as "PASS" only.We also applied additional filters to reduce false positives including variant allele frequency (VAF) ≥ 10% in tumors, read counts of ≥ 6 for each variant.VCF files were then converted to MAF format with vcftools (https:// vcfto ols.sourc eforge.net/) and subjected to R package maftools for tertiary analysis.For mutation signature analysis, we used the default parameters in maftools.The optimal number of signatures was determined using Cophenetic correlation.Extracted signatures were compared to known signatures from COSMIC database, and cosine similarity was calculated to identify best matches in our data.

Immunohistochemical staining EGFR
Tissue sectioning at 5 μm and IHC staining for the 20 FFPE cases was performed on-line at the pathology research core (Mayo Clinic, Rochester, MN) using the Leica Bond RX stainer (Leica Biosystems).Slides were retrieved for 20 min using BOND Epitope Retrieval Solution 1 (Leica Biosystems).The EGFR primary antibody (Rabbit Monoclonal, Cell Signaling #4267, clone Erb B/Her) was diluted to 1:50 in background reducing Diluent (Dako Products, Agilent) and incubated for 30 min.Slides were incubated for 10 min in DAB and DAB buffer (1:19 mixture) from the bond polymer refine detection system (Leica Biosystems), then rinsed between steps with 1X Bond Wash Buffer.Slides were counterstained for five minutes using Schmidt hematoxylin and molecular biology grade water (1:1 mixture), followed by several rinses in 1X Bond wash buffer and distilled water, rinsed in tap water for three minutes, then dehydrated in increasing concentrations of ethyl alcohol and cleared in 3 changes of xylene prior to permanent coverslipping in xylene-based medium.EGFR scoring was performed by a GI pathologist (RPG) with the following criteria: 0 = no staining, 1+ faint cytoplasmic staining in > 10% tumor cells, 2 + moderate membranous staining, 3 + strong membranous staining.Negative samples included 0 and 1+ cases, while positive samples were 2+ and 3+.

aCGH
All aCGH was done according to our published protocols [21][22][23][24] .Briefly, DNAs from frozen tissue and FFPE samples were treated with DNAse 1 prior to Klenow-based labeling.High molecular weight templates were digested for 30 min while DNAs from FFPE samples were digested for 1 min.In each case 1 µl of 10 × DNase 1 reaction buffer and 2 μl of DNase 1 dilution buffer were added to 7 μl of DNA sample and incubated at room temperature then transferred to 70 °C for 30 min to deactivate DNase 1. Sample and reference templates were then labeled with Cy-5 dUTP and Cy-3 dUTP respectively using a BioPrime labeling kit (Invitrogen, Carlsbad, CA) according to our published protocols 23 .All labeling reactions were assessed using a Nanodrop assay (Nanodrop, Wilmington, DE) prior to mixing and hybridization to 400 k CGH arrays (Agilent Technologies, Santa Clara, CA) for 40 h in a rotating 65 °C oven.After washing microarrays were scanned using an Agilent 2565C DNA scanner and the images were analyzed with Agilent Feature Extraction version 11.0 using default settings.The aCGH data was assessed with a series of QC metrics then analyzed using an aberration detection algorithm (ADM2) 25 .

Telomere measure
Telomere length measure of sorted nuclei was performed with a monochrome multiplex quantitative polymerase chain reaction (MMqPCR) assay that has been described previously 26 .The MMqPCR assay uses a telomere repeat primer and single copy gene primer for calculation of the relative ratio of telomere quantity based on cycle number to single copy gene (human beta globin gene, HBB) quantity (T/S ratio).Each sample was run in triplicate, and the final T/S ratio was based on the mean of the three measurements.

Flow sorting and ploidy
We detected aneuploid populations in 11/21 (52%) EOCRC samples.The ploidy of these ranged from near diploid (2.3N) to hyper triploid (3.5N).The remaining 10 tumors were diploid by DNA content flow cytometry.In all cases we collected a minimum of two populations based on their ploidy.These included 2N(G 0 /G 1 ) fractions from each tumor, as well as 4N(G 2 /M) and aneuploid fractions when present (Fig. 1).In one case, EOCRC12, we identified and collected a diploid, a 4N(G 2 /M), and two distinct aneuploid populations (2.5N and 3.2N) from the same FFPE sample (Fig. 1B).

Mutation profiles
The exomes of flow sorted populations from the 21 tumor samples were sequenced with our established workflow.These included the two distinct aneuploid populations in EOCRC12.Peripheral blood samples were available from 20 of 21 cases while the sorted diploid fraction was used as patient matched control in EOCRC21.The mean coverage for targeted regions across all 43 normal tumor paired samples was 40 with a range of 12.4-140.1 and a median of 40.5.Pathogenic somatic variants in known driver genes of CRC were detected in diploid and aneuploid tumors (Fig. 2A).These included APC (62%), TP53 (33%), and KRAS (24%) (Supplemental Fig. 1A-C).
The APC variants included 3 cases with R216X and two cases with R283X, both well characterized nonsense variants that are found in familial and sporadic CRCs.The somatic TP53 variants included missense (n = 5) and  www.nature.com/scientificreports/nonsense (n = 1) variants targeting the DNA binding domain, as well as a splice site variant, that were present exclusively in aneuploid cases in our cohort.The KRAS variants targeted codon 12 in each case including a G12D variant present in both aneuploid populations (AN1, AN2) from EOCRC12.Strikingly, the latter sorted tumor populations from the same biopsy, had distinct TMBs with a > threefold increase in the number of mutations in the 2.5N (AN1) population relative to the 3.2N (AN2) population (Fig. 2B).
In addition, we detected somatic variants in Leucine zipper-like transcription regulator 1 (LZTR1) in three cases (Fig. 2A, Supplemental Fig. 1D).The latter is a Kelch-BTB-BACK domain-containing protein that functions as substrate adaptor of a CRL complex, CRL3LZTR1 implicated in rare neurodevelopmental disorders 27 .Mutations and deletions targeting LZTR1 have been reported in multiple cancers including colorectal carcinoma 10,12 .The three variants, each a variant of unknown significance (VUS) include G169R that occurs in a highly conserved residue within a KELCH 3 domain, S382L, and L809P adjacent to R810W that compromises LZTR1 protein degradation 28 .Previous studies have shown that pathogenic variants in LZTR1 fail to promote degradation of EGFR 29 .Thus, the somatic LZTR1 variants suggested that similar to adult cases, EGFR signalling is activated in EOCRC 30 .
We detected 3 of the 4 single base substitution (SBS) mutation signatures associated with but not exclusive to CRC in our EOCRC samples (Fig. 4A, Supplemental Fig. 1E) 32 .SBS1 and SBS5 are clock-like signatures that correlate with age whereas SBS6 is one of seven mutational signatures associated with defective DNA mismatch repair and MSI.However, none of the 21 cases were MSI+ based on our exome analyses.Total mutational burden (TMB) distinguished relatively high and low TMB EOCRCs in our cohort that were independent of ploidy and TP53 mutation (Figs. 2, 4B,C).Notably, SBS5 was present only in higher mutation signature EOCRCs.In contrast, SBS10 with a proposed etiology of polymerase epsilon (POLE) exonuclease domain mutations and typically associated with large numbers of somatic mutations (> 100 mutations per Mb) in samples termed hypermutators, was absent in our cohort despite two patients, EOCRC5 and EOCRC7, having somatic variants in POLE (Fig. 2).The first of these, R793H, is associated with a germ line polymorphism of uncertain significance (rs1422986795), while the second L1245I, has not been previously reported (Supplemental Fig. 1F).Notably, both of these cases, one diploid tumor and the other an aneuploid tumor, had the highest somatic TMBs including the highest numbers of POLE associated variants in their tumor genomes (Fig. 2B) 33 .However, these were below thresholds for SBS10.

Copy number profiles
The copy number variant (CNV) profiles in the six cases with sufficient tumor available after sorting and sequencing included a diploid tumor and five aneuploid cases (Fig. 5).For the diploid tumor we used the 4N(G 2 /M) fraction from the sort for CNV analysis, while the aneuploid tumors had distinct DNA contents.Notably, two cases, diploid (EOCRC3) and aneuploid (EOCRC12), had a focal deletion of PTEN.In addition, aneuploid case EOCRC17, a 37 year old female, had focal HDs that included known and novel cancer related genes (Supplemental Fig. 2).These included a 5q34 HD targeting TENM2/ODZ2 (Teneurin Transmembrane Protein 2) that enables cell adhesion molecule binding activity and signaling receptor binding activity, and a 18q11 HD that simultaneously targets five genes with distinct functions; PSMA8 (Proteasome 20S Subunit Alpha 8, TAF4B (TATA-Box Binding Protein Associated Factor 4b), SS18 (SS18 Subunit Of BAF Chromatin Remodeling Complex), ZNF521 Transcription factor, and KCTD1 (potassium channel tetramerization domain containing 1).To our knowledge these two HDs have not been previously reported in primary tumor samples 34 .www.nature.com/scientificreports/

Telomere lengths
In each available case we compared the telomere lengths of diploid (P2) with either 4N(G 2 /M) or aneuploid(s) tumor populations (P3, P4, or P5).We detected a significant shortening of telomere lengths in the tumor fractions of the 9 samples assayed (two tailed paired sample t-test P = 0.00132) (Supplemental Fig. 3).Our application of single tube MMqPCR with patient matched flow sorted fractions controls for sample and assay variabilities as previously described 26 .Notably, the coefficient of variation (CV) within triplicate samples for each flow sorted population was less than 9.34%.

Discussion
The prevalence of diploid (48%) and aneuploid (52%) tumors in our cohort is consistent with prior FISH studies supporting the enriched presence of diploid tumors in MSS EOCRC 17 .Our findings are also consistent with TCGA studies of later onset CRC that identified the WNT, MAPK, PI3K, TGF-band and p53 pathways as targets of somatic genomic lesions 10 .These included recurrent pathogenic mutations in APC, TP53, and KRAS, as well as deletions targeting PTEN that are present in our EOCRC samples.Targeted panel-based sequencing of large cohorts demonstrated similar mutational rates and TMBs in EOCRC and late onset cases with the most frequent alterations targeting APC, TP53, and KRAS 35 .In addition, multi-omic studies that include exome and genome analyses of different cohorts showed similar mutational profiles between early and late onset CRC but with enrichment of PTEN mutations in EOCRC 36 .The presence of focal PTEN deletions in two of six samples profiled for CNVs is notable given the association of PTEN mutations with EOCRC 36 .However, larger studies incorporating our flow cytometry methods are needed to determine whether PTEN lesions, both deletions and mutations, are enriched in EOCRC relative to average onset CRC.CRC is distinguished by 4 distinct single base pair (SBS) mutational signatures 37 .We confirmed 3 of these signatures, SBS1, SBS5, and SBS6, but an absence of SBS10 in our EOCRC cohort.SBS10, defined by the presence of huge numbers of mutations in subsets of colorectal and uterine cancer, has been associated with altered activity of the error-prone polymerase Pol ε consequent on mutations in the gene 32 .The two POLE variant cases, EOCRC5 and EOCRC7 both somatic in nature, had the highest TMBs including the highest number of POLE associated mutations in our cohort (Supplemental Table 1).However, the total number of these variants did not meet criteria for scoring SBS10.Furthermore, the allele fraction for each variant (R793H, L1245I) was only 0.32 in both cases, one diploid (EOCRC5) and one aneuploid (EOCRC7) tumor, suggesting a sub clonal population in each case.Future single cell level studies with flow sorted clinical samples and preclinical models will provide additional insights related to the role of POLE variants in EOCRC.
Our TMB analyses discriminated 9 cases (43%) with relatively high mutation burden and 12 cases (57%) with low mutation burden (Fig. 2B).These were further distinguished by the presence of SBS5 in the high TMB cases and its absence in the low TMB cases (Fig. 4B,C).Notably, neither TP53 mutation nor aneuploidy was associated with TMB.The presence of multiple aneuploid populations has been associated with accelerated progression in both premalignant and invasive carcinomas 38,39 .Although limited to single biopsies per case, we detected two co-occurring ploidies in EOCRC12 (Fig. 1B).Strikingly, although sharing the same mutation signatures, and KRAS G12D and APC D1486Yfs*27 driver mutations, the 2.5N population had a > threefold increase in TMB relative to the 3.2N population (Fig. 2B).Future studies will include additional biopsies from individual cases to explore of multiple aneuploid populations and accelerated mutational processes in the evolution of EOCRC.
Our CNV analysis, although limited, highlights additional distinguishing features of EOCRC.These include PTEN deletions in two cases (Fig. 5).In addition, EOCRC17 a stage III sigmoid tumor from a 37 year old woman, had two unique HDs at 5q34 and 18q11 in the aneuploid tumor genome.The 5q34 deletion targets TENM2 a member of the Teneurin gene family of transmembrane proteins that mediate cell-cell and cell-extracellular matrix interactions associated with important functions in development and nervous system function 40,41 .Genebased expression-profile analyses from the Human Protein Atlas data sets suggest that TENM2 expression has potential relevance as a prognostic marker in a range of tumors.Notably, low levels of TENM2 expression are correlated with lower patients' overall survival for colorectal, pancreatic, prostate and ovarian cancers.However, to our knowledge genomic lesions in TENM2 have not been reported in EOCRC.
The five genes targeted by the 18q11 HD include the transcription factor ZNF521 that has been previously reported in CRC based on single nucleotide polymorphism array profiling of patient derived xenografts that were used to overcome stromal tissue in patient samples 42 .In contrast, our flow-sorting samples provides discrimination of HDs directly in patient samples regardless of tumor/normal cell content in each biopsy of interest.This increased resolution mapped the four additional genes, PSMA8, TAF4B, SS18, and KCTD1 within the same 18q11 HD (Supplemental Fig. 2).ZNF521 regulates expression of RNA polymerase II, is involved in BMP signaling, and in the regulation of the immature compartment of the hematopoietic system and can both act as an activator or a repressor depending on the context 43 .It associates with SMADs in response to BMP2 leading to activate transcription of BMP target genes.PSMA8 is a testis specific proteosome that promotes acetylation dependent degradation of histones and the degradation of meiotic proteins RAD51 and RPA1 44 .High protein levels are associated with good prognosis in breast cancer 45 .TAF4B is a component of a highly conserved regulatory network that promotes oocyte development 46 .This includes the proper development and morphogenesis of the embryonic intestinal endoderm 47 .Loss of TAF4 in a mouse model led to increased PRC2 activity in cells of adult crypts associated with modification of the immune/inflammatory microenvironment that potentiated APC-driven tumorigenesis.Notably, EOCRC17 has an APC somatic R283X pathogenic nonsense variant.Genomic lesions targeting SS18, synovial sarcoma translocation chromosome 18, are associated with a variety of soft tissue tumors as well as a subset of CRC 48 .The most frequent events targeting SS18 are fusions of amplifications.However, deletion of SS18 has also been reported in a small subset (0.14%) of cancers.The fifth gene in the HD interval, KCTD1, negatively regulates the AP-2 family of transcription factors and the WNT signaling www.nature.com/scientificreports/pathway 49,50 .Thus, this single HD deletes a series of genes and targets pathways with potential roles in EOCRC that will be explored in future studies.
Our current study confirms that EOCRCs present with different combinations of genomic lesions including aneuploidy, CNVs and variable TMB in the presence of driver mutations associated with average onset CRC related genes and pathways.These include a tumor (EOCRC12) with two distinct aneuploid populations that shared clonal driver mutations but distinct TMBs, a diploid tumor (EOCRC5) with relatively high TMB, and an aneuploid tumor (EOCRC7) with a relatively high TMB but a lack of CNVs (Supplemental Fig. 4).Furthermore, the presence of clock-like and age related mutation signatures SBS1 and SBS5 in the absence of MSI+, and the relative shortening of telomeres in tumor populations within a biopsy, add support to the model whereby EOCRCs result from aberrant accelerated biological aging 51 .However, despite the heterogenous nature of EOCRC, our IHC results and the presence of LZTR1 variants and PTEN focal deletions, suggest that therapies targeting EGFR and AKT/mTOR pathway signalling may be of clinical benefit for distinct subsets of patients with this disease.
Our flow sorting approach provides a framework for detailed analyses of EOCRC.This includes discriminating and comparing diploid and aneuploid tumors with variable mutational and CNV burdens from archived tissue banks and/or prospectively collected biopsies.A challenge in the use of archived tumor samples is the presence of sequence artefacts and variable DNA quality in FFPE tissue.However, our established flow sorting protocols to isolate intact nuclei, GATK package tools for filtering FFPE artefacts from NGS data, and matching diploid and aneuploid fractions from single biopsies, provide a rigorous control and pipeline for genomic and telomere length analyses in our study 21,22,52,53 .The presence of unique HDs in the small number of samples profiled for CNVs suggests that more comprehensive studies incorporating whole genome sequencing of flow sorted samples will reveal additional examples of clinically relevant pathways targeted by genomic lesions in EOCRC.Previous studies have suggested that DNA methylation signatures contribute to EOCRC 54 .Of significant interest in our ongoing studies will be to integrate epigenetic profiling in our EOCRC analyses with flow sorted samples similar to our studies of pancreatic cancer 22 .Notably, the role of environmental exposures including microbiomes as well as predisposition genes and pathways will be explored in both premalignant and malignant models.These will provide novel mechanistic studies of the genes and pathways targeted in EOCRC, and promote the development of effective interventions to prevent and treat EOCRC.

Figure 2 .
Figure 2. Mutational landscape of EOCRC.(A) Oncomap summary of somatic lesions in ASCP genomes.(B) Distribution of mutation frequency across EOCRC samples.Blue arrow EOCRC12 AN1, red arrow EOCRC12AN2.Red dashed horizontal line mean number of mutations.Black vertical line.