Identifying colorectal cancer caused by biallelic MUTYH pathogenic variants using tumor mutational signatures

Carriers of germline biallelic pathogenic variants in the MUTYH gene have a high risk of colorectal cancer. We test 5649 colorectal cancers to evaluate the discriminatory potential of a tumor mutational signature specific to MUTYH for identifying biallelic carriers and classifying variants of uncertain clinical significance (VUS). Using a tumor and matched germline targeted multi-gene panel approach, our classifier identifies all biallelic MUTYH carriers and all known non-carriers in an independent test set of 3019 colorectal cancers (accuracy = 100% (95% confidence interval 99.87–100%)). All monoallelic MUTYH carriers are classified with the non-MUTYH carriers. The classifier provides evidence for a pathogenic classification for two VUS and a benign classification for five VUS. Somatic hotspot mutations KRAS p.G12C and PIK3CA p.Q546K are associated with colorectal cancers from biallelic MUTYH carriers compared with non-carriers (p = 2 × 10−23 and p = 6 × 10−11, respectively). Here, we demonstrate the potential application of mutational signatures to tumor sequencing workflows to improve the identification of biallelic MUTYH carriers.

The MCCS is a prospective cohort study of 41,513 13 The NHS cohort began in 1976 when over 121,000 female registered nurses ages 30 to 55 years returned the initial questionnaire that ascertained a variety of important health-related exposures. Colorectal cancer and other outcomes were reported by participants or next-of-kin and followed up through review of the medical and pathology record by physicians. Overall, more than 97% of self-reported colorectal cancers were confirmed by medical-record review.

Nurses' Health Study (NHS)
Participants have been sent questionnaires biennially to update information on lifestyle factors and newly diagnosed disease. Data on histology and primary location were abstracted. FFPE tissue blocks were collected from hospitals where participants with colorectal carcinoma had undergone tumor resection or endoscopic biopsy (for pre-operatively treated rectal cancer). 14 The Nurses' Health Study II is an ongoing cohort of over 116,000 female registered nurses in the US, aged 25-42 years at baseline in 1989. Demographic, lifestyle and health-related information were obtained from participants at baseline and updated every 2 years using self-administered questionnaires. Study participants who had not previously reported a diagnosis of cancer and had responded to the 1995 study questionnaire were invited to provide blood samples between 1996 and 1999. Blood samples were collected from over 29,000 participants, aged 32 to 54 years at the time of blood draw. Similarly, between 2004 and 2006, active study participants who had not previously provided a blood sample were invited to provide buccal samples. Swish-and-spit samples of buccal cells were received from nearly 30,000 participants. Participants with a prior history of any cancer (except non-melanoma skin cancer), ulcerative colitis, or familial polyposis syndromes were excluded. FFPE tissue blocks were collected from hospitals where participants with colorectal carcinoma had undergone tumor resection or endoscopic biopsy (for preoperatively treated rectal cancer). 15 PLCO is a large, randomized, two-arm trial that enrolled over 154,000 men and women between the age of 55 and 74 years at ten centers in order to determine the effectiveness of screening to reduce cancer mortality. Half of the participants were randomized into the screening arm and half into the control arm. Participants in the screening arm received annual screens for the four cancers for the first 6 years; participants in the control arm received usual care. Enrollment began in 1993 and concluded in 2001. Both arms were followed for cancer incidence and mortality for at least 13 years from baseline. Details of this study have been previously described and are available online (http://dcp.cancer.gov/plco). In 2006, FFPE pathology tissue samples were collected from PLCO participants who developed selected cancers, including colorectal cancer.

Women's Health Initiative (WHI) 16
The WHI study is a large, multi-center study of postmenopausal women aged 50 to 79 years at recruitment from 40 US clinical centers between 1993 and 1998, including over 68,000 women who participated in four overlapping trials evaluating: menopausal hormone therapy (HT: two trials), dietary modification (DM) and calcium-vitamin D (CaD) supplementation.
Participants in the CaD trial were recruited from those who were either in the HT or the DM trial. Details of the WHI study design have been described elsewhere and are available online (https://www.whi.org/). FFPE pathology tissue samples were collected from WHI participants who developed selected cancers, including colorectal cancer. Patients with sufficient material and consent were included in this study.

Whole Exome Sequencing
The training dataset of whole-exome sequenced (WES) samples were processed as described previously 17 . Formalin-fixed paraffin embedded (FFPE) tissues from CRCs were macrodissected and DNA extracted using the QIAamp DNA FFPE Tissue kit (Qiagen, Hilden, Germany) using standard protocols. Peripheral blood-derived DNAs were extracted using DNeasy blood and tissue kit (Qiagen) and sequenced as germline references. Capture of the whole exome was performed using Agilent Clinical Research Exome V2 (Agilent, Santa Clara, CA) with sequencing performed on an Illumina NovaSeq 6000 (San Diego, CA) comprising 150bp paired-end reads at the Australian Genome Research Facility.
Mean on-target coverage across MUTYH was 581.2 ± 156.9 (mean ± SD) for the tumor DNA samples and 372.0 ± 148.3 for blood-derived DNA samples.

Targeted Sequencing
The panel-sequenced tumors were processed as described previously 18   procedure. Paired-end reads were aligned to the reference human genome (GRCh37/hg19) using Burrows-Wheeler Aligner (BWA-MEM version 0.7.9a). Local realignments and base quality recalibrations were performed on aligned data. Only reads aligned uniquely to the reference human GRCh37/hg19 genome assembly were used in downstream analysis.

Bioinformatics Pipelines and Analysis
WES samples were aligned to the GRCh37 human reference genome using BWA 0.7.12, from FASTQ files trimmed with trimmomatic 0.38 19 to remove adapter sequences. Somatic single-nucleotide variants (SNVs) and short insertions and deletions were called with Strelka 2.9.2 20 and Mutect2 21 . PASS variants that were reported by both callers were retained, and further filtered to those with a variant allele fraction ≥ 0.1, and a tumor depth ≥ 25 bases.
Somatic variants were generated from the panel-sequenced tumors as described previously 18  Loss of heterozygosity (LOH) across all cohorts was assessed using the LOH calculation tool LOHdeTerminator v0.5 29 , which determines likely regions of LOH by identifying heterozygous germline variants with somatic equivalents skewed towards homozygosity.
With the lower mutation count across panel-sequenced data, we additionally required at least one somatic variant suggestive of LOH to occur within 100,000 bases of MUTYH.
We assessed the prevalence of copy number loss across MUTYH in CRCs with publicly available data from Pan-Cancer Analysis of Whole Genomes (PCAWG) 30 and The Cancer Genome Atlas (TCGA) 31 . For PCAWG, copy number loss was considered to be any segment spanning MUTYH with predicted copy number less than two. For TCGA, loss was considered to be any segment spanning MUTYH with mean log2(copy-number/2) less than -0.3 32 .
Microsatellite instability (MSI) status was determined using the method described by MSIseq v1.0 33 , such that tumors observed to have a MSIseq threshold >1.9 were considered MSIhigh.
Mutational signatures were calculated using the simulated annealing method previously The reconstruction error was calculated as the cosine distance between the observed mutational context counts and the predicted mutational context counts as computed from the calculated mutational signatures, a value bounded by 0 and 1, 0 indicating maximal similarity. 42

Variant Classification
All variants detected within MUTYH, both somatic and germline, were classified into three categories: Pathogenic: a variant annotated as pathogenic or likely pathogenic in ClinVar 28 ; Potentially pathogenic or uncertain: a variant that has not been annotated as pathogenic in ClinVar, but has a gnomAD frequency of less than 1%, and exhibits any of the following: • ClinVar classification as either variant of uncertain clinical significance or with conflicting interpretations of pathogenicity (VUS); • Computational prediction of pathogenicity: REVEL>0.6 25 ; or • Computational prediction of pathogenicity: CADD>20 24 .
Not pathogenic: all other variants.
Each tumor was classified based on these variant categories.

Determining a confidence threshold for TMSs
The training dataset consisted of 102 whole-exome sequenced CRCs, of which eight were known MUTYH positives and 92 were known MUTYH negatives. To establish a panelspecific threshold, we simulated panel-based results by restricting variant calls to the capture region of the panel, then calculating mutational signatures on this reduced set of variant calls.