Improved structural variant interpretation for hereditary cancer susceptibility using long-read sequencing

Purpose Structural variants (SVs) may be an underestimated cause of hereditary cancer syndromes given the current limitations of short-read next-generation sequencing. Here we investigated the utility of long-read sequencing in resolving germline SVs in cancer susceptibility genes detected through short-read genome sequencing. Methods Known or suspected deleterious germline SVs were identified using Illumina genome sequencing across a cohort of 669 advanced cancer patients with paired tumor genome and transcriptome sequencing. Candidate SVs were subsequently assessed by Oxford Nanopore long-read sequencing. Results Nanopore sequencing confirmed eight simple pathogenic or likely pathogenic SVs, resolving three additional variants whose impact could not be fully elucidated through short-read sequencing. A recurrent sequencing artifact on chromosome 16p13 and one complex rearrangement on chromosome 5q35 were subsequently classified as likely benign, obviating the need for further clinical assessment. Variant configuration was further resolved in one case with a complex pathogenic rearrangement affecting TSC2. Conclusion Our findings demonstrate that long-read sequencing can improve the validation, resolution, and classification of germline SVs. This has important implications for return of results, cascade carrier testing, cancer screening, and prophylactic interventions.


Illumina sequencing
Germline genome sequencing, tumour genome sequencing and tumour transcriptome sequencing (RNA-seq) was performed for 669 adult patients with primarily metastatic cancers participating in BC Cancer's Personalized OncoGenomics (POG) program in Vancouver, British Columbia, Canada (NCT02155621). Tissue collection, nucleic acid extraction and short-read sequencing library preparation have been previously described 1 . Briefly, DNA was extracted from peripheral blood and from tumour biopsy sections embedded in optimal cutting temperature compound. PCR-free genome libraries were prepared for paired-end genome sequencing, which was performed on the Illumina HiSeq2000, HiSeq2500 or HiSeqX to an average coverage of 40X for peripheral blood and 80X for tumour samples. mRNA was purified from tumour biopsy specimens, converted to cDNA, and paired-end sequencing of strandspecific libraries was performed on Illumina HiSeq instruments to a mean depth of approximately 200 million reads. All Illumina and Nanopore sequencing data for the POG cohort has been deposited in the European Genome-phenome Archive (EGA) under accession EGAS00001001159. Accession number for the individuals described in this study are provided in Table S6.

Germline structural variant calling
Illumina genome sequencing reads were aligned to the human reference genome version hg19 using Burrows Wheeler Aligner (BWA)-MEM v0.7.6, and duplicate reads were removed using Picard tools v1.92 (http://broadinstitute.github.io/picard/) 2 . To improve the sensitivity of structural variant (SV) detection, two computational pipelines were implemented to identify potential pathogenic and likely pathogenic germline SVs. Large copy number variants were called using the read depth-based tool Control-FREEC, and region-based filtering was used to identify variants overlapping 98 cancer predisposition genes (Table S1) 3 . Known and recurrent technical artifacts were subsequently filtered prior to manual review. SV calling was performed using DELLY v0.7.3, Manta v1.0.0 and Trans-ABySS v1.4.10, and putative variants identified by each tool were compared, merged and annotated with gene and functional information using MAVIS [4][5][6][7] . Gene-based filtering and filtering based on predicted impact to protein-coding regions was performed to identify non-synonymous variants in candidate cancer predisposition genes.
Manual review of germline and tumour Illumina genome sequencing data was performed using the Integrated Genomics Viewer (IGV) v2.7.0 to flag suspected technical artifacts and prioritize candidate variants for assessment by Oxford Nanopore long-read sequencing. Five carriers of pathogenic SVs were previously identified through clinical testing and referred to the BC Cancer Hereditary Cancer Program. These variants were used to determine the sensitivity of SV calling from short-read genome sequencing and guide manual data curation of novel variants.

Breakpoint sequence analysis
Repetitive elements overlapping breakpoints predicted by Illumina and/or Nanopore genome sequencing were identified using the annotated RepeatMasker dataset obtained from the University of California Santa Cruz (UCSC) Table Browser for the reference genome version hg19 (http://genome.ucsc.edu/) 8,9 . Sequence identity within ±150 bp of predicted breakpoints was evaluated through pairwise sequence alignment using EMBOSS Needle 10 . Percent identity and gaps in pairwise alignments between each corresponding 5' and 3' breakpoint were noted, and each alignment was manually reviewed for regions of microhomology. Genomic features at breakpoint junctions were similarly evaluated through pairwise sequence alignment and manual review, comparing short-read contig sequences, when available, and expected junctional sequences based on the reference genome.

RNA-seq analysis
Paired-end RNA-seq reads were aligned to the hg19 reference genome using Trans-ABySS v1.4.10, and duplicate reads were marked with Picard tools v1.92. mRNA read support for aberrant splicing and fusion transcript expression associated with germline SVs was computed using TAP, a pipeline for targeted assembly and realignment 18 . Briefly, we classified and filtered RNA-seq reads matching target gene reference sequences and performed de novo assembly using Trans-ABySS. Contigs were aligned to the reference genome and transcriptome using BWA-MEM to characterize splicing events and fusion transcripts, and read support across known and novel splice and fusion junctions was calculated from the number of reads mapping to each contig sequence. Figure S1. Illumina and Oxford Nanopore genome sequencing data indicating a recurrent intronic inverted duplication on chromosome 16p13 in Case 1

Supplementary
Illumina and Oxford Nanopore genome sequencing data for Case 1 visualized using IGV at the loci of IFT140 and TSC2. Paired-end reads mapping to intron 30 of IFT140 and intron 16 of TSC2 are shown in parallel and coloured by strand. 133 bp and 136 bp insertions were found in two Nanopore reads, with sequences mapping to Alu elements at the locus of the TSC2 breakpoint predicted by Illumina short-read sequencing. Figure S2. Illumina and Oxford Nanopore genome sequencing data indicating a recurrent intronic inverted duplication on chromosome 16p13 in Case 2

Supplementary
Illumina and Oxford Nanopore genome sequencing data for Case 2 visualized using IGV at the loci of IFT140 and TSC2. Paired-end reads mapping to intron 30 of IFT140 and intron 16 of TSC2 are shown in parallel and coloured by strand. 133 bp and 136 bp insertions were found in two Nanopore reads, with sequences mapping to Alu elements at the locus of the TSC2 breakpoint predicted by Illumina short-read sequencing. Figure S3. Illumina and Oxford Nanopore genome sequencing data indicating a recurrent intronic inverted duplication on chromosome 16p13 in Case 3

Supplementary
Illumina and Oxford Nanopore genome sequencing data for Case 3 visualized using IGV at the loci of IFT140 and TSC2. Paired-end reads mapping to intron 30 of IFT140 and intron 16 of TSC2 are shown in parallel and coloured by strand. 133 bp and 136 bp insertions were found in two Nanopore reads, with sequences mapping to Alu elements at the locus of the TSC2 breakpoint predicted by Illumina short-read sequencing. Figure S4. Illumina and Oxford Nanopore genome sequencing data supporting a likely benign complex rearrangement on chromosome 5q35

Supplementary
Illumina and Oxford Nanopore genome sequencing data for Case 4 visualized using IGV at the locus of UIMC1 and NSD1. Split Nanopore reads spanning the breakpoint junctions are shown mapping to flanking regions of the predicted breakpoints, denoted by black arrows, and connected by a thin gray line. Read segments coloured red and blue denote split reads mapping to both plus and minus strands, indicating a probable inversion event.
Supplementary Figure S5. Illumina and Oxford Nanopore genome sequencing data supporting a pathogenic complex rearrangement on chromosome 16p13 Illumina and Oxford Nanopore genome sequencing data for Case 5 visualized using IGV at the locus of TSC2 and NTHL1. Split Nanopore reads spanning the breakpoint junctions are shown mapping to flanking regions of the predicted breakpoints (black arrows) connected by a thin gray line. Read segments coloured red and blue denote split reads mapping to both plus and minus strands, indicating a probable inversion event.
Supplementary Figure S6. Illumina and Oxford Nanopore genome sequencing data supporting a 96 kb deletion in ATM Illumina and Oxford Nanopore genome sequencing data for Case 6 visualized using IGV at the locus of ATM. One Nanopore read spanning the breakpoint junction from two independent sequencing runs are shown mapping to flanking regions of the predicted breakpoints (black arrows) and are connected by a thin gray line. Figure S7. Illumina and Oxford Nanopore genome sequencing data supporting a single-exon inversion in RAD51C

Supplementary
Illumina and Oxford Nanopore genome sequencing data for Case 7 visualized using IGV at the locus of RAD51C. Split Nanopore reads spanning the breakpoint junctions are shown mapping to flanking regions of the predicted breakpoints (black arrows) connected by a thin gray line. Read segments coloured red and blue denote split reads mapping to both plus and minus strands, indicating a probable inversion event.
Supplementary Figure S8. Illumina and Oxford Nanopore genome sequencing data supporting a single-exon deletion in ATM Illumina and Oxford Nanopore genome sequencing data for Case 8 visualized using IGV at the locus of ATM. Split Nanopore reads spanning the breakpoint junctions are shown mapping to flanking regions of the predicted breakpoints (black arrows) connected by a thin gray line. Figure S9. Illumina and Oxford Nanopore genome sequencing data supporting a 77 kb deletion with breakpoints in BRCA1 and NBR2

Supplementary
Illumina and Oxford Nanopore genome sequencing data for Case 9 visualized using IGV at the locus of BRCA1. Split Nanopore reads spanning the breakpoint junctions are shown mapping to flanking regions of the predicted breakpoints (black arrows) connected by a thin gray line. Figure S10. Illumina and Oxford Nanopore genome sequencing data supporting a multiexon deletion in BRCA1

Supplementary
Illumina and Oxford Nanopore genome sequencing data for Case 10 visualized using IGV at the locus of BRCA1. Several split Nanopore reads from two PromethION sequencing runs spanning the breakpoint junctions are shown mapping to flanking regions of the predicted breakpoints (black arrows) and are connected by a thin gray line.

Supplementary Figure S11. Illumina and Oxford Nanopore genome sequencing data supporting a 129 kb deletion encompassing EPCAM and part of MSH2
Illumina and Oxford Nanopore genome sequencing data for Case 11 visualized using IGV at the locus of EPCAM and MSH2. Split Nanopore reads spanning the breakpoint junctions are shown mapping to flanking regions of the predicted breakpoints (black arrows) connected by a thin gray line.

Supplementary Figure S12. Illumina and Oxford Nanopore genome sequencing data supporting a 24 kb deletion in FANCA
Illumina and Oxford Nanopore genome sequencing data for Case 12 visualized using IGV at the locus of FANCA. Split Nanopore reads spanning the breakpoint junction were identified in only one of two independent PromethION sequencing runs and are shown mapping to flanking regions of the predicted breakpoints (black arrows).

Supplementary Figure S13. Illumina and Oxford Nanopore genome sequencing data supporting a 3.4 kb deletion in PALB2
Illumina and Oxford Nanopore genome sequencing data for Case 13 visualized using IGV at the locus of PALB2. Split Nanopore reads spanning the breakpoint junction are shown mapping to flanking regions of the predicted breakpoints (black arrows).

Supplementary Figure S14. Illumina genome sequencing data supporting a clinicallyconfirmed multiexon deletion in TP53
Germline and tumour Illumina short-read genome sequencing data in Case 14 for a clinicallyvalidated germline deletion in TP53 but for whom germline DNA was insufficient for long-read sequencing. Breakpoints characterized by Illumina genome sequencing are denoted by black arrows, and deletions are observed by a decrease in read coverage highlighted by blue shaded boxes. Figure S15. Contribution of characterized somatic SNV signatures to tumourigenesis in cases with known genetic associations Somatic SNV signatures were characterized in tumours from carriers of pathogenic germline structural variants. The number of somatic SNVs in each of 96 possible trinucleotide contexts is shown for cases with known associations according to the Catalog of Somatic Mutations in Cancer (COSMIC) version 2, and the percent contribution of relevant signatures to global somatic single nucleotide variation is noted. BER, base excision repair; HR, homologous recombination; MMR, mismatch repair.