Prevalence of germline pathogenic variants in 22 cancer susceptibility genes in Swedish pediatric cancer patients

Up to 10% of pediatric cancer patients harbor pathogenic germline variants in one or more cancer susceptibility genes. A recent study from the US reported pathogenic variants in 22 out of 60 analyzed autosomal dominant cancer susceptibility genes, implicating 8.5% of pediatric cancer patients. Here we aimed to assess the prevalence of germline pathogenic variants in these 22 genes in a population-based Swedish cohort and to compare the results to those described in other populations. We found pathogenic variants in 10 of the 22 genes covering 3.8% of these patients. The prevalence of TP53 mutations was significantly lower than described in previous studies, which can largely be attributed to differences in tumor diagnosis distributions across the three cohorts. Matched family history for relatives allowed assessment of familial cancer incidence, however, no significant difference in cancer incidence was found in families of children carrying pathogenic variants compared to those who did not.


Supplementary
: Distribution of (A) variant type and (B) effect classification for the 25 variants determined to be pathogenic or likely pathogenic. Actual numbers are given above the bars.

Target enrichment and sequencing
Fluidigm Juno target-enriched libraries were prepared for all 797 DNA samples. Since the final sequence yield of different libraries showed a degree of random variation not related to sample input DNA concentration, with some libraries having very low yield, at least two replicate libraries were prepared for each DNA sample (Supplementary Table "Library Info").
In total, 1706 sequencing libraries were prepared and sequenced. Two replicate libraries were sequenced for 773 of the DNA samples, four libraries for 23 of the samples, and 20 libraries for one of the samples. All samples passed our minimum base quality score requirement of 80% of bases at base quality 30 or higher. However, seven samples had less than 90% of the assay target region covered by 30 high-quality aligned reads. These seven samples were therefore excluded from further analyses. For the remaining samples, 94.6% of the target region was covered by 30 or more high-quality aligned reads, on average, and the mean sequence coverage was 1741 reads. Only 1.1% of the assay target region had no coverage, on average per sample (Supplementary Table "Sample Info")

Target-enrichment panel primer design
We designed a Fluidigm Juno targeted DNA sequencing assay (Fluidigm, South San Francisco, California, USA) for 22 genes previously found to carry pathogenic or likely pathogenic variants predisposing for cancer in a cohort of pediatric cancer patients (1). The assay was designed to target the coding region, 5¢ untranslated region (5¢ UTR) and 20 bp of intron adjacent to each exon/intron border of all complete coding transcripts from RefSeq and Gencode version 24 with transcript support level 1 (Supplementary Methods Tables "Target   Transcripts" and "Target Summary"). The assay was performed using Fluidigm D3 software.
Overlapping amplicons between 150 and 290 bp long were tiled across the target regions taking into account SNPs from dbSNP 137 with ³1% frequency or flagged as clinically associated. The forward and reverse primers included a universal 5¢ tail sequence required for adding sample-specific index and adapter sequences for Illumina sequencing. The design was manually reviewed, and five amplicons with both primers designed in repeat regions were removed. Primer pairs were divided into multiplex pools based on melting temperature, and to avoid unintended complementarity between primers and primers and products. The final assay contained primer pairs for 809 amplicons divided into twenty primer pools (Supplementary Methods

DNA extraction, library preparation, target enrichment and sequencing
DNA was extracted from 797 samples using the QIAmp DNA Blood Maxi kit (Qiagen).
Library preparation and target enrichment were performed using the Fluidigm Juno targeted

Sequence preprocessing, alignment, and variant calling
See Supplemental figure S1 with bioinformatics processing workflow

Sequence read preprocessing
We trimmed adapter sequences from reads using Trimmomatic version 0.33 (2)

Alignment
Sequence reads were aligned to the human reference genome using the base quality aware Novoalign version 3.02.13 (Novocraft Technologies Sdn Bhd, Petaling Jaya, Selangor, Malaysia, http://www.novocraft.com/products/novoalign/). We used the human reference genome GRCh37 with decoy sequences from the 1000 Genomes Project as reference (see ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_seq uence/README_human_reference_20110707). To reduce allelic bias in the alignment, substitution variant alleles in dbSNP version 37 with an allele frequency of 1% or more in 1000 Genomes Project phase 1 genotype data were added to the reference genome as ambiguity codes. The Novoalign genome index was created with a k-mer length of 14 (-k 14) and step size of 2 (-s 2). Novoalign was run in amplicons mode with a delta value of 2 (amplicons <amplicon primer coordinates bed> 2). This means that sequences corresponding to the PCR primers are utilized for the alignment, but then soft-clipped from the ends of readpairs mapping within 2 bp from the coordinates of the primer binding sites. Soft-clipping was enabled and a soft-clip bonus of 20 was added to the alignment score for alignments extending to the end of reads (--softclip 20). Other non-default parameters were -r R -Q2Off and -i +-145-295 (the range of amplicon sizes in the assay is 150-290 bp). Read group information was added (AddOrReplaceReadGroups) and bam files merged per library (MergeSamFiles) using Picard version 2.8.2.

Sequence and alignment quality control
The yield (total reads and bases that pass the Illumina quality control "chastity" filter) and base qualities (percent bases that achieve a quality score of 30 or higher) of the demultiplexed, trimmed and filtered sequence reads were assessed using Picard CollectQualityYieldMetrics. Alignment quality control metrics were calculated using Picard TargetedPcrMetrics with MINIMUM_MAPPING_QUALITY set to 20, MINIMUM_BASE_QUALITY set to 20, and CLIP_OVERLAPPING_READS set to true.
Evaluated quality metrics included PCT_PF_UQ_READS_ALIGNED, PCT_OFF_AMPLICON, MEAN_TARGET_COVERAGE, ZERO_CVG_TARGETS_PCT, and PCT_TARGET_BASES_30X (for definition of metrics, see http://broadinstitute.github.io/picard/picard-metric-definitions.htm). With these minimum mapping and base quality settings, the PCT_TARGET_BASES_30X can be regarded as a measure of the fraction of the target region that is "callable", meaning that the coverage of high-quality aligned reads is sufficient for a variant caller to detect genetic variants.
Samples were excluded from further analyses if less than 80% of bases had a quality score of 30 or higher, or if less than 90% of the target region was covered with 30 or more high quality aligned reads.
-For VarDict the following parameters were set: -F 0x500 -I 120 -k 1 -z 1 -c 1 -S 2 -E 3 The vast majority of pathogenic variants in genes associated with hereditary cancers have allele frequencies well below 0.01% in large outbred general population databases (21). The very few pathogenic variants with allele frequencies above 0.01% in the general population are likely to be well known, already characterized in the literature, and represented in databases of pathogenic variants such as ClinVar (21). We therefore applied the ACMG/AMP BS1 criterion with an allele frequency cutoff of 0.01%, except for variants already reported as (likely) pathogenic by any submitter to ClinVar. To account for sampling variance, we used a so-called "filtering allele frequency" threshold, the maximum true population allele frequency that is consistent with a particular allele count observed in a population based on the Poisson 95% confidence interval (22). Filtering allele frequencies from populations known to have gone through a bottleneck (Ashkenazi Jewish and Finnish) were not used. We applied the BA1 criterion to variants with a filtering allele frequency greater than 1%. The BS2 criterion was applied as supporting evidence (BS2_P) in combination with BS1 if homozygous individuals were reported in gnomAD or ExAc populations.

Confirmation of variants classified as (likely) pathogenic
The supporting alignment data for all variants classified as pathogenic or likely pathogenic were visually reviewed in Integrative Genomics Viewer (IGV) to help identify potential false positives and characterize complex variants (23,24). All variants classified as pathogenic and likely pathogenic that passed the review in IGV were Sanger sequenced to confirm true variant calls.

Validation of methods
We have validated the Fluidigm Juno method using a larger assay covering 31 genes and 1349 amplicons in 20 primer pools. That assay included 10 of the 22 genes and 56% of the targeted bases of the assay used in this study. The methods used for assay design, library preparation, target enrichment, sequencing, and bioinformatic analyses were identical to those used in this study and described above with one exception: we removed amplicons with both primers in repeat regions when designing the panel used in this study. This reduced the offtarget capture in this study compared to the validation study. Amplicon lengths were similar with an average of 219 bp in the validation design and 222 bp in this study design.
We used 95 DNA samples extracted from blood as positive controls. These DNA samples had previously been screened for genetic variants using a clinically validated hybrid selection assay (Agilent SureSelect) and were selected to represent a broad variety of sequence variants. The overlap between the target regions of the clinical hybrid selection assay and the Fluidigm Juno validation assay included the coding region and 10 bp adjacent introns of 21 genes (78,643 bp), and the positive controls together contained 314 different genetic variants in that region. These variants included both common polymorphisms and rare variants. The longest deletion was 21 bp and the longest insertion was 11 bp combined with a deletion of 1 bp. Most control variants were also Sanger sequenced for confirmation (96% of the indels and 89% of the SNVs).
The sequence yield and coverage were similar for the positive control libraries prepared with the validation assay and the libraries prepared with the assay used in this study. On average, 95.1% of the bases targeted by the validation panel were covered with 30 or more highquality aligned reads, compared to 94.6% in this study.
The sensitivity of the validation assay to detect the positive control variants was 96.7%, calculated as the mean sensitivity across variants (Smean(var), see below). The number of false positive variants (FP) detected was 41 (11.5%). All the false positives were identified as such when visually reviewed in IGV. The fraction of false positives identified when variants detected in this study were reviewed in IGV was similar (13.4%) to the false positive rate estimated for the validation assay.
The sensitivity per variant (Svar) was calculated as: where TPvar is the number of libraries with true positive calls, and FNvar is the number of libraries with false negative calls for the variant.
The mean sensitivity across variants (Smean(var)) was calculated as ! ./#0("#$) = ! "#$ , "#$ where Nvar is the number of distinct variants. This mean gives equal weight to each distinct variant, rare or common, and is therefore a good measure of our ability to detect the complete spectrum of variants.