Genetic landscape of interval and screen detected breast cancer

Interval breast cancers (IBCs) are cancers diagnosed between screening episodes. Understanding the biological differences between IBCs and screen-detected breast-cancers (SDBCs) has the potential to improve mammographic screening and patient management. We analysed and compared the genomic landscape of 288 IBCs and 473 SDBCs by whole genome sequencing of paired tumour-normal patient samples collected as part of the UK 100,000 Genomes Project. Compared to SDBCs, IBCs were more likely to be lobular, higher grade, and triple negative. A more aggressive clinical phenotype was reflected in IBCs displaying features of genomic instability including a higher mutation rate and number of chromosomal structural abnormalities, defective homologous recombination and TP53 mutations. We did not however, find evidence to indicate that IBCs are associated with a significantly different immune response. While IBCs do not represent a unique molecular class of invasive breast cancer they exhibit a more aggressive phenotype, which is likely to be a consequence of the timing of tumour initiation. This information is relevant both with respect to treatment as well as informing the screening interval for mammography.


Sample curation
Tumour and germline sequencing data were obtained using version 14 of the main programme release of the 100,000 Genomes Project (100kGP), an NHS initiative for high-throughput sequencing of cancers and rare diseases 1,2 .Participant recruitment was coordinated by 13 Genomic Medicine Centres (GMCs) and affiliated hospitals from around the UK and written informed consent was provided by all patients.Ethical approval for the 100kGP was granted to Genomics England by the East of England -Cambridge South Research Ethics Committee (REC reference: 14/EE/1112).Tissue collection and the subsequent preparation, extraction and quantification of DNA was undertaken locally.Cubes or slices of tissue were cut from the tumour with a requirement that at least 40% of the tissue nuclei were malignant, and less than 20% of the area being necrotic.A minimum of 2μg of tumour DNA and 10μg of germline DNA was required for processing.DNA was then transferred to a central biorepository where whole genome sequencing of the paired tumour/normal DNA was performed by Illumina.The resulting processed BAM files were delivered to Genomics England, who performed additional quality checks.

Whole genome sequencing
Samples were prepared using the Illumina TruSeq DNA PCR-free library preparation kit and sequencing was performed using HiSeq X, producing 150 base pair (bp) paired-end reads.The average sequencing depth of normal blood and tumour samples were 30x and 100x respectively.Samples identified as having poor sequencing quality, considering: AT/CG dropout; percentage of mapped reads; percentage of chimeric DNA fragments; average insert size; local coverage unevenness, were excluded.Alignment of sequences to the Homo sapiens GRCh38Decoy assembly was achieved using Isaac (version 03.16.02.19) 3.

Small variant calling
Germline SNVs and indels were called using Starling 3 while somatic SNVs and small indels were called using Strelka (version 2.4.7) 4 .Somatic variants were removed using the default Strelka filters along with the following additional criteria: • A population germline allele frequency ≥1% according to gnomAD 5 or the 100kGP dataset.
• Classified as a simple repeat by Tandem Repeats Finder 6 .
• An indel for which ≥10% of base calls in a window of 50 bases on either side of the indel have been filtered by Strelka.This filter is indicative of a high level of sequencing noise.
• A majority of overlapping 150bp reads that align to multiple loci will result in those reads being discarded.
• Statistical evidence of variant being a result of a calling artefact.This was evaluated by comparing the ratio of tumour allele depths at a given site to that of allele depths in a panel of normal samples present in the 100kGP dataset.Individuals that did not carry the relevant alternate allele at a given site were included when computing the normal allele depth.In order to replicate the Strelka default filters, duplicated reads were removed and quality thresholds were set at base quality ≥5 and mapping quality ≥5.Furthermore, variants with a phred score <80, as calculated using Fisher's exact test, were removed.
High impact pathogenic germline variants were identified by further filtering based on CADD 7 scores and ClinVar 8 annotations.A threshold of CADD score > 30 was imposed, while variants were required to have a ClinVar annotation that was not "Benign".

Copy number alterations
Copy number alterations (CNAs) were called using an iterative procedure that utilised Battenberg v2.2.8 9 .The four steps to call CNAs are as follows: Step 1) Clonal and subclonal CNAs were profiled using Battenberg, along with an estimation of sample purity and tumour ploidy.Briefly, the number of reads supporting SNV reference and alternate alleles were counted for both tumour and normal samples, using alleleCount-FixVAF 10 .Heterozygous SNVs were phased using SHAPEIT2 v2.r904 11 .The phased SNVs were then segmented using piece-wise constant fitting 12 and subclonal segments were identified using t-tests.Sample purity and tumour ploidy were estimated using the method described by Van Loo, et al. 13 As sequencing data were aligned to hg38, it was necessary to convert SNV positions to hg37 before phasing, and convert output segments back to hg38.
Step 2) Variant allele frequency distributions are used to evaluate copy number profile concordance.
The expected variant allele frequency is dependent on a number of factors, including: the fraction of tumour cells containing the variant; the tumour copy number profile; the number of chromosome copies with the variant (multiplicity); the sample purity 14 .Given the tumour copy number profile and an estimated sample purity, both estimated by Battenberg in step 1, we can expect to observe enrichment of variants with allele frequencies approximating particular values which represent clonal variants present in all tumour cells 13 .A failure to observe such an enrichment would suggest that either the copy number profile or sample purity is incorrect.We therefore assessed the Battenberg output validity via a comparison with the SNV variant allele frequency (VAF) distributions.
Each of these five copy number states was evaluated separately, as the possible variant multiplicities and expected clonal SNV VAFs differ between states 13 .Copy number states which corresponded to genomic regions containing <5% of all SNVs were not considered.Expected locations of peaks in the VAF distribution were estimated as: where  !"##$%&$'( is the sample purity as estimated by Battenberg,  * is the ploidy of the tumour at the variant site, and  is the variant multiplicity.The multiplicity can take the value of 1 or 2 in copy number states of 2:2, 2:1 and 2:0, and only 1 in the remaining states.VAF distribution peaks were called using peakPick v0.11 15 , which utilises kernel density estimation.Peaks that corresponded to a density <0.3 were excluded.Iterating over the copy number states, the expected location of the peak corresponding to the highest variant multiplicity was matched to the largest VAF of the observed distribution.All other expected peaks were matched to the observed peak with the most similar VAF.
Tumour heterogeneity must be considered as it can reduce VAF peak detection capabilities.
Therefore, for samples where ≥1 expected peak locations were considered, the expected peak that was furthest from the respective matched observed peak was removed.Sample purity,  + , was reestimated for each remaining expected peak location (with VAF ) using the matched observed peak VAF, : where  , is the ploidy of the respective copy number state.This allowed a single new purity estimate,  %$-, to be computed as the weighted average of the peak-wise purity estimates: where  + is the number of SNVs in genomic regions with the copy number state, N is the number of SNVs in genomic regions of all considered copy number states, and  + is the number of considered multiplicities for the copy number state.
Step 3) Quality assessment of CNA profiles.A metric to aid in evaluating the CNA profile quality is the weighted average of the difference between the purity estimated by Battenberg and the peak-wise purity estimates: The following criteria were used to evaluate the CNA profiles:

•
The location of VAF distribution peaks were estimated correctly (defined as  <5%).
• DPClust 9 identified a cluster containing ≥5% of all SNVs with a CCF of between 0.9 and 1.1.

•
In the case that Battenberg estimates that most of the genome is tetraploid (2:2), a peak in the SNV VAF distribution in 2:2 regions corresponding to a variant multiplicity of 1 was observed.
• No single homozygous deletion >10Mb is called.
Samples satisfying all criteria were deemed to pass and their CNA profiles and purity estimates were used in downstream analyses.Samples that did not pass these criteria were re-profiled (Step 4).
Step 4) CNA re-profiling occurred a maximum of three times using Battenberg with new purity and ploidy estimates.Samples that did not pass the quality assessment criteria (Step 3) after three re-profiling attempts were not considered in downstream analyses.The new purity,  %$-, was estimated in Step 2, whilst the new ploidy,  %$-, was estimated using 13 :

Structural variant calling
Somatic structural variants (SVs) were called using Delly 16 , Lumpy 17 and Manta 18 , taking a consensus of all three while also taking into account the identified copy number alterations.The default parameters were used for all three SV callers.In addition, Delly was run with post-filtering of somatic SVs using all normal samples, as described in the Delly documentation.SVs from the three individual callers were subject to the following additional filters and removed if satisfying any condition: any reads supporting the variant were also identified in the matched normal sample; the variant was supported by <2% of tumour reads; a variant breakpoint was located in a telomeric or centromeric region; a variant breakpoint was located on a non-standard reference contig (i.e., not chromosomes 1-22, X or Y).The identified SVs were combined using a modified version of PCAWG Merge SV, which utilises a graph-based approach to identify and merge SVs identified by multiple callers, allowing a 400bp window for the breakpoint positions 19 .SVs were retained if they were identified by at least two of the SVs callers.Additionally, a SV was also retained if identified by a single SV caller but with a breakpoint that lies within 3kb of a called CNA segment boundary.

Whole genome duplication
The average genome copy number state,  ")$ , was used as a metric to determine the threshold used to classify whether a tumour had undergone whole genome duplication (WGD). ")$ is defined as , where  is the number of copy number genome segments,  :,+ is the fraction of tumour cells carrying copy number state ,  :,+ <": and  :,+ <+% are the major and minor allele copy numbers and  + is the base pair length of a genome segment.In the absence of a subclonal alteration,  =,+ = 1 and  >,+ = 0.

Identification of driver mutations
Candidate protein-coding driver genes were identified using The Integrative OncoGenomics pipeline (IntOGen) 21 .Tumour samples were flagged for exclusion from driver gene identification if they were hypermutated (>10,000 mutations) or had an outlier mutation count when compared to the rest of the cohort.This threshold was defined as: upper quartile + 1.5 × interquartile range.Additionally, mutations that were identified in the Hartwig Consortium Panel of Normal samples were also excluded 22 .
IntOGen incorporates seven driver gene identification methods: dNdSCV 23 ; OncodriveFML 24 ; OncodriveCLUSTL 25 ; cBaSE 26 ; MutPanning 27 ; HotMaps3D 28 ; smRegions 29 .The results of each of the driver identification methods were combined by first generating a "truth set" of driver genes using Tier 1 and 2 genes in the COSMIC Cancer Gene Census 30 .The relative enrichment of genes in the truth set is used to generate a weighting for each method.A consensus ranking of driver genes was generated using Schulze's voting method, taking the ranked lists of genes from each of the seven methods.P-values were estimated using a weighted Stouffer Z-score method.With the consensus ranking and combined P-values estimated, driver genes were classified into four tiers of descending confidence: Tier 1 -genes where the consensus ranking is higher than the ranking of the first gene with Stouffer Q > 0.05; Tier 2 -genes which are part of the truth set and show a combined Stouffer Q < 0.25; Tier 3 -genes with a Stouffer Q < 0.05; Tier 4 -genes with Stouffer Q > 0.05.Candidate genes are classified according to their highest possible tiering, e.g a gene that satisfies the criteria for tiers 1 and 3 will be classified as tier 1.
Additional filtering was performed on the candidate driver genes as a final step of the identification process.Genes were excluded from the final list of driver genes if they had any of the following properties: classified as tier 4 by the combination method; only significant (Q < 0.1) in one of the seven identification methods; the gene has very low expression in The Cancer Genome Atlas breast cancer samples; the gene is in a list of olfactory receptor genes; the gene is in a known list of artefacts or long genes.Further details of the driver identification process are given by Kinnersley et al. 31

Polygenic risk scores
Polygenic scores (PGS) were calculated on a per patient basis using genome-wide association studies (GWAS) summary statistics for European populations.Scores were calculated by plink using the '-scores' argument with the 'sum' argument enabled.The per patient PGS were aggregated and normalised such that the distribution for the whole cohort had mean of 0 and standard deviation of 1.The normalised scores for IBCs and SDBCs were extracted and their difference assessed using a t test.
We used the GWAS as reported by Mavaddat et al. 32 for breast cancer risk.We also considered a number of well-established modifiable risk factors for breast cancer and used GWAS from the following resources: GSCAN consortium meta-analysis of smoking initiation (ever vs never status)32 UK biobank (UKBB) meta-analysis of body mass index (BMI)33, and those relating to diabetes, such as fasting glucose and fasting insulin, were obtained from UKBB studies35.We used GWAS relating to breast density as reported by Chen et al.34.

Assessment of immune evasion
Neoantigen prediction HLA-typing was performed using POLYmorphic loci reSOLVER 33 (POLYSOLVER), resulting in all six HLA class I alleles (from the HLA-A, HLA-B and HLA-C genes) identified for all samples.Neoantigens were predicted using personalized Variant Antigens by Cancer Sequencing (pVAC-Seq) 34 .pVAC-Seq uses the predicted binding affinities of peptides, arising due to non-synonymous mutations, to major histocompatibility complex class I molecules.This is achieved by combining the results of eight methods (NetMHC Peptides are classified as neoantigens if the peptide meets the following conditions: • Has a binding affinity ≤500nM (mean of all 8 methods).
• Corresponds to a canonical transcript.
• Is novel with respect to the human proteome.

Immune escape mechanisms
We considered three mechanisms of genetic immune escape: a non-synonymous mutation in any of the three HLA Class I genes; LOH in any of the three HLA-I genes; any inactivating mutation in a list of 22 genes essential to antigen presentation.POLYSOLVER was used to identify somatic mutations in the HLA genes.This uses a combination of MuTect 43 to check for non-synonymous SNVs and Strelka for insertions and deletions in HLA-aligned reads.HLA LOH was predicted using Loss of Heterozygosity in Human Leukocyte Antigen 44 (LOHHLA).LOHHLA was run with the number of mismatch sites between any two allele pairs set to >10 and the minimum coverage filter at these sites set to 10. LOHHLA did not make predictions of LOH for genes of patients that did not meet either of these thresholds, thus homozygous alleles were neglected.To define LOH of HLA we used the same definition as Cornish et al. 45 • Presence of allelic imbalance, with the difference in evidence of the two alleles fulfilling P <0.01.

•
The copy number of the lost allele was <0.5 with a confidence interval <0.7.

•
The copy number of the kept allele was >0.7.

•
The number of mismatched sites between alleles was >10.
A list of 22 antigen presenting genes was created by considering the genetic components of antigen presenting machinery.Specifically, the IFN- pathway, the PF-L1 receptor, the CD58 receptor, and epigenetic escape via SETDB1 were considered resulting in the following genes 46,47

Extraction of mutational signatures
De novo extraction of single-base-substitution (SBS), doublet-base-substitution (DBS), insertion and deletion (ID), copy number (CN) and structural variant (SV) signatures, including decomposition to known COSMIC signatures 49 (v3.3), was performed using SigProfilerExtractor 50 by Everall et al. 51 Single base substitutions were classified by considering their tri-nucleotide and transcriptional context, resulting in 288 unique classes.Double base substitutions and small indels were classified into 78 and 83 classes as is the case in COSMIC.Copy number events were classified into 48 classes that depended on the length of the sequence, type of copy number change and whether LOH was present 52 .Thirty two classes of SVs were based on the size of the SV and whether it existed as part of a cluster 53 .
All signatures were extracted using random initialization, 500 NMF replicates, and between 10,000 and 1,000,000 NMF iterations.We assumed the presence of between 1 and 25 SBS and ID signatures, between 1 and 20 DBS signatures, and between 1 and 15 CN and SV signatures.Optimal solutions were manually chosen considering solution stability across NMF replicates and the observed mutational profile reconstruction error.