Analysis of 10,478 cancer genomes identifies candidate driver genes and opportunities for precision oncology

Tumor genomic profiling is increasingly seen as a prerequisite to guide the treatment of patients with cancer. To explore the value of whole-genome sequencing (WGS) in broadening the scope of cancers potentially amenable to a precision therapy, we analysed whole-genome sequencing data on 10,478 patients spanning 35 cancer types recruited to the UK 100,000 Genomes Project. We identified 330 candidate driver genes, including 74 that are new to any cancer. We estimate that approximately 55% of patients studied harbor at least one clinically relevant mutation, predicting either sensitivity or resistance to certain treatments or clinical trial eligibility. By performing computational chemogenomic analysis of cancer mutations we identify additional targets for compounds that represent attractive candidates for future clinical trials. This study represents one of the most comprehensive efforts thus far to identify cancer driver genes in the real world setting and assess their impact on informing precision oncology.


Study sample selection
Tumour samples were excluded if clinical data were missing or if unresolvable conflicts existed between data sources (Supplementary Table 1).2,251 of the 14,129 (15.9%) samples were excluded because: (1) Sex reported by PHE-NCRAS, NHSD and/or the GMC conflicted with the sex inferred from sequencing data; (2) It was not possible to assign the cancer to one of the 35 tumour groups, either because of missing or conflicting originating tissue or histology data, or because the disease was not represented by one group; (3) Ambiguity as to whether a primary tumour, a metastasis or a recurrence of a primary tumour was sampled; (4) It was not possible to assign a date of sampling due to missing or conflicting data; (5) Patient was <18 years old at time of sampling.
Tumour sample purity and sequencing data quality affect variant calling precision and sensitivity and we therefore applied additional quality control (QC) criteria to the WGS data (Supplementary Table 1) 1 .Overall, 267/11,878 (2.2%) of tumour samples were excluded based on the following sequencing data QC criteria: (1) Tumour or matched germline cross-contamination was >1%, as determined by VerifyBamID 2 ; (2) Number of SNVs called in a tumour was a low outlier for the associated tumour group (SNV number outliers were defined as tumors with a tumor-groupspecific log-transformed SNV number Z-score <-3).To ensure that no individual was represented in the same tumour group, we removed duplicated samples from the same individual, keeping primary tumor samples of highest purity, as estimated by Ccube 3 .A further 505 non-solid tumour samples were excluded as these tumour types were not the subject of the present analysis.After imposing these QC metrics, 10,478 tumour samples were suitable for analysis: 9,693 primary tumours, 634 metastases and 151 primary tumour recurrences from 10,470 individuals.Eight patients were represented in more than one tumour group (Supplementary Table 1).

Whole genome sequencing
Paired tumour-normal (T/N) samples were obtained as part of the 100kGP programme 4,5 .
Recruitment of patients was through 13 Genomic Medicine Centres (GMCs) and affiliated hospitals (Supplementary Fig. 1).All patients provided written informed consent.Tissue collection and preparation, extraction and quantification of DNA was undertaken locally, and DNA transferred to a central national biorepository.Whole genome sequencing of paired T/N DNA was conducted by Illumina.Additional processing, quality checking and data storage was performed by Genomics England.
Sample preparation was conducted using Illumina TruSeq DNA PCR-free library preparation kits.
Sequencing was performed using HiSeq X, generating 150bp paired-end reads.Tumour and germline samples were sequenced to average depths of 100x and 30x respectively.Poor sequencing quality outliers were identified using principal component analysis and excluded (based on the following quality metrics: average insert size, AT/CG dropout, unevenness of local coverage, percentage of mapped reads and percentage of chimeric DNA fragments).Sequencing quality outliers were not included in the 100kGP main programme releases and were therefore not considered herein.Reads were aligned to the Homo sapiens GRCh38Decoy assembly using Isaac v03.16.02.19 6 .Paired T/N sequencing data for 14,129 cancer samples were obtained from the 100kGP main program version 11 release.

Somatic variant calling
Somatic single nucleotide variant (SNV) and small insertion and deletion (indel) calling was performed using Strelka (v2.4.7) 7 .Variants were excluded if they failed any of the default Strelka filters or met any of the following criteria: (1) population germline allele frequency ≥1% in the gnomAD or 100kGP cohorts 8 ; (2) somatic frequency ≥5% in 100kGP tumour samples; (3) overlapped a simple repeat as defined by Tandem Repeats Finder 9 ; (4) the indel was in a region with high levels of sequencing noise (high sequencing noise being defined as ≥10% of base calls in a window extending 50 base pairs to either side of the indel call being excluded by Strelka).
SNVs likely to be an artefact caused by systematic mapping or calling issues were identified by computing the ratio of tumour allele depths at each somatic SNV site and comparing against the ratio of allele depths at the same site in a panel of 7,000 normal germline samples.Allele depth at each site was counted, using bcftools mpileup function (version 1.9), considering only individuals not carrying the corresponding alternate allele.Duplicate reads were removed prior to counting and mapping quality ≥5 and base quality ≥5 thresholds were applied to replicate Strelka filters.SNVs with a Phred quality score ≤50 computed using Fisher's exact test were excluded.Copy number alterations (CNAs) were called using Battenberg following variant allele frequency correction with alleleCount-FixVAF 10,11 .
Pre-processing of mutations -Somatic mutations passing filtering criteria described above were subject to initial sample and mutation pre-processing.In the case of multiple tumours from the same patient, the primary tumour was analysed (in the case of primary/recurrence pair), alternatively the tumour with the highest purity was anlaysed.In each cohort, hypermutated tumours were flagged for exclusion from downstream driver gene identification if containing > 10,000 mutations and having an outlier mutation count (count >1.5* interquartile range (IQR) + upper quartile (UQ)).Mutations found to be present in a Hartwig Panel of Normals were further excluded.Unless otherwise specified, mutations were mapped to canonical protein-coding transcripts from Ensembl v101.
Combining driver identification methodologies -The driver combination procedure considered the top-100 ranked genes and their association Pand Q-values in each of the driver identification methods.Briefly, genes assigned as Tier 1 or Tier 2 somatically mutated genes in the COSMIC Cancer Gene Census (https://cancer.sanger.ac.uk/census; v92 downloaded February 2021) were designated as "CGC" genes and represented a "truth" set of known drivers 25 .Through comparison of the relative enrichment of known drivers in the top ranked gene lists a per-method weighting was obtained.Per-method ranked lists were combined using Schulze's voting method to generate a consensus ranking, with combined P-values estimated using a weighted Stouffer Zscore method.
Driver candidates were classified into the following tiers: Tier 1 -Candidates where the consensus ranking is higher than the ranking of the first gene with Stouffer Q >0.05 (high confidence drivers); Tier 2 -Candidates not meeting the criteria for Tier ; candidate drivers where 0.1 represent those with an excess of missense to nonsense mutations and are assigned as "Oncogenes"; candidate drivers where  < 0.1 represent those with an excess of nonsense to missense mutations and are assigned as "Tumor suppressor genes (TSGs)"; otherwise, the role of the candidate driver is unclear and was assigned as "Ambiguous".
In the case of multiple cohorts being run representing subsets of a given tumour type, a "consensus" role was designated comparing between each subtype role ("Oncogene" if 1+ cohort and no other cohorts assigned as "TSG", "TSG" if in 1+ cohort and no other cohorts assigned as "Oncogene", otherwise "ambiguous").
Gene candidates were annotated by their overlap with any IntOGen cohorts from a previous 2020.02.01 pan-cancer analysis (https://www.intogen.org/download?file=IntOGen-Cohorts-20200201.zip) and from the pan-cancer TCGA analysis reported by Bailey et al., 2018 27 .

Mutations exhibiting extreme strand bias
SNV mutations that otherwise pass filtering criteria as previously detailed were scrutinised if they showed excessive strand bias (i.e.Strelka INFO field "SNVSB=" >10).This highlighted the large number of mutations causing the missense change in CACNA1E (p.Ile95Leu) as being likely falsepositive calls and therefore CACNA1E was excluded as a driver candidate.Similarly, we did not consider WDR64, VCP, GOLGA6L10, NBPF1, TUBB8, TRIM64B, HLA-DQB2 and KIR3DL2 as being bona fide driver genes (Supplementary Table 14).

OncoKB annotation of driver mutations
Nonsynonymous mutations in 684 gene transcripts considered by OncoKB v3.11 were annotated using the OncoKB API (https://www.oncokb.org/) 28.In the first instance, the HGVSg identifier was used, however in rare instances if this failed a combination of gene symbol, consequence and HGVSp were used to map mutations to OncoKB annotations.

Lollipop plots of driver gene mutations -
Lollipop plots of driver gene mutations were generated using the R package trackViewer (https://github.com/jianhong/trackViewer 29.Pfam protein domains mapping to the Ensembl v101 canonical transcripts were plotted.The protein position was taken from the first position in the HGVSp annotation, other than for splice donor and acceptor mutations where the codon nearest to the HGVSc transcript position was assigned as the protein position. Power estimates for driver gene estimation -Power to detect a driver assumes a driver gene has a higher non-silent mutation rate compared to the corresponding background mutation rate 30 .A binomial model was used to theoretically compare a driver gene with a non-driver conditional on the sample size.The mutation rate factor,  + = 3.9, is defined such that driver genes in the 90th percentile of gene-specific background mutation rates (as calculated by MutsigCV 31 ) are included.
Letting  be the exome-wide background mutation rate (mutations per base) of a tumour cohort, the general gene mutation background rate for 90th percentile of genes is  =   . is the nonsilent mutation rate above the background rate.The effective gene length is defined as  -.. = 3/4 assuming the ratio of non-silent to silent mutations is 3 to 1 and  = 1,500 represents the average gene length.The power from the binomial model considers the hypotheses; , where  213 is the non-silent mutation rate of a suspected driver gene.The null hypothesis is rejected at a P-value of 5 × 10 (6 .

Comparison of candidate driver gene mutation rates in different histologies
To compare the rate of driver somatic mutations in different histologies per cancer tissue, expected mutation rate and uncertainties were estimated by taking a uniform distribution prior with a binomial likelihood function with the number of samples which are mutated () vs total samples () for the given cancer site.This generates a beta distribution posterior with parameters  =  + 1,  =  −  + 1 such that the expected mutation rate is ( + 1)/( + 2).For each histology, P-values are evaluated using a binomial test of the mutation rate of the samples with the given histology against the mutation rate of all samples for the given organ.

Assessment of domain specific mutations
We assessed domain specific mutations by considering the cancer drivers where smRegions is a significant bidder (Q-value <0.1) and the driver is annotated in multiple cancer types.

Predicting mismatch repair deficiency
Samples with evidence of defective mismatch repair were detected using mSINGS, following the previously described procedure for background model generation 32,33 .
Copy number profiling and whole genome doubling Clonal and subclonal somatic copy number alterations (CNAs) were identified using an iterative process incorporating Battenberg v2.2.8 11 .
Whole genome duplicated tumours were identified based on the methodology described in Gerstung, et al. ( 2020) 34 .Further details can be found in our accompanying manuscripts 32,35 .

Structural variants calling
We identified structural variants using a graph-based consensus approach including Delly, Lumpy and Manta, and support from CNAs [36][37][38] .Additional details can be found in our accompanying manuscripts 32,35 .

Predicting homologous recombination deficiency
Evidence of homologous recombination deficiency (HRD) was assessed using HRDetect 39 .
HRDetect requires CNA data and was therefore run only on 9,207 tumours passing CNA calling.
To compute the HRDetect score we determined the following input features: exposures of single base substitution signatures, SBS3 and SBS8, as well as COSMIC rearrangement signatures 3 and 5, the proportion of short deletions at microhomology, and the HRD-LOH index 39,40 .SBS3 and SBS8 contribution estimates were obtained from SigProfiler 41 .We used a probabilistic cutoff of 0.7, which translates to 98.7% sensitivity for predicting BRCA1/BRCA2 deficiency and has a high efficacy when applied to multiple cancer types 39 .

Comparison of driver mutation frequencies between Genomics England and MSK
To determine the sensitivity of WGS data from the 100kGP we first compared driver mutation frequencies with those from MSK-IMPACT and MSK-MET, a combined cohort of ~25,000 cancer patients whose tumours have been panel sequenced to identify driver mutations 42,43  and the fraction of samples with an oncogenic mutation in a given driver gene were compared.

Assessing the sensitivity of WGS to detect driver mutations
By analysing the distribution of allelic depths in called PASS mutations in the 100kGP cohort we found that the rate of calls falls when <6 reads support the alternate allele (Supplementary Fig. 13).We therefore used 6 reads as a minimum coverage threshold to approximate the sensitivity of the WGS samples to somatic mutations.Per-base coverage was extracted from tumour bam files using GATK v4.4.0.0 DepthOfCoverage 44 .A gene panel of 43 representative driver genes was obtained from the NHS Genomic Test Directory for Cancer (2021-22 v5.0 published 31 October 2022).Genomic regions were defined as per driver gene identification (i.e.coding sequence (CDS) from the canonical ensembl v101 transcript including essential splice sites).The TERT promoter region was defined as GRCh38 chr5:1295019-1295268. Coverage was mapped to gene panel regions using bedops v2.4.26 and bedtools v2.3.0 to obtain per-gene coverage statistics 45,46 .
Assuming a heterozygous clonal mutation the expected number of reads is given by 0.5 ×  × .The distribution of coverage across samples for three common pancancer drivers is given in Supplementary Fig. 14.The sensitivity is the probability of measuring at least 6 reads given the expected number of reads which we assume is Poisson distributed.We estimated this for all genes and samples shown in Supplementary Fig. 15 & 16.
In a realistic worst case, for a read coverage of 75 (i.e.lower 5th percentile of genes in samples) with a tumour purity of 0.2, the sensitivity is 76% however, this rises to 99.98% for a purity of 0.5.We also estimated the fraction of each driver gene with an expected alt read count > 6.
Results for TP53, KRAS and PIK3CA are given in Supplementary Fig. 17.

Comparison of actionability between WGS and panel
Actionable driver genes were defined from the OncoKB and COSMIC databases.This list was compared with the NHS Genomic Test Directory for Cancer (2021-22 v5.0 published 31 October 2022) at a mutation-specific level.

Timing candidate driver gene mutations
The relative evolutionary timing of candidate driver mutations was obtained using MutationTimeR (https://github.com/gerstung-lab/MutationTimeR) 34.MutationTimeR times somatic mutations relative to clonal and subclonal copy number states and calculates the relative timing of copy number gains.Hence, the number of clonal and subclonal copy number states can influence estimates of the timing of somatic mutations across different tumor types.

Preparing MutationTimeR input files
Copy number input for MutationTimeR was prepared from Battenberg segmentation files, with the clonal frequency of each segment taken as the tumour purity.In the case of subclonal calls, the clonal frequency was calculated by multiplying the tumour purity by the clonal fraction.
The clusters input for MutationTimeR was prepared from DPClust cluster estimates.The VAF proportion was calculated by multiplying the estimated cluster CCF by the tumour purity.
VCF input for MutationTimeR was obtained from the small somatic SNV/indel variant VCFs which had been filtered as previously described.For SNVs, alt and ref depths were obtained using FixVAF (https://github.com/danchubb/FixVAF).For indels, ref and alt depths were obtained from Tier2 Strelka TAR and TIR fields respectively.Only mutations within Battenberg copy-number segments were retained (note: for male XY tumours with only 1 copy of the X chromosome copy number information is restricted to the pseudoautosomal region (PAR) and battenberg was not run on the Y chromosome).

Creating a canSAR interactome
The canSAR interactome features interactions where there are: (i) at least two publications with experimental evidence of binary interaction between the two proteins; (ii) 3D protein evidence of a complex; (iii) at least two reports that one protein is a substrate of the other; (iv) at least two publications reporting that one protein is the product of a gene under the direct regulatory control of the other.Each tumour-specific interactome was seeded using cancer driver proteins, retrieving interacting proteins that had supporting experimental evidence.To ensure that additional proteins are likely to function primarily through interaction with proteins in the network, we adopted the following strategy: starting with the input list of proteins we obtained all possible first neighbours.We then computed, for each new protein the proportion of its first neighbours in the original input list.To define proteins likely to function through the network, we calculated the probability of these occurring randomly by permuting the interactome 10,000 times.We corrected empiric P-values for multiple testing retaining only proteins having a FDR < 0.05.For each cancer type we minimised the network by retaining only proteins connected to more than one cancer protein, or whose only connection was to a cancer-specific protein.