Identification of new putative driver mutations and predictors of disease evolution in chronic lymphocytic leukemia

Dear Editor, The analysis of hundreds of chronic lymphocytic leukemia (CLL) exomes has shed new light on the heterogeneous genomic background characterizing this disease. At the same time, the increased availability of exome-sequencing data comes along with a big bottleneck in the interpretation of its results, which is related to the remarkable heterogeneity in mutation detection between different bioinformatic protocols. Differences in clonality, purity, sequencing coverage, and quality constitue difficulties for most variant callers. The methods with the highest sensitivity are frequently accompanied by lower precision, leading to remarkable differences in mutation detection. Therefore, we hypothesize that numerous variants in large sequencing projects have passed unnoticed. Here, we report the results of a complementary analysis performed on the International Cancer Genome Consortium (ICGC) CLL cohort. The final analysis included 49 monoclonal B cell lymphocytosis and 390 treatmentnaive CLL samples. Mutation detection was performed with two different methods: VarsCan2, which uses a heuristic/statistical method for variant detection; and Platypus, which implements a Bayesian approach and local realignment of reads for indel and complex mutation detection. Variant quality was recalibrated using a logistic model, and drivers were detected by integrating the results of methods based on mutation frequency (MuSiC2), functional impact (OncodriveFM), colocalization (OncodriveClust and Mutation3D), and pathogenicity prediction (VEST and CHASM) (Supplementary Methods). Cox regression was used for survival analysis. Assumption of proportional hazards was checked with Schoenfeld’s method. An unadjusted model was used to test the association of each mutated gene/pathway with time to treatment and overall survival. Similarly, we created an adjusted model which included variables associated with outcomes of interest at a nomial p-value < 0.2 (IGHV status, sex, and stage at diagnosis for time to treatment analysis; and IGHV status, age and stage at diagnosis for overall survival analysis). In the case of pathways analysis, the total number of mutations in genes belonging to each pathway were used as input. P-values were adjusted for multiple testing using the Benjamini–Hochberg (BH) method. A total of 28,350 mutations were detected in 439 treatment-naive patient samples, of which 12,057 affected protein-coding regions (Supplementary Table 1). There were 8,965 non-silent and 3,095 silent mutations. The large majority of the non-silent mutations were missense (7,558 events). Point mutations were the most frequent (21,180), followed by short deletions (3,240) and insertions (2,041). There were 1,888 multi-nucleotide mutations (involving 2 or more consecutive nucleotides) (Supplementary Fig. 1). Sixty-six genes were detected as putative drivers (Fig. 1, Supplementary Table 2, Supplementary Tables 3–8), of which thirty-two had been previously described by Puente et al. Among the novel ones, the most frequently mutated were DTX1, LPHN3, LRP1B, LTB, and WDFY3. LPHN2 and SI were mutated in six patients; BIRC6, DOCK1, MLL3, PCDH15, PTPN13, PTPRM, RELN, and TFEB were mutated in five patients and the remaining putative drivers were mutated in four different cases.

Exome-seq was performed by Puente et al. (2015) as described in their original paper. Briefly, 3 μg of genomic DNA were used for paired-end sequencing library construction, followed by enrichment in exomic sequences using the SureSelect Human All Exon 50Mb kit (Agilent Technologies). Next, DNA was pulled down using magnetic beads with streptavidin, followed by 18 cycles of amplification. Sequencing was performed on an Illumina GAIIx or on a HiSeq2000 sequencer (2x76bp). Reads were previously aligned to the reference genome (GRCh37.75) using bwa [1]. We performed duplicate read removal, sorting and indexing using samtools [2]. Base quality score recalibration was made with BamUtil [3] using a logistic regression model. Samtools parameters were the following: " C=50 "; " d=250 ", " Q=13 ", " R ", " O ", " e=10 ", " F=0.002 ", " h=100 ", " m=1 ", " o=20 ". The remaining parameters were run as in default mode. Next, VarScan2 [4] was run on paired-end mode with default parameters and the " strand-filter " option. For somatic variant detection, variants were filtered according to the following specifications: Fisher's p-value for variant frequency distribution between tumor and normal samples below 0.05, minimum coverage of 10x in tumor and control samples and a minimum mutation VAF of 10%. We discarded those mutations with a VAF in the control above 5%, more than 5 absolute reads covering the variant in the control and less than 5 variant reads in the tumoral sample (note that a small contamination of the control sample by CLL leukocytes is expected).
Variants from the two different callers were normalized and fused into single vcf files using CombineVariants functions implemented in the Genome Analysis Toolkit (GATK) [6]. Mutations were annotated using the Variant Effect Predictor (VEP) [7] and converted to MAF format [8]. Each mutation was annotated to dbSNP and ExAC databases. Mutation plots were produced using maftools [9].

Driver detection tools 2.1 MuSiC2 Analysis
We ran Music2 [10] analysis on coding and non-coding regions covered by Agilent Exome SureSelect All Exon v4 kits with at least x10 depth in tumor and control samples. The GC-adjusted Convolution Test test was used as a measure of significance. Genes mutated in at least 4 patients with a Benjamini-Hochberg (BH)-adjusted p-value <0.1 were selected as potential new drivers. A similar procedure was used to analyzed intron mutation enrichment. Exome-seq also covers intronic regions in the neighbourhood of the exons, and mutations at these levels may be functional. To analyze gene enrichment in intronic mutations, we created background mutation statistics including all intronic regions that were covered with at a 10x depth in both tumor and normal samples.

OncodriveFM Analysis
We ran OncodriveFM [11] analysis with default parameters. Putative drivers were considered if they were mutated in at least 4 patients and had BH-adjusted p-values <0.1.

OncodriveClust
OncodriveClust [12] analysis for detection of mutation "hotspots" was run as implemented in maftools with default parameters. We set a mutation threshold of 4 events and a BH-adjusted p-value <0.1 in order to consider new drivers.

mutation3D analysis
Missense mutations were analyzed using mutation3D [13] to find genes whose missense mutations co-occur on specific tridimensional domains of the protein. The algorithm was run with the following parameters: minimum missense mutation number of 3, minimum number of mutations per cluster of 2, maximum intracluster distance of 20Å and MPQS >1.1. A BH-adjusted p-value threshold of 0.1 was selected as significance threshold.

CRAVAT analysis
We analyzed non-synonymous mutations with the CRAVAT pipeline [14], which included both the Cancer-Specific High-throughput Annotation of Somatic Mutations (CHASM) [15] and the Variant Effect Scoring Tool (VEST) [16] methods. Both of them are based on machine-learning (ML) classification of tumorigenic mutations based on a known set of driver mutations. Significant genes were selected as those with a BH-adjusted composite p-value <0.05.
Silent mutations at putative driver genes were analyzed with Human Splicing Finder [17] in search for donor or acceptor cryptic splice sites. Putative drives affecting less than 4 patients (circa 1% of the whole sample) and with more than 1 splice-neutral silent mutation were not considered for this analysis unless otherwise specified in the text. Genes were assessed for expression in lymphoid tissues using the Human Protein Atlas [18], and those without expression were discarded, except in the case of known human cancer drivers.

Low frequency putative drivers
CHASM and VEST methods were used to prioritize candidate driver mutations according to their predicted functional impact. Low frequency mutated genes present in less than 4 patients and with at least 2 linkely functional mutations (BH-adjusted p-value <0.25) were considered likely to be drivers if: 1) they represented >50% of the detected mutations in that gene; 2) the gene is expressed in lymphoid tissues and 3) had no synonymous mutations in the same gene, except for the case of BCR and PTPN11 , which are known CLL drivers.

Pathways Analysis
PathScore analysis [19] was run with default mutation background frequency. To limit potential false findings due to false mutations we discarded all those that had a MAF above 0.01 in the Non-Finish European Population of ExAc. Significantly mutated pathways were labeled as those with Bonferroni adjusted p-value <0.1.

Mutation visual inspection and validation
Mutations were manually visualized using the Integrative Genomics Viewer [20]. For validation purposes, we used a subset of 88 whole genome sequencing (WGS) data that had matched exome sequencing data. We visually analyzed the subset of mutations falling at candidate driver genes (those that obtained BH q-values <0.25 in driver detection method, including intronic mutations, as well as those mutations in the list of new putative low-frequency drivers).

Code availability
The software used for this analysis is available in public repositories. Interested readers can contact the first author of this manuscript in order to obtain any further information.