Detection of Somatic Mutations in Exome Sequencing of Tumor-only Samples

Due to lack of normal samples in clinical diagnosis and to reduce costs, detection of small-scale mutations from tumor-only samples is required but remains relatively unexplored. We developed an algorithm (GATKcan) augmenting GATK with two statistics and machine learning to detect mutations in cancer. The averaged performance of GATKcan in ten experiments outperformed GATK in detecting mutations of randomly sampled 231 from 241 TCGA endometrial tumors (EC). In external validations, GATKcan outperformed GATK in TCGA breast cancer (BC), ovarian cancer (OC) and melanoma tumors, in terms of Matthews correlation coefficient (MCC) and precision, where MCC takes both sensitivity and specificity into account. Further, GATKcan reduced high fractions of false positives detected by GATK. In mutation detection of somatic variants, classified commonly by VarScan 2 and MuTect from the called variants in BC, OC and melanoma, ranked by adjusted MCC (adjusted precision) GATKcan was the top 1, followed by MuTect, VarScan 2 and GATK. Importantly, GATKcan enables detection of mutations when alternate alleles exist in normal samples. These results suggest that GATKcan trained by a cancer is able to detect mutations in future patients with the same type of cancer and is likely applicable to other cancers with similar mutations.

Advances in both next generation sequencing (NGS) technologies and computational tools have transformed biological and medical research over the past few years. In particular, calling somatic mutations from DNA sequencing data of tumor samples has become essential for characterizing cancer genomes and clinical genome typing 1,2 . Exome-sequencing (exome-seq) has enabled rapid detection of mutations that altered protein functions across hundreds of patients. However, identifying small-scale mutations consisting of somatic single nucleotide variations (SNVs) and insertions and deletions (indels) of exome-seq data is challenging, because sequencing coverage is non-uniform across target regions and among samples, the genomes of primary tumors are genetically heterogeneous, and so on 3 . Several algorithms have been developed to tackle these challenges, and they can be classified into two groups: (1) calling variants in tumor and normal samples separately, then identifying tumor-specific variants by a simple subtraction method, e.g., GATK 4 ; and (2) analyzing tumor-normal samples simultaneously by heuristic methods or statistical models, e.g., Strelka 5 , VarScan 2 3 and MuTect 6 . The algorithms in the second category can detect small-scale mutations with enhanced accuracy. In particular, MuTect focuses on detecting low-allele-frequency somatic mutations, which are often missed by existing methods, in exome-seq data requiring only a few supporting reads. VarScan 2 outperformed MuTect and other tools for variants with allele frequency >0.35, while MuTect outperformed the other five algorithms for identifying mutations with allele frequency ≤ 0.35, as shown by simulated data 7 . The algorithms in the second category can be applied only when tumor-normal paired samples are available. However, most of the exome-seq data from clinical diagnosis and formalin-fixed, paraffin-embedded samples are tumor-only. As artifacts of called variants generated either from next-generation sequencing machines (accuracy limited to one error in 100 or 1000 bases) or from variant-calling algorithms remain inevitable, and validation of variants is costly (US$5-10 per variant in Taiwan), developing an algorithm to accurately detect somatic mutations of exome-seq from tumor-only samples is of interest. Moreover, detecting mutations with high accuracy may provide clues to identify driver genes in cancer [8][9][10] , which may reveal the mechanism of carcinogenesis.
GATK is good at discovering all potential variants across diverse sequencing technologies and experimental designs. GATK trained by known polymorphic sites performs well in capturing true single nucleotide polymorphisms (SNPs), but may produce false positives in detecting somatic mutations in exome-seq of tumor-only samples (a pilot study of endometrial tumors in Taiwan; unpublished data). Here, we developed an algorithm based on GATK 4,11 and partial reported mutations of endometrial cancer (EC) in The Cancer Genome Atlas (TCGA) 2 , and named it GATK for cancer (GATKcan). Specifically, we incorporated two statistics to filter false mutations and detect true mutations from called variants, in addition to four statistics in hard filtering of GATK. Next, we trained the thresholds of the six statistics using partial randomly sampled TCGA endometrial tumors and machine learning. To evaluate the stability of GATKcan's performance, we repeated the training procedure ten times and compared the averaged performance of GATKcan in detecting mutations of the remaining 231 TCGA endometrial tumors to that of GATK. We further compare GATKcan to GATK, VarScan 2 and MuTect in predicting somatic variants, classified commonly by VarScan 2 and MuTect, from the called variants. Moreover, the four algorithms were compared using exome-seq data of 215 ovarian tumors, 503 breast cancer tumors and 342 samples in melanoma of TCGA [12][13][14] . Detecting small-scale mutations when alternative alleles in normal samples exist has been a bottleneck in the area. Because GATKcan does not require normal samples, this problem is circumvented using our approach.

Results
For this study, we incorporated exome-seq of EC tumors (~95% non-Asian) from TCGA 2 , which was part of 373 endometrial carcinomas consisting of genomic, transcriptomic and proteomic profiling 2 . This integrated characterization provided key molecular insights into tumor classification. We first applied HaplotypeCaller to yield variant calls, that HaplotypeCaller compared with a reference genome (hg19) to sift variants. Specifically, a total of 64,295 variants (base quality ≥ 10 and MQ ≥ 20) were called by GATK from exome-seq of 241 samples (focusing on ~800 cancer genes (~1GB per sample)); seven of the 248 files were damaged after downloading. The list of cancer genes studied is shown in Supplementary Table S1. Of these called variants, 64,183 were classified as point mutations and 112 were indels by GATK. Calling variants of each sample took GATK ~2 h using a multi-core cluster (2 Xeon 2.67 GHz CPUs and 24GB RAM). The details of the datasets are shown in Table 1.
GATK is very good at uncovering potential variants and filtered machine artifacts of DNA sequencing data. HaplotypeCaller of GATK is very useful for calling single nucleotide polymorphisms (SNPs) and indels of DNA sequencing data from diseased-only samples and paired samples. After applying HaplotypeCaller to call variants, we excluded known SNPs in the HapMap3 and the 1000 Genomes Project 15 to result in potential somatic variants in tumors. Note that we did not use dbSNP, because it contained some verified somatic mutations which were of interest to us. We then inputted these variants to hard filtering of GATK, using the following five statistics to identify somatic mutations; HaplotypeScore was excluded because it had been taken into account during the calling process.
• QualByDepth (QD): this is the quality of the variant divided by the unfiltered depth of non-reference samples.  • ReadPosRankSum Test (ReadPosRankSum): the z-approximation from the Mann-Whitney rank sum test 16 for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error.
Hard filtering classifies a called variant with QD < 2.0, FS > 60.0, MQ < 40.0, MQRankSum < −12.5 or ReadPosRankSum < −8.0 (QD < 2.0, FS > 200.0 and ReadPosRankSum < −20.0) to be an artifact, and otherwise to be a SNP (an indel). To see whether all five statistics were useful for identifying somatic mutations, we conducted a pilot study by applying hard filtering with the aforementioned default thresholds 4,11 to 241 EC tumors from TCGA. The artifacts filtered by ReadPosRankSum were a subset of those filtered by FS, so we only incorporated QD, FS, MQ and MQRankSum into our method.
The proposed method-GATK for cancer (GATKcan). Because validation of somatic mutation is costly, we introduced two more statistics to filter false mutations and identify true mutations from called variants, in addition to the above four statistics of hard filtering. Further, we were able to assess Level 1 exome-seq data of 241 endometrial tumors from TCGA, thus we trained the thresholds of the six statistics using known mutations of partial TCGA EC, reported mutations in 19 TCGA cancer types and applied GATKcan to detect somatic mutations of the remaining EC tumors. Further, we also applied GATKcan to detect mutations in similar carcinoma (ovarian cancer and breast cancer) and a tumor of a different carcinoma (melanoma; squamous cell).
For a called variant, if the differences in the number of reads from 5′ and 3′ deviate from those of its corresponding REF a lot, then it is likely to be a false mutation. We assume that mutation sites for some tumors of a cancer are the same. For each called variant, we incorporated the Mann-Whitney statistic to test whether its differences in the number of reads from both strains have the same distribution as those of the corresponding REF across tumors; ideally the latter have median zero. If the hypothesis is rejected, then we predict this variant as a false mutation. Moreover, somatic mutation rate is rare (~2.8 × 10 −7 per base 17 ), thus the probability of an adjacent mutation existing within the neighborhood of a true mutation is very small. The intermutation distance (IMD) is defined as the distance from one mutation to the next one 18 . The IMDs calculated from cancer genes of 248 EC, 510 BC, 316 OC and 346 melanoma tumors are huge, and their median are ~17.5, 25.3, 2.7 and 3.4 Mb, respectively. Therefore, for a given variant if the genomic distance from its nearest reported mutation is small but >0 (=0), then we can classify it as a false (true) mutation. We also incorporated this distance (called dNM) and determined its threshold by a machine learning approach. Therefore, in addition to QD, FS, MQ and MQRankSum (QD and FS) for detecting point mutations (indels), we incorporated the following two statistics into our algorithm GATKcan.  1. Mann-Whitney test, and 2. dNM: a genomic distance of a called variant from its nearest true mutation.
Note that when a dNM equaled to zero, namely this variant coincided with a reported mutation in 19 TCGA cancer types, we predicted it as a true mutation and did not filter it further. Note that dNM and these trained thresholds enabled GATKcan to identify true mutations, in addition to filter false mutations.
Because this study was originally motivated by an analysis of ten EC tumors in Taiwan (results not shown), when we gained access to exome-seq data of TCGA endometrial tumors, it was reasonable to train the thresholds of the statistics in GATKcan using these EC tumors, instead of using the fixed cutoffs of hard filtering. To compare with training using Taiwanese ECs and to reduce training time, we used (~2788 called variants of) ten randomly selected TCGA EC tumors for training. Specifically, we trained the thresholds of the six statistics of GATKcan using the reported mutations and false mutations (false positives) of the called variants in randomly sampled 2, 3, 2 and 3 EC tumors of stage I-IV, respectively (namely adopting a stratified sampling scheme). The 64,295 called variants did not contain any reported indel, thus we used ~9% of 539 reported indels (from all reported mutations) and 112 false mutations (from the called variants) to train the four cutoffs of GATKcan for detection of indels; see Methods for further details of the training procedure. To assess stability of GATKcan's performances, we repeated the training procedure ten times and obtained ten sets of trained thresholds in Table 2.
For each cancer under study, we focused on a few hundreds of cancer genes, whose DNAs consisted of reference (non-mutated) sites and called variants. Excluding the reported SNPs and germline mutations in ExAC 19 , we defined a true mutation as a true positive (TP) and a false mutation in called variants as a true negative (TN). The true positive rate (TPR, namely sensitivity) is defined as the ratio of the identified TPs to the total number of TPs (reported by TCGA). Similarly, precision is the ratio of the identified TPs to all predicted mutations, false positive rate (FPR, namely − 1 specificity) is the ratio of the predicted mutations to all REF sites of cancer genes in all tumors under test, and conditional FPR (cFPR) is the ratio of predicted mutations to all TNs in the called variants. Note that cFPR is of interest in clinical diagnoses, in addition to TPR. Because there is only ~10% true mutations in the EC tumors, we further adopted Matthews correlation coefficient (MCC). MCC is a balanced measure which takes into account true and false positives (negatives), and

TPs TNs FPs FNs TPs FPs TPs FNs TNs FPs TNs FNs
where FPs and FNs denote false positives and false negatives, respectively. Next, we compared GATKcan to hard filtering (denoted by GATK henceforth) in detection of mutations of the remaining 231 tumors in ten repeats. The averaged TPR (cFPR) of GATK and GATKcan (trained by ten tumors) for detecting mutations of the 61,507 called variants are ~88.2% (~65.1%) and ~96.1% (~12.2%), respectively. Further, the averaged MCC (precision) of GATK and GATKcan are ~15% (~13%) and ~62% (46%), respectively. Let Mb denote megabase. The FPR of GATK and GATKcan (with the first set of cutoffs) was ~313 − Mb 1 and ~189 − Mb 1 , respectively which were computed over randomly selected 10% of ~. × 6 04 10 8 reference sites in cancer genes of 241 EC tumors. The detailed results of the ten repeats are shown in Supplementary Table S2.
To investigate the performance of GATKcan on more training samples, we further trained GATKcan using the called variants in ~18% (44 randomly selected) of 241 TCGA EC tumors. GATKcan trained by 44 tumors performed similarly to GATKcan trained by 10 tumors, which may be due to the six statistics captures the patterns of TPs and TNs well and training by ~2788 variants (of ten samples) been sufficient; the results are shown in Table 3 and the cutoffs in Table S2. Although GATK can identify small-scale mutations reasonably well for tumor-only samples, it is known to be limited to mutations with median to high allelic fractions 4,20 , where allelic fraction was defined as the ratio of the variant reads to the total reads at a given site. Thus, it is of interest to compare GATK and GATKcan in identifying mutations of EC by allelic fractions. This may provide insights into which cases both methods can be applied adequately. Figure 1 demonstrates that averaged over 10 repeats, GATKcan identifies true mutations well with allelic fractions ≥ 0.2, while GATK requires allelic fractions ≥ 0.3 to perform well. The TPR of GATKcan is close to those of GATK for variants with allelic fractions in (0.2, 1.0], but TPR of GATKcan is 22% and 70% higher than that of GATK for variants with allelic fraction in (0, 0.1] and (0.1, 0.2], respectively (Fig. 1a). For all variants with allelic factions in (0, 1.0], GATKcan is more powerful to detect artifacts, because of 6% to 79% lower cFPR than GATK (Fig. 1b).
Further, we compared GATK and GATKcan to VarScan 2 and MuTect; the latter two outperformed the remaining five tools in detecting small-scale mutations with allele frequency >0.35 and ≤0.35, respectively in In general, detection of low allele frequency mutations requires sufficient coverage, while exome-seq of ECs from TCGA had coverage of 20x only; therefore, it is of interest to compare the four algorithms.
In each of the 10 repeats, on average GATK and GATKcan detected mutations from 61,507 (~52,291) variants of 231 (197) tumors. Of these variants, VarScan 2 and MuTect (both default settings) classified ~2,102 (~1,741) as common somatic variants. Of these somatic variants, the high confidence mutations classified by VarScan 2 (MuTect with high-confidence mode) were treated as somatic mutations predicted by VarScan 2 (MuTect). We checked the predictions against the mutations reported by TCGA research network, and computed adjusted TPR and adjusted cFPR, which was the number of predicted true mutations (the predicted mutations) over the total true mutations (the total false mutations) of the somatic variants, in addition to adjusted precision (MCC). Of the ten experiments, the averaged adjusted TPR of GATK, GATKcan (both using tumor-only samples), VarScan 2 and MuTect were 98.8%, 98.6%, 99.6% and 94.3%, respectively, while their averaged adjusted cFPR were 81.0%, 5.7%, 60.5% and 37.9%, respectively. Ranked by adjusted MCC and adjusted precision, GATKcan was the top 1, followed by MuTect, VarScan 2 and GATK; some detailed results are shown in Table 4. The performances of the four algorithms on ~1,741 somatic variants (shown in Table 4) were similar to those on ~2,102 somatic variants; all results of the ten repeats are shown in Supplementary Table S2   pairs (~11 GB/sample) was analyzed. For GATKcan, we used the ten sets of thresholds trained by 10 and 44 randomly sampled TCGA EC tumors, respectively. Specifically, a total of 50,799 variants (base quality ≥ 10 and MQ ≥ 20) were called by GATK from exome-seq, focusing on 488 cancer genes queried from COSMIC, of the 503 BC tumors.
Application to TCGA ovarian cancer data. We further applied GATK, GATKcan and VarScan 2 to 215 ovarian cancer (OC) tumors after excluding 21 non-downloadable and damaged files 12 . The whole exome-seq of 215 tumor-normal pairs, whose size was ~14 GB (WUGSC) and ~27 GB (broad Institute) per sample, were analyzed. Specifically, a total of 27,167 variants (base quality ≥ 10 and MQ ≥ 20) were called by GATK from exome-seq, focusing on 432 cancer genes queried from COSMIC, of 215 OC tumors. Of the called variants, 19,182 and 7,984 (~29.4%) were point mutations and indels, respectively.
Application to TCGA cutaneous melanoma data. Finally, to see whether GATKcan can be applied to a cancer with different histology from EC (adenocarcinoma), we applied GATK, GATKcan and VarScan 2 to cutaneous melanoma (squamous cancer). In total, whole exome-seq of 342 tumors with 340 blood derived normal and 2 normal tissues 14 were analyzed (~9 GB/sample). For GATKcan, we used the 10 sets of thresholds trained by ten and 44 randomly sampled EC tumors of TCGA, respectively. Focusing on 498 cancer genes queried from COSMIC, GATK called a total of 33,053 variants (base quality ≥ 10 and MQ ≥ 20) from exome-seq of these tumor samples.

Discussion
Detection of mutations in exome-seq of tumor-only samples is useful for clinical diagnoses, as they can serve as a base for classifying cancer patients via molecular signatures and suggesting precision medicines. In addition to four statistics of GATK, GATKcan incorporated Mann-Whitney statistic and dNM to detect mutations, and we trained the cutoffs of GATKcan using reported mutations of ten randomly sampled TCGA endometrial tumors and reported mutations of 19 TCGA cancer types in each of ten experiments. The averaged performance of GATKcan was better than GATK in detecting mutations of the remaining 231 endometrial tumors. Further, GATKcan with such thresholds outperformed GATK in detecting mutations of 27,167 to 50,799 called variants in TCGA BC, OC and melanoma tumors in terms of MCC and precision, where MCC takes both sensitivity and specificity into account. Importantly, GATKcan reduced high fractions (about 23%, 52%, 65% and 60%) of false positives detected by GATK in the four cancers, whereas validation is costly (US$5-10 per variant in Taiwan). Ranked by adjusted MCC and adjusted precision, GATKcan was top 1, followed by MuTect, VarScan 2 and GATK in mutation detection of somatic variants classified commonly by VarScan 2 and MuTect from the called variants of BC, OC and melanoma. Note that GATKcan does not require normal samples, thus it reduces sequencing costs to half. Further, it enables detection of mutations when alternate alleles exist in normal samples, which remains a bottleneck in the area.
To investigate the performance of GATKcan on more training samples, we further trained GATKcan using called variants in 44 random samples of 241 TCGA EC in each of ten experiments. GATKcan trained by 44 tumors performed similarly to GATKcan trained by ten tumors, which may be because GATKcan captures patterns of true positives and negatives well and ~2788 called variants in ten tumors are sufficient for training. Tables 5-7 show that GATKcan trained by EC yields better prediction results in melanoma and OC than in BC, which suggests that the mutations of EC are more similar to those in melanoma and OC than to those in BC. Thus, if we know from cancer biology that cancer A is similar to a cancer to be detected, then training GATKcan by cancer A will be effective. These results suggest that GATKcan trained by a cancer is able to detect mutations in future patients with the same type of cancer and is likely applicable to other cancers whose mutations are similar. In addition to sequence-based clinical diagnoses, GATKcan is expected to have a large number of applications such as in the Precision Medicine Initiative for Oncology recently launched in the US.
In the future, the reported mutations in all types of cancers 21 of ICGC will be integrated with the mutations in TCGA as true mutations to check against. Then a pipeline based on GATKcan (for called variants in known mutation sites) and Variant Effect Predictor 22 (for variants in sites with unknown mutation status 23 ) will be built to detect somatic mutations of exome-seq of tumor-only samples. Finally, GATKcan may be improved further in identifying true mutations of tumors, e.g., by training GATKcan with partial reported mutations of all other cancers in TCGA and adding other statistics to capture characteristics of true mutations. We leave these topics for future research.

Methods
Training the cutoffs of the statistics in GATKcan. To train the six thresholds of GATKcan for detection of point mutations, the reported mutations of 241 endometrial tumors of TCGA were regarded as true mutations, and we randomly sampled 10 tumors (stratified by stage; 2, 3, 2 and 3 for stage I-IV), from which the called variants were inputted to the optimization algorithm particle swarm optimization 24 . Ten randomly sampled tumors were used for training, because CPU time was proportional to the number of variants inputted. The six cutoffs were optimized in the sense that they maximized the fitness function α(1-cFPR) + ( α − 1 )TPR, with α varying from 0 (0.1) to 1 using 10-fold. For a given α, the training set consisted of the true mutations and artifacts from 90% of the ten endometrial tumors, and the called variants of the remaining tumor made up the test set of the training (also called cross-validation). This step was iterated through each tumor being set as the test set, then we selected an α value to obtain the six thresholds for GATKcan. In each training, we used 4,000 seed points and 500 generations to run PSO, and partitioned the ranges of dNM, FS, MQ, MQRankSum, QD and P value of Mann-Whitney test into 1000, 60, 50, 100, 20 and 6 segments, respectively. In the training of dNM, we also used the reported mutations of 19 cancer types of TCGA; some details are in Supplementary Note 1. Similarly, to train the four cutoffs of GATKcan for detection of indels, we randomly sampled 50 of all 539 reported indels and 10 artifacts from the 64,295 called variants, respectively, because the called variants contained no true indels; the remaining 489 reported indels and 102 false indels constituted the test set. Then, the rest training procedure followed that of the six cutoffs of GATKcan. PSO is a well-known optimization method; for details of the method and computer complexity, please see Section 2.3.5 of Chuang et al. (2008) 25 . On average, each training procedure of GATKcan took ~15 h (~5 h) for the six (four) cutoffs, and was conducted by a high memory computing cluster (256GB RAM, limited to 11 cores and each with Xeon CPU 2.5 GHz).