Multigene testing of cancer susceptibility is widely applied in clinical practice to attempt to predict the risk of developing cancer. Estimating the effect of DNA variants in these large gene panels is a major clinical challenge. Because it is impractical to functionally classify every variant identified, several in silico tools have been developed to predict the pathogenicity of single-nucleotide variants (SNVs). Many of these tools focus on protein-coding regions of the genome (summarized in ref. 1). However, the number of noncoding variants far outstrips coding variants in the human genome,2,3 and approximately 88% of trait/disease-associated SNVs in collective genome-wide association studies are in intronic or intergenic regions.4 The combined annotation-dependent depletion (CADD) method is designed to predict the pathogenicity of SNVs at any location in the genome. Kircher et al.5 described the receiver-operating characteristics (ROC) curves of CADD scores for curated, pathogenic mutations defined by the ClinVar database and showed that a CADD score has a greater area under the curve (AUC) than GerpS, PhCons, and phyloP scores for a set of defined variants. They also examined two enhancers and one promoter in which saturation mutagenesis had been previously performed and showed that CADD had the highest Spearman rank correlation between the predictive score and the observed changes in protein expression (ref. 5, Supplementary Figure S17 of that reference), stating that CADD provides, “in principle, a genome-wide, data-rich, functionally generic and organismally relevant estimate of variant effect.”5 Based on claims of genome-wide relevant estimates of variant effects, we sought to test the clinical validity of CADD scores by comparing their distributions in common and rare variants identified in 624 patients tested in our large cancer-risk gene panel, with specific attention on nonexonic variants that did not alter protein coding or canonical splice sites. We evaluated the rare variants with the highest CADD scores in these noncoding regions where CADD score distributions were significantly different than expected. We also explored the hypothetical sensitivity and specificity cutoffs that would be required to achieve meaningful clinical positive or negative predictive values.

Materials and Methods


We evaluated a total of 624 unique, consecutively submitted DNA samples clinically requested for germ-line cancer susceptibility testing using the University of Washington (UW) BROCA assay6 between June 2014 and February 2015. All variant data were de-identified prior to release to the investigators in this study. De-identified minimal cancer and family history phenotypes were retained with the data to aid in the interpretation of potential variant significance. This project was deemed nonhuman-subjects research consistent with ongoing quality improvement and assurance activities as a component of clinical testing.

Targeted deep sequencing by BROCA

Library construction, gene capture, and massively parallel sequencing were performed using clinical ColoSeq and BROCA assays as previously described6 and detailed online (, Briefly, DNA was sonicated, purified, and subjected to end repair, A-tailing, and ligation with Illumina paired-end adapters. The adapter-ligated library was amplified, and individual paired-end libraries were hybridized to a custom design of complementary RNA biotinylated oligonucleotides spanning all exons and nonrepetitive intronic regions spanning 49 genes (Supplementary Table S3 online). The library–bait hybrids were purified and washed. Each library was amplified by polymerase chain reaction using primers with a unique index. After amplification, libraries were quantified and equimolar concentrations were pooled, denatured, and cluster-amplified on a single lane of an Illumina flow cell. Sequencing was performed with 2 × 101-bp paired-end reads and a 7-bp index on a HiSeq 2000 (Illumina, San Diego, CA). Mean sequencing depth was more than 100 for all samples.

We used a custom targeted sequencing bioinformatics pipeline.7 Reads were mapped to human reference genome 19 (hg19, GRCh37), and alignment was performed using Burrows–Wheeler alignment and SAMtools. SNV calling was performed with GATK and VarScan. The entire pipeline was validated and shown to have more than 99.9% accuracy for single-nucleotide changes.7

Variant curation

Variant evaluation was limited to probable germ-line mutations, defined as SNVs with variant read fraction >30%. For this project, rare variants were defined as those identified at a minor allele frequency of less than 1% by the 1000 Genomes Project (1KG).8 All variants with computed CADD scores were included in the analysis.

Statistical analysis

Distribution of variant-scaled CADD scores was compared for three variant types: rare variants in patient samples, common variants in patient samples, and all possible variants as defined by Kircher et al.5 in their Supplementary Table S8. We further grouped variants by genomic region to determine whether CADD performed effectively in different genomic contexts. ANNOVAR9 was used to define genomic regions as downstream, intronic, intergenic nonsynonymous, splice site, synonymous, stopgain, upstream, 3ʹ untranslated region (UTR), or 5ʹ UTR. We compared the sample median using the Wilcoxon rank-sum test to evaluate the significance of differences. Because we were testing 30 comparisons, we chose a P value of 0.001 as our cutoff for significance. We calculated ratios between the proportion of variants in a group with a given CADD score and plotted these to visually evaluate differences in CADD score for different groups. Statistical tests were performed using built-in R functions.10

Evaluation of validity of CADD scores for noncoding variants

To evaluate possible causative variants, we used several criteria to narrow the list of variants of interest (VOI) for further analysis. First, rare variants had to be in genes broadly consistent with the patient phenotype. For example, we excluded rare variants in known breast cancer–risk genes if they were seen only in patients with a personal history of colorectal cancer (or patients with a family history excluding breast cancer for those without a personal cancer history). For SNVs that were present in multiple samples, variants were considered if the majority of those patients had cancer phenotypes consistent with the gene mutated. Second, if a pathogenic mutation consistent with patient phenotype was present, then other rare variants for that patient were considered unlikely to be causative and were excluded. Finally, the variant base was compared with the reference base in up to 100 vertebrate species (the default of the UCSC genome browser11). If the variant base was present as the reference base in any of the species for which data were available, then the variant was excluded. Remaining variants after this step were considered VOI.

For CADD scores to be clinically useful in noncoding regions, we first expected the distribution of CADD scores for rare variants to be different from the null distribution of CADD scores, particularly for variants with high CADD scores. We evaluated rare variants with the highest 10% of CADD scores for intronic variants to determine whether these variants might possibly explain patient disease phenotypes. We used the pROC program in R12 to create a ROC curve for the results of this analysis. The PhyloP score13 for pairwise alignment of 100 vertebrate species (the default of the UCSC genome browser) was also calculated for the 10% of intronic variants with the highest CADD scores (PhyloP, Supplementary Table S3 online). Each VOI was evaluated along with the 50 bases preceding and the 50 bases following the variant base using the Berkeley Drosophila Genome Project,14 Human Splicing Finder 3.0,15 and NetGene16 splice-site prediction algorithms to predict changes in splice sites along the transcribed strand (Supplementary Table S4 online).

Modeling of the sensitivity and specificity needed to achieve clinically acceptable identification of possible pathogenic variants.

For an in silico predictive tool to be clinically useful, it must either rule out benign variants with high certainty or identify pathogenic variants with modest certainty to minimize the necessary follow-up functional or cosegregation studies necessary to definitively classify variants. For our practice, we determined that an optimal rule-out predictor would have at least 95% negative predictive value (NPV) consistent with accepted definitions of what constitutes a likely benign variant.17,18 Because of the extensive work that is required to confirm a pathogenic variant, we desire at least a 50% positive predictive value (PPV) to minimize unnecessary follow-up of unknown variants.

For given sensitivity and specificity, PPV and NPV vary with the prevalence of pathogenic mutations within the set of variants evaluated. We calculated the sensitivity and specificity required to achieve a minimum PPV of 50% and a minimum NPV of 95% using approximate representative mutation prevalence estimates: 50% for evaluation of coding elements in a single gene, 10% for evaluation of coding and noncoding elements of a single gene, 5% for evaluation of coding elements in panel testing, 0.5% for coding and noncoding elements in panel testing, 0.05% for exome testing, and 0.0001% for genome testing.3,19,20,21,22,23,24


Comparison of CADD score distribution between rare, common, and all possible variants

We identified 12,391 unique SNVs with computed scaled CADD scores in the 624 patient samples. The specific numbers of variants in downstream, intergenic, intronic, nonsynonymous, splicing, synonymous, upstream, 3ʹ UTR, and 5ʹ UTR regions are shown in Supplementary Table S1 online.

We compared rare, common, and all possible variants in each category to each other using the Wilcoxon rank-sum test. There were statistically significant differences between common and all possible variants for intergenic, nonsynonymous, and upstream SNVs (Supplementary Table S2 online). As shown in Supplementary Figure S1 online, when the proportion of common variants at any given CADD score was graphed over the proportion of all possible variants at that score for these significant regions, nonsynonymous variants with CADD scores less than 10 were significantly overrepresented in the common variants (P = 4.8 × 10−14). This is consistent with the hypothesis that common nonsynonymous variants have been subject to evolutionary selection and are thus enriched for benign variants. Surprisingly, there was an upward trend from the lowest to the highest CADD scores for the intergenic (P = 2.6 × 10−5) and upstream variants (P = 1.9 × 10−12). This suggests that high CADD score variants in these regions are more likely to occur in our patient samples than would be expected by chance.

There were statistically significant differences between rare and all possible variants for downstream, intergenic, intronic, upstream, and 5ʹ UTR SNVs (Supplementary Table S2 online). As shown in Supplementary Figure S2 online, when graphed over the proportion of all possible variants at each possible CADD score for these significant regions, we found that rare downstream variants with CADD scores more than 15 were overrepresented compared with all possible variants (P = 6 × 10−8), as were intronic variants with CADD scores more than 25 (P = 2.2 × 10−16) and 5ʹ UTR variants with CADD scores more than 17 (P = 2.2 × 10−16). Therefore, rare variants in these regions were more likely to have high CADD scores than would be expected by chance. Rare intergenic variants with CADD scores less than 4 were underrepresented (P = 1.3 × 10−11), as were rare upstream variants with CADD scores less than 5 (P = 2.2 × 10−16) and rare 5ʹ UTR variants with CADD scores less than 6 (P = 2.2 × 10−16). This means that rare variants with low CADD scores are statistically less frequent in our patient population than would be expected by chance. These findings are consistent with the hypothesis that rare variants have not been subjected to extensive selective pressure and are more likely to be functionally deleterious.

Comparison of rare and common variants showed statistically significant differences for intronic and nonsynonymous variants (Supplementary Table S2 online). Graphing the proportion of common variants over the proportion of rare variants at each possible CADD score for these significant regions ( Figure 1 ) revealed that SNVs with higher CADD scores were proportionally underrepresented for common intronic variants when compared with rare variants (P = 5 × 10−6). Common SNVs with CADD scores less than 6 were proportionally overrepresented for nonsynonymous variants when compared with rare variants (P = 5 × 10−11). These findings are consistent with the hypothesis that rare variants from these regions are more likely to be deleterious than common ones and are thus more likely to have high CADD scores.

Figure 1
figure 1

Ratio of common to rare variants with significant differences by Wilcoxon rank-sum test. The proportion of common variants at any given combined annotation-dependent depletion (CADD) score was compared with that of rare variants at the same CADD score (rounded to the nearest 1). Only genomic regions with significant differences by Wilcoxon rank-sum test were evaluated graphically.

Evaluation of validity of CADD scores for noncoding variants

If CADD scores are to have clinical validity for the identification of novel pathogenic variants in noncoding regions, then the subset of rare variants with the highest CADD scores in genomic regions with significantly different CADD scores between rare and common variants should be enriched for pathogenic variants. Because the only noncoding region that had statistically different CADD scores between rare and common variants were introns, we specifically looked at the 10% of rare intronic variants with the highest CADD scores to evaluate whether these mutations could possibly cause disease in our patient population. Two hundred eighty-six of 690 variants evaluated were in genes not known to cause the type of cancer found in the patient or patient’s family and were therefore excluded. Thirty-eight of the 404 remaining rare variants were seen in patients with other known pathogenic variants and were therefore considered unlikely to cause the phenotype in those patients. Three-hundred five of the remaining 366 variants were present as the conserved base in one or more of the vertebrate species evaluated by MULTIZ alignment of up to 100 vertebrate species, which was used as evidence that there was no functional consequence to the variant. This left 61 VOI.

There was no significant enrichment of VOI as the CADD score cutoff increased. Forty-two of 517 variants with CADD scores between 10.51 and 14.99 (8.1%) were VOI. Sixteen of 145 variants with CADD scores between 15 and 19.99 (11%) were VOI, and for variants with CADD scores of 20 or more (28 total), there were 3 VOI (10.7%). We plotted the ROC curve ( Figure 2a ) of VOI over all rare variants in the CADD score range examined to determine whether there was an optimal cutoff at which CADD score identified the most VOI (highest sensitivity) with the highest specificity. The AUC was 0.591 (95% confidence interval: 0.516–0.667), and there was no CADD cutoff at which sensitivity and specificity were optimized. The PPV of CADD score to identify VOI at a score of 10.51 or more was 8.8%.

Figure 2
figure 2

Receiver-operating characteristics (ROC) curve for non-coding variants. (a) ROC curve for combined annotation-dependent depletion (CADD) score (black) and 100 vertebrate PhyloP score (grey) for variants of interest in the top 10% of rare intronic variants. (b) ROC curve for CADD score for noncoding variants from Kircher and colleagues’5 source data. Area under the curve (95% confidence interval) calculated using pROC in R.

In an effort to identify pathogenic mutations, we used three splice-site prediction algorithms (NNSplice, Human Splice Finder 3.0, and NetGene) to evaluate the possibility that splice-site changes on the transcribed strand caused by VOI introduced alternative splice sites. No variants were predicted to introduce novel splice sites by all three prediction algorithms tested (Supplementary Table S4 online). NNSplice and HSF3.0 splice predictions were consistent for five variants, and three of these showed a 15% or greater increase in splicing score for both predictions. However, the frequency of these variants in our overall clinical sample set was not significantly higher than the reported variant frequency in the 1KG data set, and the clinical histories of other individuals with these variants evaluated outside this study were not consistent with the gene of interest, suggesting that these variants are unlikely to substantially alter disease risk.

No definitively pathogenic variants were present in our cohort, and thus we were unable to robustly measure the false-negative rate (FNR). For this reason, we also computed the ROC of noncoding variants from the data from Figure 3 of the study by Kircher et al.5 whose data set contained well-characterized pathogenic variants ( Figure 2b ). The AUC of this ROC curve was 0.663 (95% CI: 0.607–0.720), which was not significantly better than the one from our data.

We further evaluated CADD scores for pathogenic intronic variants using 47 deep intronic variants reported to be deleterious (Supplementary Table S5 online).25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40 The CADD scores for these variants ranged from 0.356 to 19.05, with a median of 3.498. Only 5 of the 47 variants (10.6%) had CADD scores above 10.51, the cutoff at which we evaluated variants for possible pathogenicity. More than 50% of all rare intronic variants in our data set had CADD scores greater than the median for known pathogenic variants (3.498), consistent with a low AUC.

To evaluate the comparative usefulness of CADD scores compared with other in silico predictive algorithms for intronic regions, we compared the performance of PhyloP scores from 100 vertebrate species to that of CADD scores for the identification of VOI. The AUC for the ROC curve of the PhyloP analysis was 0.666 (95% CI: 0.593–0.739), which is similar to that seen for CADD scores ( Figure 2a ). However, because our definition of VOI included evaluation of the conservation of the base between species, we cannot exclude ascertainment bias.

Sensitivity and specificity needed to achieve clinically acceptable identification of possible pathogenic variants

For maximal clinical validity, a predictive tool should identify benign variants with confidence while minimizing the number of variants that require further workup. We calculated the minimum sensitivity and/or specificity needed for a predictor to achieve a PPV of at least 50% and an NPV of at least 95% ( Table 1 ). If pathogenic mutations represent 0.5% of all rare mutations (our estimate for the prevalence of pathogenic mutations in coding and noncoding regions of genes in panel testing), then the required minimum specificity for an in silico tool to identify pathogenic mutations at this level of confidence is 99.5%, although the sensitivity of the tool is less critical at that mutation prevalence ( Table 1 ). The sensitivity required for an in silico prediction tool to meet clinically meaningful NPV requirements increases as the prevalence of pathogenic mutations increases, whereas the specificity becomes less critical for overall performance. However, if the number of genes and rare variants tested increases, then the specificity required to achieve a clinically meaningful PPV increases and the sensitivity becomes less critical for overall performance.

Table 1 Minimum sensitivity and specificity of an in silico predictive tool needed for clinical validity


The comparison of the CADD scores of common and rare variants in different genomic areas with the CADD scores of all possible variants in these areas as defined by Kircher et al.5 suggested that CADD scores may have modest predictive power for nonsynonymous variants. We found that in a clinical population, common nonsynonymous variants have significantly lower CADD scores than those produced by random mutations, whereas the distribution of CADD scores for nonsynonymous rare variants is no different than the null distribution. This suggests that CADD scores may correlate with functional consequences because common nonsynonymous variants—which are likely to be functionally benign, having gone through extensive natural selection—have lower CADD scores.

For both intergenic and upstream variants, common variants with low CADD scores were proportionally underrepresented and those with higher CADD scores were proportionately overrepresented compared with all possible variants in these regions, which was unexpected. If CADD scores are representative of evolutionary selection, then this suggests evolutionary pressure supporting promoter diversity. Alternatively, this could represent a statistical anomaly due to the fact that our panel does not cover a large proportion of the intergenic and upstream regions present in the human genome. For downstream, intronic, intergenic, upstream, and 5ʹ UTR regions, the rare variants with higher CADD scores are overrepresented compared to those expected by chance, which is the pattern one would expect if these variants were more likely to be functionally deleterious, not having undergone extensive selection.

Differences in observed distributions between CADD scores for common and rare intronic and nonsynonymous variants suggest that CADD scores may be useful for identification of pathogenic intronic or nonsynonymous variants in targeted testing situations when used in combination with other data. However, our data suggest that CADD scores are unlikely to be useful for identifying disease-causing mutations in other noncoding regions in cancer-risk genes. Evaluation of the 10% of rare intronic variants with the highest CADD scores revealed 61 (of 690 examined) variants in genes consistent with patient presentation. Additional investigation suggested that these variants are unlikely to cause substantial disease risk (Supplementary Table S3 online). The absence of any convincing pathogenic or likely pathogenic variants in our clinical data set was a major limitation of our analysis. For this reason, we also evaluated the noncoding data from the original article used to describe CADD scoring,5 which had known positive variants as well as a set of 47 previously reported pathogenic deep intronic variants. The ROC curve for noncoding variants from the original data set from Kircher et al.5 and the one generated from our data set are similar, supporting our conclusion about the very low PPV of CADD score for noncoding variants. This conclusion is further supported by our evaluation of known pathogenic intronic variants from the literature.

For an in silico predictive tool to be useful in clinical interpretation of unique variants, it should have high NPV to avoid missing truly pathogenic variants and moderate PPV to minimize further clinical workup. Our analysis of PPV and NPV in different clinical situations suggests that for a single gene test in which 50% of the identified rare variants are pathogenic, the required sensitivity of a predictor must be very high (94.8%) to achieve appropriately high NPV, but the required specificity is low. Given the reported sensitivity and specificity of CADD in this scenario from the work of Kircher et al.,5 it is possible that a CADD score cutoff value for nonsynonymous mutations could approach this level of sensitivity. The number of variants in noncoding regions is higher, however, and there is a lower density of pathogenic mutations in nonexonic regions. In our patient population, for example, there were more than eight times as many rare variants in noncoding regions as there were in coding regions and splice sites (Supplementary Table S1 online). Evaluating the ROC curves generated for noncoding variants using the data from Kircher et al.5 ( Figure 2b ) in the context of PPV, it becomes clear that there is no cutoff at which CADD score is clinically useful for nonexonic variants. Additionally, if more genes are added to a panel (thereby increasing the number of rare coding or noncoding variants to evaluate), then the gap between the current performance of CADD and the performance required for clinical usefulness increases. We thus conclude that while CADD scores are in principle a genome-wide, data-rich, functionally generic, and organismally relevant estimate of variant effect,”5 in clinical practice for hereditary cancer panels (or, most likely, in any larger genomic test) they lack predictive power. There may be situations in which CADD scores or other in silico scores can be combined with other predictors to produce clinically useful predictions; these combined analysis situations will need to be evaluated separately to determine how much CADD scores independently improve predictions.

Another difficulty in interpreting CADD scores (or other predictive scores) is the distinction between changes that are functionally deleterious and clinically pathogenic. The data underlying CADD scores are evolutionary and functional predictors. There are many situations in which a deleterious variant does not cause clinical phenotype. This separation between functional prediction and clinical consequence reduces the real-world predictive value of predictive scores.

Although sensitivity and specificity of CADD have been shown to be high in data sets balanced for known pathogenic and benign variants, sensitivity and specificity are test values that are agnostic to population prevalence. The real-world PPV of CADD score and other in silico tests is not high enough to effectively classify individual nonexonic variants or reduce the number of potential pathogenic variants to those that could be efficiently followed up in the context of a hereditary cancer panel. This finding supports the idea that currently available in silico predictive scores should be used, at most, as supporting evidence of pathogenicity, as is currently recommended by the American College of Medical Genetics and Genomics and the Association for Molecular Pathology.18


The authors declare no conflict of interest.