Abstract
Purpose:
Several in silico tools have been shown to have reasonable research sensitivity and specificity for classifying sequence variants in coding regions. The recently developed combined annotation-dependent depletion (CADD) method generates predictive scores for single-nucleotide variants (SNVs) in all areas of the genome, including noncoding regions. We sought for non-coding variants to determine the clinical validity of common CADD scores.
Methods:
We evaluated 12,391 unique SNVs in 624 patient samples submitted for germ-line mutation testing in a cancer-related gene panel. Stratifying by genomic region, we compared the distributions of CADD scores of rare SNVs, SNVs common in our patient population, and the null distribution of all possible SNVs.
Results:
The median CADD scores of intronic and nonsynonymous variants were significantly different between rare and common SNVs (P < 0.0001). Despite these different distributions, no individual variants could be identified as plausibly causative among the rare intronic variants with the highest scores. The receiver-operating characteristics (ROC) area under the curve (AUC) for noncoding variants is modest, and the positive predictive value of CADD for intronic variants in panel testing was found to be 0.088.
Conclusion:
Focused in silico scoring systems with much higher predictive value will be necessary for clinical genomic applications.
Genet Med 18 12, 1269–1275.
Introduction
Multigene testing of cancer susceptibility is widely applied in clinical practice to attempt to predict the risk of developing cancer. Estimating the effect of DNA variants in these large gene panels is a major clinical challenge. Because it is impractical to functionally classify every variant identified, several in silico tools have been developed to predict the pathogenicity of single-nucleotide variants (SNVs). Many of these tools focus on protein-coding regions of the genome (summarized in ref. 1). However, the number of noncoding variants far outstrips coding variants in the human genome,2,3 and approximately 88% of trait/disease-associated SNVs in collective genome-wide association studies are in intronic or intergenic regions.4 The combined annotation-dependent depletion (CADD) method is designed to predict the pathogenicity of SNVs at any location in the genome. Kircher et al.5 described the receiver-operating characteristics (ROC) curves of CADD scores for curated, pathogenic mutations defined by the ClinVar database and showed that a CADD score has a greater area under the curve (AUC) than GerpS, PhCons, and phyloP scores for a set of defined variants. They also examined two enhancers and one promoter in which saturation mutagenesis had been previously performed and showed that CADD had the highest Spearman rank correlation between the predictive score and the observed changes in protein expression (ref. 5, Supplementary Figure S17 of that reference), stating that CADD provides, “in principle, a genome-wide, data-rich, functionally generic and organismally relevant estimate of variant effect.”5 Based on claims of genome-wide relevant estimates of variant effects, we sought to test the clinical validity of CADD scores by comparing their distributions in common and rare variants identified in 624 patients tested in our large cancer-risk gene panel, with specific attention on nonexonic variants that did not alter protein coding or canonical splice sites. We evaluated the rare variants with the highest CADD scores in these noncoding regions where CADD score distributions were significantly different than expected. We also explored the hypothetical sensitivity and specificity cutoffs that would be required to achieve meaningful clinical positive or negative predictive values.
Materials and Methods
Samples
We evaluated a total of 624 unique, consecutively submitted DNA samples clinically requested for germ-line cancer susceptibility testing using the University of Washington (UW) BROCA assay6 between June 2014 and February 2015. All variant data were de-identified prior to release to the investigators in this study. De-identified minimal cancer and family history phenotypes were retained with the data to aid in the interpretation of potential variant significance. This project was deemed nonhuman-subjects research consistent with ongoing quality improvement and assurance activities as a component of clinical testing.
Targeted deep sequencing by BROCA
Library construction, gene capture, and massively parallel sequencing were performed using clinical ColoSeq and BROCA assays as previously described6 and detailed online (http://web.labmed.washington.edu/tests/COLOSEQ, http://tests.labmed.washington.edu/BROCA). Briefly, DNA was sonicated, purified, and subjected to end repair, A-tailing, and ligation with Illumina paired-end adapters. The adapter-ligated library was amplified, and individual paired-end libraries were hybridized to a custom design of complementary RNA biotinylated oligonucleotides spanning all exons and nonrepetitive intronic regions spanning 49 genes (Supplementary Table S3 online). The library–bait hybrids were purified and washed. Each library was amplified by polymerase chain reaction using primers with a unique index. After amplification, libraries were quantified and equimolar concentrations were pooled, denatured, and cluster-amplified on a single lane of an Illumina flow cell. Sequencing was performed with 2 × 101-bp paired-end reads and a 7-bp index on a HiSeq 2000 (Illumina, San Diego, CA). Mean sequencing depth was more than 100 for all samples.
We used a custom targeted sequencing bioinformatics pipeline.7 Reads were mapped to human reference genome 19 (hg19, GRCh37), and alignment was performed using Burrows–Wheeler alignment and SAMtools. SNV calling was performed with GATK and VarScan. The entire pipeline was validated and shown to have more than 99.9% accuracy for single-nucleotide changes.7
Variant curation
Variant evaluation was limited to probable germ-line mutations, defined as SNVs with variant read fraction >30%. For this project, rare variants were defined as those identified at a minor allele frequency of less than 1% by the 1000 Genomes Project (1KG).8 All variants with computed CADD scores were included in the analysis.
Statistical analysis
Distribution of variant-scaled CADD scores was compared for three variant types: rare variants in patient samples, common variants in patient samples, and all possible variants as defined by Kircher et al.5 in their Supplementary Table S8. We further grouped variants by genomic region to determine whether CADD performed effectively in different genomic contexts. ANNOVAR9 was used to define genomic regions as downstream, intronic, intergenic nonsynonymous, splice site, synonymous, stopgain, upstream, 3ʹ untranslated region (UTR), or 5ʹ UTR. We compared the sample median using the Wilcoxon rank-sum test to evaluate the significance of differences. Because we were testing 30 comparisons, we chose a P value of 0.001 as our cutoff for significance. We calculated ratios between the proportion of variants in a group with a given CADD score and plotted these to visually evaluate differences in CADD score for different groups. Statistical tests were performed using built-in R functions.10
Evaluation of validity of CADD scores for noncoding variants
To evaluate possible causative variants, we used several criteria to narrow the list of variants of interest (VOI) for further analysis. First, rare variants had to be in genes broadly consistent with the patient phenotype. For example, we excluded rare variants in known breast cancer–risk genes if they were seen only in patients with a personal history of colorectal cancer (or patients with a family history excluding breast cancer for those without a personal cancer history). For SNVs that were present in multiple samples, variants were considered if the majority of those patients had cancer phenotypes consistent with the gene mutated. Second, if a pathogenic mutation consistent with patient phenotype was present, then other rare variants for that patient were considered unlikely to be causative and were excluded. Finally, the variant base was compared with the reference base in up to 100 vertebrate species (the default of the UCSC genome browser11). If the variant base was present as the reference base in any of the species for which data were available, then the variant was excluded. Remaining variants after this step were considered VOI.
For CADD scores to be clinically useful in noncoding regions, we first expected the distribution of CADD scores for rare variants to be different from the null distribution of CADD scores, particularly for variants with high CADD scores. We evaluated rare variants with the highest 10% of CADD scores for intronic variants to determine whether these variants might possibly explain patient disease phenotypes. We used the pROC program in R12 to create a ROC curve for the results of this analysis. The PhyloP score13 for pairwise alignment of 100 vertebrate species (the default of the UCSC genome browser) was also calculated for the 10% of intronic variants with the highest CADD scores (PhyloP, Supplementary Table S3 online). Each VOI was evaluated along with the 50 bases preceding and the 50 bases following the variant base using the Berkeley Drosophila Genome Project,14 Human Splicing Finder 3.0,15 and NetGene16 splice-site prediction algorithms to predict changes in splice sites along the transcribed strand (Supplementary Table S4 online).
Modeling of the sensitivity and specificity needed to achieve clinically acceptable identification of possible pathogenic variants.
For an in silico predictive tool to be clinically useful, it must either rule out benign variants with high certainty or identify pathogenic variants with modest certainty to minimize the necessary follow-up functional or cosegregation studies necessary to definitively classify variants. For our practice, we determined that an optimal rule-out predictor would have at least 95% negative predictive value (NPV) consistent with accepted definitions of what constitutes a likely benign variant.17,18 Because of the extensive work that is required to confirm a pathogenic variant, we desire at least a 50% positive predictive value (PPV) to minimize unnecessary follow-up of unknown variants.
For given sensitivity and specificity, PPV and NPV vary with the prevalence of pathogenic mutations within the set of variants evaluated. We calculated the sensitivity and specificity required to achieve a minimum PPV of 50% and a minimum NPV of 95% using approximate representative mutation prevalence estimates: 50% for evaluation of coding elements in a single gene, 10% for evaluation of coding and noncoding elements of a single gene, 5% for evaluation of coding elements in panel testing, 0.5% for coding and noncoding elements in panel testing, 0.05% for exome testing, and 0.0001% for genome testing.3,19,20,21,22,23,24
Results
Comparison of CADD score distribution between rare, common, and all possible variants
We identified 12,391 unique SNVs with computed scaled CADD scores in the 624 patient samples. The specific numbers of variants in downstream, intergenic, intronic, nonsynonymous, splicing, synonymous, upstream, 3ʹ UTR, and 5ʹ UTR regions are shown in Supplementary Table S1 online.
We compared rare, common, and all possible variants in each category to each other using the Wilcoxon rank-sum test. There were statistically significant differences between common and all possible variants for intergenic, nonsynonymous, and upstream SNVs (Supplementary Table S2 online). As shown in Supplementary Figure S1 online, when the proportion of common variants at any given CADD score was graphed over the proportion of all possible variants at that score for these significant regions, nonsynonymous variants with CADD scores less than 10 were significantly overrepresented in the common variants (P = 4.8 × 10−14). This is consistent with the hypothesis that common nonsynonymous variants have been subject to evolutionary selection and are thus enriched for benign variants. Surprisingly, there was an upward trend from the lowest to the highest CADD scores for the intergenic (P = 2.6 × 10−5) and upstream variants (P = 1.9 × 10−12). This suggests that high CADD score variants in these regions are more likely to occur in our patient samples than would be expected by chance.
There were statistically significant differences between rare and all possible variants for downstream, intergenic, intronic, upstream, and 5ʹ UTR SNVs (Supplementary Table S2 online). As shown in Supplementary Figure S2 online, when graphed over the proportion of all possible variants at each possible CADD score for these significant regions, we found that rare downstream variants with CADD scores more than 15 were overrepresented compared with all possible variants (P = 6 × 10−8), as were intronic variants with CADD scores more than 25 (P = 2.2 × 10−16) and 5ʹ UTR variants with CADD scores more than 17 (P = 2.2 × 10−16). Therefore, rare variants in these regions were more likely to have high CADD scores than would be expected by chance. Rare intergenic variants with CADD scores less than 4 were underrepresented (P = 1.3 × 10−11), as were rare upstream variants with CADD scores less than 5 (P = 2.2 × 10−16) and rare 5ʹ UTR variants with CADD scores less than 6 (P = 2.2 × 10−16). This means that rare variants with low CADD scores are statistically less frequent in our patient population than would be expected by chance. These findings are consistent with the hypothesis that rare variants have not been subjected to extensive selective pressure and are more likely to be functionally deleterious.
Comparison of rare and common variants showed statistically significant differences for intronic and nonsynonymous variants (Supplementary Table S2 online). Graphing the proportion of common variants over the proportion of rare variants at each possible CADD score for these significant regions ( Figure 1 ) revealed that SNVs with higher CADD scores were proportionally underrepresented for common intronic variants when compared with rare variants (P = 5 × 10−6). Common SNVs with CADD scores less than 6 were proportionally overrepresented for nonsynonymous variants when compared with rare variants (P = 5 × 10−11). These findings are consistent with the hypothesis that rare variants from these regions are more likely to be deleterious than common ones and are thus more likely to have high CADD scores.
Ratio of common to rare variants with significant differences by Wilcoxon rank-sum test. The proportion of common variants at any given combined annotation-dependent depletion (CADD) score was compared with that of rare variants at the same CADD score (rounded to the nearest 1). Only genomic regions with significant differences by Wilcoxon rank-sum test were evaluated graphically.
Evaluation of validity of CADD scores for noncoding variants
If CADD scores are to have clinical validity for the identification of novel pathogenic variants in noncoding regions, then the subset of rare variants with the highest CADD scores in genomic regions with significantly different CADD scores between rare and common variants should be enriched for pathogenic variants. Because the only noncoding region that had statistically different CADD scores between rare and common variants were introns, we specifically looked at the 10% of rare intronic variants with the highest CADD scores to evaluate whether these mutations could possibly cause disease in our patient population. Two hundred eighty-six of 690 variants evaluated were in genes not known to cause the type of cancer found in the patient or patient’s family and were therefore excluded. Thirty-eight of the 404 remaining rare variants were seen in patients with other known pathogenic variants and were therefore considered unlikely to cause the phenotype in those patients. Three-hundred five of the remaining 366 variants were present as the conserved base in one or more of the vertebrate species evaluated by MULTIZ alignment of up to 100 vertebrate species, which was used as evidence that there was no functional consequence to the variant. This left 61 VOI.
There was no significant enrichment of VOI as the CADD score cutoff increased. Forty-two of 517 variants with CADD scores between 10.51 and 14.99 (8.1%) were VOI. Sixteen of 145 variants with CADD scores between 15 and 19.99 (11%) were VOI, and for variants with CADD scores of 20 or more (28 total), there were 3 VOI (10.7%). We plotted the ROC curve ( Figure 2a ) of VOI over all rare variants in the CADD score range examined to determine whether there was an optimal cutoff at which CADD score identified the most VOI (highest sensitivity) with the highest specificity. The AUC was 0.591 (95% confidence interval: 0.516–0.667), and there was no CADD cutoff at which sensitivity and specificity were optimized. The PPV of CADD score to identify VOI at a score of 10.51 or more was 8.8%.
Receiver-operating characteristics (ROC) curve for non-coding variants. (a) ROC curve for combined annotation-dependent depletion (CADD) score (black) and 100 vertebrate PhyloP score (grey) for variants of interest in the top 10% of rare intronic variants. (b) ROC curve for CADD score for noncoding variants from Kircher and colleagues’5 source data. Area under the curve (95% confidence interval) calculated using pROC in R.
In an effort to identify pathogenic mutations, we used three splice-site prediction algorithms (NNSplice, Human Splice Finder 3.0, and NetGene) to evaluate the possibility that splice-site changes on the transcribed strand caused by VOI introduced alternative splice sites. No variants were predicted to introduce novel splice sites by all three prediction algorithms tested (Supplementary Table S4 online). NNSplice and HSF3.0 splice predictions were consistent for five variants, and three of these showed a 15% or greater increase in splicing score for both predictions. However, the frequency of these variants in our overall clinical sample set was not significantly higher than the reported variant frequency in the 1KG data set, and the clinical histories of other individuals with these variants evaluated outside this study were not consistent with the gene of interest, suggesting that these variants are unlikely to substantially alter disease risk.
No definitively pathogenic variants were present in our cohort, and thus we were unable to robustly measure the false-negative rate (FNR). For this reason, we also computed the ROC of noncoding variants from the data from Figure 3 of the study by Kircher et al.5 whose data set contained well-characterized pathogenic variants ( Figure 2b ). The AUC of this ROC curve was 0.663 (95% CI: 0.607–0.720), which was not significantly better than the one from our data.
We further evaluated CADD scores for pathogenic intronic variants using 47 deep intronic variants reported to be deleterious (Supplementary Table S5 online).25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40 The CADD scores for these variants ranged from 0.356 to 19.05, with a median of 3.498. Only 5 of the 47 variants (10.6%) had CADD scores above 10.51, the cutoff at which we evaluated variants for possible pathogenicity. More than 50% of all rare intronic variants in our data set had CADD scores greater than the median for known pathogenic variants (3.498), consistent with a low AUC.
To evaluate the comparative usefulness of CADD scores compared with other in silico predictive algorithms for intronic regions, we compared the performance of PhyloP scores from 100 vertebrate species to that of CADD scores for the identification of VOI. The AUC for the ROC curve of the PhyloP analysis was 0.666 (95% CI: 0.593–0.739), which is similar to that seen for CADD scores ( Figure 2a ). However, because our definition of VOI included evaluation of the conservation of the base between species, we cannot exclude ascertainment bias.
Sensitivity and specificity needed to achieve clinically acceptable identification of possible pathogenic variants
For maximal clinical validity, a predictive tool should identify benign variants with confidence while minimizing the number of variants that require further workup. We calculated the minimum sensitivity and/or specificity needed for a predictor to achieve a PPV of at least 50% and an NPV of at least 95% ( Table 1 ). If pathogenic mutations represent 0.5% of all rare mutations (our estimate for the prevalence of pathogenic mutations in coding and noncoding regions of genes in panel testing), then the required minimum specificity for an in silico tool to identify pathogenic mutations at this level of confidence is 99.5%, although the sensitivity of the tool is less critical at that mutation prevalence ( Table 1 ). The sensitivity required for an in silico prediction tool to meet clinically meaningful NPV requirements increases as the prevalence of pathogenic mutations increases, whereas the specificity becomes less critical for overall performance. However, if the number of genes and rare variants tested increases, then the specificity required to achieve a clinically meaningful PPV increases and the sensitivity becomes less critical for overall performance.
Discussion
The comparison of the CADD scores of common and rare variants in different genomic areas with the CADD scores of all possible variants in these areas as defined by Kircher et al.5 suggested that CADD scores may have modest predictive power for nonsynonymous variants. We found that in a clinical population, common nonsynonymous variants have significantly lower CADD scores than those produced by random mutations, whereas the distribution of CADD scores for nonsynonymous rare variants is no different than the null distribution. This suggests that CADD scores may correlate with functional consequences because common nonsynonymous variants—which are likely to be functionally benign, having gone through extensive natural selection—have lower CADD scores.
For both intergenic and upstream variants, common variants with low CADD scores were proportionally underrepresented and those with higher CADD scores were proportionately overrepresented compared with all possible variants in these regions, which was unexpected. If CADD scores are representative of evolutionary selection, then this suggests evolutionary pressure supporting promoter diversity. Alternatively, this could represent a statistical anomaly due to the fact that our panel does not cover a large proportion of the intergenic and upstream regions present in the human genome. For downstream, intronic, intergenic, upstream, and 5ʹ UTR regions, the rare variants with higher CADD scores are overrepresented compared to those expected by chance, which is the pattern one would expect if these variants were more likely to be functionally deleterious, not having undergone extensive selection.
Differences in observed distributions between CADD scores for common and rare intronic and nonsynonymous variants suggest that CADD scores may be useful for identification of pathogenic intronic or nonsynonymous variants in targeted testing situations when used in combination with other data. However, our data suggest that CADD scores are unlikely to be useful for identifying disease-causing mutations in other noncoding regions in cancer-risk genes. Evaluation of the 10% of rare intronic variants with the highest CADD scores revealed 61 (of 690 examined) variants in genes consistent with patient presentation. Additional investigation suggested that these variants are unlikely to cause substantial disease risk (Supplementary Table S3 online). The absence of any convincing pathogenic or likely pathogenic variants in our clinical data set was a major limitation of our analysis. For this reason, we also evaluated the noncoding data from the original article used to describe CADD scoring,5 which had known positive variants as well as a set of 47 previously reported pathogenic deep intronic variants. The ROC curve for noncoding variants from the original data set from Kircher et al.5 and the one generated from our data set are similar, supporting our conclusion about the very low PPV of CADD score for noncoding variants. This conclusion is further supported by our evaluation of known pathogenic intronic variants from the literature.
For an in silico predictive tool to be useful in clinical interpretation of unique variants, it should have high NPV to avoid missing truly pathogenic variants and moderate PPV to minimize further clinical workup. Our analysis of PPV and NPV in different clinical situations suggests that for a single gene test in which 50% of the identified rare variants are pathogenic, the required sensitivity of a predictor must be very high (94.8%) to achieve appropriately high NPV, but the required specificity is low. Given the reported sensitivity and specificity of CADD in this scenario from the work of Kircher et al.,5 it is possible that a CADD score cutoff value for nonsynonymous mutations could approach this level of sensitivity. The number of variants in noncoding regions is higher, however, and there is a lower density of pathogenic mutations in nonexonic regions. In our patient population, for example, there were more than eight times as many rare variants in noncoding regions as there were in coding regions and splice sites (Supplementary Table S1 online). Evaluating the ROC curves generated for noncoding variants using the data from Kircher et al.5 ( Figure 2b ) in the context of PPV, it becomes clear that there is no cutoff at which CADD score is clinically useful for nonexonic variants. Additionally, if more genes are added to a panel (thereby increasing the number of rare coding or noncoding variants to evaluate), then the gap between the current performance of CADD and the performance required for clinical usefulness increases. We thus conclude that while CADD scores are in principle a genome-wide, data-rich, functionally generic, and organismally relevant estimate of variant effect,”5 in clinical practice for hereditary cancer panels (or, most likely, in any larger genomic test) they lack predictive power. There may be situations in which CADD scores or other in silico scores can be combined with other predictors to produce clinically useful predictions; these combined analysis situations will need to be evaluated separately to determine how much CADD scores independently improve predictions.
Another difficulty in interpreting CADD scores (or other predictive scores) is the distinction between changes that are functionally deleterious and clinically pathogenic. The data underlying CADD scores are evolutionary and functional predictors. There are many situations in which a deleterious variant does not cause clinical phenotype. This separation between functional prediction and clinical consequence reduces the real-world predictive value of predictive scores.
Although sensitivity and specificity of CADD have been shown to be high in data sets balanced for known pathogenic and benign variants, sensitivity and specificity are test values that are agnostic to population prevalence. The real-world PPV of CADD score and other in silico tests is not high enough to effectively classify individual nonexonic variants or reduce the number of potential pathogenic variants to those that could be efficiently followed up in the context of a hereditary cancer panel. This finding supports the idea that currently available in silico predictive scores should be used, at most, as supporting evidence of pathogenicity, as is currently recommended by the American College of Medical Genetics and Genomics and the Association for Molecular Pathology.18
Disclosure
The authors declare no conflict of interest.
References
Cooper GM, Shendure J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat Rev Genet 2011;12:628–640.
1000 Genomes Project Consortium; Abecasis GR, Altshuler D, Auton A, et al. A map of human genome variation from population-scale sequencing. Nature 2010;467:1061–1073.
Pelak K, Shianna KV, Ge D, et al. The characterization of twenty sequenced human genomes. PLoS Genet 2010;6:e1001111.
Hindorff LA, Sethupathy P, Junkins HA, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA 2009;106:9362–9367.
Kircher M, Witten DM, Jain P, O’Roak BJ, Cooper GM, Shendure J. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 2014;46:310–315.
Pritchard CC, Smith C, Salipante SJ, et al. ColoSeq provides comprehensive lynch and polyposis syndrome mutational analysis using massively parallel sequencing. J Mol Diagn 2012;14:357–366.
Pritchard CC, Salipante SJ, Koehler K, et al. Validation and implementation of targeted capture and sequencing for the detection of actionable mutation, copy number variation, and gene rearrangement in clinical cancer specimens. J Mol Diagn 2014;16:56–67.
1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, et al. An integrated map of genetic variation from 1,092 human genomes. Nature 2012;491:56–65.
Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 2010;38:e164.
R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/. 2014. Accessed 5 January 2015.
Kent WJ, Sugnet CW, Furey TS, et al. The human genome browser at UCSC. Genome Res 2002;12:996–1006.
Robin X, Turck N, Hainard A, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 2011;12:77.
Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res 2010;20:110–121.
Reese MG, Eeckman FH, Kulp D, Haussler D. Improved splice site detection in Genie. J Comput Biol 1997;4:311–323.
Desmet FO, Hamroun D, Lalande M, Collod-Béroud G, Claustres M, Béroud C. Human Splicing Finder: an online bioinformatics tool to predict splicing signals. Nucleic Acids Res 2009;37:e67.
Brunak S, Engelbrecht J, Knudsen S. Prediction of human mRNA donor and acceptor sites from the DNA sequence. J Mol Biol 1991;220:49–65.
Plon SE, Eccles DM, Easton D, et al.; IARC Unclassified Genetic Variants Working Group. Sequence variant classification and reporting: recommendations for improving the interpretation of cancer susceptibility genetic test results. Hum Mutat 2008;29:1282–1291.
Richards S, Aziz N, Bale S, et al.; ACMG Laboratory Quality Assurance Committee. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med 2015;17:405–424.
Newman B, Mu H, Butler LM, Millikan RC, Moorman PG, King MC. Frequency of breast cancer attributable to BRCA1 in a population-based series of American women. JAMA 1998;279:915–921.
Ng PC, Levy S, Huang J, et al. Genetic variation in an individual human exome. PLoS Genet 2008;4:e1000160.
Ramus SJ, Gayther SA. The contribution of BRCA1 and BRCA2 to ovarian cancer. Mol Oncol 2009;3:138–150.
Kim H, Choi DH. Distribution of BRCA1 and BRCA2 mutations in Asian patients with breast cancer. J Breast Cancer 2013;16:357–365.
Foley SB, Rios JJ, Mgbemena VE, et al. Use of whole genome sequencing for diagnosis and discovery in the cancer genetics clinic. EBioMedicine 2015;2:74–81.
Tung N, Battelli C, Allen B, et al. Frequency of mutations in individuals with breast cancer referred for BRCA1 and BRCA2 testing using next-generation sequencing with a 25-gene panel. Cancer 2015;121:25–33.
Hübner CA, Utermann B, Tinschert S, et al. Intronic mutations in the L1CAM gene may cause X-linked hydrocephalus by aberrant splicing. Hum Mutat 2004;23:526.
Gámez-Pozo A, Palacios I, Kontic M, et al. Pathogenic validation of unique germline intronic variants of RB1 in retinoblastoma patients using minigenes. Hum Mutat 2007;28:1245.
Takeshima Y, Yagi M, Okizuka Y, et al. Mutation spectrum of the dystrophin gene in 442 Duchenne/Becker muscular dystrophy cases from one Japanese referral center. J Hum Genet 2010;55:379–388.
Castoldi E, Duckers C, Radu C, et al. Homozygous F5 deep-intronic splicing mutation resulting in severe factor V deficiency and undetectable thrombin generation in platelet-rich plasma. J Thromb Haemost 2011;9:959–968.
Kulseth MA, Lyle R, Rødningen OK, Sorte H, Prescott T. Exon trapping analysis of c.301-19G > A in intron 1 of the SHH gene in a patient with a microform of holoprosencephaly. Eur J Med Genet 2011;54:130–135.
Meeths M, Chiang SC, Wood SM, et al. Familial hemophagocytic lymphohistiocytosis type 3 (FHL3) caused by deep intronic mutation and inversion in UNC13D. Blood 2011;118:5783–5793.
Richards AJ, McNinch A, Whittaker J, et al. Splicing analysis of unclassified variants in COL2A1 and COL11A1 identifies deep intronic pathogenic mutations. Eur J Hum Genet 2012;20:552–558.
Spier I, Horpaopan S, Vogt S, et al. Deep intronic APC mutations explain a substantial proportion of patients with familial or early-onset adenomatous polyposis. Hum Mutat 2012;33:1045–1050.
Cavalieri S, Pozzi E, Gatti RA, Brusco A. Deep-intronic ATM mutation detected by genomic resequencing and corrected in vitro by antisense morpholino oligonucleotide (AMO). Eur J Hum Genet 2013;21:774–778.
Costantino L, Rusconi D, Soldà G, et al. Fine characterization of the recurrent c.1584+18672A>G deep-intronic mutation in the cystic fibrosis transmembrane conductance regulator gene. Am J Respir Cell Mol Biol 2013;48:619–625.
Steele-Stallard HB, Le Quesne Stabej P, Lenassi E, et al. Screening for duplications, deletions and a common intronic mutation detects 35% of second mutations in patients with USH2A monoallelic mutations on Sanger sequencing. Orphanet J Rare Dis 2013;8:122.
Bholah Z, Smith MJ, Byers HJ, Miles EK, Evans DG, Newman WG. Intronic splicing mutations in PTCH1 cause Gorlin syndrome. Fam Cancer 2014;13:477–480.
Bonifert T, Karle KN, Tonagel F, et al. Pure and syndromic optic atrophy explained by deep intronic OPA1 mutations and an intralocus modifier. Brain 2014;137(Pt 8):2164–2177.
Bach JE, Wolf B, Oldenburg J, Müller CR, Rost S. Identification of deep intronic variants in 15 haemophilia A patients by next generation sequencing of the whole factor VIII gene. Thromb Haemost 2015;114:757–767.
Liquori A, Vaché C, Baux D, et al. Whole USH2A Gene Sequencing Identifies Several New Deep Intronic Mutations. Hum Mutat 2016;37:184–193.
Palagano E, Blair HC, Pangrazio A, et al. Buried in the Middle but Guilty: Intronic Mutations in the TCIRG1 Gene Cause Human Autosomal Recessive Osteopetrosis. J Bone Miner Res 2015;30:1814–1821.
Acknowledgements
Funding for this project was provided in part by the University of Washington Department of Laboratory Medicine and development funds from the Fred Hutchinson/University of Washington Cancer Consortium Cancer Center Support Grant from the National Cancer Institute (5P30 CA015704-39 to B.H.S.). B.H.S. is a Damon Runyon-Rachleff Innovator supported in part by the Damon Runyon Cancer Research Foundation (DRR-33-15). C.C.P. is supported by CDMRP award PC131820 and a 2013 Young Investigator Award from the Prostate Cancer Foundation.
Author information
Authors and Affiliations
Corresponding author
Supplementary information
Supplementary Figures and Tables
(DOC 1622 kb)
Supplementary Table S8
(XLS 196 kb)
Rights and permissions
About this article
Cite this article
Mather, C., Mooney, S., Salipante, S. et al. CADD score has limited clinical validity for the identification of pathogenic variants in noncoding regions in a hereditary cancer panel. Genet Med 18, 1269–1275 (2016). https://doi.org/10.1038/gim.2016.44
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/gim.2016.44
Keywords
- combined annotation-dependent depletion
- CADD score
- in silico predictor
- noncoding sequences
- predictive algorithm
This article is cited by
-
VPMBench: a test bench for variant prioritization methods
BMC Bioinformatics (2021)
-
CADD-Splice—improving genome-wide variant effect prediction using deep learning-derived splice scores
Genome Medicine (2021)
-
Clinical evolution, genetic landscape and trajectories of clonal hematopoiesis in SAMD9/SAMD9L syndromes
Nature Medicine (2021)
-
CAPICE: a computational method for Consequence-Agnostic Pathogenicity Interpretation of Clinical Exome variations
Genome Medicine (2020)
-
Exome sequencing study revealed novel susceptibility loci in subarachnoid hemorrhage (SAH)
Molecular Brain (2020)