To the Editor:
The majority of the gene variants discovered by next-generation sequencing (NGS) projects are either intronic or synonymous. These variants are difficult to interpret because their effects on protein expression and function tend to be less obvious than those of missense or nonsense variants. Here we present MutationTaster2 (http://www.mutationtaster.org/), the latest version of our web-based software MutationTaster1, which evaluates the pathogenic potential of DNA sequence alterations. It is designed to predict the functional consequences of not only amino acid substitutions but also intronic and synonymous alterations, short insertion and/or deletion (indel) mutations and variants spanning intron-exon borders.
MutationTaster2 includes all publicly available single-nucleotide polymorphisms (SNPs) and indels from the 1000 Genomes Project2 (hereafter referred to as 1000G) as well as known disease variants from ClinVar3 and HGMD Public4. Alterations found more than four times in the homozygous state in 1000G or in HapMap5 are automatically regarded as neutral. Variants marked as pathogenic in ClinVar are automatically predicted to be disease causing, and the disease phenotype is displayed. We have integrated tests for regulatory features, including data from the ENCODE project6 and JASPAR7, and score the evolutionary conservation around DNA variants (Supplementary Methods). To reduce the number of false positive splice-site predictions, MutationTaster2 considers loss or decreased strength of splice sites only at existing intron-exon borders. A sequence change within 2 base pairs of an intron-exon junction is regarded as the loss of a splice site. As a further improvement, MutationTaster2 is able to analyze sequence alterations spanning an intron-exon junction; these are likely to perturb normal splicing and hence have considerable pathogenic potential.
We were able to substantially increase the speed of MutationTaster2 by caching BLASTP results from protein-conservation analysis and by implementing our own function to search for changes in the amino acid sequence. A single analysis now takes less than 0.10 seconds on average.
For the rapid and user-friendly analysis of NGS results, we created a dedicated query engine. Users can upload VCF files and adjust several parameters, such as confining consideration to homozygous variants or certain regions and filtering for known polymorphisms. Job-scheduler software processes the genotypes in a highly parallel fashion (500,000 alterations per hour). Users can opt to be notified by e-mail when the process is complete. The results can be filtered, prioritized and inspected in a web browser or downloaded. We integrated our candidate-gene search engine, GeneDistiller8, to let users determine the most likely candidate genes among the potentially deleterious variants. In addition, we developed a web interface for single queries using chromosomal positions. MutationTaster2 automatically maps the variant to all suitable genes and transcripts, analyzes the variant in all of them and displays a table summarizing the predictions for all transcripts and detailed results for each transcript.
As with its predecessor, MutationTaster2 uses a Bayes classifier to generate predictions. Because alterations with different effects on the protein sequence require different tests, we use three classification models, designed for alterations that lead to single amino acid substitutions ('simple_aae'), involve more than one amino acid ('complex_aae') or are noncoding or synonymous ('without_aae'). MutationTaster2 was trained and tested with single base exchanges and short indels, comprising >6,000,000 validated polymorphisms from 1000G and (with permission from BIOBASE) >100,000 known disease mutations from HGMD Professional (Supplementary Table 1). We were able to improve the accuracy in all classification models, with a slight increase in the simple_aae model (from 87.2% in MutationTaster to 88.6% in MutationTaster2) and substantial changes in the without_aae model (from 82.7% to 92.2%) and the complex_aae model (from 79.3% to 90.7%) (Supplementary Table 2).
We compared the predictions of the web versions of MutationTaster2, SIFT (http://sift.jcvi.org/), PolyPhen-2 (http://genetics.bwh.harvard.edu/pph2/) and PROVEAN (http://provean.jcvi.org/index.php) on 1,100 polymorphisms and 1,100 disease mutations with variants causing single amino acid exchanges. MutationTaster2 had the highest accuracy (88%) of the tools tested (Table 1). The actual performance of MutationTaster2 is even better because the program automatically detects and categorizes confirmed polymorphisms and known disease mutations. In a real-world example using exome data, MutationTaster2 yielded a false positive rate of 1% for homozygous alterations (Supplementary Table 3 and Supplementary Methods).
The major drawback of MutationTaster2 is its limitation to intragenic variants. With the advance of whole-genome sequencing projects, it should be possible to overcome this limitation in the future. It should be noted that MutationTaster2 has been designed specifically to aid the identification of rare variants with severe impact (as in monogenic disorders) and is not intended to predict the consequences of common variants with small effects.
This work is supported by grants from the Deutsche Forschungsgemeinschaft (SFB665 TP-C4) to M.S., the Einsteinstiftung Berlin (A-2011-63) to J.M.S. and M.S. and the German Bundesministerium für Bildung und Forschung (mitoNET 01GM1113D) to D.S. and M.S. M.S. is a member of the NeuroCure Center of Excellence (Exc 257).
Supplementary Table 1–3 and Methods