SIFT missense predictions for genomes

Journal name:
Nature Protocols
Volume:
11,
Pages:
1–9
Year published:
DOI:
doi:10.1038/nprot.2015.123
Published online

Abstract

The SIFT (sorting intolerant from tolerant) algorithm helps bridge the gap between mutations and phenotypic variations by predicting whether an amino acid substitution is deleterious. SIFT has been used in disease, mutation and genetic studies, and a protocol for its use has been previously published with Nature Protocols. This updated protocol describes SIFT 4G (SIFT for genomes), which is a faster version of SIFT that enables practical computations on reference genomes. Users can get predictions for single-nucleotide variants from their organism of interest using the SIFT 4G annotator with SIFT 4G's precomputed databases. The scope of genomic predictions is expanded, with predictions available for more than 200 organisms. Users can also run the SIFT 4G algorithm themselves. SIFT predictions can be retrieved for 6.7 million variants in 4 min once the database has been downloaded. If precomputed predictions are not available, the SIFT 4G algorithm can compute predictions at a rate of 2.6 s per protein sequence. SIFT 4G is available from http://sift-dna.org/sift4g.

At a glance

Figures

  1. Comparison of the SIFT and SIFT 4G algorithms.
    Figure 1: Comparison of the SIFT and SIFT 4G algorithms.

    (a) The steps of the SIFT and SIFT 4G algorithms are shown on the left and right, respectively. The principle of each step has been preserved, but the first two steps have been optimized for speed in the SIFT 4G algorithm. See (refs. 23,24,25,26). (b) Matthew's correlation coefficient (MCC) of SIFT (light-colored bars) and SIFT 4G (dark-colored bars) on four data sets (HumDiv (human), HumVar (human), LacI (E. coli) and lysozyme (bacteriophage) depicted in red, green, brown and blue, respectively). Accuracy is the percentage of correct predictions. MCC is a balanced measure of the true and false positives and negatives. Panel b is reproduced under a Creative Commons license from http://sift-dna.org/sift4g/AboutSIFT4G.html.

  2. Workflow for the SIFT 4G annotator.
    Figure 2: Workflow for the SIFT 4G annotator.

    After downloading the SIFT 4G annotator, the user selects a list of variants to be annotated and the appropriate SIFT 4G database. The SIFT 4G annotator will generate two output files with SIFT annotations (a VCF file and an XLS file).

  3. The SIFT 4G annotator graphical user interface.
    Figure 3: The SIFT 4G annotator graphical user interface.

    The user's steps are numbered. The user selects a VCF file (1), selects the database for the desired organism (2), decides on the option to annotate with multiple transcripts (3) and then clicks the start button (4).

  4. Select the database for the desired organism.
    Figure 4: Select the database for the desired organism.

    The database needs to be downloaded locally on its first use. To view the available SIFT 4G databases, click 'Select database to download' in the 'Database' dropdown menu, and a list of organisms along with their assembly and gene annotation versions will appear. Users can find their organism by scrolling down the list or using the search box near the top of the list. Clicking on the organism name and then the button 'OK' will locally install the database for current and subsequent uses.

  5. View of the SIFT 4G annotator after annotation has been completed.
    Figure 5: View of the SIFT 4G annotator after annotation has been completed.

    The number of variants with and without SIFT annotation is displayed for each chromosome. To view annotations, users can click on links to output files.

  6. Sensitivity and specificity of SIFT and SIFT 4G.
    Supplementary Fig. 1: Sensitivity and specificity of SIFT and SIFT 4G.

    The algorithms were applied to four datasets: HumDiv (red), HumVar (green), LacI (brown), and lysozyme (blue). SIFT and SIFT 4G’s performances are shown in light-colored and dark-colored bars, respectively. Reproduced under a Creative Commons license from http://sift-dna.org/sift4g/AboutSIFT4G.html.

  7. ROC comparison of SIFT and SIFT 4G.
    Supplementary Fig. 2: ROC comparison of SIFT and SIFT 4G.

    The algorithms were applied to four datasets: HumDiv (red), HumVar (green), LacI (beige), and lysozyme (blue). SIFT’s performance is depicted with dashed lines; SIFT 4G with solid lines. Reproduced under a Creative Commons license from http://sift-dna.org/sift4g/AboutSIFT4G.html.

References

  1. Xia, Q. et al. Complete resequencing of 40 genomes reveals domestication events and genes in silkworm (Bombyx). Science 326, 433436 (2009).
  2. The Bovine HapMap Consortium. Genome-wide survey of SNP variation uncovers the genetic structure of cattle breeds. Science 324, 528532 (2009).
  3. Huang, X. et al. A map of rice genome variation reveals the origin of cultivated rice. Nature 490, 497501 (2012).
  4. McNally, K.L. et al. Sequencing multiple and diverse rice varieties. Connecting whole-genome variation with phenotypes. Plant Physiol. 141, 2631 (2006).
  5. The 3,000 rice genomes project. The 3,000 rice genomes project. Gigascience 3, 7 (2014).
  6. Herper, M. Gene Machine (Forbes, 2010).
  7. Drosophila 12 Genomes Consortium. Evolution of genes and genomes on the Drosophila phylogeny. Nature 450, 203218 (2007).
  8. Atanur, S.S. et al. The genome sequence of the spontaneously hypertensive rat: analysis and functional significance. Genome Res. 20, 791803 (2010).
  9. Seppälä, E.H. et al. LGI2 truncation causes a remitting focal epilepsy in dogs. PLoS Genet. 7, e1002194 (2011).
  10. Kumar, P., Henikoff, S. & Ng, P.C. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat. Protoc. 4, 10731081 (2009).
  11. Ng, P.C. & Henikoff, S. Accounting for human polymorphisms predicted to affect protein function. Genome Res. 12, 436446 (2002).
  12. Ng, P.C. & Henikoff, S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 31, 38123814 (2003).
  13. Sim, N.L. et al. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res. 40, W452W457 (2012).
  14. Henikoff, S., Till, B.J. & Comai, L. TILLING. Traditional mutagenesis meets functional genomics. Plant Physiol. 135, 630636 (2004).
  15. Mitsui, J. et al. CSF1R mutations identified in three families with autosomal dominantly inherited leukoencephalopathy. Am. J. Med. Genet. B Neuropsychiatr. Genet. 159B, 951957 (2012).
  16. Tennessen, J.A. et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 6469 (2012).
  17. Lamichhaney, S. et al. Evolution of Darwin's finches and their beaks revealed by genome sequencing. Nature 518, 371375 (2015).
  18. Leida, C. et al. Variability of candidate genes, genetic structure and association with sugar accumulation and climacteric behavior in a broad germplasm collection of melon (Cucumis melo L.). BMC Genet. 16, 28 (2015).
  19. Moreira, G.C. et al. Variant discovery in a QTL region on chromosome 3 associated with fatness in chickens. Anim. Genet. 46, 141147 (2015).
  20. Ortega, R., Guzmán, C. & Alvarez, J. Wx gene in diploid wheat: molecular characterization of five novel alleles from einkorn (Triticum monococcum L. ssp. monococcum) and T. urartu. Mol. Breeding 34, 11371146 (2014).
  21. Renaut, S. & Rieseberg, L.H. The accumulation of deleterious mutations as a consequence of domestication and improvement in sunflowers and other Compositae crops. Mol. Biol. Evol. 32, 22732283 (2015).
  22. Choi, J.W. et al. Whole-genome resequencing analyses of five pig breeds, including Korean wild and native, and three European origin breeds. DNA Res. 22, 259267 (2015).
  23. Schensted, C. Longest increasing and decreasing subsequences. Can. J. Math. 13, 179191 (1961).
  24. Korpar, M., Sosic, M., Blazeka, D. & Sikic,, M. SW#db: GPU-accelerated exact sequence similarity database search. doi:10.1101/013805 (14 January 2015).
  25. Ng, P.C. & Henikoff, S. Predicting deleterious amino acid substitutions. Genome Res. 11, 863874 (2001).
  26. Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 33893402 (1997).
  27. Suzek, B.E., Huang, H., McGarvey, P., Mazumder, R. & Wu, C.H. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23, 12821288 (2007).
  28. Adzhubei, I.A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248249 (2010).
  29. Pace, H.C. et al. Lac repressor genetic map in real space. Trends Biochem. Sci. 22, 334339 (1997).
  30. Rennell, D., Bouvier, S.E., Hardy, L.W. & Poteete, A.R. Systematic mutation of bacteriophage T4 lysozyme. J. Mol. Biol. 222, 6788 (1991).
  31. Kent, W.J. et al. The human genome browser at UCSC. Genome Res. 12, 9961006 (2002).
  32. Goodstein, D.M. et al. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res. 40, D1178D1186 (2012).
  33. Pruitt, K.D., Tatusova, T. & Maglott, D.R. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 33, D501D504 (2005).

Download references

Author information

  1. These authors contributed equally to this work.

    • Robert Vaser &
    • Swarnaseetha Adusumalli

Affiliations

  1. Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia.

    • Robert Vaser &
    • Mile Sikic
  2. Computational and Systems Biology Group, Genome Institute of Singapore, Agency for Science, Technology and Research (A*STAR), Singapore, Singapore.

    • Swarnaseetha Adusumalli,
    • Sim Ngak Leng &
    • Pauline C Ng
  3. Bioinformatics Institute, Agency for Science, Technology and Research, Singapore, Singapore.

    • Mile Sikic

Contributions

M.S. and P.C.N. conceived the project. R.V. implemented and tested the performance of the SIFT 4G algorithm. S.A. and S.N.L. implemented the SIFT 4G annotator. S.A. and P.C.N. wrote the manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

Supplementary Figures

  1. Supplementary Figure 1: Sensitivity and specificity of SIFT and SIFT 4G. (227 KB)

    The algorithms were applied to four datasets: HumDiv (red), HumVar (green), LacI (brown), and lysozyme (blue). SIFT and SIFT 4G’s performances are shown in light-colored and dark-colored bars, respectively. Reproduced under a Creative Commons license from http://sift-dna.org/sift4g/AboutSIFT4G.html.

  2. Supplementary Figure 2: ROC comparison of SIFT and SIFT 4G. (163 KB)

    The algorithms were applied to four datasets: HumDiv (red), HumVar (green), LacI (beige), and lysozyme (blue). SIFT’s performance is depicted with dashed lines; SIFT 4G with solid lines. Reproduced under a Creative Commons license from http://sift-dna.org/sift4g/AboutSIFT4G.html.

PDF files

  1. Supplementary Text and Figures (1,142 KB)

    Supplementary Figures 1 and 2, Supplementary Tables 1 and 2

Additional data