Abstract
Detecting genomic structural variants from high-throughput sequencing data is a complex and unresolved challenge. We have developed a statistical learning approach, based on Random Forests, that integrates prior knowledge about the characteristics of structural variants and leads to improved discovery in high-throughput sequencing data. The implementation of this technique, forestSV, offers high sensitivity and specificity coupled with the flexibility of a data-driven approach.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Sebat, J. et al. Science 305, 525–528 (2004).
Iafrate, A.J. et al. Nat. Genet. 36, 949–951 (2004).
Stankiewicz, P. & Lupski, J.R. Annu. Rev. Med. 61, 437–455 (2010).
Sebat, J., Levy, D.L. & McCarthy, S.E. Trends Genet. 25, 528–535 (2009).
Yoon, S., Xuan, Z., Makarov, V., Ye, K. & Sebat, J. Genome Res. 19, 1586–1592 (2009).
Abyzov, A., Urban, A.E., Snyder, M. & Gerstein, M. Genome Res. 21, 974–984 (2011).
Handsaker, R.E., Korn, J.M., Nemesh, J. & McCarroll, S.A. Nat. Genet. 43, 269–276 (2011).
Chen, K. et al. Nat. Methods 6, 677–681 (2009).
Ye, K., Schulz, M.H., Long, Q., Apweiler, R. & Ning, Z. Bioinformatics 25, 2865–2871 (2009).
Mills, R.E. et al. Nature 470, 59–65 (2011).
Breiman, L. Mach. Learn. 45, 5–32 (2001).
1000 Genomes Project Consortium. Nature 467, 1061–1073 (2010).
Conrad, D.F. et al. Nature 464, 704–712 (2010).
McCarroll, S.A. et al. Nat. Genet. 40, 1166–1174 (2008).
Malhotra, D. & Sebat, J. Cell 148, 1223–1241 (2012).
Li, H. et al. Bioinformatics 25, 2078–2079 (2009).
Bentley, D.R. et al. Nature 456, 53–59 (2008).
Liaw, A. & Wiener, M. R News 2, 18–22 (2002).
Acknowledgements
This work was performed under US National Institutes of Health grants HG005725 and MH076431 and with support from the Beyster Family Foundation. We also thank the 1000 Genomes Project for access to data and J. Wang, H. Zheng, Y. Li, X. Jin and Y. Shi from BGI-Shenzhen for their roles in producing the unpublished autism sequencing data.
Author information
Authors and Affiliations
Contributions
J.J.M. conceived of and implemented forestSV. J.J.M. and J.S. wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–14, Supplementary Tables 1 and 2 and Supplementary Results (PDF 2737 kb)
Supplementary Data 1
Genomic regions used for training (TXT 13315 kb)
Supplementary Data 2
Structural variant calls produced by forestSV in NA12878, NA12891, NA12892, NA19240, NA19238 and NA19239 (TXT 18421 kb)
Rights and permissions
About this article
Cite this article
Michaelson, J., Sebat, J. forestSV: structural variant discovery through statistical learning. Nat Methods 9, 819–821 (2012). https://doi.org/10.1038/nmeth.2085
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.2085
This article is cited by
-
Detecting genomic deletions from high-throughput sequence data with unsupervised learning
BMC Bioinformatics (2023)
-
InDel marker detection by integration of multiple softwares using machine learning techniques
BMC Bioinformatics (2016)