Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

novoBreak: local assembly for breakpoint detection in cancer genomes


We present novoBreak, a genome-wide local assembly algorithm that discovers somatic and germline structural variation breakpoints in whole-genome sequencing data. novoBreak consistently outperformed existing algorithms on real cancer genome data and on synthetic tumors in the ICGC-TCGA DREAM 8.5 Somatic Mutation Calling Challenge primarily because it more effectively utilized reads spanning breakpoints. novoBreak also demonstrated great sensitivity in identifying short insertions and deletions.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Rent or buy this article

Prices vary by article type



Prices may be subject to local taxes which are calculated during checkout

Figure 1: The novoBreak workflow.
Figure 2: novoBreak performance on DREAM 8.5 Mutation Calling Challenge data.

Accession codes

Primary accessions

Sequence Read Archive


  1. Mitelman, F., Johansson, B. & Mertens, F. Nat. Rev. Cancer 7, 233–245 (2007).

    Article  CAS  Google Scholar 

  2. Stephens, P.J. et al. Nature 462, 1005–1010 (2009).

    Article  CAS  Google Scholar 

  3. Alkan, C., Coe, B.P. & Eichler, E.E. Nat. Rev. Genet. 12, 363–376 (2011).

    Article  CAS  Google Scholar 

  4. Alkan, C., Sajjadian, S. & Eichler, E.E. Nat. Methods 8, 61–65 (2011).

    Article  CAS  Google Scholar 

  5. Chen, K. et al. Nat. Methods 6, 677–681 (2009).

    Article  CAS  Google Scholar 

  6. Abyzov, A., Urban, A.E., Snyder, M. & Gerstein, M. Genome Res. 21, 974–984 (2011).

    Article  CAS  Google Scholar 

  7. Ye, K., Schulz, M.H., Long, Q., Apweiler, R. & Ning, Z. Bioinformatics 25, 2865–2871 (2009).

    Article  CAS  Google Scholar 

  8. Rausch, T. et al. Bioinformatics 28, i333–i339 (2012).

    Article  CAS  Google Scholar 

  9. Chen, K. et al. Genome Res. 24, 310–317 (2014).

    Article  Google Scholar 

  10. Li, Y. et al. Nat. Biotechnol. 29, 723–730 (2011).

    Article  CAS  Google Scholar 

  11. Earl, D. et al. Genome Res. 21, 2224–2241 (2011).

    Article  CAS  Google Scholar 

  12. Boutros, P.C. et al. Nat. Genet. 46, 318–319 (2014).

    Article  CAS  Google Scholar 

  13. Ewing, A.D. et al. Nat. Methods 12, 623–630 (2015).

    Article  CAS  Google Scholar 

  14. McKenna, A. et al. Genome Res. 20, 1297–1303 (2010).

    Article  CAS  Google Scholar 

  15. Saunders, C.T. et al. Bioinformatics 28, 1811–1817 (2012).

    Article  CAS  Google Scholar 

  16. Li, H. Bioinformatics 28, 1838–1844 (2012).

    Article  CAS  Google Scholar 

  17. Pleasance, E.D. et al. Nature 463, 191–196 (2010).

    Article  CAS  Google Scholar 

  18. Wang, J. et al. Nat. Methods 8, 652–654 (2011).

    Article  CAS  Google Scholar 

  19. Zhang, J. et al. Genome Res. 26, 108–118 (2016).

    Article  CAS  Google Scholar 

  20. Li, H. Preprint at (2013).

  21. Bloom, B.H. Communications of the ACM 13, 422–426 (1970).

    Article  Google Scholar 

  22. Sedgewick, R. & Wayne, K. Algorithms 4th edn. (Addison-Wesley, 2011).

  23. Warren, R.L., Sutton, G.G., Jones, S.J.M. & Holt, R.A. Bioinformatics 23, 500–501 (2007).

    Article  CAS  Google Scholar 

  24. DePristo, M.A. et al. Nat. Genet. 43, 491–498 (2011).

    Article  CAS  Google Scholar 

  25. Van der Auwera, G.A. et al. Curr. Protoc. Bioinformatics 11,– (2013).

    Google Scholar 

Download references


We thank the ICGC-TCGA DREAM SMC Challenge organizers and participants for providing data and evaluations; and we thank A.K. Eterovic and G.B. Mills for assistance with the experiment and manuscript. This study was supported in part by the National Institutes of Health (grant numbers R01 CA172652 to K.C. and U41 HG007497 to C. Lee, Jackson Lab), the National Cancer Institute Cancer Center Support Grant (P30 CA016672 to R. Depinho, MD Anderson Cancer Center), Andrew Sabin Family Foundation to K.C., and a training fellowship from the Computational Cancer Biology Training Program of the Gulf Coast Consortia (CPRIT grant number RP140113) to Z.C. The results published here are in part based upon data generated by TCGA established by the NCI and NHGRI. Information about TCGA and the investigators and institutions who constitute the TCGA research network can be found at

Author information

Authors and Affiliations



Z.C., J.R., and K.C. conceived the algorithm. Z.C. developed the software. Z.C. and K.C. designed and analyzed the experiments. M.G. and J.C. designed and performed the validation experiments. W.Z. designed the scoring statistics. M.G., T.C., X.F., L.D., A.Y.L., and P.B. tested the algorithm and performed additional analyses. K.C. supervised the projects. Z.C. and K.C. wrote the manuscript with input from all authors. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Ken Chen.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Illustration of the fate of the breakpoint spanning read pairs in alignment-based methods versus that in our k-mer targeted assembly method.

The alignment-based approaches underutilize read pairs spanning a breakpoint (top), while our k-mer targeted assembly strategy can fully utilize them (bottom).

Supplementary Figure 2 Illustration of clustering based on novo k-mers and associated read pairs.

At each breakpoint, there are k-1 k-mers covering the breakpoint. If a read fully covers a breakpoint, it must contain several k-mers (< k-1 if there are sequencing errors) covering the breakpoint. On the other hand, there should be several read pairs sharing identical k-mers, given sufficient coverage. Based on this relationship, a union-find algorithm is applied to accomplish the clustering procedure.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1 and 2, Supplementary Tables 1, 2 and 6, and Supplementary Notes 1–4. (PDF 674 kb)

Supplementary Table 3

Previously experimentally validated somatic SV breakpoints in COLO-829 cell line. (XLSX 67 kb)

Supplementary Table 4

novoBreak predicted breakpoints in the COLO-829 whole genome sequencing data. (XLSX 47 kb)

Supplementary Table 5

Experimentally validated novel breakpoints found by novoBreak in COLO-829. (XLSX 41 kb)

Supplementary Table 7

All filtered novoBreak calls of the 22 TCGA samples. (XLSX 1169 kb)

Supplementary Table 8

Sensitivity of novoBreak in detecting gene fusions from the whole-genome sequencing data of 22 breast cancer patients from TCGA. (XLSX 48 kb)

Supplementary Software

novoBreak source code and binary distribution. (ZIP 2005 kb)

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chong, Z., Ruan, J., Gao, M. et al. novoBreak: local assembly for breakpoint detection in cancer genomes. Nat Methods 14, 65–67 (2017).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

This article is cited by


Quick links

Nature Briefing: Cancer

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

Get what matters in cancer research, free to your inbox weekly. Sign up for Nature Briefing: Cancer