novoBreak: local assembly for breakpoint detection in cancer genomes


We present novoBreak, a genome-wide local assembly algorithm that discovers somatic and germline structural variation breakpoints in whole-genome sequencing data. novoBreak consistently outperformed existing algorithms on real cancer genome data and on synthetic tumors in the ICGC-TCGA DREAM 8.5 Somatic Mutation Calling Challenge primarily because it more effectively utilized reads spanning breakpoints. novoBreak also demonstrated great sensitivity in identifying short insertions and deletions.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: The novoBreak workflow.
Figure 2: novoBreak performance on DREAM 8.5 Mutation Calling Challenge data.

Accession codes

Primary accessions

Sequence Read Archive


  1. 1

    Mitelman, F., Johansson, B. & Mertens, F. Nat. Rev. Cancer 7, 233–245 (2007).

    CAS  Article  Google Scholar 

  2. 2

    Stephens, P.J. et al. Nature 462, 1005–1010 (2009).

    CAS  Article  Google Scholar 

  3. 3

    Alkan, C., Coe, B.P. & Eichler, E.E. Nat. Rev. Genet. 12, 363–376 (2011).

    CAS  Article  Google Scholar 

  4. 4

    Alkan, C., Sajjadian, S. & Eichler, E.E. Nat. Methods 8, 61–65 (2011).

    CAS  Article  Google Scholar 

  5. 5

    Chen, K. et al. Nat. Methods 6, 677–681 (2009).

    CAS  Article  Google Scholar 

  6. 6

    Abyzov, A., Urban, A.E., Snyder, M. & Gerstein, M. Genome Res. 21, 974–984 (2011).

    CAS  Article  Google Scholar 

  7. 7

    Ye, K., Schulz, M.H., Long, Q., Apweiler, R. & Ning, Z. Bioinformatics 25, 2865–2871 (2009).

    CAS  Article  Google Scholar 

  8. 8

    Rausch, T. et al. Bioinformatics 28, i333–i339 (2012).

    CAS  Article  Google Scholar 

  9. 9

    Chen, K. et al. Genome Res. 24, 310–317 (2014).

    Article  Google Scholar 

  10. 10

    Li, Y. et al. Nat. Biotechnol. 29, 723–730 (2011).

    CAS  Article  Google Scholar 

  11. 11

    Earl, D. et al. Genome Res. 21, 2224–2241 (2011).

    CAS  Article  Google Scholar 

  12. 12

    Boutros, P.C. et al. Nat. Genet. 46, 318–319 (2014).

    CAS  Article  Google Scholar 

  13. 13

    Ewing, A.D. et al. Nat. Methods 12, 623–630 (2015).

    CAS  Article  Google Scholar 

  14. 14

    McKenna, A. et al. Genome Res. 20, 1297–1303 (2010).

    CAS  Article  Google Scholar 

  15. 15

    Saunders, C.T. et al. Bioinformatics 28, 1811–1817 (2012).

    CAS  Article  Google Scholar 

  16. 16

    Li, H. Bioinformatics 28, 1838–1844 (2012).

    CAS  Article  Google Scholar 

  17. 17

    Pleasance, E.D. et al. Nature 463, 191–196 (2010).

    CAS  Article  Google Scholar 

  18. 18

    Wang, J. et al. Nat. Methods 8, 652–654 (2011).

    CAS  Article  Google Scholar 

  19. 19

    Zhang, J. et al. Genome Res. 26, 108–118 (2016).

    CAS  Article  Google Scholar 

  20. 20

    Li, H. Preprint at (2013).

  21. 21

    Bloom, B.H. Communications of the ACM 13, 422–426 (1970).

    Article  Google Scholar 

  22. 22

    Sedgewick, R. & Wayne, K. Algorithms 4th edn. (Addison-Wesley, 2011).

  23. 23

    Warren, R.L., Sutton, G.G., Jones, S.J.M. & Holt, R.A. Bioinformatics 23, 500–501 (2007).

    CAS  Article  Google Scholar 

  24. 24

    DePristo, M.A. et al. Nat. Genet. 43, 491–498 (2011).

    CAS  Article  Google Scholar 

  25. 25

    Van der Auwera, G.A. et al. Curr. Protoc. Bioinformatics 11,– (2013).

    Google Scholar 

Download references


We thank the ICGC-TCGA DREAM SMC Challenge organizers and participants for providing data and evaluations; and we thank A.K. Eterovic and G.B. Mills for assistance with the experiment and manuscript. This study was supported in part by the National Institutes of Health (grant numbers R01 CA172652 to K.C. and U41 HG007497 to C. Lee, Jackson Lab), the National Cancer Institute Cancer Center Support Grant (P30 CA016672 to R. Depinho, MD Anderson Cancer Center), Andrew Sabin Family Foundation to K.C., and a training fellowship from the Computational Cancer Biology Training Program of the Gulf Coast Consortia (CPRIT grant number RP140113) to Z.C. The results published here are in part based upon data generated by TCGA established by the NCI and NHGRI. Information about TCGA and the investigators and institutions who constitute the TCGA research network can be found at

Author information




Z.C., J.R., and K.C. conceived the algorithm. Z.C. developed the software. Z.C. and K.C. designed and analyzed the experiments. M.G. and J.C. designed and performed the validation experiments. W.Z. designed the scoring statistics. M.G., T.C., X.F., L.D., A.Y.L., and P.B. tested the algorithm and performed additional analyses. K.C. supervised the projects. Z.C. and K.C. wrote the manuscript with input from all authors. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Ken Chen.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Illustration of the fate of the breakpoint spanning read pairs in alignment-based methods versus that in our k-mer targeted assembly method.

The alignment-based approaches underutilize read pairs spanning a breakpoint (top), while our k-mer targeted assembly strategy can fully utilize them (bottom).

Supplementary Figure 2 Illustration of clustering based on novo k-mers and associated read pairs.

At each breakpoint, there are k-1 k-mers covering the breakpoint. If a read fully covers a breakpoint, it must contain several k-mers (< k-1 if there are sequencing errors) covering the breakpoint. On the other hand, there should be several read pairs sharing identical k-mers, given sufficient coverage. Based on this relationship, a union-find algorithm is applied to accomplish the clustering procedure.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1 and 2, Supplementary Tables 1, 2 and 6, and Supplementary Notes 1–4. (PDF 674 kb)

Supplementary Table 3

Previously experimentally validated somatic SV breakpoints in COLO-829 cell line. (XLSX 67 kb)

Supplementary Table 4

novoBreak predicted breakpoints in the COLO-829 whole genome sequencing data. (XLSX 47 kb)

Supplementary Table 5

Experimentally validated novel breakpoints found by novoBreak in COLO-829. (XLSX 41 kb)

Supplementary Table 7

All filtered novoBreak calls of the 22 TCGA samples. (XLSX 1169 kb)

Supplementary Table 8

Sensitivity of novoBreak in detecting gene fusions from the whole-genome sequencing data of 22 breast cancer patients from TCGA. (XLSX 48 kb)

Supplementary Software

novoBreak source code and binary distribution. (ZIP 2005 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chong, Z., Ruan, J., Gao, M. et al. novoBreak: local assembly for breakpoint detection in cancer genomes. Nat Methods 14, 65–67 (2017).

Download citation

Further reading


Quick links

Sign up for the Nature Briefing newsletter for a daily update on COVID-19 science.
Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing