Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Automated assembly of centromeres from ultra-long error-prone reads

Abstract

Centromeric variation has been linked to cancer and infertility, but centromere sequences contain multiple tandem repeats and can only be assembled manually from long error-prone reads. Here we describe the centroFlye algorithm for centromere assembly using long error-prone reads, and apply it to assemble human centromeres on chromosomes 6 and X. Our analyses reveal putative breakpoints in the manual reconstruction of the human X centromere, demonstrate that human X chromosome is partitioned into repeat subfamilies and provide initial insights into centromere evolution. We anticipate that centroFlye could be applied to automatically close remaining multimegabase gaps in the reference human genome.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: centroFlyeHOR pipeline.
Fig. 2: centroFlyemono pipeline.
Fig. 3: Information about cenX assemblies.
Fig. 4: Comparison of read mappings between the centroFlye, centroFlyedel, T2T4 and T2T6 assemblies.
Fig. 5: Coverage plots for centroFlye, centroFlyedel, T2T4 and T2T6 assemblies.
Fig. 6: Coverage of various cenX assemblies by discordant reads.

Data availability

centroFlye centromere 6 and X assemblies and all supporting data is available at Zenodo: https://doi.org/10.5281/zenodo.3897531. The ONT reads that were generated by the T2T consortium are deposited under accession number PRJNA559484.

Code availability

The codebase of the algorithm is available at https://github.com/seryrzu/centroFlye. The version of centroFlye that generates the assemblies described in the paper is in the branch: cF_NatBiotech_paper_Xv0.8.3-6v0.1.3. Jupyter notebooks for reproducing all figures in this study are provided in the Github repository https://github.com/seryrzu/centroFlye_paper_scripts.

References

  1. 1.

    Chin, C. S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).

    CAS  Google Scholar 

  2. 2.

    Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).

    CAS  Google Scholar 

  3. 3.

    Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).

    CAS  Google Scholar 

  4. 4.

    Kamath, G. M., Shomorony, I., Xia, F., Courtade, T. A. & Tse, D. N. HINGE: long-read assembly achieves optimal repeat resolution. Genome Res. 27, 747–756 (2017).

    CAS  Google Scholar 

  5. 5.

    Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).

    CAS  Google Scholar 

  6. 6.

    Nowoshilow, S. et al. The axolotl genome and the evolution of key tissue formation regulators. Nature 554, 50–55 (2018).

    CAS  Google Scholar 

  7. 7.

    Ruan, J. & Li, H. Fast and accurate long-read assembly with wtdbg2. Nat. Methods 17, 155–158 (2020).

    CAS  Google Scholar 

  8. 8.

    Vollger, M. R. et al. Long-read sequence and assembly of segmental duplications. Nat. Methods 16, 88–94 (2019).

    CAS  Google Scholar 

  9. 9.

    Nagaoka, S. I., Hassold, T. J. & Hunt, P. A. Human aneuploidy: mechanisms and new insights into an age-old problem. Nat. Rev. Genet. 13, 493–504 (2012).

    CAS  Google Scholar 

  10. 10.

    Enukashvily, N. I., Donev, R., Waisertreiger, I. S.-R. & Podgornaya, O. I. Human chromosome 1 satellite 3 DNA is decondensed, demethylated and transcribed in senescent cells and in A431 epithelial carcinoma cells. Cytogenet. Genome Res. 118, 42–54 (2007).

    CAS  Google Scholar 

  11. 11.

    Ting, D. T. et al. Aberrant overexpression of satellite repeats in pancreatic and other epithelial cancers. Science 331, 593–596 (2011).

    CAS  Google Scholar 

  12. 12.

    Ferreira, D. et al. Satellite non-coding RNAs: the emerging players in cells, cellular pathways and cancer. Chromosom. Res. 23, 479–493 (2015).

    CAS  Google Scholar 

  13. 13.

    Giunta, S. & Funabiki, H. Integrity of the human centromere DNA repeats is protected by CENP-A, CENP-C, and CENP-T. Proc. Natl Acad. Sci. USA 114, 1928–1933 (2017).

    CAS  Google Scholar 

  14. 14.

    Black, E. M. & Giunta, S. Repetitive fragile sites: centromere satellite DNA as a source of genome instability in human diseases. Genes. 9, 615 (2018).

    Google Scholar 

  15. 15.

    Smurova, K. & De Wulf, P. Centromere and pericentromere transcription: roles and regulation … in sickness and in health. Front. Genet. https://doi.org/10.3389/fgene.2018.00674 (2018).

  16. 16.

    Barra, V. & Fachinetti, D. The dark side of centromeres: types, causes and consequences of structural abnormalities implicating centromeric DNA. Nat. Commun. 9, 4340 (2018).

    CAS  Google Scholar 

  17. 17.

    Zhu, Q. et al. Heterochromatin-encoded satellite RNAs induce breast cancer. Mol. Cell 70, 842–853.e7 (2018).

    CAS  Google Scholar 

  18. 18.

    Miga, K. H. Centromeric satellite DNAs: hidden sequence variation in the human population. Genes 10, 352 (2019).

    CAS  Google Scholar 

  19. 19.

    Schueler, M. G. Genomic and genetic definition of a functional human centromere. Science 294, 109–115 (2001).

    CAS  Google Scholar 

  20. 20.

    Alkan, C. et al. Organization and evolution of primate centromeric dna from whole-genome shotgun sequence data. PLoS Comput. Biol. 3, e181 (2007).

    Google Scholar 

  21. 21.

    Shepelev, V. A., Alexandrov, A. A., Yurov, Y. B. & Alexandrov, I. A. The evolutionary origin of man can be traced in the layers of defunct ancestral alpha satellites flanking the active centromeres of human chromosomes. PLoS Genet. 5, e1000641 (2009).

    Google Scholar 

  22. 22.

    Melters, D. P. et al. Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution. Genome Biol. 14, R10 (2013).

    Google Scholar 

  23. 23.

    Lower, S. S., McGurk, M. P., Clark, A. G. & Barbash, D. A. Satellite DNA evolution: old ideas, new approaches. Curr. Opin. Genet. Dev. 49, 70–78 (2018).

    CAS  Google Scholar 

  24. 24.

    Cellamare, A. et al. New insights into centromere organization and evolution from the white-cheeked gibbon and marmoset. Mol. Biol. Evol. 26, 1889–1900 (2009).

    CAS  Google Scholar 

  25. 25.

    Langley, S. A., Miga, K. H., Karpen, G. H. & Langley, C. H. Haplotypes spanning centromeric regions reveal persistence of large blocks of archaic DNA. eLife 8, e42989 (2019).

    Google Scholar 

  26. 26.

    Jain, M. et al. Linear assembly of a human centromere on the Y chromosome. Nat. Biotechnol. 36, 321–323 (2018).

    CAS  Google Scholar 

  27. 27.

    Hayden, K. E. et al. Sequences associated with centromere competency in the human genome. Mol. Cell. Biol. 33, 763–772 (2013).

    CAS  Google Scholar 

  28. 28.

    Sevim, V., Bashir, A., Chin, C.-S. & Miga, K. H. Alpha-CENTAURI: assessing novel centromeric repeat sequence variation with long read sequencing. Bioinformatics 32, 1921–1924 (2016).

    Google Scholar 

  29. 29.

    Schindelhauer, D. Evidence for a fast, intrachromosomal conversion mechanism from mapping of nucleotide variants within a homogeneous alpha-satellite DNA array. Genome Res. 12, 1815–1826 (2002).

    CAS  Google Scholar 

  30. 30.

    Mahtani, M. M. & Willard, H. F. Physical and genetic mapping of the Human X chromosome centromere: repression of recombination. Genome Res. 8, 100–110 (1998).

    CAS  Google Scholar 

  31. 31.

    Miga, K. H. et al. Centromere reference models for human chromosomes X and Y satellite arrays. Genome Res. 24, 697–707 (2014).

    CAS  Google Scholar 

  32. 32.

    Peng, Y., Leung, H. C. M., Yiu, S. M. & Chin, F. Y. L. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28, 1420–1428 (2012).

    CAS  Google Scholar 

  33. 33.

    Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).

    CAS  Google Scholar 

  34. 34.

    Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature https://doi.org/10.1038/s41586-020-2547-7 (2020).

  35. 35.

    Yang, C., Chu, J., Warren, R. L. & Birol, I. NanoSim: nanopore sequence read simulator based on statistical characterization. Gigascience 6, 1–6 (2017).

    CAS  Google Scholar 

  36. 36.

    Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).

    CAS  Google Scholar 

  37. 37.

    Price, A. L., Eskin, E. & Pevzner, P. A. Whole-genome analysis of Alu repeat elements reveals complex evolutionary history. Genome Res. 14, 2245–2252 (2004).

    CAS  Google Scholar 

  38. 38.

    Bennett, E. A. et al. Active Alu retrotransposons in the human genome. Genome Res. 18, 1875–1883 (2008).

    CAS  Google Scholar 

  39. 39.

    Keich, U. & Pevzner, P. A. Finding motifs in the twilight zone. Bioinformatics 18, 1374–1381 (2002).

    CAS  Google Scholar 

  40. 40.

    Mikheenko, A., Bzikadze, A. V., Gurevich., A., Miga, K. H. & Pevzner, P. A. TandemMapper and TandemQUAST: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats. Bioinformatics (in the press).

  41. 41.

    Uralsky, L. I. et al. Classification and monomer-by-monomer annotation dataset of suprachromosomal family 1 alpha satellite higher-order repeats in hg38 human genome assembly. Data Br. 24, 103708 (2019).

    CAS  Google Scholar 

  42. 42.

    Henikoff, J. G., Thakur, J., Kasinathan, S. & Henikoff, S. A unique chromatin complex occupies young α-satellite arrays of human centromeres. Sci. Adv. 1, e1400234 (2015).

    Google Scholar 

  43. 43.

    Waye, J. S. & Willard, H. F. Chromosome-specific alpha satellite DNA: nucleotide sequence analysis of the 2.0 kilobasepair repeat from the human X chromosome. Nucleic Acids Res. 13, 2731–2743 (1985).

    CAS  Google Scholar 

  44. 44.

    Harris, R. S., Cechova, M. & Makova, K. D. Noise-cancelling repeat finder: uncovering tandem repeats in error-prone long-read sequencing data. Bioinformatics 35, 4809–4811 (2019).

    CAS  Google Scholar 

  45. 45.

    Lin, Y. et al. Assembly of long error-prone reads using de Bruijn graphs. Proc. Natl Acad. Sci. USA 113, E8396–E8405 (2016).

    CAS  Google Scholar 

  46. 46.

    Dvorkina, T., Bzikadze, A. V. & Pevzner P. A. The string decomposition problem and its applications to centromere assembly. Bioinformatics (in the press).

Download references

Acknowledgements

We are indebted to I. Alexandrov, M. Kolmogorov, K. Miga and V. Shepelev for many insightful comments that improved centroFlye algorithm. We are grateful to A. Bankevich, A. Bzikadze, T. Dvorkina, A. Mikheenko, A. Phillippy, C. Wu and J. Yuan for helpful discussions and suggestions.

Author information

Affiliations

Authors

Contributions

Both authors contributed to developing centroFlye algorithm and writing the paper. A.V.B. implemented centroFlye algorithm. P.A.P. directed the work.

Corresponding author

Correspondence to Pavel A. Pevzner.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notes 1–7 and Figs. 1–5.

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bzikadze, A.V., Pevzner, P.A. Automated assembly of centromeres from ultra-long error-prone reads. Nat Biotechnol 38, 1309–1316 (2020). https://doi.org/10.1038/s41587-020-0582-4

Download citation

Further reading

Search

Quick links