Abstract
Centromeric variation has been linked to cancer and infertility, but centromere sequences contain multiple tandem repeats and can only be assembled manually from long error-prone reads. Here we describe the centroFlye algorithm for centromere assembly using long error-prone reads, and apply it to assemble human centromeres on chromosomes 6 and X. Our analyses reveal putative breakpoints in the manual reconstruction of the human X centromere, demonstrate that human X chromosome is partitioned into repeat subfamilies and provide initial insights into centromere evolution. We anticipate that centroFlye could be applied to automatically close remaining multimegabase gaps in the reference human genome.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
centroFlye centromere 6 and X assemblies and all supporting data is available at Zenodo: https://doi.org/10.5281/zenodo.3897531. The ONT reads that were generated by the T2T consortium are deposited under accession number PRJNA559484.
Code availability
The codebase of the algorithm is available at https://github.com/seryrzu/centroFlye. The version of centroFlye that generates the assemblies described in the paper is in the branch: cF_NatBiotech_paper_Xv0.8.3-6v0.1.3. Jupyter notebooks for reproducing all figures in this study are provided in the Github repository https://github.com/seryrzu/centroFlye_paper_scripts.
References
Chin, C. S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).
Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).
Kamath, G. M., Shomorony, I., Xia, F., Courtade, T. A. & Tse, D. N. HINGE: long-read assembly achieves optimal repeat resolution. Genome Res. 27, 747–756 (2017).
Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).
Nowoshilow, S. et al. The axolotl genome and the evolution of key tissue formation regulators. Nature 554, 50–55 (2018).
Ruan, J. & Li, H. Fast and accurate long-read assembly with wtdbg2. Nat. Methods 17, 155–158 (2020).
Vollger, M. R. et al. Long-read sequence and assembly of segmental duplications. Nat. Methods 16, 88–94 (2019).
Nagaoka, S. I., Hassold, T. J. & Hunt, P. A. Human aneuploidy: mechanisms and new insights into an age-old problem. Nat. Rev. Genet. 13, 493–504 (2012).
Enukashvily, N. I., Donev, R., Waisertreiger, I. S.-R. & Podgornaya, O. I. Human chromosome 1 satellite 3 DNA is decondensed, demethylated and transcribed in senescent cells and in A431 epithelial carcinoma cells. Cytogenet. Genome Res. 118, 42–54 (2007).
Ting, D. T. et al. Aberrant overexpression of satellite repeats in pancreatic and other epithelial cancers. Science 331, 593–596 (2011).
Ferreira, D. et al. Satellite non-coding RNAs: the emerging players in cells, cellular pathways and cancer. Chromosom. Res. 23, 479–493 (2015).
Giunta, S. & Funabiki, H. Integrity of the human centromere DNA repeats is protected by CENP-A, CENP-C, and CENP-T. Proc. Natl Acad. Sci. USA 114, 1928–1933 (2017).
Black, E. M. & Giunta, S. Repetitive fragile sites: centromere satellite DNA as a source of genome instability in human diseases. Genes. 9, 615 (2018).
Smurova, K. & De Wulf, P. Centromere and pericentromere transcription: roles and regulation … in sickness and in health. Front. Genet. https://doi.org/10.3389/fgene.2018.00674 (2018).
Barra, V. & Fachinetti, D. The dark side of centromeres: types, causes and consequences of structural abnormalities implicating centromeric DNA. Nat. Commun. 9, 4340 (2018).
Zhu, Q. et al. Heterochromatin-encoded satellite RNAs induce breast cancer. Mol. Cell 70, 842–853.e7 (2018).
Miga, K. H. Centromeric satellite DNAs: hidden sequence variation in the human population. Genes 10, 352 (2019).
Schueler, M. G. Genomic and genetic definition of a functional human centromere. Science 294, 109–115 (2001).
Alkan, C. et al. Organization and evolution of primate centromeric dna from whole-genome shotgun sequence data. PLoS Comput. Biol. 3, e181 (2007).
Shepelev, V. A., Alexandrov, A. A., Yurov, Y. B. & Alexandrov, I. A. The evolutionary origin of man can be traced in the layers of defunct ancestral alpha satellites flanking the active centromeres of human chromosomes. PLoS Genet. 5, e1000641 (2009).
Melters, D. P. et al. Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution. Genome Biol. 14, R10 (2013).
Lower, S. S., McGurk, M. P., Clark, A. G. & Barbash, D. A. Satellite DNA evolution: old ideas, new approaches. Curr. Opin. Genet. Dev. 49, 70–78 (2018).
Cellamare, A. et al. New insights into centromere organization and evolution from the white-cheeked gibbon and marmoset. Mol. Biol. Evol. 26, 1889–1900 (2009).
Langley, S. A., Miga, K. H., Karpen, G. H. & Langley, C. H. Haplotypes spanning centromeric regions reveal persistence of large blocks of archaic DNA. eLife 8, e42989 (2019).
Jain, M. et al. Linear assembly of a human centromere on the Y chromosome. Nat. Biotechnol. 36, 321–323 (2018).
Hayden, K. E. et al. Sequences associated with centromere competency in the human genome. Mol. Cell. Biol. 33, 763–772 (2013).
Sevim, V., Bashir, A., Chin, C.-S. & Miga, K. H. Alpha-CENTAURI: assessing novel centromeric repeat sequence variation with long read sequencing. Bioinformatics 32, 1921–1924 (2016).
Schindelhauer, D. Evidence for a fast, intrachromosomal conversion mechanism from mapping of nucleotide variants within a homogeneous alpha-satellite DNA array. Genome Res. 12, 1815–1826 (2002).
Mahtani, M. M. & Willard, H. F. Physical and genetic mapping of the Human X chromosome centromere: repression of recombination. Genome Res. 8, 100–110 (1998).
Miga, K. H. et al. Centromere reference models for human chromosomes X and Y satellite arrays. Genome Res. 24, 697–707 (2014).
Peng, Y., Leung, H. C. M., Yiu, S. M. & Chin, F. Y. L. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28, 1420–1428 (2012).
Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).
Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature https://doi.org/10.1038/s41586-020-2547-7 (2020).
Yang, C., Chu, J., Warren, R. L. & Birol, I. NanoSim: nanopore sequence read simulator based on statistical characterization. Gigascience 6, 1–6 (2017).
Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).
Price, A. L., Eskin, E. & Pevzner, P. A. Whole-genome analysis of Alu repeat elements reveals complex evolutionary history. Genome Res. 14, 2245–2252 (2004).
Bennett, E. A. et al. Active Alu retrotransposons in the human genome. Genome Res. 18, 1875–1883 (2008).
Keich, U. & Pevzner, P. A. Finding motifs in the twilight zone. Bioinformatics 18, 1374–1381 (2002).
Mikheenko, A., Bzikadze, A. V., Gurevich., A., Miga, K. H. & Pevzner, P. A. TandemMapper and TandemQUAST: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats. Bioinformatics (in the press).
Uralsky, L. I. et al. Classification and monomer-by-monomer annotation dataset of suprachromosomal family 1 alpha satellite higher-order repeats in hg38 human genome assembly. Data Br. 24, 103708 (2019).
Henikoff, J. G., Thakur, J., Kasinathan, S. & Henikoff, S. A unique chromatin complex occupies young α-satellite arrays of human centromeres. Sci. Adv. 1, e1400234 (2015).
Waye, J. S. & Willard, H. F. Chromosome-specific alpha satellite DNA: nucleotide sequence analysis of the 2.0 kilobasepair repeat from the human X chromosome. Nucleic Acids Res. 13, 2731–2743 (1985).
Harris, R. S., Cechova, M. & Makova, K. D. Noise-cancelling repeat finder: uncovering tandem repeats in error-prone long-read sequencing data. Bioinformatics 35, 4809–4811 (2019).
Lin, Y. et al. Assembly of long error-prone reads using de Bruijn graphs. Proc. Natl Acad. Sci. USA 113, E8396–E8405 (2016).
Dvorkina, T., Bzikadze, A. V. & Pevzner P. A. The string decomposition problem and its applications to centromere assembly. Bioinformatics (in the press).
Acknowledgements
We are indebted to I. Alexandrov, M. Kolmogorov, K. Miga and V. Shepelev for many insightful comments that improved centroFlye algorithm. We are grateful to A. Bankevich, A. Bzikadze, T. Dvorkina, A. Mikheenko, A. Phillippy, C. Wu and J. Yuan for helpful discussions and suggestions.
Author information
Authors and Affiliations
Contributions
Both authors contributed to developing centroFlye algorithm and writing the paper. A.V.B. implemented centroFlye algorithm. P.A.P. directed the work.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Notes 1–7 and Figs. 1–5.
Rights and permissions
About this article
Cite this article
Bzikadze, A.V., Pevzner, P.A. Automated assembly of centromeres from ultra-long error-prone reads. Nat Biotechnol 38, 1309–1316 (2020). https://doi.org/10.1038/s41587-020-0582-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41587-020-0582-4
This article is cited by
-
Hybrid-hybrid correction of errors in long reads with HERO
Genome Biology (2023)
-
HiCAT: a tool for automatic annotation of centromere structure
Genome Biology (2023)
-
UniAligner: a parameter-free framework for fast sequence alignment
Nature Methods (2023)
-
GALA: a computational framework for de novo chromosome-by-chromosome assembly with long reads
Nature Communications (2023)
-
Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads
Nature Biotechnology (2022)