Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Cycles of satellite and transposon evolution in Arabidopsis centromeres

Abstract

Centromeres are critical for cell division, loading CENH3 or CENPA histone variant nucleosomes, directing kinetochore formation and allowing chromosome segregation1,2. Despite their conserved function, centromere size and structure are diverse across species. To understand this centromere paradox3,4, it is necessary to know how centromeric diversity is generated and whether it reflects ancient trans-species variation or, instead, rapid post-speciation divergence. To address these questions, we assembled 346 centromeres from 66 Arabidopsis thaliana and 2 Arabidopsis lyrata accessions, which exhibited a remarkable degree of intra- and inter-species diversity. A.thaliana centromere repeat arrays are embedded in linkage blocks, despite ongoing internal satellite turnover, consistent with roles for unidirectional gene conversion or unequal crossover between sister chromatids in sequence diversification. Additionally, centrophilic ATHILA transposons have recently invaded the satellite arrays. To counter ATHILA invasion, chromosome-specific bursts of satellite homogenization generate higher-order repeats and purge transposons, in line with cycles of repeat evolution. Centromeric sequence changes are even more extreme in comparison between A.thaliana and A.lyrata. Together, our findings identify rapid cycles of transposon invasion and purging through satellite homogenization, which drive centromere evolution and ultimately contribute to speciation.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: High genetic diversity in the Arabidopsis pan-centromere.
Fig. 2: Dynamic genetic and epigenetic evolution of the Arabidopsis centromere arrays.
Fig. 3: Invasion of the Arabidopsis satellite arrays by centrophilic ATHILA retrotransposons.
Fig. 4: Inter-specific centromere satellite turnover and transposon dynamics between A.thaliana and A.lyrata.
Fig. 5: Cycles of satellite homogenization and ATHILA retrotransposon invasion drive Arabidopsis centromere evolution.

Similar content being viewed by others

Data availability

A.thaliana and A.lyrata accessions used for sequencing are held in the authors’ laboratories and seeds are freely available on request. The genome assemblies analysed in this study are available under the following accession numbers: (1) 48 A.thaliana HiFi assemblies have been submitted to the ENA under project number PRJEB55353 (ERP140242); (2) 15 A.thaliana HiFi assemblies have been submitted to the ENA under project number PRJEB55632 (ERA17524869); (3) 2 A.thaliana HiFi assemblies (Col-0 and Ey15-2) are available at the ENA under project number PRJEB50694 (ERP135313)7; (4) 1 A.thaliana HiFi assembly (Kew-1) from the Darwin Tree of Life is available under project accession PRJEB51511 (refs. 14,15) and can also be accessed at https://portal.darwintreeoflife.org/data/root/details/Arabidopsis%20thaliana; (5) ONT reads from the Ler-0, Cvi-0 and Tanz-0 accessions have been submitted as ArrayExpress accession E-MTAB-12009, while those for the accession Col-0 were previously available as ArrayExpress accession E-MTAB-10272 (ref. 6); (6) CENH3 Illumina ChIP–seq reads from Col-0, Ler-0, Cvi-0 and Tanz-0 have been submitted as ArrayExpress accession E-MTAB-11974; and (7) 2 A.lyrata HiFi assemblies are available at the ENA under project number PRJEB50329 (ERP134897)27.

Code availability

The TRASH algorithm is available at https://github.com/vlothec/TRASH, the ATHILAfinder algorithm is available at https://github.com/eliasprim/ATHILAfinder and additional custom code associated with the manuscript is available at https://github.com/vlothec/pancentromere.

References

  1. McKinley, K. L. & Cheeseman, I. M. The molecular basis for centromere identity and function. Nat. Rev. Mol. Cell Biol. 17, 16–29 (2016).

    Article  CAS  PubMed  Google Scholar 

  2. Talbert, P. B., Masuelli, R., Tyagi, A. P., Comai, L. & Henikoff, S. Centromeric localization and adaptive evolution of an Arabidopsis histone H3 variant. Plant Cell 14, 1053–1066 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Melters, D. P. et al. Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution. Genome Biol. 14, R10 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  4. Henikoff, S., Ahmad, K. & Malik, H. S. The centromere paradox: stable inheritance with rapidly evolving DNA. Science 293, 1098–1102 (2001).

    Article  CAS  PubMed  Google Scholar 

  5. Miga, K. H. & Alexandrov, I. A. Variation and evolution of human centromeres: a field guide and perspective. Annu. Rev. Genet. 55, 583–602 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  6. Naish, M. et al. The genetic and epigenetic landscape of the centromeres. Science 374, eabi7489 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  7. Rabanal, F. A. et al. Pushing the limits of HiFi assemblies reveals centromere diversity between two Arabidopsis thaliana genomes. Nucleic Acids Res. 50, 12309–12327 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  9. Altemose, N. et al. Complete genomic and epigenetic maps of human centromeres. Science 376, eabl4178 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. 1001 Genomes Consortium. 1,135 genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell 166, 481–491 (2016).

    Article  Google Scholar 

  11. Durvasula, A. et al. African genomes illuminate the early history and transition to selfing. Proc. Natl Acad. Sci. USA 114, 5213–5218 (2017).

  12. Novikova, P. Y. et al. Sequencing of the genus Arabidopsis identifies a complex history of nonbifurcating speciation and abundant trans-specific polymorphism. Nat. Genet. 48, 1077–1082 (2016).

    Article  CAS  PubMed  Google Scholar 

  13. Schmickl, R., Jørgensen, M. H., Brysting, A. K. & Koch, M. A. The evolutionary history of the Arabidopsis lyrata complex: a hybrid in the amphi-Beringian area closes a large distribution gap and builds up a genetic barrier. BMC Evol. Biol. 10, 98 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  14. Darwin Tree of Life Project Consortium. Sequence locally, think globally: the Darwin Tree of Life Project. Proc. Natl Acad. Sci. USA 119, e2115642118 (2022).

    Article  Google Scholar 

  15. Christenhusz, M. J. M. et al. The genome sequence of thale cress, Arabidopsis thaliana (Heynh., 1842). Wellcome Open Res. 8, 40 (2023).

    Article  Google Scholar 

  16. Langley, S. A., Miga, K. H., Karpen, G. H. & Langley, C. H. Haplotypes spanning centromeric regions reveal persistence of large blocks of archaic DNA. eLife 8, e42989 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  17. Dover, G. Molecular drive: a cohesive mode of species evolution. Nature 299, 111–117 (1982).

    Article  ADS  CAS  PubMed  Google Scholar 

  18. Rudd, M. K., Wray, G. A. & Willard, H. F. The evolutionary dynamics of alpha-satellite. Genome Res. 16, 88–96 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Wijnker, E. et al. The genomic landscape of meiotic crossovers and gene conversions in Arabidopsis thaliana. eLife 2, e01426 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  20. Smith, G. P. Evolution of repeated DNA sequences by unequal crossover. Science 191, 528–535 (1976).

  21. Talbert, P. B. & Henikoff, S. Centromeres convert but don’t cross. PLoS Biol. 8, e1000326 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  22. Shi, J. et al. Widespread gene conversion in centromere cores. PLoS Biol. 8, e1000327 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Slotkin, R. K. The epigenetic control of the Athila family of retrotransposons in Arabidopsis. Epigenetics 5, 483–490 (2010).

    Article  CAS  PubMed  Google Scholar 

  24. Mable, B. K., Robertson, A. V., Dart, S., Di Berardo, C. & Witham, L. Breakdown of self-incompatibility in the perennial Arabidopsis lyrata (Brassicaceae) and its genetic consequences. Evolution 59, 1437–1448 (2005).

    PubMed  Google Scholar 

  25. Foxe, J. P. et al. Reconstructing origins of loss of self-incompatibility and selfing in North American Arabidopsis lyrata: a population genetic context. Evolution 64, 3495–3510 (2010).

    Article  PubMed  Google Scholar 

  26. Hu, T. T. et al. The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. Nat. Genet. 43, 476–481 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  27. Kolesnikova, U. et al. Genome of selfing Siberian Arabidopsis lyrata explains establishment of allopolyploid Arabidopsis kamchatica. Preprint at bioRxiv https://doi.org/10.1101/2022.06.24.497443 (2022).

  28. Berr, A. et al. Chromosome arrangement and nuclear architecture but not centromeric sequences are conserved between Arabidopsis thaliana and Arabidopsis lyrata. Plant J. 48, 771–783 (2006).

    Article  CAS  PubMed  Google Scholar 

  29. Tsukahara, S. et al. Centromere-targeted de novo integrations of an LTR retrotransposon of Arabidopsis lyrata. Genes Dev. 26, 705–713 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Malik Harmit, S. & Eickbush, T. H. Modular evolution of the integrase domain in the Ty3/Gypsy class of LTR retrotransposons. J. Virol. 73, 5186–5190 (1999).

    Article  PubMed Central  Google Scholar 

  31. Nijman, I. J. & Lenstra, J. A. Mutation and recombination in cattle satellite DNA: a feedback model for the evolution of satellite DNA repeats. J. Mol. Evol. 52, 361–371 (2001).

  32. Chatterjee, B. & Lo, C. W. Chromosomal recombination and breakage associated with instability in mouse centromeric satellite DNA. J. Mol. Biol. 210, 303–312 (1989).

  33. Wolfgruber, T. K. et al. High quality maize centromere 10 sequence reveals evidence of frequent recombination events. Front. Plant Sci. 7, 308 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  34. Mahtani, M. M. & Willard, H. F. Pulsed-field gel analysis of α-satellite DNA at the human X chromosome centromere: high-frequency polymorphisms and array size estimate. Genomics 7, 607–613 (1990).

  35. Brown, S. D. & Dover, G. A. Conservation of segmental variants of satellite DNA of Mus musculus in a related species: Mus spretus. Nature 285, 47–49 (1980).

    Article  ADS  CAS  PubMed  Google Scholar 

  36. Durfy, S. J. & Willard, H. F. Concerted evolution of primate α satellite DNA. Evidence for an ancestral sequence shared by gorilla and human X chromosome α satellite. J. Mol. Biol. 216, 555–566 (1990).

    Article  CAS  PubMed  Google Scholar 

  37. Coen, E., Strachan, T. & Dover, G. Dynamics of concerted evolution of ribosomal DNA and histone gene families in the melanogaster species subgroup of Drosophila. J. Mol. Biol. 158, 17–35 (1982).

    Article  CAS  PubMed  Google Scholar 

  38. Liao, D., Pavelitz, T., Kidd, J. R., Kidd, K. K. & Weiner, A. M. Concerted evolution of the tandemly repeated genes encoding human U2 snRNA (the RNU2 locus) involves rapid intrachromosomal homogenization and rare interchromosomal gene conversion. EMBO J. 16, 588–598 (1997).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Shepelev, V. A., Alexandrov, A. A., Yurov, Y. B. & Alexandrov, I. A. The evolutionary origin of man can be traced in the layers of defunct ancestral α satellites flanking the active centromeres of human chromosomes. PLoS Genet. 5, e1000641 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  40. Armstrong, S. J. & Jones, G. H. Female meiosis in wild-type Arabidopsis thaliana and in two meiotic mutants. Sex. Plant Reprod. 13, 177–183 (2001).

    Article  Google Scholar 

  41. Akera, T., Trimm, E. & Lampson, M. A. Molecular strategies of meiotic cheating by selfish centromeres. Cell 178, 1132–1144 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Fishman, L. & Saunders, A. Centromere-associated female meiotic drive entails male fitness costs in monkeyflowers. Science 322, 1559–1562 (2008).

    Article  ADS  CAS  PubMed  Google Scholar 

  43. Kursel, L. E. & Malik, H. S. The cellular mechanisms and consequences of centromere drive. Curr. Opin. Cell Biol. 52, 58–65 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Hall, S. E., Luo, S., Hall, A. E. & Preuss, D. Differential rates of local and global homogenization in centromere satellites from Arabidopsis relatives. Genetics 170, 1913–1927 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Russo, A. et al. Low-input high-molecular-weight DNA extraction for long-read sequencing from plants of diverse families. Front. Plant Sci. 13, 883897 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  46. Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Alonge, M. et al. Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editing. Genome Biol. 23, 258 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  50. Vollger, M. R. et al. Long-read sequence and assembly of segmental duplications. Nat. Methods 16, 88–94 (2019).

    Article  CAS  PubMed  Google Scholar 

  51. Mc Cartney, A. M. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat. Methods 19, 687–695 (2022).

    Article  Google Scholar 

  52. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

    Article  CAS  PubMed  Google Scholar 

  55. Yun, T. et al. Accurate, scalable cohort variant calls using DeepVariant and GLnexus. Bioinformatics https://doi.org/10.1093/bioinformatics/btaa1081 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  56. Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 43, 11.10.1–11.10.33 (2013).

    PubMed  Google Scholar 

  57. Browning, B. L. & Browning, S. R. Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics 194, 459–471 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  58. M. P. J.van der Loo The stringdist package for approximate string matching. R J. 6, 111 (2014).

    Article  Google Scholar 

  59. Nguyen, L.-T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).

    Article  CAS  PubMed  Google Scholar 

  60. Vollger, M. R., Kerpedjiev, P., Phillippy, A. M. & Eichler, E. E. StainedGlass: interactive visualization of massive tandem repeat structures with identity heatmaps. Bioinformatics https://doi.org/10.1093/bioinformatics/btac018 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  61. Buisine, N., Quesneville, H. & Colot, V. Improved detection and annotation of transposable elements in sequenced genomes using multiple reference sequence sets. Genomics 91, 467–475 (2008).

    Article  CAS  PubMed  Google Scholar 

  62. Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16, 276–277 (2000).

    Article  CAS  PubMed  Google Scholar 

  63. Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).

    Article  CAS  PubMed  Google Scholar 

  64. Liu, K., Linder, C. R. & Warnow, T. RAxML and FastTree: comparing two methods for large-scale maximum likelihood phylogeny estimation. PLoS ONE 6, e27731 (2011).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  65. Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 49, W293–W296 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Ou, S. et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol. 20, 275 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Shumate, A. & Salzberg, S. L. Liftoff: accurate mapping of gene annotations. Bioinformatics https://doi.org/10.1093/bioinformatics/btaa1016 (2020).

    Article  Google Scholar 

  68. Pertea, G. & Pertea, M. GFF Utilities: GffRead and GffCompare. F1000Res 9, 304 (2020).

  69. Lischer, H. E. L. & Excoffier, L. PGDSpider: an automated data conversion tool for connecting population genetics and genomics programs. Bioinformatics 28, 298–299 (2012).

    Article  CAS  PubMed  Google Scholar 

  70. Kozlov, A. M., Darriba, D., Flouri, T., Morel, B. & Stamatakis, A. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics 35, 4453–4455 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  71. Yu, G., Smith, D. K., Zhu, H., Guan, Y. & Lam, T. T.-Y. Ggtree : an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods Ecol. Evol. 8, 28–36 (2017).

    Article  Google Scholar 

  72. Wang, L.-G. et al. Treeio: an R package for phylogenetic tree input and output with richly annotated and associated data. Mol. Biol. Evol. 37, 599–603 (2020).

    Article  CAS  PubMed  Google Scholar 

  73. Ni, P. et al. Genome-wide detection of cytosine methylations in plant from Nanopore data using deep learning. Nat. Commun. 12, 5976 (2021).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  74. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17, 10 (2011).

    Article  Google Scholar 

  75. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  76. Ramírez, F., Dündar, F., Diehl, S., Grüning, B. A. & Manke, T. deepTools: a flexible platform for exploring deep-sequencing data. Nucleic Acids Res. 42, W187–W191 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank S. Henikoff and P. Talbert (Fred Hutchinson Cancer Research Center, USA) for kindly providing anti-CENH3 antibodies. We thank R. Durbin, C. Zhou (University of Cambridge, UK) and the Darwin Tree of Life Project for the A.thaliana ddAraThal4 Kew-1 assembly. This work was supported by BBSRC grants BB/S006842/1, BB/S020012/1 and BB/V003984/1, European Research Council Consolidator Award ERC-2015-CoG-681987, Marie Curie International Training Network ‘MEICOM’ and Human Frontier Science Program award RGP0025/2021 to I.R.H.; EMBO long-term postdoctoral fellowship ALTF224-2022 to R.B.; a Human Frontiers Science Program (HFSP) Long-Term Fellowship (LT000819/2018-L) to F.A.R.; the Max Planck Society to D.W.; European Research Council (ERC) Synergy Grant PATHOCOM (951444) from the European Union’s Horizon 2020 program to F.R. and D.W.; an ERA-CAPS 1001G+ grant to M. Nordborg and D.W.; Royal Society awards UF160222, URF\R\221024, RGF/R1/180006 and RGF/EA/201030 to A.B.; European Research Council award ERC HOW2DOBLE 101041354 to P.Y.N.; Czech Science Foundation grant no. 21-03909S to M.A.L.; a BBSRC DTP Studentship to N.G.; a Broodbank Fellowship to M. Naish; and grant PID2022-136893NB-I00 from the Ministerio de Ciencia e Innovación of Spain/Agencia Estatal de Investigación/10.13039/50110001103/FEDER, EU, to C.A.-B.

Author information

Authors and Affiliations

Authors

Contributions

P.W., F.A.R., M. Naish, G.S., F.R., C.A.-B., M.A.L., P.Y.N., A.B., D.W. and I.R.H. designed the study. Genomic DNA extractions and sequencing were performed by F.A.R., M. Naish, A.S., N.G., K.F., A.H., C.L., T.S., M.C., M.M. and G.S. FISH experiments were performed by T.M. CENH3 ChIP–seq and DNA methylation profiling were performed by M. Naish. P.W., F.A.R., R.B., M. Naish, E.P., A.S., T.M., N.G., A.J.T., C.P., G.S., M.C., M. Nordborg, M.A.L., D.H., P.Y.N., A.B., D.W. and I.R.H. analysed the results. P.W., F.A.R., R.B., P.Y.N., A.B., D.W. and I.R.H. wrote the paper, with input from all authors.

Corresponding authors

Correspondence to Alexandros Bousios, Detlef Weigel or Ian R. Henderson.

Ethics declarations

Competing interests

D.W. holds equity in Computomics, which advises plant breeders. D.W. consults for KWS SE, a plant breeder and seed producer. All other authors declare no competing interests.

Peer review

Peer review information

Nature thanks Vincent Colot, Pierre Baduel and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Validation of A. thaliana centromere assemblies.

The coverage of primary (dark blue) and secondary (light blue) alleles for PacBio HiFi read sets (Col-0, Ler-0, Cvi-0 and Tanz-1) aligned to their corresponding genome assembly. AthCEN178 arrays coordinates, from Supplementary Table 3, are indicated by grey shading. Assembly gaps are shown by red shading, 45S rDNA by purple shading, 5S rDNA by orange shading and organelle insertions by green shading. Equivalent plots for the remaining genome assemblies can be found at the following website: https://github.com/vlothec/pancentromere.

Extended Data Fig. 2 Sampling of the AthCEN178 satellome and geographic distributions of centromere similarity groups.

a, Discovery of non-redundant unique AthCEN178 variants as a function of sampled accessions, determined with 1,000 permutations. The centre of each boxplot is the median number of non-redundant unique AthCEN178 variants in the 1,000 permutations. Blue shading = 95% confidence interval. b, Heat map showing the average value of exact AthCEN178 sequence sharing between all pairs of the indicated chromosomes. c, Geographic maps are shown with accession origin coloured according to AthCEN178 similarity group, shown separately for each of the five chromosomes. d, Pairwise geographical distance (km) vs. the proportion of shared AthCEN178 sequences, for all 2,145 accession pairs. e, Histogram showing the number of AthCEN178 similarity groups that are shared when all accession pairs were compared.

Extended Data Fig. 3 Variation in AthCEN178 copy number and CENH3 occupancy between A. thaliana genetic lineages and accessions.

a, Total AthCEN178 copies per accession, coloured according to Eurasian (blue), Iberian non-relict (orange), non-Iberian relict (green) and Iberian relict (pink) chromosome arm SNP-PCA groups. b, Corrected total AthCEN178 FISH fluorescence intensity from anther nuclei, leaf nuclei, or mitotic chromosomes, in Col-0 (Eurasian, blue) and Tanz-1 (relict, red). All Tanz-1 samples showed significantly greater fluorescence intensity compared to Col-0 (Wilcoxon tests all P = <1.04×10−6). c, Representative FISH micrographs for AthCEN178 (red) and ATHILA2 (green) on pachytene chromosomes of Col-0 and Tanz-1. Insets on the left are DAPI-stained images of the same cells. Scale bars = 10 μm. d, StainedGlass sequence identity heat maps for CEN1 of Eurasian (Col-0 and Ler-0) and non-Iberian relict (Cvi-0 and Tanz-1) accessions. e, CENH3 log2(ChIP/input) values (upper row) were plotted along all AthCEN178 repeats in the Col-0, Ler-0, Cvi-0 and Tanz-1 accessions. Beneath (lower row) are plots of AthCEN178 sequence variants against the consensus repeat for Col-0, Ler-0, Cvi-0 and Tanz-1.

Extended Data Fig. 4 AthCEN178 HORs and dynamic centromere evolution in A. thaliana.

a, Density plot of AthCEN178 HOR scores versus edit distances from the chromosome consensus, across all accessions. b, The copy number of each AthCEN178 repeat was calculated within each chromosome individually. For each chromosome, all AthCEN178 repeats were divided into 100 bins with an equal number of repeats in each bin. The counts of AthCEN178 with copy numbers of 1, 2, 3, 4, 5 and >5 were divided by the number of repeats per bin, and by the total number of chromosomes. These values were summed for each chromosome to give a total value of 1 per bin. c, Histogram of AthCEN178 HOR scores per centromere. d, Scatterplots of AthCEN178 HOR scores for each of the five chromosomes. e, StainedGlass sequence identity heat maps comparing within- and between-accession sequence identity for Ru-2 and BANI-C-1 CEN2, Cvi-0 and Med-0 CEN3 and HR-10 and 11C1 CEN5. f, Pairwise comparison of the proportion of shared versus private AthCEN178 HORs between IP-Ini-0 and BARC-A-17 CEN1, along the length of each centromere. Red lines represent a smoothing spline. g, Dot plot showing intra-centromere duplications (diagonal red lines) detected within MERE-A-13 CEN5. Horizontal and vertical dotted lines indicate intact (red) and soloLTR (blue) ATHILA.

Extended Data Fig. 5 DNA methylation, CENH3 ChIP-seq enrichment, and AthCEN178 higher-order repeat (HOR) structure within the centromere regions of Col-0, Ler-0, Cvi-0 and Tanz-1.

a, CENH3 ChIP-seq enrichment (log2[ChIP/input], black) compared with AthCEN178 density in 10 kb windows on forward (red) or reverse (blue) strands along the indicated chromosome and accession. Beneath, CENH3 ChIP-seq enrichment (black) is plotted against DNA methylation (%) in CG (red), CHG (blue) and CHH (green) sequence contexts, along the entire chromosome. Beneath are close-ups of the centromere regions with AthCEN178 density (red, blue), CENH3 ChIP-seq enrichment (black) and AthCEN178 HOR score (orange) plotted. A StainedGlass sequence identity heat map is shown at the bottom60. The centromeres in (a) are grouped on the basis of having a single AthCEN178 array that is occupied by CENH3. b, As for (a), but showing centromeres that are grouped on the basis of having distinct AthCEN178 arrays and CENH3 occupying more than one array. c, As for (a), but showing centromeres with multiple AthCEN178 arrays, only one of which is occupied by CENH3.

Extended Data Fig. 6 Variation in CENH3 coding sequence in relation to centromere AthCEN178 similarity groups.

A phylogenetic tree based on CENH3 nucleotide sequences is shown for the 66 A. thaliana accessions (left). To the right of the tree is a coloured key indicating the AthCEN178 similarity group membership for each of the five chromosomes (CEN1-CEN5) for each accession.

Extended Data Fig. 7 Centrophilic and centrophobic ATHILA in A. thaliana.

a, Pie charts of the proportions of centrophilic ATHILA families: (i) inside vs. outside the AthCEN178 arrays, (ii) inside vs. outside AthCEN178 arrays by chromosome, and (iii) intact vs. soloLTR located inside or outside the AthCEN178 arrays. b, Phylogenetic trees constructed with full-length centromeric ATHILA from each chromosome. The clades representing different ATHILA families are indicated by background shading, and the coloured branch tips represent AthCEN178 similarity groups. c, Representative FISH micrographs for ATHILA2 (green) and ATHILA5 (red) on pachytene chromosomes in the Col-0, Rab-1 and IP-Bus-0 accessions. Scale bars = 10 μm. d, ATHILA integration frequency along the length of the CEN178 consensus repeat. e, Counts of intact (left) and soloLTR (right) ATHILA located outside (top) or inside (bottom) the AthCEN178 arrays, ordered by chromosome arm SNP-PCA groups. f, Distribution of sequence identity between LTRs of intact ATHILA elements, comparing those located inside (red) or outside (blue) the centromeres, according to chromosome arm SNP-PCA group. Intact ATHILA within the AthCEN178 arrays had significantly higher LTR identity in the Eurasians and Iberian non-relicts, compared with the Iberians and non-Iberian relicts (Wilcoxon tests all P < 1.78×10-6).

Extended Data Fig. 8 ATHILA diversification via de novo integration and intra-centromere duplication within A. thaliana.

a, StainedGlass sequence identity heat maps for ANGE-B-2 and ANGE-B-10 CEN4, with % GC content (green) and the density of AthCEN178 per 10 kb on forward (red) and reverse (blue) strands plotted beneath. X-axis ticks indicate intact ATHILA (pink) and soloLTR (green) insertions. ‘*’ marks insertions that are shared between ANGE-B-2 and ANGE-B-10, whereas ‘!’ indicates those unique to ANGE-B-10. b, CEN5 is shown for FERR-A-8 and FERR-A-12 with % GC content (green) and the density of AthCEN178 per 10 kb on forward (red) and reverse (blue) strands shown beneath. X-axis ticks indicate intact ATHILA (pink) and soloLTR (green) insertions. ‘*’ marks insertions that are shared between FERR-A-8 and FERR-A-12, whereas ‘!’ marks those that correspond to post-integration duplications unique to FERR-A-8. c, StainedGlass sequence identity heat maps comparing FERR-A-8 and FERR-A-12 CEN5. d, The coverage of primary FERR-A-12 (dark blue) or FERR-A-8 (brown), and secondary FERR-A-12 (light blue) or FERR-A-8 (orange) alleles of PacBio HiFi reads to the chromosome 5 of the FERR-A-12 (upper), or FERR-A-8 (lower), genome assemblies. AthCEN178 array coordinates, from Supplementary Table 3, are indicated by grey shading. Assembly gaps are shown by red shading. e, CENH3 log2(ChIP/input) values were plotted over ATHILA elements located within the AthCEN178 arrays of the Col, Ler, Cvi and Tanz accessions (n = 100), in addition to 2 kb flanking regions. Windowed mean values are shown as solid lines, with 95% confidence intervals indicated by the shaded ribbons. This is compared to 100 randomly selected loci within the AthCEN178 arrays, with the same widths as the ATHILA. Also shown are profiles across ATHILA elements located outside the AthCEN178 arrays (n = 426), and Gypsy elements located outside the AthCEN178 arrays (n = 21,487).

Extended Data Fig. 9 Autonomous and non-autonomous ATHILA elements in the collection of A. thaliana centromeres.

a, The size distribution of intact ATHILA5 elements across the 66 A. thaliana accessions is plotted. Bar plots are coloured to indicate the number of elements inside (red), or outside (blue) the AthCEN178 arrays. Three ATHILA size classes were defined; Class I for elements <8 kb, Class II between 8–12 kb, and Class III >12 kb. b, The distribution of ORF sizes (bp) in the ATHILA5 elements, in total, or by the indicated size class. Red text indicates the position of the ATHILA-ORF and intact or truncated ORFs for GAG-POL. c, A representative diagram of an intact Class III 13.3 kb autonomous ATHILA5 element, compared to a Class II 10.5 kb non-autonomous derivative. In this example, a single ~2.8 kb fragment that contains the reverse transcriptase, RNaseH and integrase genes is absent in the non-autonomous element. The green shaded areas indicate levels of sequence identity between the matching regions. d, Multiple sequence alignment of ATHILA integrase amino acid sequence from centrophilic (ATHILA1, ATHILA2, ATHILA5, ATHILA6b, rows 1–4) and centrophobic (ATHILA0, ATHILA4c, ATHILA7, ATHILA9, rows 5–8) families. The alignment starts immediately downstream of the RNase-H domain (not shown), to ensure that the N-terminus of integrase is included.

Extended Data Fig. 10 Phylogenetic analysis of centromere satellites and ATHILA in A. thaliana and A. lyrata.

a, Sequence identity dot plots comparing syntenic centromeres between A. thaliana Col (AthCol) and A. lyrata MN47 (AlyMN47), or NT1 (AlyNT1), using 80 bp windows. Red and blue indicate strand similarity (red is same, blue is opposite). b, Maximum-likelihood phylogenetic tree of Arabidopsis satellites, using randomly sampled AlyCEN168 and AlyCEN179 from A. lyrata, and AthCEN178 and AthCEN159 from six A. thaliana accessions (Bon-1, IP-Bus-0, IP-Alo-19, IP-Cas-6, Rab-1 and Tanz-1), and using Capsella rubella satellites as a root. Branch tips are coloured by satellite repeat family. A grey circle was placed on nodes where UFBoot support value exceeds 95%. c, A maximum-likelihood phylogenetic tree of AlyCEN168 and 450 AlyCEN179 satellites sampled from A. lyrata accession MN47. Thirty AthCEN178 from the A. thaliana Col-0 accession were used as an outgroup. Tree tips are coloured according to chromosome, with the exception of the outgroup sequences, which are shaded in black. A grey circle is placed on nodes where UFBoot support value exceeds 95%. d, Phylogenetic tree of full-length ATHILA elements identified in A. lyrata. Elements were assigned to families based on their relationship to A. thaliana ATHILA, as shown in Fig. 4i. The tree was rooted using a maize Huck Ty3 element. Bootstrap support is shown for key nodes. e, Distribution of lengths (kb) of non-satellite sequence gaps within the A. lyrata centromere satellite arrays. f, Total number of non-satellite gaps between 1 and 200 kb for each chromosome, across the two A. lyrata accessions MN47 and NT1.

Supplementary information

Reporting Summary

Peer Review File

Supplementary Table 1

Genome assembly parameters and metrics for 66 A.thaliana accessions. The table provides, for each of the 66 A.thaliana accessions analysed, the accession name, PCA group membership based on chromosome-arm SNPs, ecotype ID, country of origin, latitude and longitude of collection, whether DNA was extracted from multiple or single individuals, DNA extraction and shearing method, version of sequencing binding kit used, barcode name and sequence when multiplexed, Hifiasm version used for primary assembly, additional scaffolding information, information regarding data submission to ENA (project and sample IDs), HiFi read metrics (N50, total sum, number of reads and longest read), assembly metrics on scaffolded contigs (number of contigs that constituted the scaffolded chromosomes, remaining assembly gaps, largest and shortest contig, contig N50, total length, read depth mean and standard deviation), assembly quality metrics specific to centromeres (centromeres with assembly gaps, total sum of collapsed and expanded regions in megabases and sum of heterozygosity stretches in total and per centromere). The first column includes a numerical code that can be matched with the first column of Supplementary Tables 2 and 3.

Supplementary Table 2

Key to centromere analysis of 66 A.thaliana accessions. This table provides, for each of the 66 A.thaliana accessions analysed, the name of the associated fasta file, chromosome-arm SNP-PCA genetic group membership, accession code where available, country of origin, latitude and longitude of origin, the total number of AthCEN178 repeats and their number per chromosome, the AthCEN178 similarity group for each chromosome, the total number of AthCEN159 tandem repeats, the number of intact ATHILA and soloLTRs in total and per chromosome, the total number of AthCEN178 HORs and AthCEN178 per chromosome, and AthCEN178 HOR score in total and per chromosome. Note that the following three pairs of accessions were collected at single sites and had identical or nearly identical genomes on the basis of chromosome-arm SNPs: (1) CAMA-C-2 and CAMA-C-9, (2) BARC-A-12 and BARC-A-17, and (3) BELC-C-10 and BELC-C-12. The first column includes a numerical code that can be matched with the first column of Supplementary Tables 1 and 3.

Supplementary Table 3

Centromere AthCEN178 array coordinates in A.thaliana and A.lyrata. For each A.thaliana and A.lyrata accession and each chromosome, the start and end coordinates of contiguous centromere satellite arrays are listed, in addition to the widths of each array. The first column includes a numerical code that can be matched with the first column of Supplementary Tables 1 and 2.

Supplementary Table 4

AthCEN178 HOR parameters per accession. This table reports the average number of AthCEN178 copies, HORs, average HOR scores, average HOR length (in bp) and the average distance between HOR pairs (in bp), for each PCA group defined by chromosome-arm SNPs.

Supplementary Table 5

ATHILA annotation across 66 A.thaliana and 2 A.lyrata genomes. For each ATHILA family, we provide the number and proportion of intact elements and soloLTRs in the AthCEN178 arrays versus the chromosome arms, the mean percentage of LTR identity, the number of elements with identical LTRs, the distribution across chromosomes, the genic coding capacity based on the presence of Pfam hidden Markov models and the intact to soloLTR ratio. Centrophilic families that have invaded the A.thaliana AthCEN178 satellite arrays are highlighted in red. The mean of LTR identity and intact to soloLTR ratios is shown only for families with >10 intact elements.

Supplementary Table 6

Matching ATHILA insertions for pairs of A.thaliana accessions and estimation of divergence on the basis of sequence comparisons. Intact and soloLTR ATHILA insertions are shown for the centromeres of the following pairs of accessions: BARC-A12 and BARC-A-17, BELC-C-10 and BELC-C-12, CAMA-C-2 and CAMA-C-9, SALE-A-10 and SALE-A-17, ANGE-B-2 and ANGE-B-10, the duplicated region of FERR-A-8 CEN5 and the shared element in CEN1 of BARC-A-17 and IP-Ini-0. Listed for each ATHILA, in columns B–S, are their start and stop coordinates, strand, quality (intact or soloLTR), total length (in bp), LTR length (in bp), target site duplication (TSD; when a TSD was not identified, the pentamer in column L shows the upstream 5 nucleotides of the ATHILA), LTR identity (%), ATHILA family assignment and a code that indicates whether elements are matched between the accessions, are unmatched (that is, ambiguous to classify) or represent a new insertion event in a specific accession. Columns T–AA contain information on the alignment of every matching ATHILA pair (substitutions, indels and alignment length), which was used to calculate divergence in generations using the formula T = K/2 × μ, where K is the divergence calculated as (substitutions + indels)/global_alignment_length (indels of any length count as 1 event) and μ is the estimated mutation rate of 7.0 × 10–9 mutations per site per generation.

Supplementary Video 1

StainedGlass gallery of 330 A.thaliana centromeres. An animation of StainedGlass sequence identity heat maps, generated using a 10-kb window size, for CEN1, CEN2, CEN3, CEN4 and CEN5. For each chromosome, the centromeres are shown in order of similarity, as inferred from the AthCEN178 sequence sharing heat maps in Fig. 1d. Beneath each heat map is a histogram of sequence identity percentages, showing colour assignments used in the sequence identity heat map.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wlodzimierz, P., Rabanal, F.A., Burns, R. et al. Cycles of satellite and transposon evolution in Arabidopsis centromeres. Nature 618, 557–565 (2023). https://doi.org/10.1038/s41586-023-06062-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41586-023-06062-z

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing