Centromeres are critical for cell division, loading CENH3 or CENPA histone variant nucleosomes, directing kinetochore formation and allowing chromosome segregation1,2. Despite their conserved function, centromere size and structure are diverse across species. To understand this centromere paradox3,4, it is necessary to know how centromeric diversity is generated and whether it reflects ancient trans-species variation or, instead, rapid post-speciation divergence. To address these questions, we assembled 346 centromeres from 66 Arabidopsis thaliana and 2 Arabidopsis lyrata accessions, which exhibited a remarkable degree of intra- and inter-species diversity. A. thaliana centromere repeat arrays are embedded in linkage blocks, despite ongoing internal satellite turnover, consistent with roles for unidirectional gene conversion or unequal crossover between sister chromatids in sequence diversification. Additionally, centrophilic ATHILA transposons have recently invaded the satellite arrays. To counter ATHILA invasion, chromosome-specific bursts of satellite homogenization generate higher-order repeats and purge transposons, in line with cycles of repeat evolution. Centromeric sequence changes are even more extreme in comparison between A. thaliana and A. lyrata. Together, our findings identify rapid cycles of transposon invasion and purging through satellite homogenization, which drive centromere evolution and ultimately contribute to speciation.
This is a preview of subscription content, access via your institution
Open Access articles citing this article.
Centromeric repeats in Citrus sinensis provide new insights into centromeric evolution and the distribution of G-quadruplex structures
Horticulture Advances Open Access 22 August 2023
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 51 print issues and online access
$199.00 per year
only $3.90 per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
A. thaliana and A. lyrata accessions used for sequencing are held in the authors’ laboratories and seeds are freely available on request. The genome assemblies analysed in this study are available under the following accession numbers: (1) 48 A. thaliana HiFi assemblies have been submitted to the ENA under project number PRJEB55353 (ERP140242); (2) 15 A. thaliana HiFi assemblies have been submitted to the ENA under project number PRJEB55632 (ERA17524869); (3) 2 A. thaliana HiFi assemblies (Col-0 and Ey15-2) are available at the ENA under project number PRJEB50694 (ERP135313)7; (4) 1 A. thaliana HiFi assembly (Kew-1) from the Darwin Tree of Life is available under project accession PRJEB51511 (refs. 14,15) and can also be accessed at https://portal.darwintreeoflife.org/data/root/details/Arabidopsis%20thaliana; (5) ONT reads from the Ler-0, Cvi-0 and Tanz-0 accessions have been submitted as ArrayExpress accession E-MTAB-12009, while those for the accession Col-0 were previously available as ArrayExpress accession E-MTAB-10272 (ref. 6); (6) CENH3 Illumina ChIP–seq reads from Col-0, Ler-0, Cvi-0 and Tanz-0 have been submitted as ArrayExpress accession E-MTAB-11974; and (7) 2 A. lyrata HiFi assemblies are available at the ENA under project number PRJEB50329 (ERP134897)27.
McKinley, K. L. & Cheeseman, I. M. The molecular basis for centromere identity and function. Nat. Rev. Mol. Cell Biol. 17, 16–29 (2016).
Talbert, P. B., Masuelli, R., Tyagi, A. P., Comai, L. & Henikoff, S. Centromeric localization and adaptive evolution of an Arabidopsis histone H3 variant. Plant Cell 14, 1053–1066 (2002).
Melters, D. P. et al. Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution. Genome Biol. 14, R10 (2013).
Henikoff, S., Ahmad, K. & Malik, H. S. The centromere paradox: stable inheritance with rapidly evolving DNA. Science 293, 1098–1102 (2001).
Miga, K. H. & Alexandrov, I. A. Variation and evolution of human centromeres: a field guide and perspective. Annu. Rev. Genet. 55, 583–602 (2021).
Naish, M. et al. The genetic and epigenetic landscape of the centromeres. Science 374, eabi7489 (2021).
Rabanal, F. A. et al. Pushing the limits of HiFi assemblies reveals centromere diversity between two Arabidopsis thaliana genomes. Nucleic Acids Res. 50, 12309–12327 (2022).
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
Altemose, N. et al. Complete genomic and epigenetic maps of human centromeres. Science 376, eabl4178 (2022).
1001 Genomes Consortium. 1,135 genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell 166, 481–491 (2016).
Durvasula, A. et al. African genomes illuminate the early history and transition to selfing. Proc. Natl Acad. Sci. USA 114, 5213–5218 (2017).
Novikova, P. Y. et al. Sequencing of the genus Arabidopsis identifies a complex history of nonbifurcating speciation and abundant trans-specific polymorphism. Nat. Genet. 48, 1077–1082 (2016).
Schmickl, R., Jørgensen, M. H., Brysting, A. K. & Koch, M. A. The evolutionary history of the Arabidopsis lyrata complex: a hybrid in the amphi-Beringian area closes a large distribution gap and builds up a genetic barrier. BMC Evol. Biol. 10, 98 (2010).
Darwin Tree of Life Project Consortium. Sequence locally, think globally: the Darwin Tree of Life Project. Proc. Natl Acad. Sci. USA 119, e2115642118 (2022).
Christenhusz, M. J. M. et al. The genome sequence of thale cress, Arabidopsis thaliana (Heynh., 1842). Wellcome Open Res. 8, 40 (2023).
Langley, S. A., Miga, K. H., Karpen, G. H. & Langley, C. H. Haplotypes spanning centromeric regions reveal persistence of large blocks of archaic DNA. eLife 8, e42989 (2019).
Dover, G. Molecular drive: a cohesive mode of species evolution. Nature 299, 111–117 (1982).
Rudd, M. K., Wray, G. A. & Willard, H. F. The evolutionary dynamics of alpha-satellite. Genome Res. 16, 88–96 (2006).
Wijnker, E. et al. The genomic landscape of meiotic crossovers and gene conversions in Arabidopsis thaliana. eLife 2, e01426 (2013).
Smith, G. P. Evolution of repeated DNA sequences by unequal crossover. Science 191, 528–535 (1976).
Talbert, P. B. & Henikoff, S. Centromeres convert but don’t cross. PLoS Biol. 8, e1000326 (2010).
Shi, J. et al. Widespread gene conversion in centromere cores. PLoS Biol. 8, e1000327 (2010).
Slotkin, R. K. The epigenetic control of the Athila family of retrotransposons in Arabidopsis. Epigenetics 5, 483–490 (2010).
Mable, B. K., Robertson, A. V., Dart, S., Di Berardo, C. & Witham, L. Breakdown of self-incompatibility in the perennial Arabidopsis lyrata (Brassicaceae) and its genetic consequences. Evolution 59, 1437–1448 (2005).
Foxe, J. P. et al. Reconstructing origins of loss of self-incompatibility and selfing in North American Arabidopsis lyrata: a population genetic context. Evolution 64, 3495–3510 (2010).
Hu, T. T. et al. The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. Nat. Genet. 43, 476–481 (2011).
Kolesnikova, U. et al. Genome of selfing Siberian Arabidopsis lyrata explains establishment of allopolyploid Arabidopsis kamchatica. Preprint at bioRxiv https://doi.org/10.1101/2022.06.24.497443 (2022).
Berr, A. et al. Chromosome arrangement and nuclear architecture but not centromeric sequences are conserved between Arabidopsis thaliana and Arabidopsis lyrata. Plant J. 48, 771–783 (2006).
Tsukahara, S. et al. Centromere-targeted de novo integrations of an LTR retrotransposon of Arabidopsis lyrata. Genes Dev. 26, 705–713 (2012).
Malik Harmit, S. & Eickbush, T. H. Modular evolution of the integrase domain in the Ty3/Gypsy class of LTR retrotransposons. J. Virol. 73, 5186–5190 (1999).
Nijman, I. J. & Lenstra, J. A. Mutation and recombination in cattle satellite DNA: a feedback model for the evolution of satellite DNA repeats. J. Mol. Evol. 52, 361–371 (2001).
Chatterjee, B. & Lo, C. W. Chromosomal recombination and breakage associated with instability in mouse centromeric satellite DNA. J. Mol. Biol. 210, 303–312 (1989).
Wolfgruber, T. K. et al. High quality maize centromere 10 sequence reveals evidence of frequent recombination events. Front. Plant Sci. 7, 308 (2016).
Mahtani, M. M. & Willard, H. F. Pulsed-field gel analysis of α-satellite DNA at the human X chromosome centromere: high-frequency polymorphisms and array size estimate. Genomics 7, 607–613 (1990).
Brown, S. D. & Dover, G. A. Conservation of segmental variants of satellite DNA of Mus musculus in a related species: Mus spretus. Nature 285, 47–49 (1980).
Durfy, S. J. & Willard, H. F. Concerted evolution of primate α satellite DNA. Evidence for an ancestral sequence shared by gorilla and human X chromosome α satellite. J. Mol. Biol. 216, 555–566 (1990).
Coen, E., Strachan, T. & Dover, G. Dynamics of concerted evolution of ribosomal DNA and histone gene families in the melanogaster species subgroup of Drosophila. J. Mol. Biol. 158, 17–35 (1982).
Liao, D., Pavelitz, T., Kidd, J. R., Kidd, K. K. & Weiner, A. M. Concerted evolution of the tandemly repeated genes encoding human U2 snRNA (the RNU2 locus) involves rapid intrachromosomal homogenization and rare interchromosomal gene conversion. EMBO J. 16, 588–598 (1997).
Shepelev, V. A., Alexandrov, A. A., Yurov, Y. B. & Alexandrov, I. A. The evolutionary origin of man can be traced in the layers of defunct ancestral α satellites flanking the active centromeres of human chromosomes. PLoS Genet. 5, e1000641 (2009).
Armstrong, S. J. & Jones, G. H. Female meiosis in wild-type Arabidopsis thaliana and in two meiotic mutants. Sex. Plant Reprod. 13, 177–183 (2001).
Akera, T., Trimm, E. & Lampson, M. A. Molecular strategies of meiotic cheating by selfish centromeres. Cell 178, 1132–1144 (2019).
Fishman, L. & Saunders, A. Centromere-associated female meiotic drive entails male fitness costs in monkeyflowers. Science 322, 1559–1562 (2008).
Kursel, L. E. & Malik, H. S. The cellular mechanisms and consequences of centromere drive. Curr. Opin. Cell Biol. 52, 58–65 (2018).
Hall, S. E., Luo, S., Hall, A. E. & Preuss, D. Differential rates of local and global homogenization in centromere satellites from Arabidopsis relatives. Genetics 170, 1913–1927 (2005).
Russo, A. et al. Low-input high-molecular-weight DNA extraction for long-read sequencing from plants of diverse families. Front. Plant Sci. 13, 883897 (2022).
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Alonge, M. et al. Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editing. Genome Biol. 23, 258 (2022).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Vollger, M. R. et al. Long-read sequence and assembly of segmental duplications. Nat. Methods 16, 88–94 (2019).
Mc Cartney, A. M. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat. Methods 19, 687–695 (2022).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
Yun, T. et al. Accurate, scalable cohort variant calls using DeepVariant and GLnexus. Bioinformatics https://doi.org/10.1093/bioinformatics/btaa1081 (2021).
Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 43, 11.10.1–11.10.33 (2013).
Browning, B. L. & Browning, S. R. Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics 194, 459–471 (2013).
M. P. J.van der Loo The stringdist package for approximate string matching. R J. 6, 111 (2014).
Nguyen, L.-T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).
Vollger, M. R., Kerpedjiev, P., Phillippy, A. M. & Eichler, E. E. StainedGlass: interactive visualization of massive tandem repeat structures with identity heatmaps. Bioinformatics https://doi.org/10.1093/bioinformatics/btac018 (2022).
Buisine, N., Quesneville, H. & Colot, V. Improved detection and annotation of transposable elements in sequenced genomes using multiple reference sequence sets. Genomics 91, 467–475 (2008).
Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16, 276–277 (2000).
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
Liu, K., Linder, C. R. & Warnow, T. RAxML and FastTree: comparing two methods for large-scale maximum likelihood phylogeny estimation. PLoS ONE 6, e27731 (2011).
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 49, W293–W296 (2021).
Ou, S. et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol. 20, 275 (2019).
Shumate, A. & Salzberg, S. L. Liftoff: accurate mapping of gene annotations. Bioinformatics https://doi.org/10.1093/bioinformatics/btaa1016 (2020).
Pertea, G. & Pertea, M. GFF Utilities: GffRead and GffCompare. F1000Res 9, 304 (2020).
Lischer, H. E. L. & Excoffier, L. PGDSpider: an automated data conversion tool for connecting population genetics and genomics programs. Bioinformatics 28, 298–299 (2012).
Kozlov, A. M., Darriba, D., Flouri, T., Morel, B. & Stamatakis, A. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics 35, 4453–4455 (2019).
Yu, G., Smith, D. K., Zhu, H., Guan, Y. & Lam, T. T.-Y. Ggtree : an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods Ecol. Evol. 8, 28–36 (2017).
Wang, L.-G. et al. Treeio: an R package for phylogenetic tree input and output with richly annotated and associated data. Mol. Biol. Evol. 37, 599–603 (2020).
Ni, P. et al. Genome-wide detection of cytosine methylations in plant from Nanopore data using deep learning. Nat. Commun. 12, 5976 (2021).
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17, 10 (2011).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Ramírez, F., Dündar, F., Diehl, S., Grüning, B. A. & Manke, T. deepTools: a flexible platform for exploring deep-sequencing data. Nucleic Acids Res. 42, W187–W191 (2014).
We thank S. Henikoff and P. Talbert (Fred Hutchinson Cancer Research Center, USA) for kindly providing anti-CENH3 antibodies. We thank R. Durbin, C. Zhou (University of Cambridge, UK) and the Darwin Tree of Life Project for the A. thaliana ddAraThal4 Kew-1 assembly. This work was supported by BBSRC grants BB/S006842/1, BB/S020012/1 and BB/V003984/1, European Research Council Consolidator Award ERC-2015-CoG-681987, Marie Curie International Training Network ‘MEICOM’ and Human Frontier Science Program award RGP0025/2021 to I.R.H.; EMBO long-term postdoctoral fellowship ALTF224-2022 to R.B.; a Human Frontiers Science Program (HFSP) Long-Term Fellowship (LT000819/2018-L) to F.A.R.; the Max Planck Society to D.W.; European Research Council (ERC) Synergy Grant PATHOCOM (951444) from the European Union’s Horizon 2020 program to F.R. and D.W.; an ERA-CAPS 1001G+ grant to M. Nordborg and D.W.; Royal Society awards UF160222, URF\R\221024, RGF/R1/180006 and RGF/EA/201030 to A.B.; European Research Council award ERC HOW2DOBLE 101041354 to P.Y.N.; Czech Science Foundation grant no. 21-03909S to M.A.L.; a BBSRC DTP Studentship to N.G.; a Broodbank Fellowship to M. Naish; and grant PID2022-136893NB-I00 from the Ministerio de Ciencia e Innovación of Spain/Agencia Estatal de Investigación/10.13039/50110001103/FEDER, EU, to C.A.-B.
D.W. holds equity in Computomics, which advises plant breeders. D.W. consults for KWS SE, a plant breeder and seed producer. All other authors declare no competing interests.
Peer review information
Nature thanks Vincent Colot, Pierre Baduel and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
The coverage of primary (dark blue) and secondary (light blue) alleles for PacBio HiFi read sets (Col-0, Ler-0, Cvi-0 and Tanz-1) aligned to their corresponding genome assembly. AthCEN178 arrays coordinates, from Supplementary Table 3, are indicated by grey shading. Assembly gaps are shown by red shading, 45S rDNA by purple shading, 5S rDNA by orange shading and organelle insertions by green shading. Equivalent plots for the remaining genome assemblies can be found at the following website: https://github.com/vlothec/pancentromere.
Extended Data Fig. 2 Sampling of the AthCEN178 satellome and geographic distributions of centromere similarity groups.
a, Discovery of non-redundant unique AthCEN178 variants as a function of sampled accessions, determined with 1,000 permutations. The centre of each boxplot is the median number of non-redundant unique AthCEN178 variants in the 1,000 permutations. Blue shading = 95% confidence interval. b, Heat map showing the average value of exact AthCEN178 sequence sharing between all pairs of the indicated chromosomes. c, Geographic maps are shown with accession origin coloured according to AthCEN178 similarity group, shown separately for each of the five chromosomes. d, Pairwise geographical distance (km) vs. the proportion of shared AthCEN178 sequences, for all 2,145 accession pairs. e, Histogram showing the number of AthCEN178 similarity groups that are shared when all accession pairs were compared.
Extended Data Fig. 3 Variation in AthCEN178 copy number and CENH3 occupancy between A. thaliana genetic lineages and accessions.
a, Total AthCEN178 copies per accession, coloured according to Eurasian (blue), Iberian non-relict (orange), non-Iberian relict (green) and Iberian relict (pink) chromosome arm SNP-PCA groups. b, Corrected total AthCEN178 FISH fluorescence intensity from anther nuclei, leaf nuclei, or mitotic chromosomes, in Col-0 (Eurasian, blue) and Tanz-1 (relict, red). All Tanz-1 samples showed significantly greater fluorescence intensity compared to Col-0 (Wilcoxon tests all P = <1.04×10−6). c, Representative FISH micrographs for AthCEN178 (red) and ATHILA2 (green) on pachytene chromosomes of Col-0 and Tanz-1. Insets on the left are DAPI-stained images of the same cells. Scale bars = 10 μm. d, StainedGlass sequence identity heat maps for CEN1 of Eurasian (Col-0 and Ler-0) and non-Iberian relict (Cvi-0 and Tanz-1) accessions. e, CENH3 log2(ChIP/input) values (upper row) were plotted along all AthCEN178 repeats in the Col-0, Ler-0, Cvi-0 and Tanz-1 accessions. Beneath (lower row) are plots of AthCEN178 sequence variants against the consensus repeat for Col-0, Ler-0, Cvi-0 and Tanz-1.
a, Density plot of AthCEN178 HOR scores versus edit distances from the chromosome consensus, across all accessions. b, The copy number of each AthCEN178 repeat was calculated within each chromosome individually. For each chromosome, all AthCEN178 repeats were divided into 100 bins with an equal number of repeats in each bin. The counts of AthCEN178 with copy numbers of 1, 2, 3, 4, 5 and >5 were divided by the number of repeats per bin, and by the total number of chromosomes. These values were summed for each chromosome to give a total value of 1 per bin. c, Histogram of AthCEN178 HOR scores per centromere. d, Scatterplots of AthCEN178 HOR scores for each of the five chromosomes. e, StainedGlass sequence identity heat maps comparing within- and between-accession sequence identity for Ru-2 and BANI-C-1 CEN2, Cvi-0 and Med-0 CEN3 and HR-10 and 11C1 CEN5. f, Pairwise comparison of the proportion of shared versus private AthCEN178 HORs between IP-Ini-0 and BARC-A-17 CEN1, along the length of each centromere. Red lines represent a smoothing spline. g, Dot plot showing intra-centromere duplications (diagonal red lines) detected within MERE-A-13 CEN5. Horizontal and vertical dotted lines indicate intact (red) and soloLTR (blue) ATHILA.
Extended Data Fig. 5 DNA methylation, CENH3 ChIP-seq enrichment, and AthCEN178 higher-order repeat (HOR) structure within the centromere regions of Col-0, Ler-0, Cvi-0 and Tanz-1.
a, CENH3 ChIP-seq enrichment (log2[ChIP/input], black) compared with AthCEN178 density in 10 kb windows on forward (red) or reverse (blue) strands along the indicated chromosome and accession. Beneath, CENH3 ChIP-seq enrichment (black) is plotted against DNA methylation (%) in CG (red), CHG (blue) and CHH (green) sequence contexts, along the entire chromosome. Beneath are close-ups of the centromere regions with AthCEN178 density (red, blue), CENH3 ChIP-seq enrichment (black) and AthCEN178 HOR score (orange) plotted. A StainedGlass sequence identity heat map is shown at the bottom60. The centromeres in (a) are grouped on the basis of having a single AthCEN178 array that is occupied by CENH3. b, As for (a), but showing centromeres that are grouped on the basis of having distinct AthCEN178 arrays and CENH3 occupying more than one array. c, As for (a), but showing centromeres with multiple AthCEN178 arrays, only one of which is occupied by CENH3.
Extended Data Fig. 6 Variation in CENH3 coding sequence in relation to centromere AthCEN178 similarity groups.
A phylogenetic tree based on CENH3 nucleotide sequences is shown for the 66 A. thaliana accessions (left). To the right of the tree is a coloured key indicating the AthCEN178 similarity group membership for each of the five chromosomes (CEN1-CEN5) for each accession.
a, Pie charts of the proportions of centrophilic ATHILA families: (i) inside vs. outside the AthCEN178 arrays, (ii) inside vs. outside AthCEN178 arrays by chromosome, and (iii) intact vs. soloLTR located inside or outside the AthCEN178 arrays. b, Phylogenetic trees constructed with full-length centromeric ATHILA from each chromosome. The clades representing different ATHILA families are indicated by background shading, and the coloured branch tips represent AthCEN178 similarity groups. c, Representative FISH micrographs for ATHILA2 (green) and ATHILA5 (red) on pachytene chromosomes in the Col-0, Rab-1 and IP-Bus-0 accessions. Scale bars = 10 μm. d, ATHILA integration frequency along the length of the CEN178 consensus repeat. e, Counts of intact (left) and soloLTR (right) ATHILA located outside (top) or inside (bottom) the AthCEN178 arrays, ordered by chromosome arm SNP-PCA groups. f, Distribution of sequence identity between LTRs of intact ATHILA elements, comparing those located inside (red) or outside (blue) the centromeres, according to chromosome arm SNP-PCA group. Intact ATHILA within the AthCEN178 arrays had significantly higher LTR identity in the Eurasians and Iberian non-relicts, compared with the Iberians and non-Iberian relicts (Wilcoxon tests all P < 1.78×10-6).
Extended Data Fig. 8 ATHILA diversification via de novo integration and intra-centromere duplication within A. thaliana.
a, StainedGlass sequence identity heat maps for ANGE-B-2 and ANGE-B-10 CEN4, with % GC content (green) and the density of AthCEN178 per 10 kb on forward (red) and reverse (blue) strands plotted beneath. X-axis ticks indicate intact ATHILA (pink) and soloLTR (green) insertions. ‘*’ marks insertions that are shared between ANGE-B-2 and ANGE-B-10, whereas ‘!’ indicates those unique to ANGE-B-10. b, CEN5 is shown for FERR-A-8 and FERR-A-12 with % GC content (green) and the density of AthCEN178 per 10 kb on forward (red) and reverse (blue) strands shown beneath. X-axis ticks indicate intact ATHILA (pink) and soloLTR (green) insertions. ‘*’ marks insertions that are shared between FERR-A-8 and FERR-A-12, whereas ‘!’ marks those that correspond to post-integration duplications unique to FERR-A-8. c, StainedGlass sequence identity heat maps comparing FERR-A-8 and FERR-A-12 CEN5. d, The coverage of primary FERR-A-12 (dark blue) or FERR-A-8 (brown), and secondary FERR-A-12 (light blue) or FERR-A-8 (orange) alleles of PacBio HiFi reads to the chromosome 5 of the FERR-A-12 (upper), or FERR-A-8 (lower), genome assemblies. AthCEN178 array coordinates, from Supplementary Table 3, are indicated by grey shading. Assembly gaps are shown by red shading. e, CENH3 log2(ChIP/input) values were plotted over ATHILA elements located within the AthCEN178 arrays of the Col, Ler, Cvi and Tanz accessions (n = 100), in addition to 2 kb flanking regions. Windowed mean values are shown as solid lines, with 95% confidence intervals indicated by the shaded ribbons. This is compared to 100 randomly selected loci within the AthCEN178 arrays, with the same widths as the ATHILA. Also shown are profiles across ATHILA elements located outside the AthCEN178 arrays (n = 426), and Gypsy elements located outside the AthCEN178 arrays (n = 21,487).
Extended Data Fig. 9 Autonomous and non-autonomous ATHILA elements in the collection of A. thaliana centromeres.
a, The size distribution of intact ATHILA5 elements across the 66 A. thaliana accessions is plotted. Bar plots are coloured to indicate the number of elements inside (red), or outside (blue) the AthCEN178 arrays. Three ATHILA size classes were defined; Class I for elements <8 kb, Class II between 8–12 kb, and Class III >12 kb. b, The distribution of ORF sizes (bp) in the ATHILA5 elements, in total, or by the indicated size class. Red text indicates the position of the ATHILA-ORF and intact or truncated ORFs for GAG-POL. c, A representative diagram of an intact Class III 13.3 kb autonomous ATHILA5 element, compared to a Class II 10.5 kb non-autonomous derivative. In this example, a single ~2.8 kb fragment that contains the reverse transcriptase, RNaseH and integrase genes is absent in the non-autonomous element. The green shaded areas indicate levels of sequence identity between the matching regions. d, Multiple sequence alignment of ATHILA integrase amino acid sequence from centrophilic (ATHILA1, ATHILA2, ATHILA5, ATHILA6b, rows 1–4) and centrophobic (ATHILA0, ATHILA4c, ATHILA7, ATHILA9, rows 5–8) families. The alignment starts immediately downstream of the RNase-H domain (not shown), to ensure that the N-terminus of integrase is included.
Extended Data Fig. 10 Phylogenetic analysis of centromere satellites and ATHILA in A. thaliana and A. lyrata.
a, Sequence identity dot plots comparing syntenic centromeres between A. thaliana Col (AthCol) and A. lyrata MN47 (AlyMN47), or NT1 (AlyNT1), using 80 bp windows. Red and blue indicate strand similarity (red is same, blue is opposite). b, Maximum-likelihood phylogenetic tree of Arabidopsis satellites, using randomly sampled AlyCEN168 and AlyCEN179 from A. lyrata, and AthCEN178 and AthCEN159 from six A. thaliana accessions (Bon-1, IP-Bus-0, IP-Alo-19, IP-Cas-6, Rab-1 and Tanz-1), and using Capsella rubella satellites as a root. Branch tips are coloured by satellite repeat family. A grey circle was placed on nodes where UFBoot support value exceeds 95%. c, A maximum-likelihood phylogenetic tree of AlyCEN168 and 450 AlyCEN179 satellites sampled from A. lyrata accession MN47. Thirty AthCEN178 from the A. thaliana Col-0 accession were used as an outgroup. Tree tips are coloured according to chromosome, with the exception of the outgroup sequences, which are shaded in black. A grey circle is placed on nodes where UFBoot support value exceeds 95%. d, Phylogenetic tree of full-length ATHILA elements identified in A. lyrata. Elements were assigned to families based on their relationship to A. thaliana ATHILA, as shown in Fig. 4i. The tree was rooted using a maize Huck Ty3 element. Bootstrap support is shown for key nodes. e, Distribution of lengths (kb) of non-satellite sequence gaps within the A. lyrata centromere satellite arrays. f, Total number of non-satellite gaps between 1 and 200 kb for each chromosome, across the two A. lyrata accessions MN47 and NT1.
Genome assembly parameters and metrics for 66 A. thaliana accessions. The table provides, for each of the 66 A. thaliana accessions analysed, the accession name, PCA group membership based on chromosome-arm SNPs, ecotype ID, country of origin, latitude and longitude of collection, whether DNA was extracted from multiple or single individuals, DNA extraction and shearing method, version of sequencing binding kit used, barcode name and sequence when multiplexed, Hifiasm version used for primary assembly, additional scaffolding information, information regarding data submission to ENA (project and sample IDs), HiFi read metrics (N50, total sum, number of reads and longest read), assembly metrics on scaffolded contigs (number of contigs that constituted the scaffolded chromosomes, remaining assembly gaps, largest and shortest contig, contig N50, total length, read depth mean and standard deviation), assembly quality metrics specific to centromeres (centromeres with assembly gaps, total sum of collapsed and expanded regions in megabases and sum of heterozygosity stretches in total and per centromere). The first column includes a numerical code that can be matched with the first column of Supplementary Tables 2 and 3.
Key to centromere analysis of 66 A. thaliana accessions. This table provides, for each of the 66 A. thaliana accessions analysed, the name of the associated fasta file, chromosome-arm SNP-PCA genetic group membership, accession code where available, country of origin, latitude and longitude of origin, the total number of AthCEN178 repeats and their number per chromosome, the AthCEN178 similarity group for each chromosome, the total number of AthCEN159 tandem repeats, the number of intact ATHILA and soloLTRs in total and per chromosome, the total number of AthCEN178 HORs and AthCEN178 per chromosome, and AthCEN178 HOR score in total and per chromosome. Note that the following three pairs of accessions were collected at single sites and had identical or nearly identical genomes on the basis of chromosome-arm SNPs: (1) CAMA-C-2 and CAMA-C-9, (2) BARC-A-12 and BARC-A-17, and (3) BELC-C-10 and BELC-C-12. The first column includes a numerical code that can be matched with the first column of Supplementary Tables 1 and 3.
Centromere AthCEN178 array coordinates in A. thaliana and A. lyrata. For each A. thaliana and A. lyrata accession and each chromosome, the start and end coordinates of contiguous centromere satellite arrays are listed, in addition to the widths of each array. The first column includes a numerical code that can be matched with the first column of Supplementary Tables 1 and 2.
AthCEN178 HOR parameters per accession. This table reports the average number of AthCEN178 copies, HORs, average HOR scores, average HOR length (in bp) and the average distance between HOR pairs (in bp), for each PCA group defined by chromosome-arm SNPs.
ATHILA annotation across 66 A. thaliana and 2 A. lyrata genomes. For each ATHILA family, we provide the number and proportion of intact elements and soloLTRs in the AthCEN178 arrays versus the chromosome arms, the mean percentage of LTR identity, the number of elements with identical LTRs, the distribution across chromosomes, the genic coding capacity based on the presence of Pfam hidden Markov models and the intact to soloLTR ratio. Centrophilic families that have invaded the A. thaliana AthCEN178 satellite arrays are highlighted in red. The mean of LTR identity and intact to soloLTR ratios is shown only for families with >10 intact elements.
Matching ATHILA insertions for pairs of A. thaliana accessions and estimation of divergence on the basis of sequence comparisons. Intact and soloLTR ATHILA insertions are shown for the centromeres of the following pairs of accessions: BARC-A12 and BARC-A-17, BELC-C-10 and BELC-C-12, CAMA-C-2 and CAMA-C-9, SALE-A-10 and SALE-A-17, ANGE-B-2 and ANGE-B-10, the duplicated region of FERR-A-8 CEN5 and the shared element in CEN1 of BARC-A-17 and IP-Ini-0. Listed for each ATHILA, in columns B–S, are their start and stop coordinates, strand, quality (intact or soloLTR), total length (in bp), LTR length (in bp), target site duplication (TSD; when a TSD was not identified, the pentamer in column L shows the upstream 5 nucleotides of the ATHILA), LTR identity (%), ATHILA family assignment and a code that indicates whether elements are matched between the accessions, are unmatched (that is, ambiguous to classify) or represent a new insertion event in a specific accession. Columns T–AA contain information on the alignment of every matching ATHILA pair (substitutions, indels and alignment length), which was used to calculate divergence in generations using the formula T = K/2 × μ, where K is the divergence calculated as (substitutions + indels)/global_alignment_length (indels of any length count as 1 event) and μ is the estimated mutation rate of 7.0 × 10–9 mutations per site per generation.
StainedGlass gallery of 330 A. thaliana centromeres. An animation of StainedGlass sequence identity heat maps, generated using a 10-kb window size, for CEN1, CEN2, CEN3, CEN4 and CEN5. For each chromosome, the centromeres are shown in order of similarity, as inferred from the AthCEN178 sequence sharing heat maps in Fig. 1d. Beneath each heat map is a histogram of sequence identity percentages, showing colour assignments used in the sequence identity heat map.
About this article
Cite this article
Wlodzimierz, P., Rabanal, F.A., Burns, R. et al. Cycles of satellite and transposon evolution in Arabidopsis centromeres. Nature 618, 557–565 (2023). https://doi.org/10.1038/s41586-023-06062-z
This article is cited by
Single-molecule targeted accessibility and methylation sequencing of centromeres, telomeres and rDNAs in Arabidopsis
Nature Plants (2023)
Centromeric repeats in Citrus sinensis provide new insights into centromeric evolution and the distribution of G-quadruplex structures
Horticulture Advances (2023)