Chromosome-level assemblies of multiple Arabidopsis genomes reveal hotspots of rearrangements with altered evolutionary dynamics

We report chromosome-level, reference-quality assemblies of seven Arabidopsis thaliana accessions selected across the global range of this predominately ruderal plant. Each genome revealed between 13-17 Mb rearranged and 5-6 Mb novel sequence introducing copy-number changes in ∼5,000 genes, including ∼1,900 genes which are not part of the current reference annotation. Analyzing the collinearity between the genomes revealed ∼350 regions (4.1% of the euchromatin) where accession-specific tandem duplications destroyed the syntenic gene order between the genomes. These hotspots of rearrangements were characterized by the loss of meiotic recombination in hybrids within these regions and the enrichment of genes implicated in biotic stress response. Together this suggests that hotspots of rearrangements are governed by altered evolutionary dynamics as compared to the rest of the genome, which are based on new mutations and not on the recombination of existing variation, and thereby enable a quick response to the ever-evolving challenges of biotic stress.


2
The individual genomes of sexually reproducing species are typically highly collinear to 29 enable physical exchange of alleles (chromosome arms) during meiosis. This exchange 30 ensures the generation of diversity and the removal of deleterious alleles 1-3 and at the time 31 protects the offspring from major mutations changing the karyotype of a genome 4,5 . Despite 32 the obvious importance to preserve a common karyotype, the presence of genomic 33 rearrangements suggests that the genomes are in fact not entirely collinear. Genomic 34 rearrangements (and the resulting lack of allelic exchange) have been shown to contribute 35 to population diversification including the evolution of different sexes 6,7 or life-history traits 8 . 36 But even though the absence of collinearity can have drastic effects, there is hardly 37 anything known about the actual degree of collinearity within populations as most of the 38 current genome studies are not based on chromosome-level assemblies. The first complete 39 assembly of a plant genome was the reference sequence of A. thaliana (Col-0), which was 40 based on a minimal tiling path of BACs sequenced with Sanger technology 9 . Since then 41 multiple hundred Arabidopsis genomes have been studied, however, most of these studies 42 relied on short-read based resequencing or reference-guided assembly, where the 43 identification of genomic rearrangements remained challenging 10-16 . In contrast, reference-44 independent, chromosome-level assemblies with almost complete reconstruction of the 45 nucleotide sequence enable accurate identification of all sequence differences and would 46 therefore reveal the degree of synteny across the genome 17 . So-far, however, there are only 47 6 genome (Fig. 5c,d and Supplementary Fig. 5). Moreover, reduced recombination combined 176 with geographic isolation can provide the basis for the development of alleles, which are 177 incompatible with distantly related haplotypes leading to intra-species incompatibilities 40 . To 178 test this, we searched the location of nine recently reported genetic incompatible loci 41 179 (DM1-9) and found that all except of one overlapped with HOT regions, while DM3, the 180 locus which did not overlap with a HOT region, was closely flanked by two HOT regions 181 (Supplementary Fig. 6-12). In addition, we also checked the locus of a recently published 182 single-locus genetic incompatibility 42 and found that it was also residing in a HOT region 183 ( Supplementary Fig. 13). 184 The high structural diversity of the HOT regions was reminiscent of the patterns that 185 have been described for R gene clusters 43-50 . In fact, the 808 reference genes in HOT 186 regions were significantly enriched for genes involved in defense response, signal 187 transduction and secondary metabolite biosynthesis (Fig. 5e)  assembly was based on the Falcon assembly as these assemblies always showed highest 232 assembly contiguity. A few contigs were further connected or extended based on whole 233 genome alignments between Falcon and Canu or MECAT assemblies. Contigs were 234 labelled as organellar contigs if they showed alignment identity and coverage both larger 235 than 95% when aligned against the mitochondrial or chloroplast reference sequences. A few 236 of contigs aligned to multiple chromosomes and were split if no Illumina short read 237 alignments supported the conflicting regions. Assembly contigs larger than 20kb were 238 combined to pseudo-chromosomes according to their alignment positions when aligned 239 against the reference sequence using MUMmer4 60 . Contigs with consecutive alignments 240 were concatenated with a stretch of 500 Ns. To note, the assembly of the Ler accession 241 was already described in a recent study 17 . 242

Assembly evaluation 243
We evaluated the assembly completeness by aligning the reference genes against each 244 of the seven genomes using Blastn 61 . Reference genes which were not aligned or only 245 partially aligned might reveal genes which were missed during the assembly. To examine 246 whether they were really missed, we mapped Illumina short reads from each genome 247 against the reference genome using the BWA 58 and checked the mapping coverage of 9 coverage (Supplementary Table 7). 250 Centromeric and telomeric tandem repeats were annotated by searching for the 178 bp 251 tandem repeat unit 62 and the 7 bp tandem repeat unit of TTTAGGG 63 . rDNA clusters were 252 annotated with Infernal version 1.1 64 . 253 The assembly contiguity of Cvi-0 and Ler were further tested using three previously 254 published genetic maps 65-67 (Supplementary Table 3). For this we aligned the marker 255 sequences against the chromosome-level assemblies and checked the order of the markers 256 in the assembly versus their order in the genetic map. The ordering of contigs to 257 chromosomes was perfectly supported by all three maps. Overall, only six (out of 1,156) 258 markers showed conflicts between the genetic and physical map. In all six cases we found 259 evidence that the conflict was likely caused by structural differences between the parental 260 genomes. 261

Gene annotation 262
Protein-coding genes were annotated based on ab initio gene predications, protein 263 sequence alignments and RNA-seq data. Three ab initio gene predication tools were used 264 including Augustus 68 , GlimmerHMM 69 and SNAP 70 . The reference protein sequences from 265 the Araport 11 24 annotation were aligned to each genome assembly using exonerate 71 with 266 the parameter setting "--percent 70 --minintron 10 --maxintron 60000". For five accessions 267 (An-1, C24, Cvi-0, Ler-0, and Sha) we downloaded a total of 155 RNA-seq data sets from 268 the NCBI SRA database (Supplementary Table 8). RNA-seq reads were mapped to the 269 corresponding genome using HISAT2 72 and then assembled into transcripts using 270 StringTie 73 (both with default parameters). All different evidences were integrated into 271 consensus gene models using Evidence Modeler 74 . 272 The resulting gene models were further evaluated and updated using the Araport 11 24 273 annotation. Firstly, for each of the seven genomes, the predicted gene and protein 274 sequences were aligned to the reference sequence, while all reference gene and protein 275 sequences were aligned to each of the other seven genomes using Blast 61 . Then, 276 potentially mis-annotated genes including mis-merged (two or more genes are annotated as 277 a single gene), mis-split (one gene is annotated as two or more genes) and unannotated 278 genes were identified based on the alignments using in-house python scripts. Mis-annotated 279 or unannotated genes were corrected or added by incorporating the open reading frames 280 generated by ab initio predications or protein sequence alignment using Scipio 75  . NB-LRR R gene clusters were defined based on the annotation from a previous study 78 . 285

Pan-genome analysis 286
Pan-genome analyses were performed at both sequence and gene level. To construct a 287 pan-genome of sequences, we generated pair-wise whole genome sequence alignments of 288 all possible pairs of the eight genomes using the nucmer in the software package 289 MUMmer4 60 . A pan-genome was initiated by choosing one of the genomes, followed by 290 iteratively adding the non-aligned sequence of one of the remaining genomes. Here, non-291 aligned sequences were required to be longer than 100bp without alignment with an identity 292 of more than 90%. The core genome was defined as the sequence space shared by all 293 sampled genomes. Like the pan-genome, the core-genome analysis was initiated with one 294 genome. Then all other genomes were iteratively added, while excluding all those regions 295 which were not aligned against each of the other genomes. The pan-and core-genome of 296 genes was built in a similar way. The pan-genome of genes was constructed by selecting 297 the whole protein coding gene set of one of the accessions followed by iteratively adding the 298 genes of one of the remaining accessions. Likewise, the core-genome of genes was defined 299 as the genes shared in all sampled genomes. 300 For each pan or core genomes analysis, all possible (∑ ! !( ) ) combinations of 301 integrating the eight genomes (or a subset of them) were evaluated. The exponential 302 regression model y = A e Bx + C to was then used to model the pan-genome/core-genomes 303 by fitting medians using the least square method implemented in the nls function of R. 304

Analysis of structural rearrangements and gene CNV 305
All assemblies were aligned to the reference sequence using nucmer from the 306 MUMmer4 60 toolbox with parameter setting "-max -l 40 -g 90 -b 100 -c 200". The resulting 307 alignments were further filtered for alignment length (>100) and identity (>90). Structural 308 rearrangements and local variations were identified using SyRI 17 . The functional effects of 309 sequence variation were annotated with snpEff 79 . The gene CNV were identified according 310 to the gene family clustering using the tool OrthoFinder 28 based on all protein sequences 311 from the eight accessions. 312

Synteny diversity, hotspots of rearrangements and diversity estimates 313
Synteny Diversity was defined as the average fraction of non-syntenic sites found within 314 all pairwise genome comparisons within a given population. Here we denote Synteny 315 Diversity as 316 = ∑ , 317 where and refer to the frequencies of sequence and and to the average 318 probability of a position in to be non-syntenic. Note, π syn can be calculated in a given 319 region or for the entire genome. However even when calculated for small regions the 320 annotation of synteny still needs to established within the context of the whole genomes to 321 avoid false assignments of homologous but non-allelic sequence. Here we used the 322 annotation of SyRI to define syntenic regions. π syn values can range from 0 to 1, with higher 323 values referring to a higher average degree of non-syntenic regions between the genomes. 324 For the analyses, we calculated π syn in 5-kb sliding windows with 1kb step-size across 325 the entire genome. HOT regions were defined as regions with π syn larger than 0.5.

326
Neighboring regions were merged into one HOT region if their distance was shorter than 327 2kb. 328 The nucleotide and haplotype diversity were calculated with the R package 329  (80-. ). 294, 555-559 (2001).    core-genome (red) estimations were fitted using an exponential model. 587