Patterns of polymorphism, selection and linkage disequilibrium in the subgenomes of the allopolyploid Arabidopsis kamchatica

Although genome duplication is widespread in wild and crop plants, little is known about genome-wide selection due to the complexity of polyploid genomes. In allopolyploid species, the patterns of purifying selection and adaptive substitutions would be affected by masking owing to duplicated genes or ‘homeologs’ as well as by effective population size. We resequenced 25 distribution-wide accessions of the allotetraploid Arabidopsis kamchatica, which has a relatively small genome size (450 Mb) derived from the diploid species A. halleri and A. lyrata. The level of nucleotide polymorphism and linkage disequilibrium decay were comparable to A. thaliana, indicating the feasibility of association studies. A reduction in purifying selection compared with parental species was observed. Interestingly, the proportion of adaptive substitutions (α) was significantly positive in contrast to the majority of plant species. A recurrent pattern observed in both frequency and divergence-based neutrality tests is that the genome-wide distributions of both subgenomes were similar, but the correlation between homeologous pairs was low. This may increase the opportunity of different evolutionary trajectories such as in the HMA4 gene involved in heavy metal hyperaccumulation.


70
Genome duplication is a widespread evolutionary force in plants. As  To sort Illumina reads of A. kamchatica to their parentally-derived subgenomes, we generated 148 long mate-pair de novo assemblies of A. lyrata subsp. petraea (also called A. petraea subsp. 149 umbrosa) in addition to East Asian A. halleri subsp. gemmifera which we previously reported 39 .

150
Assembly statistics indicated that the A. lyrata and A. halleri reference genomes have scaffold 151 N50 of 1.2 Mb and 0.7 Mb, comprising 1,675 and 2,239 scaffolds respectively (Table 1,   152   Supplementary Table 1   We examined the patterns of nucleotide diversity for ca. 21,000 coding sequences of both 172 halleri and lyrata-derived homeologs in A. kamchatica that could be aligned to A. thaliana 173 orthologs as the outgroup. We found that both subgenomes showed similar mean values of 174 nucleotide diversity (π) (π coding = 0.0014 bp -1 for halleri-subgenome and π coding = 0.0015 bp -1 for 175 lyrata-subgenome, and π coding = 0.0015 bp -1 when combined) although the lyrata-derived 176 homeologs showed slightly broader ranges in π (Table 2, Fig. 1A). Nucleotide diversity at 177 synonymous sites (π syn ) was also similar for the two subgenomes with a slightly higher value for 178 the lyrata-subgenome (π syn = 0.0049) than in the halleri-subgenome (π syn = 0.0044). The 179 nucleotide diversity in A. kamchatica is about six times lower than European A. halleri and A. 180 lyrata (π syn = 0.029 for A. halleri and 0.028 for A. lyrata estimated using resequencing data 181 from 30 ) and is more similar to that of A. thaliana (π syn = 0.0059 -0.007) 17,30,40 . Sliding window 182 analysis including non-coding regions also showed comparable values (Supplementary Table 4).

183
We calculated the effective population size, N e , using our empirical estimates of π for A. 184 kamchatica and both diploid species and two different mutation rates 41

192
Higher proportions of non-synonymous mutations were found to be at low frequency 193 compared with synonymous mutations and no significant differences in the relative proportions 194 were found between subgenomes (Fig. 1B). This suggests purifying selection on a large 195 proportion of amino-acid changing substitutions in both subgenomes. Frequency-based test 196 statistics clearly show significant departures from neutrality for both subgenomes (Fig. 1C)

206
We found the means of the distributions for most summary statistics to be very similar 207 between the two subgenomes, but when pairs of all homeologs were compared correlations 208 were generally low for diversity and neutrality estimators ( Table 2). The correlations of π syn and 209 θ w syn were both nearly zero (Table 2) Long scaffold assemblies allowed us to estimate genome-wide LD for each subgenome to 220 evaluate the feasibility of association mapping in A. kamchatica. We found that mean LD decay 221 was between 5-10 kb for both subgenomes (Fig. 1D), which is similar to the self-fertilizing

247
The distribution of π in the HMA4-M region for H-origin genes showed low diversity 248 (π mean = 0.0007) but it is not significantly lower than the background genes ( Fig. 2B and 2C).

263
We also estimated diversity of all annotated heavy metal transporters, metal ion 264 transporters, and metal homeostasis genes for comparison with the genome-wide average (HM 265 genes, N=118 genes). We expected these genes to have low overall diversity in both genomes 266 due to selective constraint as many of these ion transporters are expected to have roles in basic 267 metal homeostasis 46 . As a contrast, we compared NBS-LRR genes (N=39 genes) which have   (Table 1), providing additional support that redundant genes exhibit significant 303 differences due to stronger positive or purifying selection on only one of the two copies.

305
The Distribution of Fitness Effects (DFE) 306 The tests above indicated that large numbers of homeologs show patterns consistent with 307 purifying selection on amino-acid changing mutations (see Fig. 3). We quantified the genome-   While the differences are significant, the magnitude of the differences is not remarkable.

321
To examine whether subsets of either subgenome experience a reduction in purifying 322 selection, we classified homeologs according to gene expression level, which is one of the best 323 predictors of evolutionary rates (dN/dS) in most organisms 49 . Expression level is negatively correlated with dN/dS due to strong constraint on amino acid substitutions (dN) 22 for highly 325 expressed genes, but this has not been shown in recent polyploid species. As a test of selective 326 constraint on highly expressed genes, we found dN/dS was negatively correlated with 327 expression for both homeologs (Fig. 4B). We would therefore expect genes that are highly 328 expressed to show the strongest purifying selection, and low expressed genes to show relaxed 329 constraint. We estimated the DFE again to quantify purifying selection and relaxed constraint  Table   347 6). The α estimates for the H-and L-origin subgenomes of A. kamchatica were lower than those 348 of the corresponding diploid species but significantly greater than zero (0.12 and 0.09, 349 respectively) (Fig. 4E). The difference in α between subgenomes was significant but subtle (3% 350 difference using the samples above, 6% difference when all 25 A. kamchatica accessions were 351 used; Supplementary Fig. 4).   Coding sequence (CDS) alignments 608 We identified homeologous genes based on reciprocal blast hit (best-to-best with E-values < 10 -genome annotations. Using the same approach, we also detected orthologous relationships 611 between the predicted genes in diploid A. halleri and A. lyrata annotated genome assemblies 612 and A. thaliana genes (TAIR 10). In cases of duplicated genes of interest such as HMA4 613 (tandemly duplicated three times in A. halleri), we used only one copy for diversity analysis due 614 to non-unique alignments of Illumina reads and very high sequence identity (99%) in the A. Population structure and phylogenetic analysis 634 We used 1000 randomly selected coding sequence (CDS) alignments from both halleri and lyrata 635 derived homeologs. We then individually concatenated the halleri alignments and the lyrata 636 alignments to use for population structure and phylogenetic analysis. The input data sets for the 637 population structure analysis contained 21,341 and 16,223 markers from halleri-and lyrata-638 origin CDS respectively. We ran STRUCTURE v2.3.4 64 ten times for each K = 1 to 9 using the 639 admixture model and 50,000 MCMC rounds for burnin followed by 100,000 rounds to generate       851  852  853  854  855  856  857  858  859  860  861  862  863  864  865  866  867  868  869  870  871  872  873  874  875  876