Sequencing of the genus Arabidopsis identifies a complex history of nonbifurcating speciation and abundant trans-specific polymorphism

The notion of species as reproductively isolated units related through a bifurcating tree implies that gene trees should generally agree with the species tree and that sister taxa should not share polymorphisms unless they diverged recently and should be equally closely related to outgroups. It is now possible to evaluate this model systematically. We sequenced multiple individuals from 27 described taxa representing the entire Arabidopsis genus. Cluster analysis identified seven groups, corresponding to described species that capture the structure of the genus. However, at the level of gene trees, only the separation of Arabidopsis thaliana from the remaining species was universally supported, and, overall, the amount of shared polymorphism demonstrated that reproductive isolation was considerably more recent than the estimated divergence times. We uncovered multiple cases of past gene flow that contradict a bifurcating species tree. Finally, we showed that the pattern of divergence differs between gene ontologies, suggesting a role for selection.

The notion of species as reproductively isolated units related through a bifurcating tree implies that gene trees should generally agree with the species tree and that sister taxa should not share polymorphisms unless they diverged recently and should be equally closely related to outgroups. It is now possible to evaluate this model systematically. We sequenced multiple individuals from 27 described taxa representing the entire Arabidopsis genus. Cluster analysis identified seven groups, corresponding to described species that capture the structure of the genus. However, at the level of gene trees, only the separation of Arabidopsis thaliana from the remaining species was universally supported, and, overall, the amount of shared polymorphism demonstrated that reproductive isolation was considerably more recent than the estimated divergence times. We uncovered multiple cases of past gene flow that contradict a bifurcating species tree. Finally, we showed that the pattern of divergence differs between gene ontologies, suggesting a role for selection.
The genus Arabidopsis has been proposed to consist of as many as 26 taxa in addition to the model plant A. thaliana 1 . To clarify relationships within this important genus, we resequenced 94 individuals, covering all taxa, sampled throughout their geographic range ( Table 1, Supplementary Data Set 1 and Supplementary Fig. 1). Because A. thaliana was known to be most distantly related to the other taxa and should thus be equally related to all of them under a bifurcating speciation model, the reads were mapped to the A. thaliana reference genome. On average, 43% were mapped, covering about 66% of the genome (Supplementary Data Set 1). To produce a highquality data set for between-species comparisons across the genus, we focused on genic regions, requiring that at least half of all exons be covered in each individual, and excluded genes that showed evidence of duplication. The resulting data are thus biased toward conserved exons, as well as with respect to A. thaliana. In total, 9,119 genes fulfilled our criteria across the entire sample, comprising genus-wide alignments for 25 Mb of the genome. The data were analyzed together with published SNPs from 337 A. thaliana lines [2][3][4] . In total the alignment contains 7.5 million SNPs, 6.6 million of which are biallelic.
Clustering on the basis of genome-wide polymorphism revealed four major groups, corresponding to the widely distributed species A. thaliana, A. halleri, A. lyrata and A. arenosa, and three minor groups, corresponding to the geographically limited A. croatica, A. cebennensis and A. pedemontana ( Fig. 1 and Supplementary  Fig. 2). This is in broad agreement with previous results 1 . In addition, there are the two well-known allotetraploid species, A. suecica and A. kamchatica, hybrids between A. arenosa and A. thaliana and between A. lyrata and A. halleri, respectively [5][6][7] , and there is also clear evidence for the previously described admixture between tetraploid A. lyrata ssp. petraea and A. arenosa ssp. borbasii 8 .
The divergence among these species is highly variable. At the level of individual gene trees (Supplementary Fig. 3), 100% supported the monophyly of A. thaliana-as expected, given that this species is estimated to have diverged from the rest of the genus at least 6 million years ago (Myr) 9 (Supplementary Fig. 4) and is reproductively isolated as a consequence of a chromosome number of 5 rather than 8 ( Table 1). The separation of A. cebennensis and A. pedemontana from Sequencing of the genus Arabidopsis identifies a complex history of nonbifurcating speciation and abundant trans-specific polymorphism 1 0 7 8 VOLUME 48 | NUMBER 9 | SEPTEMBER 2016 Nature GeNetics l e t t e r s the remaining diploids is also strong (63% of gene trees supported this split), whereas the remaining species, at least some of which can be crossed experimentally 10,11 , are much more closely related.
To gain further insight into the relationships among the species, we searched for identical-by-descent (IBD) haplotypes using BEAGLE 12 (Fig. 2). Only 49 such haplotypes were detected in comparisons between species, compared to 14,600 within the outcrossing species (haplotype sharing in the selfing A. thaliana is much more extensive, as expected) 13 . The haplotypes shared between species were very short (median = 2.0 kb, maximum = 3.4 kb), suggesting gene flow occurring on the order of 20,000 generations ago rather than more recently (Online Methods). Much longer shared haplotypes were found within the outcrossing species (median = 2.3 kb, maximum = 16.3 kb), although most haplotypes shared within these species were also short ( Supplementary Fig. 5), probably owing to the patchiness of data that result from cross-species alignments.
Although the IBD analysis did not reveal conclusive evidence for gene flow, we used the results to guide further analyses. Specifically, we used the presence of between-species IBD blocks (Fig. 2) to identify probable cases of gene flow and then further examined these using the ABBA-BABA test 14,15 . This test uses the distribution of derived alleles to determine whether one of two sister taxa is closer to an outgroup than the other (thus violating a bifurcating species tree). We found several such cases. For example, subarctic A. lyrata was more closely related to A. halleri from East Asia than to A. halleri from continental Europe (Fig. 3a,b and Supplementary Fig. 6a-d), suggesting ancient admixture in East Asia between A. lyrata and A. halleri. Gene flow between these species was also supported by the incongruence between the chloroplast tree and the species tree ( Fig. 1 and Supplementary Figs. 3 and 4), and the allopolyploid A. kamchatica originated in East Asia through hybridization between A. lyrata and A. halleri 7 . Similarly, the common ancestor of A. pedemontana and A. cebennensis was more closely related to A. halleri than to A. lyrata or A. arenosa (Supplementary Fig. 7a). Consistent with hybridization between these species, we observed shared haplotypes between only the geographically closest samples of A. pedemontana and A. halleri ( Fig. 2 and Supplementary Fig. 7b).
We did not detect any shared IBD blocks between A. thaliana and other species in the genus. However, the ABBA-BABA test shows that A. thaliana is closer to A. lyrata than to A. arenosa ( Fig. 3c and Supplementary Fig. 6e-j). This contradicts a species tree in which the rest of the genus would be monophyletic with respect to A. thaliana and is notable given that the average sequence divergence between the common outcrossing species is less than half of that between these species and A. thaliana (Supplementary Table 1). Thus the highly diverged model plant A. thaliana appears to have become reproductively isolated from the outcrossing species considerably more recently than the estimated divergence time of 6 Myr 9 ( Supplementary  Fig. 3). This has important implications for within-species polymorphism data. For example, the well-established pattern of local deep sequence divergence within A. thaliana 16 may well reflect ancient admixture, and the dramatic changes in genome structure that distinguish this species from the rest of the genus may have occurred even more rapidly than previously believed 17 , perhaps in conjunction with the transition to selfing 18,19 .
Significant ABBA-BABA results may reflect population subdivision in the ancestral species as well as gene flow 14,20 . Thus the fact that fossil Neanderthals share more alleles with modern non-Africans than with modern Africans may not necessarily reflect admixture between Neanderthals and anatomically modern humans outside Africa but could also indicate that both Neanderthals and the modern humans that left Africa came from the same African subpopulation or isolated region 15 . However, for this to have occurred, the ancient African subdivision would have had to persist for a very long time while still allowing sufficient gene flow for modern humans to be most closely related at most loci (and evolve into anatomically modern humans). Admixture thus seems a more likely explanation.
Analogously, the greater number of shared alleles between A. thaliana and A. lyrata than between A. thaliana and A. halleri or A. arenosa could be due to ancient population subdivision, with the same subpopulation or region giving rise to A. thaliana and A. lyrata. However, the argument against the likelihood of ancient subdivision in humans is even stronger in Arabidopsis, because the species are so highly diverged. Whereas average coalescence times in modern humans are greater than the supposed split from Neanderthals, the average coalescence times between the common outcrossing Arabidopsis species is less than half the divergence between these species and A. thaliana.
Despite the evidence for ancient gene flow, the modern species are clearly distinct. Allele frequencies for shared polymorphisms were uncorrelated (Fig. 4), and F st was very high (given the high levels of within-species polymorphism 21 Table 1). However, even the highly diverged A. thaliana shares polymorphisms with the other species (for example, more than 20,000 synonymous SNPs with A. lyrata). Many of these must be due to mutation independently npg l e t t e r s generating the same polymorphism in both species, but they could also be ancestral polymorphisms, maintained either by balancing selection 22,23 or by chance (the latter is possible only if divergence times were more recent than estimated; see Online Methods).
To investigate this further, we focused on SNPs that segregate in all four common species. Using conservative filtering to avoid paralogous SNPs caused by cryptic duplications, we observed 3,818 SNPs in 2,365 genes. These genes show substantially higher values of Tajima's D statistic 24 than random genes ( Supplementary Fig. 8), as expected for ancestral polymorphisms, regardless of whether they are maintained by selection.
To eliminate any shared polymorphism due to repeated mutations, we refined this set further by considering only shared haplotypes with at least two shared SNPs. Such haplotypes are unlikely to have arisen independently 25 . Specifically, we selected genes with shared haplotypes such that a pair of SNPs shared across species was in linkage disequilibrium (r 2 > 0.3) in A. thaliana and in at least two of the three other common species. Moreover, we required that at least one of the shared SNPs was nonsynonymous. Using these criteria, we ended up with 129 genes containing 340 shared sites (Supplementary Data Set 2). These genes showed an even stronger increase in Tajima's D value relative to background (Supplementary Fig. 8) and were significantly (FDR-corrected P = 0.006; Online Methods) enriched for genes involved in response to virus, suggesting that selection may indeed have contributed to maintaining some of the ancestral polymorphisms. We identified a total of four virus-response genes with haplotypes shared across all four species: CYCT1;4 (TAIR locus AT4G19600), which is involved in transcriptional activation of viral genes; GRIK1 (TAIR locus AT3G45240), which has a role in the metabolic regulation of virus-infected cells 26,27 ; NIG (TAIR locus AT4G13350), which is involved in nucleocytoplasmic trafficking of viral proteins 28 ; and WAK1 (TAIR locus AT1G21250), which activates an immunity response to damaged cell walls 29 .
If selection had a role in determining which genes share polymorphism across species, it may also have influenced which genes do not. Although our data are heavily biased toward genes that are conserved across the genus, we searched for genes that may have contributed to the adaptive divergence between species, primarily relying on high F st values in combination with low polymorphism and Tajima's D values. We focused on identifying gene categories that appeared to be overrepresented among divergent genes in particular species. Several such categories were found (Supplementary Table 2). For example, genes divergent in A. halleri were overrepresented in the 'heavy metal-associated domain' category, consistent with metal hyperaccumulation in A. halleri 30,31 . Less obviously, genes divergent in A. lyrata were overrepresented in the 'circadian rhythm' category, perhaps reflecting the wide latitudinal distribution of this species and adaptation to different day lengths. Indeed, a recent selective sweep in phytochromes has been described in A. lyrata populations 32 . Finally, genes identified in the morphologically highly divergent A. pedemontana and A. cebennensis were overrepresented in the 'organ morphogenesis' and 'tissue development' categories.
In conclusion, the current species of the genus Arabidopsis, which are clearly identifiable with either morphological or polymorphism data (Fig. 1), do not have a tree-like relationship. Gene trees and species trees may disagree because of 'incomplete lineage sorting' , but this discrepancy is simply a consequence of random coalescence events between closely related species; it may make species trees difficult to infer, but it does not contradict their existence 33 . The results we present here do, and the apparent pervasiveness of gene flow suggests that it is not just a case of introgression leading to minor inconsistencies. Rather, we would argue that speciation in this genus is a protracted process involving selection and long periods of partial isolation between multiple incipient species, perhaps as a consequence of the multiple glaciations that have characterized the Quaternary Period. A bifurcating species tree describes neither a li a n a 6 9 7 6 A . th a li a n a 9

0 A . s u e c ic a (t h a li a n a ) A S 5 9 A . s u e c ic a (t h a li a n a ) A S S 3 A . s u e c ic a (t h a li a n a ) A S O 5 A . th a lia n a 7 3 2 3
A . th a lia n a 6 Figure 1. We divided A. halleri into eastern (orange area) and western lineages and A. lyrata into northern (blue area) and southern lineages. (b) Northern A. lyrata (lN) are closer to eastern A. halleri (hE) than to western A. halleri (hW). Plots show the distribution of D statistics resulting from testing different individuals from each region; the significance of the D statistics for each test can be assessed using previously published methods 14,48 (supplementary Fig. 6a-d and Online Methods). All between-population comparisons discussed are significant, i.e., hE was closer to A. lyrata than was hW (two-sided Wilcoxon test; P = 9.5 × 10 −190 , blue vs. green plot), and lN was closer to hE than was southern A. lyrata (lS) (P = 8.     (Supplementary Data Set 1).
We performed correction of raw reads for the A. halleri ssp. halleri 3 sample (Supplementary Data Set 1). The ErrorCorrectReads.pl module of ALLPATHS-LG version 44837 was used for correction of raw reads (34.0 million reads). PHRED_ENCODING was set to 64, and the other parameters were set to default according to the ALLPATHS-LG manual. In total, 24.9 million corrected reads were used in downstream data analyses.
Raw reads were uploaded to NCBI SRA (numbers of BioProjects and BioSamples are available in Supplementary Data Set 1). Chloroplast assemblies were uploaded to ENA (Supplementary Data Set 1).

Read mapping and variants discovery.
We mapped reads to A. thaliana (TAIR10) reference genome using the BWA-MEM algorithm from BWA 50 (version 0.7.4) with an increased penalty for unpaired read pairs to 15, then we removed duplicated reads with Samtools 51 (version 0.1.18) rmdup function and performed local realignment with Genome Analysis Toolkit 52,53 (GATK, version 2.5.2) IndelRealigner. After filtering for uniquely and primary aligned reads, we calculated coverage distribution using GATK Pileup and chose intervals with coverage between the third and ninety-seventh percentiles for further analysis. SNPs and short indels were called with GATK UnifiedGenotyper with default quality thresholds. Diploid samples were phased using GATK ReadBackedPhasing. Called SNPs were annotated with SnpEff software 54 . We also included 337 A. thaliana accessions from already published data sets 2-4 and used mapping and SNP calling for these samples from the 1001 Genomes Project 55 .
We first separated allopolyploid samples raw data according to the parental genomes and then included them into the main pipeline for diploid and tetraploid individuals described above. A. suecica reads were mapped to A. thaliana (TAIR10) and A. lyrata 17 (version 1.0) references simultaneously using the same parameters and software. After the realignment step, reads were filtered for primary and uniquely aligned reads in proper pairs (samtools flags −F 256 −f 3 −q 10) and then split according to the scaffolds. Reads mapped to Chr1-5 of A. thaliana and scaffolds 1-8 of A. lyrata reference parts were then mapped separately to the A. thaliana reference. A. kamchatica reads were mapped to two parental species genome (A. halleri 56 and A. lyrata) and separated into two groups using the read classification method HomeoRoq 56 , and we treated each A. kamchatica sample as two different diploid individuals.
The structure of the genus Arabidopsis. Population structure analysis was performed using all the samples of A. thaliana relatives, 10 A. thaliana accessions from different geographical locations and C. rubella as an outgroup. We focused on genic regions where at least half of the total exon length was covered in each individual analyzed. Then we calculated copy number for those genes in each individual on the basis of total exon coverage depth normalized by mean coverage depth and filtered for genes with copy number between 0.4 and 1.6 in all individuals. 9,119 genes satisfied these criteria.
A neighbor-joining tree was built with R package APE 57 from the genetic distance matrix generated from the 281,305 biallelic SNP calls that had no missing data. Genealogies of individual genes were generated using the same approach. Support values for tree splits were calculated by SumTrees program from DendroPy package 58 using the percentage of the trees in which a particular split is found as a degree of support. Maximum likelihoods of individual ancestries were estimated with ADMIXTURE 47 for the same genes, allowing for missing data. In order to include autotetraploid samples in ancestry assignment, we randomly chose two alleles for each site. We chose two local minima of cross-validation errors of ancestral populations assignment (K = 5 and K = 8) for subsequent analysis (Supplementary Fig. 2). described above. A rapid bootstrap analysis and search for the best-scoring ML tree was conducted with 1,000 bootstrap replicates. The partitioned data set with per-partition estimation of branch lengths was used with GTR+I+Γ as substitution model and the clade of C. rubella, C. bursa-pastoris and Camelina sativa was set as an outgroup. The ML tree was used as a starting tree for divergence time estimation after it was made ultrametric with node ages to fit constraints using the R package APE 57 version 3.1-4. Divergence time was estimated using the software BEAST 66 version 1.7. The same partitioned alignments as for ML analysis were used, with their respective best models as substitution models. Site and clock models were set to unlinked, but a linked partition tree was used with tree prior birth-death incomplete sampling 67 . An uncorrelated lognormal relaxed clock with estimated rates was used to account for rate heterogeneity between sites 68 . Owing to the lack of fossils from the genus Arabidopsis or the Brassicaceae family we used secondary calibration for divergence time estimation. Three calibration points were extracted from Hohmann et al. 9 : The age of the outgroup (split between the genera Camelina and Capsella) was set to 7.3572 Myr; the crown age of the genus Arabidopsis was set to 5.9685 Myr, and the root height (split between outgroup and genus Arabidopsis) was set to 8.1627 Myr. Normal distributions with a s.d. = 1.0 fit the 95% confidence intervals from Hohmann et al. 9 and were thus used for all three calibration points. An additional constraint was set on the root height by truncation to 4-12 Myr.
We ran two independent MCMC runs with 5 × 10 8 generations each and sampling parameters every 5 × 10 4 generations. LogCombiner 66 version 1.7.5 was used to combine trees from the two runs and the first 5 × 10 7 generations of each run were discarded as burn-in. The resulting 18,000 trees were combined to a maximum clade credibility tree in TreeAnnotator 66  Neutral expectations for allele sharing. Consider the number of shared polymorphism expected under the standard coalescent model. A necessary condition for trans-specific polymorphism is that two lineages survive back to the species split in both sister species. The probability of this is exp[-t split /(2N e )] independently in each species, where t split is the number of generations ago the two species became isolated, and N e is the effective population size, which may differ between the two species 69 . The expected pairwise coalescence time is E(T 2 ) = 2N e generations within each species, and E(T 2 ) = 2N e + t split generations between species, where N e in the latter equation is that of the ancestral species. In what follows, we will assume that Ne of the ancestral species equaled the larger of the N e of the descendant species. Simple moments estimators for all relevant parameters can be obtained by noting that the expectation of the pairwise sequence divergence d is proportional to E(T 2 ). We can thus obtain an upper bound for the probability of a trans-specific polymorphism as where d A is the estimated pairwise sequence divergence (nucleotide diversity) within species A, d B the same for species B, and d AB is the estimated pairwise sequence divergence between species. For example, the per-site probability of a trans-specific polymorphism under neutrality between A. thaliana and A. lyrata is less than e −(11.57-3.04)/0.69 × e −(11.6-3.04)/3.04 = 2.5 × 10 −7 . This is conservative, because it ignores the probability of the right order of coalescences in the ancestral species. The pairwise sequence divergence at aligned fourfold degenerate sites was used for the calculation. Thus, if we were to choose two alleles at random from each these two species, we would expect <1 out of the 2,909,657 fourfold degenerate sites to be trans-specific by chance. In the actual data we observed 599, and we conclude that either there was gene flow more recently than what is implied by the high divergence under the simple model of splitting or polymorphisms were maintained by selection.
Shared polymorphisms between common Arabidopsis species. We applied the same criteria to choose genes for the analysis of shared variation between major Arabidopsis clades, restricting it to 80 A. thaliana relatives analyzed and excluding A. thaliana from filtering genes pipeline, which gave us 15,454 genes that passed copy number filtering. Joint allele frequency spectra were calculated for biallelic sites in two clades, unfolded using C. rubella as an outgroup and plotted with dadi 70 .
F st estimates for pairwise common species (A. thaliana, A. halleri, A. lyrata, A. arenosa), comparisons were calculated at biallelic synonymous SNPs (Supplementary Table 1) as (H total -H subsp ) / H total , where H total is heterozygosity in total population (for example, A. thaliana and A. halleri taken together as one population), H subsp is average of heterozygosities in subpopulations (for example, average of A. thaliana and A. halleri heterozygosities). Heterozygosity is calculated as 2pq, where p and q are derived and ancestral allele frequencies, respectively.
We searched for SNPs segregating in all four common Arabidopsis species. Even though we filtered out potential duplicated genes on the basis of coverage before calling variants, we noticed that some of the shared sites show: (i) only heterozygous genotypes (Aa) or (ii) heterozygous and only one of the homozygous genotypes (Aa and AA or Aa and aa) among diploid A. thaliana relatives, which can be explained by the presence of a duplicated region. After filtering out such sites, we ended up with 3,818 sites spanning 2,365 genes. Additional filtering based on linkage was applied to avoid potential back mutations among shared SNPs. Here, we selected genes with shared haplotypes such that a pair of shared SNPs were in linkage disequilibrium (r 2 > 0.3) in A. thaliana and in at least two out of the three other common species. Moreover, we required that at least one of the shared SNPs was nonsynonymous.
We used GOWINDA 71 to test enrichment for Gene Ontology categories (GO slim) for identified genes under balancing selection. Total (private and shared) coding SNPs identified for the four common species (A. thaliana, A. halleri, A. lyrata and A. arenosa) were used as a background set, and shared coding SNPs segregating in all four species (3,818 SNPs in 2,365 genes) and that passed additional LD filtering (340 SNPs in 129 genes) were used as the query sets. We ran GOWINDA with 100,000 simulations in 'gene' mode, assuming that only SNPs located inside a gene are associated with the corresponding gene (-gene-definition 'gene'). Additionally to GO_SLIM terms from TAIR10 we created and tested GO term for NB-LRR gene family previously identified 72 .
Clade-specific divergence in common Arabidopsis species. To check for clade-specific divergence we tested for enrichment of particular A. thaliana GO_SLIM and INTERPRO categories in the diverged genes for the major clades A. halleri, A. lyrata, and A. arenosa, as well as for A. cebennensis and A. pedemontana as one clade. F st , nucleotide diversity and Tajima's D for each gene (including UTR regions) in each clade were calculated using msABC 73 'observed' mode (-obs). Several cut-off combinations were tried, and for some comparisons we also considered nucleotide diversity as this resulted in more interesting results. The reported P values do not take this implicit multiple testing into account and should thus be interpreted with caution. There is no obvious way of correcting for this kind of overfitting, and there is also no way of knowing a priori what parameters might be appropriate to detect selection in a given species.
For A. cebennensis and A. pedemontana, divergent genes were identified as those with F st values in the upper eighty-fifth percentile in comparisons between this clade and A. halleri, A. lyrata and A. arenosa. For A. halleri, we used the upper eightieth percentile threshold for F st and the additional criterion of nucleotide diversity in the lower fiftieth percentile in A. halleri. For A. lyrata and A. arenosa, divergent genes were identified as having F st values in the upper eighty-fifth percentile, and we also required nucleotide diversity in the lower forty-fifth percentile and Tajima's D values in the lower thirty-fifty percentile. Individuals that showed signs of introgression between A. lyrata and A. arenosa in the ADMIXTURE analysis we excluded.
We used only general A. thaliana GO_SLIM categories (TAIR10), which were not deeper than the third node on a GO categories graph for 'biological process' root category (136 categories in total) and also used only the first node categories of INTERPRO database (627 categories). We also required a minimum of five genes from a category to be present in a candidate set to perform a test. Gene categories enrichment was tested by Fisher's exact test: all P values were adjusted for multiple testing (using R function 'p.adjust' with method = 'fdr').