Extreme genetic signatures of local adaptation during plant colonization

Colonization of new habitats is expected to require genetic adaptations to overcome environmental challenges. Here we use full genome re-sequencing and extensive common garden experiments to investigate demographic and selective processes associated with the recent colonization of Japan by Lotus japonicus. We carefully track the colonization process where L. japonicus gradually spread from subtropical conditions to much colder climates in northern Japan. We characterize the loss of diversity during this process and identify genomic regions with extreme genetic differentiation. Next, we perform population structure-corrected association mapping of phenotypic traits measured in a common garden and discover a number of genome-wide significant associations. Contrasting these analyses, we find that there is a strong concordance between phenotypic variation and extreme differentiation for overwintering and flowering time traits. Our results provide evidence that these traits were direct targets of selection by local adaptation during the colonization process and point to associated candidate genes.


Introduction
Lotus japonicus (Lotus) is a perennial wild legume. It has a relatively small genome size of ~472 Mb (Handberg and Stougaard, 1992) and is found in diverse natural habitats across East and Central Asia, including Japan, Korea, and China, and extending west into Afghanistan (Hashiguchi et al., 2011). L. japonicus natural diversity is being characterized through the establishment of a collection of wild Japanese accessions (Hashiguchi et al., 2011). They represent an interesting population sample for studying local adaptation because of the geographical isolation of the Japanese archipelago and the pronounced variation in climate between the southern and the northern parts of the country. Climates range from subtropical to hemiboreal with yearly average temperatures of 17.9 and 5.3 ˚C, respectively, while annual daylight varies from 1316 to 2202 hours, and annual precipitation ranges from 775 mm to 3250 mm (Hashiguchi et al., 2011). The topography of Japan is also very varied, and an up to 3,000 meters high mountain range traverses the central regions of the archipelago. The mountainous topography will likely limit dispersal and the steep climatic gradients could result in strong selection pressures, creating conditions that could promote population differentiation driven by local adaptation (Kawecki and Ebert, 2004).
Experimental evidence for local adaptation can be obtained using reciprocal transplants, which compare at home versus away fitness, or common garden experiments that compare fitness of locals and immigrants in the same environment. Superior fitness at home in reciprocal transplants and locals outperforming immigrants in common gardens constitute evidence of local adaptation, with the local vs. immigrant comparison arguably offering stronger evidence (Kawecki and Ebert, 2004). Such experiments are now being coupled with genotype data to begin understanding the molecular events underlying local adaptation. Once it has been demonstrated that a subpopulation displays local adaptation, genotyping of individuals from contrasting subpopulations allows genome scans for selection signatures such as high fixation index (FST) levels. Demographic history, however, can generate similar signatures through genetic drift, causing false positives.
Population structure, whether due to genetic drift or selection, has long been recognized as a major confounding effect in genome-wide association (GWA) studies, and stringent population structure correction methods have been developed to separate genuine genotype-phenotype associations from spurious associations due to population structure (Atwell et al., 2010;Kang et al., 2008). It has been suggested that combining population structure-corrected GWA analysis of phenotypic data from common garden experiments with purely genotype-based genome scans could be a powerful approach for studying local adaptation. The argument is that overlapping adaptive signals detected both using phenotype-independent genome scans and common garden phenotype data constitute independent lines of evidence for local adaptation (de Villemereuil et al., 2016).
Perhaps the most striking example of GWA application to the study of local adaptation is the investigation of human skin pigmentation. Evolution of skin pigmentation is driven by UV irradiation, with dark skin offering protection where there are high levels of UV irradiation, and light skin promoting vitamin D synthesis under low UV irradiation. SNPs in several genes involved in melanin production were identified as associated with skin pigmentation in GWA scans, and they also generally show extremely high FST levels (Crawford et al., 2017).
Relatively few studies have applied GWA to the investigation of plant local adaptation. One large Arabidopsis study used common garden fitness data in combination with GWA and correlations with climatic factors, but did not employ genome scans for population differentiation (Fournier-Level et al., 2011). A second study examined drought stress data from a laboratory experiment and observed marginally elevated FST levels for the top SNPs (Exposito-Alonso et al., 2018), while a third showed evidence for an adaptive role of a sodium transporter by GWA analysis of ion accumulation coupled with investigation of gene expression and distances to saline soils (Baxter et al., 2010). In addition, flowering time is considered a critical adaptive trait in annual Arabidopsis and has been thoroughly examined in GWA studies (1001Genomes Consortium, 2016Atwell et al., 2010). However, human dispersal of Arabidopsis seeds has eroded the link between geographic and genetic distance, complicating interpretation with respect to local adaptation (1001Genomes Consortium, 2016Bergelson et al., 1998;Johanson et al., 2000). Most of the relatively few plant traits studied have thus not revealed clear signals of local adaptation, and the power of combining GWA analysis of common garden phenotype data with genotype-based genome scans for the study of local adaptation has not been systematically explored.
Here, we genetically characterize a set of wild Japanese Lotus accessions, identify subpopulations, characterize their demographic history, and show evidence for local adaptation in a common garden experiment. We discover very pronounced overlaps in adaptive signals between phenotype-based GWA and genotype-based FST approaches, allowing us to identify traits and associated genomic loci that were direct targets of selection by local adaptation during Lotus japonicus colonization of Japan.

Kyushu Island is the center of Lotus diversity in Japan
We carried out whole-genome re-sequencing of 136 wild Lotus accessions collected throughout Japan using Illumina paired-end reads ( Figure 1A, Supplemental file 1), identifying a set of 525,800 high confidence SNPs in non-repetitive regions (Supplemental file 2, Supplemental figures 1-2).
In order to examine the genetic relatedness of the accessions, we carried out a principal component (PCA) analysis based on the accession genotypes. We found a striking correlation between the position of the accessions in the PCA plot and their geographical origin ( Figure   1A-B), similar to that observed for studies of human populations (Bryc et al., 2010;Novembre et al., 2008). PC1 clearly separated the north and south of Japan, while PC2 most strongly distinguished accessions from the eastern and southern coastline of Kyushu Island (Figure 1A-B). For both PC1 and PC2, the northern accessions clustered more tightly than the southern accessions, and the accessions from Kyushu Island were particularly well-resolved ( Figure 1A- The tight clustering of the northern accessions suggested that these could have lower levels of genetic diversity and, consequently, that Lotus could have arrived in the south of Japan and then migrated north. To examine the migration history of Lotus in detail, we carried out a Pairwise Sequentially Markovian Coalescent (PSMC) model analysis, which infers population size history from a diploid sequence based on the extent of heterozygous regions in the genome (Li and Durbin, 2011). Since Lotus is self-compatible and most accessions had low heterozygosity levels (Supplemental file 1), the individual accessions were not well suited for PSMC analysis. Instead, we generated pseudo-diploids by merging read alignments from pairs of a subset of individuals with at least 7x read coverage and called a pseudo-diploid consensus sequence based on the merged alignments, which was used for PSMC analysis. This type of analysis can infer the time of last contact of the pair of individuals forming the pseudo-diploid as the time of the last coalescence events, measured as an abrupt increase in the estimated effective population size. We exploited this feature to roughly estimate divergence times (cessation of gene flow) for the pseudodiploid pairs as the point in time when the PSMC curve abruptly rises ( Figure 1C-D), assuming a mutation rate of 6.5×10-9 per year (Ossowski et al., 2010). For example, for accession MG126 found in northern Japan, the cessation of gene flow to other accessions from northern Japan (blue lines) are 1-5 thousand years, for Eastern Kyushu (light brown lines) 7-9 thousand years and for south and west Kyushu (dark brown lines) 10-12 thousand years ( Figure 1C-D). The estimated divergence times for all pairs of accessions showed clear clustering along the inferred colonization route ( Figure 1D) with South and Northern Japan having last contact 10-18 thousand years ago, consistent with colonization after the last ice age. For the same accession pairs, we also calculated simple genetic distances ( Figure 1E). Despite the clear overall similarity of the genetic distance and PSMC-based divergence time analysis, we found marked deviations for accessions from Tsushima Island  Figure 1D-E). These central Kyushu lines, which grow at high altitudes with sub-0°C minimum temperatures, thus likely became isolated and stopped exchanging genetic material with other accessions relatively shortly after Lotus colonisation of Japan. Altogether, our data suggests that Lotus migrated north and south from a starting point near Tsushima Island and that the north was colonized most recently, since accession pairs across a large geographical area in northern Japan display relatively short divergence times ( Figure 1D and Supplemental file 1). Our results also indicate that representatives of all major genetic clusters are found on Kyushu Island, making it the center of diversity for Lotus in Japan.

The Japanese Lotus accessions can be grouped into three subpopulations
In the PCA analysis, the northern and southern accessions clustered into different groups, indicating that distinct subpopulations might exist. fastSTRUCTURE (Raj et al., 2014) analysis, which groups accessions based on allele frequencies, suggested that a large fraction of the variation could be accounted for by three subpopulations (Supplemental figure 3).
Populations 1 (pop1), 2 (pop2), and 3 (pop3) occupied southern, central, and northern Japan, respectively, and corresponded to the three vertices in the PCA plot (Figure 2A-C). The three subpopulations also overlapped with the major clusters identified in the analysis of pairwise genetic distances and divergence times ( Figure 1D-E), indicating that they reflect a robust grouping of the accessions. This is supported by Figure 2D, which shows that the genetic distance for a given physical distance is larger for comparisons between these three populations as compared to distances within populations. Examining genetic diversity (pi) for each subpopulation, we found that the southernmost populations showed higher levels of diversity ( Figure 2E), which is consistent with the hypothesis that Lotus migrated north from southern Japan, losing genetic diversity along the way.

Contrasting subpopulations identifies strong population differentiation signals
We calculated FST for all polymorphic positions, comparing pop3 accessions against accessions with no population 3 membership in order to detect markers strongly differentiated between these two groups with relatively many characterized members. There were 5,612 genes with at least four informative SNPs, for which we calculated the mean FST per gene. The median pergene FST value was 0.24, and there were 132 genes with average FST>0.65 (Supplemental figure 4A-B). The top gene Lj6g3v1790920, had a mean FST value of 0.99 across 20 SNPs and was located in a ~20 kb genomic region showing very strong fixation of the alternative allele in pop3 relative to accessions without pop3 membership (Supplemental figure 4C).

Pop3 is locally adapted
The distinct regions occupied by each of the three subpopulations, the restriction of pop3 accessions in southern Japan to a cold, high-altitude environment in central Kyushu, and the pronounced differentiation observed for specific genomic regions, suggested that population differentiation could have been influenced by local adaptation in addition to genetic drift. To test this hypothesis experimentally, we grew the accessions in a common garden using a field site at Tohoku University (38.46°N, 141.09°E), which is located in northern Japan where pop3 individuals dominate ( We also collected phenotype data for other potentially adaptive traits. Flowering time was a likely candidate, and to maximize the chances of identifying phenotypic variation of adaptive relevance, we carried out both a field and a greenhouse flowering time experiment. In the field, we found the flowering time characteristics to depend greatly on the planting date. In 2014, the plants were sown on July 4th and transplanted to the field on August 4th. This late planting date caused a very strong differentiation between the accessions. Many of the northern accessions failed to flower in the planting year, flowering instead the following spring, and this trait was strongly correlated with geographic origin, in contrast to the greenhouse flowering time data ( Figure 3C, Supplemental figures 5-6). In the field, we also measured leaf accumulation of sodium and potassium ions, which was poorly correlated with geographic origin (Figure 3C, Supplemental figures 5-6). In addition, we quantified seed properties, which showed an intermediate correlation with geographic origin (Figure 3C, Supplemental figures 5-6).

Selection acted on winter survival during pop3 local adaptation
For our population sample, linkage disequilibrium decayed to an r 2 value of 0.2 within approximately 10 kb, indicating that mapping resolution would be sufficiently high to identify a limited sets of candidate genes in GWA scans (Supplemental figure 7). With phenotype data available for potentially adaptive traits, we proceeded with GWA analysis using a method that includes stringent correction for population structure, ensuring that significant hits represent associations that cannot be explained by the overall population structure (Atwell et al., 2010;Kang et al., 2008). We identified SNPs strongly associated with a number of the examined traits, particularly for overwintering and flowering time ( Figure 3D). Overwintering 2014 displayed the most highly significant associations, which were distributed across several separate genomic loci, indicating that winter survival is a complex trait. The 57 top SNPs Whereas pop3 was locally adapted to the field site near Tohoku, the Tohoku winter survival phenotype should not be associated with pop1 vs. pop2 differentiation. To test this, we repeated the GWA/FST overlap analysis for the pop1 vs. pop2 comparison. As expected, the SNPs associated with Tohoku overwintering traits did not show skews towards high FST values for the pop1 vs. pop2 comparison (Figure 4C, Supplemental figure 9). In contrast, the top SNPs that winter survival has played a major role in pop3 local adaptation and points to specific candidate genes with potential adaptive roles.
Three candidate genes reside within the strongly differentiated chr6 region 20,048,000-20,068,000 comprising the top GWA overwintering signals. Lj6g3v1790920 encodes a RINGtype zinc finger protein, Lj6g3v1790910 is a ubiquitin ligase, while Lj6g3v1789900 is a putative 1-aminocyclopropane-1-carboxylate oxidase involved in ethylene biosynthesis. Another prominent GWA peak on chr6 was in the region 20,961,109-20,970,404, containing a single candidate gene, Lj6g3v1887780, which is annotated as an HBS1-like protein containing a predicted Alpha/Beta hydrolase fold (IPR029058). The third prominent GWA peak on chr6, 24,760,760-24,781,399 contained three candidate genes, two putative G-type lectin S-receptorlike serine/threonine-protein kinases, Lj6g3v2130140 and Lj6g3v2130160, and Lj6g3v2130130 encoding a putative Glutamyl-tRNA(Gln) amidotransferase subunit A protein. On chr1, several SNPs with very strong GWA associations and high FST values were dispersed across a large 240 kb region (13,458,698,337). A more clearly defined peak on chr1 was located at 29,029,933-29,033,958. It contained a single candidate gene, Lj1g3v2533770, which encodes a FERONIA receptor-like kinase. According to published expression data, the overwintering candidate genes were primarily expressed in roots and nodules (Supplemental figure 10) (Høgslund et al., 2009;Mun et al., 2016;Verdier et al., 2013).  (Høgslund et al., 2009;Mun et al., 2016;Verdier et al., 2013).
Seed width, which showed less strong correlations with environmental parameters and geographical origin (Figure 3C, Supplemental figures 5-6), also showed GWA hits with a skew towards high FST values, although not as extreme as for the overwintering or flowering time traits (Figure 4B). This was due to a peak on chromosome 5 (Figure 5), where the genes Lj5g3v0526490 and Lj5g3v0526510, highly expressed in seeds and flowers (Supplemental figure 10) (Høgslund et al., 2009;Mun et al., 2016;Verdier et al., 2013), but without known functions, neighboured the top scoring SNPs.

Discussion
Here, we have inferred the population structure and demographic history of wild Lotus in Japan.
We conclude that Lotus colonized Japan from Tsushima Island after the last ice ages in a very regular fashion, reaching southern Japan before the northern part with little long-distance gene flow post colonization. It was a priori conceivable that the very different climate across Japan could be a barrier to colonization and that selection on standing variation for phenotypic traits related to temperature and daylight would be detectable. Thus, we have exploited the Lotus subpopulations to study local adaptation through a combination of genotype-based FST genome scans and phenotype-based GWA analyses using data from a common garden field experiment.
Our study illustrates the power of this combination. Alone, FST genome scans identified strong population differentiation signals, but provide no hints as to the phenotypic traits or environmental variables driving the differentiation, nor do they account for the effects of population structure. GWA analysis of phenotypic traits, on the other hand, provided strong, population structure-corrected phenotype-genotype associations for a number of traits, but offered no way of determining which, if any, of the traits were important for local adaptation.
In our study, the combination of the two approaches was powerful, because we found very pronounced skews towards high FST values for certain traits, identifying these as strong candidates for driving local adaption and supplying trait labels to previously anonymous FST peaks.
The year-to-year differences in winter survival and their impact on the GWA results illustrates For a perennial plant such as Lotus, it makes intuitive sense that surviving harsh winters could have been critical for local adaptation to cold climates. Our work supported this hypothesis and identified the most prominent genomic regions and candidate genes associated with overwintering. Remarkably, 48% of the phenotypic variation for winter survival 2014 could be explained by strongly differentiated and phenotype-associated SNPs, indicating that a limited number of loci with large effects have a strong impact on the trait. Such pronounced overlaps between top GWA and FST signals have not, to our knowledge, been reported in the plant literature, but they are reminiscent of the characteristics displayed by SNPs associated with human skin pigmentation (Crawford et al., 2017). Analogously to the tradeoff between UV protection and vitamin D synthesis, our results suggest that strong divergent selection is acting on Lotus alleles that are beneficial for perennial winter survival in cold climates, but likely have detrimental effects on plant fitness in warmer regions, thus contributing to shaping and maintaining genetically differentiated populations that are locally adapted to markedly different climates. Based on the candidate genes we identify here, the molecular genetic mechanisms that underpin these fundamental adaptations can now be investigated.

Sequencing, mapping and variant calling
All samples were sequenced using Illumina paired-end reads (ENA accession PRJEB27969).
The reads were mapped to the L. japonicus genome version 3.0 (http://www.kazusa.or.jp/lotus/) using Burrows-Wheeler Aligner (BWA) mem v0.75a with default parameters (Li, 2013) (Supplemental file 6). An average of 96% of the reads were mapped and the resulting average genome coverage was 12.5 (Supplemental file 1). Duplicates were marked in the resulting alignment (BAM) files using Picard v1.96. Then, using the Genome Analysis Tool Kit (GATK) v2.7-2 pipeline (McKenna et al., 2010), reads were re-aligned in INDEL regions using the RealignerTargetCreator and IndelRealigner functions. A preliminary list of SNPs and INDELs was then generated using the re-aligned BAM files using the GATK UnifiedGenotyper function. Subsequently, base recalibration of the scores was done using the BaseRecalibrator, and the BAM files generated were compressed using the ReduceReads function followed by variant discovery, resulting in 8,716,552 putative variants that included SNPs (7, To validate our filtering, we compared these variants to a set of independently validated SNPs genotyped using the Illumina GoldenGate system. Out of 343 validated SNPs included in the initial call set, 324 matched the genotype calls based on Gifu B-129 and MG-20 sequencing reads (Supplemental file 1). Our stringent filtering had removed 93% of the initially called SNPs and 37% of the experimentally validated SNPs, resulting in a 9.5 fold enrichment for validated SNPs in the filtered set. Assuming that, since Lotus Gifu is highly inbred, all heterozygote calls in this accession would be false positives, we estimated a false positive rate of 4%. The site frequency spectrum for the high-confidence SNPs indicated that we had undercalled rare heterozygous alleles with our conservative filtering approach, most likely because these calls would only be supported by data from a small subset of the accessions (Supplemental figure 2).

Calculation of genetic diversity, genetic and geographical distances
The --window-pi function of VCF tools version 0.1.9 (Danecek et al., 2011) was used to calculate genetic diversity, with the chromosome size as the window size (Supplemental file 6). Pairwise genetic distances were calculated by assigning a score of +2 for homozygous differences, +1 for heterozygous differences and 0 for identical alleles at each polymorphic site and then dividing by the total number of polymorphic sites. Geographical distances were calculated using the distVincentyEllipsoid function from the R package geosphere. See https://github.com/ShahNiraj/JapanHistory.

Principal component and linkage disequilibrium analysis
PCA analysis was performed with EIGENSOFT v6.0beta (Price et al., 2006;Pritchard et al., 2000) using the smartpca command with default parameters (Supplemental file 6). PCA plots were generated using R version 3.4.3 (Supplemental file 7). Linkage disequilibrium (LD) was calculated using VCFtools v0.1.9 (Danecek et al., 2011) for all the SNP-pairs with a maximum physical separation of 50,000 base pairs. Chromosome 0 was ignored in the analysis. A randomly selected set of 200,000 SNP-pairs was plotted using R version 3.4.3.

Population structure analysis
To infer population structure, a Bayesian model based clustering analysis was conducted with fastSTRUCTURE version 1.0 (Raj et al., 2014) for 201,694 non-repetitive SNP markers with minor allele count > 5 in 136 accessions. The analysis was run with a number of clusters (K) ranging from 1 to 8 with default parameter settings.

PSMC analysis
For estimation of population size history, the Pairwise Sequentially Markovian Coalescent (PSMC) model was used (Li and Durbin, 2011). Accessions with more than 10x coverage were subsampled to 10x coverage. Diploid consensus sequences were made with SAMtools v1.3 and BCFtools v1.3 and SNPs which had mapping quality >50 and read depth between 5 and 100 were included in the analysis. PSMC analysis was run with each pseudodiploid consensus sequence using the recommended settings (Li and Durbin, 2011). The mutation rate per nucleotide in A. thaliana, 6.5 × 10 -9 year/site (Ossowski et al., 2010)  For seed phenotyping, seeds were collected from the plants grown in a greenhouse (Trige Aarhus, Denmark) in 2014. Seed images were obtained using Epson Perfection 600 Photo scanners. The image files were analyzed using SmartGrain (Tanabata et al., 2012) to count total number of the seeds and measure area size, length, width, length-width ratio, circularity of each seed.

F ST and GWA analysis
The --weir-fst-pop function from VCF tools version 0.1.9 (Danecek et al., 2011) was used to calculate fixation indices (FST) (Supplemental file 6). Prior to GWA analysis, missing genotype data points were imputed using Beagle version 5.0 (Browning et al., 2018), and the VCF file was converted to a simple genotype-only format (Supplemental files 6 and 8). GWA scans were carried out using a python implementation of EMMA (https://app.assembla.com/spaces/atgwas/git/source) (Atwell et al., 2010;Kang et al., 2008) with a minor allele count cutoff of 8 (Supplemental file 6). The software GCTA-GREML was used to estimate the proportion of phenotypic variance explained by significant SNPs. For selected traits, the method was applied to capture variance explained by a subset of associated SNPs GWA p-values≤5e-5. SNPs with a minor allele count below 10 were not included.

Data availability
Accessions read data have been deposited in the European Nucleotide Archive (ENA) database under primary accession number PRJEB27969 and secondary accession number ERP110110.
Genotype data is available for online GWA analysis through a website (https://lotus.au.dk/gwas/) based on the GWAPP platform (Seren et al., 2012