Population genetic structure of Oryza sativa in East and Southeast Asia and the discovery of elite alleles for grain traits

We investigated the nuclear simple sequence repeat (SSR) genotypes of 532 rice (Oryza sativa L.) accessions collected from East and Southeast Asia and detected abundant genetic diversity within the population. We identified 6 subpopulations and found a tendency towards directional evolution in O. sativa from low to high latitudes, with levels of linkage disequilibrium (LD) in the 6 subpopulations ranging from 10 to 30 cM. We then investigated the phenotypic data for grain length, grain width, grain thickness and 1,000-grain weight over 4 years. Using a genome-wide association analysis, we identified 17 marker-trait associations involving 14 SSR markers on 12 chromosome arms, and 8 of the 17 associations were novel. The elite alleles were mined based on the phenotypic effects of the detected quantitative trait loci (QTLs). These elite alleles could be used to improve target traits through optimal cross designs, with the expected results obtained by pyramiding or substituting the elite alleles per QTL (independent of possible epistatic effects). Together, these results provide an in-depth understanding of the genetic diversity pattern among rice-grain traits across a broad geographic scale, which has potential use in future research work, including studies related to germplasm conservation and molecular breeding by design.

programs 4 . Grain weight is determined largely by the grain size, which mainly includes the grain length (GL), grain width (GW) and grain thickness (GT). In addition, GL, GW and GT are important parameters for evaluating the appearance of rice, which directly relates to the trade value of rice.
Grain length, width and thickness and 1,000-grain weight are quantitative traits controlled by multiple genes. To date, many quantitative trait locus (QTL) studies of grain traits have been reported based on bi-parental family populations. The detected QTLs that control GL, GW and TGW were distributed in all twelve chromosomes [5][6][7][8][9][10][11][12][13][14][15] , whereas the detected QTLs that control GT were located on chromosomes 3, 5, 6, 9, 10 and 11 14,15 . Recently, several genes have been shown to control grain size, including GS3 16,17 , DEP1 18 and GL3 19 , which regulate grain length and grain weight, and GW2 20 (encoding a RING-type E3 ubiquitin ligase), qSW5 21 , GS5 22 (encoding a putative serine carboxypeptidase) and GW8 23 (encoding a transcription factor with squamosa promoter binding protein-like (SBP) domain), which regulate grain width. Despite the success of QTL analyses, the major limitation of linkage mapping is that only two alleles at any given locus can be studied in biparental crosses of inbred lines.
Association mapping based on linkage disequilibrium (LD) has recently emerged as an alternative approach to detecting QTLs by taking into account marker-trait associations, and it enables researchers to exploit the natural diversity of the genome and locate valuable genes 24 . Association mapping has been widely applied to the mining of excellent alleles in humans 25 ; animals such as chickadees (Poecile atricapillus) 26 and ptarmigan (Lagopus mutus) 27 , and plants, such as rice (O. sativa) [28][29][30] , maize (Zea mays) 31 , wheat (Triticum aestivum) 32 , soybean (Glycine max) 33 and foxtail (Setaria viridis) 34 .
Recent many studies have reported the discovery of elite alleles for grain traits in rice by using single nucleotide polymorphism (SNP) markers in association analyses. These studies found 19 significant associated loci for grain length, which were distributed on chromosomes 1, 3, 4, 5, 6, 7 and 12; 16 significant associated loci for grain width, which were distributed on chromosomes 2, 5, 7, 8, 10, 11 and 12; and 7 significant associated loci for grain weight, which were distributed on chromosomes 1, 2, 4, 7 and 8 28,35,36 . However, the SNPs showed a high false positive rate 34,35 . Simple sequence repeat (SSR) loci are particularly useful for the study of population structure because their high level of allelic diversity facilitates the detection of the fine diversity more efficiently than an equal number of restriction fragment length polymorphisms (RFLPs), amplified fragment-length polymorphisms (AFLPs), or SNP loci 37 .
The objectives of this study are to (1) evaluate the population structure and genetic diversity of Asian rice, (2) detect the extent of LD between pairs of SSR markers in an entire rice genome, (3) detect the QTLs controlling grain traits and mine elite alleles and (4) explore the optimal cross design for cultivar improvement using population genetic analyses and association mapping with a set of 532 rice accessions using SSRs in four environments.
Population structure and genetic relatedness. An analysis of the model-based population structure provided evidence of a significant population structure in the 532 rice accessions and identified the highest likelihood value at K = 6 for all five replicates (five runs for each K) (Fig. 1a, Supplementary  Fig. 1). The population structure data based on the Q matrix for each accession are summarized in Supplementary Table 2. A neighbour-joining tree of the 532 accessions was constructed based on Nei's genetic distance (Fig. 1b), and the results were consistent with the results from the Structure analysis.
The non-admixed accessions in each subpopulation were determined using the Q-matrix assignment of above 0.9. The number of non-admixed accessions from POP1 to POP6 was 94, 54, 147, 65, 69 and 45. The 94 accessions in POP1 were all from Vietnam; the 54 accessions in POP2 were mainly from northeastern China and Japan; the 147 accessions in POP3 represented landraces from the Taihu Lake valley; the 65 accessions in POP4 were mainly from northern China (northern Jiangsu, Anhui, Henan, and Shandong provinces and Tianjin City); the 69 accessions in POP5 were modern improved varieties mainly from the Taihu Lake valley; the 45 accessions in POP6 were mainly from South China (Taiwan, Yunnan and Hunan provinces); and the remaining 58 accessions showed admixed ancestry, and they were excluded in subsequent analyses.
Genetic relatedness analysis indicated that the accessions in this study were distantly related ( Supplementary Fig. 2), with greater than 80% of the kinship coefficient values at less than 0.05, 5.6% ranging from 0.05-0.10 and the remaining 12.4% showing various degrees of genetic relatedness. This result suggests that there was weak or absent relatedness between the pairwise rice accessions. Based on the results of the relatedness analysis, a K matrix was constructed for the association analysis.  Table 3). These results indicate a high degree of genetic differentiation across the six subpopulations.    Table 4). We used the D' value corresponding to intrachromosomal SSR loci as well as the genetic distance in each subpopulation to draw the attenuation map. Figure 3 shows that the D' values decay with increasing genetic distance (cM). Regression analysis between the D' values and genetic distances of the syntenic marker pairs revealed that the six subpopulation genomes fit the equation y = b lnx + c. The minimum distances of LD decay for POP1-POP6 were 13.6 cM, 23.7 cM, 25.2 cM, 23.4 cM, 16.5 cM and 25.3 cM. Apparently, POP1 had the highest decay velocity with the shortest decay distance, whereas POP3 and POP6 demonstrated the lowest decay velocity among the 6 subpopulations.
Phenotypic distributions and correlations between multiple traits. The phenotypic data of the GW, GT and TGW traits in the studied population followed a normal distribution, whereas GL followed a skewed distribution based on the skewness values and kurtosis statistics ( Table 2). The largest and smallest values for the GL, GW, GT and TGW traits over 4 years were approximately 13 and 6 mm, 4 and 2 mm, 3 and 1.7 mm and 33 and 16 g, respectively, and they had average values of 8 mm, 3 mm, 2 mm and 25 g in the population ( Table 2). A two-way ANOVA showed that significant differences occurred among the tested cultivars (P < 0.01), indicating a large amount of genetic variation in the 532 cultivars. The average broad-sense heritability value for the GL, GW, GT and TGW traits over 4 years was 98.26%, 95.86%, 96.31% and 95.53%, respectively.
GL was correlated negatively with GW and GT but not significantly correlated with TGW, and the correlations of TGW with GW and GT and of GW with GT were significantly positive (Supplementary Table 5).

Significant marker-trait association loci detected across the entire population. A marker-trait
association analysis based on a mixed linear model (MLM) revealed that eight markers located on chromosomes 3, 4, 6, 8, 9 and 11 were associated with GL ( Table 3). The range of phenotypic variation explained (PVE) was from 2.4% to 13.1%. RM335_Chr 4, which resides on 5.4 cM, had the maximum PVE for GL, which was 12.9% in 2010, 12.8% in 2011, 13.1% in 2012 and 10.8% in 2013 (Table 3). Four markers distributed on three chromosomes were associated with GW (Table 3) Two markers distributed on chromosomes 2 and 10 were associated with GT (Table 3), of which RM573_ Chr 2 had the highest average PVE of 6.4% over 4 years. Three markers distributed on chromosomes 1 and 6 were associated with TGW (Table 3). More than 60% of the markers were located on chromosome 6, and the corresponding PVE ranged from 4.5% to 6.5%. RM345 had the maximum PVE, which was 5.9% in 2010, 5.1% in 2011, 6.5% in 2012 and 6.0% in 2013. Three markers were synchronously associated with two traits, with RM348 associated with GL and GW, RM345 associated with GL and TGW and RM528 associated with GW and TGW.
For GW, a total of 22 significant marker-trait association loci were detected in the 6 subpopulations, and the values from POP1 to POP6 were 5, 1, 6, 4, 4 and 2. The loci RM348, RM528 and RM6863 were simultaneously detected in three different subpopulations, and the loci RM1019, RM528 and RM209 were simultaneously detected in two different subpopulations (Supplementary Table 6). Among the 6 aforementioned loci, RM348, RM528, RM1019 and RM6863 were also detected across the entire population.
For GT, only one locus was detected in each subpopulation. The locus RM573 was detected in four different subpopulations, and RM269 was detected in two different subpopulations (Supplementary Table  6). Both RM573 and RM269 were also detected across the entire population.
Discovery of elite alleles. In this study, the alleles with positive effects are considered elite alleles for all four grain traits measured. A summary of the elite alleles and their typical carrier materials is shown in Supplementary Table 7. The total numbers of elite alleles for GL, GW, GT and TGW detected across the entire population were 17, 10, 6 and 8, respectively. The allele RM335-155 bp showed the largest phenotypic effect (1.45 mm) for GL, and the typical carrier accession was Yuedao 32 (Supplementary Table  7). The allele RM348-145 bp showed the largest phenotypic effect (0.35 mm) for GW, and the typical carrier accession was Hongmangjing (Supplementary Table 7). The allele RM269-170 bp showed the largest phenotypic effect (0.14 mm) for GT, and the typical carrier accession was Wanhuangdao (Supplementary Table 7). The allele RM528-185 bp showed the largest phenotypic effect (0.67 g) for TGW, and the typical carrier accession was Xiangjing 9407 (Supplementary Table 7).
Optimal cross designs for improving grain traits. Based on the number of elite alleles that could be substituted into an individual plant and the expected phenotypic effects of elite alleles that could be pyramided, the top five cross combinations for improving GL, GW, GT and TGW were proposed (Supplementary Table 8). The elite alleles carried by the parents in excellent crosses are listed in Supplementary Table 9. Figure 4 shows the five parents in the superior cross for each trait. Certain accessions were found repeatedly in the supposed parental combinations (e.g., Yuedao 21 and Cuyingwanyangdao emerged four times in the combinations for GL and GW), indicating that these accessions possess unique elite alleles.

Discussion
The population genetic structure analysis showed that the six subpopulations occurred in the 532 accessions using the model-based method (Structure) and Nei's genetic distance method. The results showed that POP1 consists of accessions collected from Southeast Asia (Vietnam, latitudes lower than 17°N) and belongs to the subspecies indica; and POP2 consists of accessions collected from East Asia (northeastern China and Japan, latitudes higher than 45°N) and belongs to the subspecies japonica. Indica rice is primarily grown in lowland regions throughout tropical Asia, whereas japonica is typically found in temperate East Asia, upland areas of Southeast Asia, and high altitude regions in South Asia. We found that the average standardized individual SSR allele in POP1 and POP6 consisted primarily of low-latitude accessions from Vietnam and southern China, respectively, and these alleles were significantly smaller than those of POP2 and POP4, the high-latitude accessions from northeastern China and northern China (mainly Japonica subspecies). This result suggests a tendency of directional evolution in O. sativa from low (subspecies indica) to high latitudes (subspecies japonica). No remarkable difference between POP6 (mainly modern indica cultivars from southern China) and POP1 could explain the short geographical distance between the two subpopulations. The average individual SSR allele in POP3 (consisting primarily of japonica landraces from the Taihu Lake valley) was significantly shorter than those in POP2 and POP4, and this may reflect the directional evolution of O. sativa from low latitudes (warmer climate) to high latitudes (cooler climate) if the independent domestication of indica and japonica were true 2 .
In the present study, we define a rice accession with local adaptation, long history of cultivation, tall stature, susceptible to pathogen and pest epidemic currently, lack of formal genetic improvement as a rice landraces, which is somewhat different from the definition of landrace proposed by Villa et al. 38 , since rice is a strictly self-pollinated crop, and genetic diversity of intra-accession do not exist.
The proportion of rare alleles (frequency less than 5%) was 36.25% within the 2698 identified alleles. The high ratio of rare alleles in the present study might have been caused by the wide distribution of latitudes of the accessions. As the rice cultivation area expands from south to north and from the plains to plateau, new alleles appeared and certain original alleles disappeared, resulting in the emergence of accessions with rare alleles.
The average number of alleles per locus was 10.46 among the 532 accessions genotyped with 258 markers. The allele number per locus is higher than that reported by Garris et al. 37 , Agrama et al. 39 , and Vanniaraja et al. 40 but lower than that reported by Borba et al. 41 and Li et al. 42 . The average PIC value in this study was 0.7042, which is the highest value among previous studies for rice populations 41,43 , with an exception of the PIC value of 0.71 reported by Li et al. 42 and PIC value of 0.75 reported by    Borba et al. 44 . The wide range of genetic diversity and manageable number of accessions in this study indicate that this is one of the best data sets for mining valuable genes in rice. An association analysis that does not consider population structure would have a high rate of Type I errors (false positive). In this study, the 532 O. sativa accessions were classified into six model-based subpopulations based on an ancestry analysis (Fig. 1a). The dendrogram (Fig. 1b) based on Nei's genetic distance was consistent with the population structure determined by this collection. Therefore, the results obtained from these two separate analyses are consistent. In addition, the population structure was dependent on geographic origin, such as the accessions from Vietnam, which were classified into POP1, and the accessions from northeastern China, most of which were classified into POP2. The distinct geographic origins that correspond to the different ecological environments could be partially responsible for the observed genetic differentiation, which in turn contributes to the different responses to environmental factors and rare alleles in the germplasm accessions.
The significant F ST values among the subpopulations (Table 1) suggest a divergence between these subpopulations, and heterosis might be observed for the crosses between the accessions to improve yield.
The number of markers required to cover the genome in an association study is determined by the extent of LD. To date, varying results have been reported for LD patterns by researchers in rice. Olsen et al. 45 and Mather et al. 46 reported LD decay occurring at an approximately 1 cM distance using DNA sequencing, whereas Dang et al. 30 , Agrama et al. 39 , and Vanniarajan et al. 40 reported LD decay at 10-80 cM, 20-30 cM and 20-30 cM distances, respectively, using SSR markers. Compared with previous studies, the LD values for POP1-POP6 in the present study were similar to those reported by Vanniarajan et al. 40 . The variation in LD patterns across chromosomal regions observed at the subpopulation level suggests that the extent of LD varies among different rice accessions 39 , different markers 45 and different genomic regions 46 . In this study, POP1 presented the fastest decay velocity, and it was followed by PO5, POP2, POP3, POP4 and POP6. The fast observed decay might have been caused by frequent artificial hybridization that is used in breeding because of the short day length in the lower latitudes of the Northern Hemisphere. The materials in POP2, POP3, POP4 and POP6 were mostly superior and modern accessions, and the chance for improvement was limited; therefore, they presented a slow decay velocity.
In this study, we identified seventeen markers associated with grain traits using the entire set of accessions, including 8 associated with GL, 4 associated with GW, 2 associated with GT and 3 associated with TGW. Nine of the 17 associations were in regions where the QTL associated with the given trait had been identified (http://www.gramene.org/), and they are listed in Supplementary Table 10 (Table 3). For the 3 new loci in GW, the PVE averaged ranged from 2.4% to 16.1% over 4 years. The marker RM573, which is located on chromosome 2, was a new locus associated with GT, and it showed a PVE from 5.0% to 8.0% over 4 years. These results might increase the descriptive power of the QTLs associated with the grain traits in rice and provide useful information for further fine mapping or cloning. In addition, certain loci were mapped at close to gene resolution (e.g., RM6976 close to GS3), indicating that association analyses of rice accessions can provide an effective approach for gene identification.
After comparing the results of the significant marker-trait loci between the subpopulations and the entire population, two phenomena were observed. The first phenomenon is that the PVE for the same significant marker-trait loci detected in the subpopulations was larger than that detected across the entire population (Table 3, Supplementary Table 6). For instance, the PVE of the locus RM345 for GL detected across 4 years was 23.1%, 21.2%, 32.5% and 17.9% in POP4 and 30.2%, 31.1%, 31.0% and 21.5% in POP6, whereas the PVE of the locus RM345 for GL detected across the entire population for 4 years was 4.8%, 3.9%, 6.1% and 6.3% (Table 3, Supplementary Table 6). The second phenomenon is that the 17 significant marker-trait loci detected across the entire population were not detected in a single subpopulation but in different subpopulations. For example, the locus RM6863 for GW was detected in POP2, POP3 and POP5, but not in POP1, POP4 and POP6, and the locus RM269 for GT was detected in POP1, POP4, POP5 and POP6 but not in POP2 and POP3 (Table 3, Supplementary Table 6). These phenomena might be interpreted as the smoothing effect caused by population size between the entire population and subpopulations in the marker-trait association analysis. The smoothing effect means that the trait difference, which was significant within a subpopulation, may become insignificant when the subpopulation accessions are integrated into the entire population.
Comparing the results of the significant marker-trait loci among the six subpopulations, we found that certain loci were simultaneously detected in more than 2 subpopulations, but no single locus was simultaneously detected in all 6 subpopulations. This phenomenon might be interpreted as the difference in genetic differentiation among the subpopulations.
To maintain information integrity, the results of significant marker-trait association loci obtained from the entire population were used for mining elite alleles. Seventeen elite alleles for GL were mined at the eight identified loci. Among them, 17.6% of the elite alleles were carried by accessions collected from northeastern China, 29.4% were carried by accessions from central China, and 52.9% were carried by accessions from Vietnam. Similarly, certain unique elite alleles were identified in various accessions for GW, GT and TGW, and these results suggest that during the process of rice evolution from the south to north, certain alleles were lost in the process of natural or artificial selection, whereas others were retained in modern cultivars or appeared for the first time in modern cultivars. For example, RM345-165 bp was common for Vietnam accessions but not found in northeastern China accessions, whereas the allele RM345-150 bp was found only in northeastern China accessions.
Correlations between the measured traits were observed, and GL was significantly negatively correlated with GW but positively correlated with TGW. We identified one SSR marker co-associated with GL and TGW in which the allele RM345-150 bp increased the phenotypic effect values of GL and TGW simultaneously (Table 3, Supplementary Table 7). We also detected one SSR marker co-associated with GL and GW in which the alleles RM348-130 bp, RM348-145 bp and RM348-170 bp increased with GW but decreased with GL (Table 3, Supplementary Table 7). These co-associated alleles have the correct sign with respect to trait correlations, and these data illustrate the genetic basis of trait correlations. In addition, if grain length elongates rapidly, then it will likely consume more carbohydrates in the endosperm, thereby resulting in less supply for the grain width, and vice versa.
For the GL trait, the broad-sense heritability averaged across four years was 97%, which was considerably high. Thus, the expected results for improving GL could be obtained by marker-assisted selection. Among the eight SSR-associated markers detected for GL, RM335_Chr 4 had the largest PVE (12.9%, 12.8%, 13.1% and 10.8% in 2010-2013) and among the two elite alleles found at this marker locus, RM335-155 bp had the largest phenotypic effect value (1.45 mm). This elite allele was carried by 45 accessions, and Yuedao 32 was the typical carrier material. Thus, GL could be improved greatly by the crosses described in Supplementary Table 8.
For the GW trait, the broad-sense heritability averaged across four years was 95%. Among the four SSR-associated markers detected for GW, RM1019_Chr 8 had the largest PVE (15.2%, 12.1%, 8.0% and 9.0% in 2010-2013), and among the four elite alleles found at this marker locus, RM1019-150 bp had the largest phenotypic effect value (0.28 mm). This elite allele was carried by 64 accessions, and Hongmangjing was the typical carrier material. Thus, GW could be improved greatly by the crosses described in Supplementary Table 8.
For the GT trait, the broad-sense heritability averaged across four years was 93%. Among the two SSR-associated markers detected for GT, RM573_Chr 2 had the largest PVE (5.6%, 5.0%, 6.9% and 8.0% in 2010-2013), and among the four elite alleles found at this marker locus, RM573-220 bp had the largest phenotypic effect value (0.08 mm). This elite allele was carried by 32 accessions, and Si4263 was the typical carrier material. Thus, GT could be improved greatly by the crosses described in Supplementary  Table 8.
The broad-sense heritability averaged across four years for the TGW trait was 94%, which was also high. Among the three SSR markers associated with TGW, RM345_Chr 6 had the largest PVE (5.9%, 5.1%, 6.5% and 6.0% in 2010-2013, respectively), and four elite alleles-RM345-105 bp, RM345-150 bp, RM573-155 bp and RM345-160 bp-were found at this marker locus. Thus, TGW might be improved by the crosses listed in Supplementary Table 8.
If the target trait must be further improved, the best elite alleles could be pyramided into one cultivar using multi-round crossing. For example, there were 17 elite alleles detected for GL, and the 8 best elite alleles could be pyramided or substituted by the combination of the accessions Yuzhenxiang, Yuedao 82, Yuedao 21, Fengyouwan 8hao, and Nongxiang 25 (Supplementary Table 8).

Materials and methods
Plant materials. A total of 532 rice accessions from the geographical regions of East and Southeast Asia were used for the association mapping, including 121 from Vietnam (17°N -23°N), 400 from China and 11 from Japan (20°N -54°N). Detailed information on their origins is summarized in Supplementary  Table 2.
Field planting and trait measurement. The 532 accessions were planted in a paddy rice field at the Nanjing Agricultural University Experimental Farm, Nanjing, China from May to October in 2010, 2011, 2012 and 2013. The field experiments in the four consecutive years were treated as four independent environments. The field trials followed a completely randomized block design with two replicates per year. Each plot contained five rows, with 8 plants in each row, 17 cm between plants within each row and 20 cm between rows. The field management followed standard agricultural practices. At the mature stage, 5 normally developed plants from the middle of the plots were harvested each year and dried under natural conditions for the trait investigation. Fully filled grains were used for measuring the grain length (GL/mm), grain width (GW/mm), grain thickness (GT/mm) and 1,000-grain weight (TGW/g). Ten randomly chosen grains (after removing awns) from each plant were lined up length-wise along an electronic digital Vernier calliper to measure the grain length and then arranged by breadth to measure the grain width. The individual grain thickness was determined according to the maximal values for each grain using a Vernier calliper, and the values were averaged and used as the measurements for the plants. The 1,000-grain weight was calculated based on 1,000 grains. SSR marker genotyping. Genomic 48 and McCouch et al. 49 , 258 SSRs scattered on 12 chromosomes were selected. The primers were synthesized by Shanghai Generay Biotech Co. Ltd., Shanghai, China. Each 10 μ l PCR reaction contained 10 mM Tris-HCl (pH 9.0), 50 mM KCl, 0.1% Triton X-100, 1.5 mM MgCl 2 , 0.5 nM dNTPs, 0.14 pM forward primers, 0.14 pM reverse primers, 0.5 units Taq polymerase, and 20 ng genomic DNA. The DNA amplification was performed using a PTC-100 TM Peltier Thermal Cycler (MJ Research TM Incorporated, USA) under the following conditions: 1) denaturation at 94 °C for 5 min; 2) 34 cycles of denaturation at 94 °C for 0.5 min, annealing at 55-63 °C for 1 min, and extension at 72 °C for 1 min; and 3) final extension at 72 °C for 10 min. The PCR products were run on an 8% polyacrylamide gel at 150 V for 1 h and visualized using silver staining. One pair of SSR markers detected one locus, and each polymorphic band at the same marker locus in the population was recorded as one allele. After screening the polyacrylamide gel electrophoresis (PAGE) products, the molecular weight of each band was calculated by the software Quantity One.
Data analysis. All of the basic statistical analyses were performed using the SAS package (SAS Institute Inc., Cary, NC, USA). Broad-sense heritability (H 2 B ) was calculated according to the natural population through an analysis of variance using the formula H 2 B = σ 2 g / (σ 2 g + σ 2 e /n), where σ 2 g is the genetic variance, σ 2 e is the error variance, and n is the number of replicates. The PIC value was used to measure the allele diversity at a locus, and the allele number per locus and genetic diversity were calculated using PowerMarker version 3.25 software 50 to quantify the genetic variation within the 532 accessions. Nei's 51 distance was also calculated and used for the unrooted phylogeny reconstruction using the neighbour-joining method as implemented in PowerMarker with the tree viewed using MEGA 4.0 52 .
Levels of genetic variation within and among populations identified by the subpopulation analysis were estimated from allelic frequencies using AMOVA 53 . The software Arlequin 3.01 54 was used to perform the AMOVA procedure using SSR and standard multi-locus frequency data.
The software SPAGeDi 55 (Spatial Pattern Analysis of Genetic Diversity) was used to calculate the pairwise relatedness coefficients (K, kinship matrix) to estimate the genetic relatedness among individuals, and the negative value of kinship set to zero.
We calculated the average individual allele size of the SSRs as the mean of the standardization size of the 258 SSR loci following the method reported by Vigouroux et al. 56 .
Population structure analysis. The optimum number of populations (K) was selected after five independent runs of a burn-in of 50,000 iterations followed by 100,000 iterations for each value of K (from 2 to 10) using Structure version 2.2 57 . The mean log-likelihood value over 5 runs at each K value was used. If the mean log-likelihood value reached the highest value in the model parameter K, a suitable value of K was determined. The non-admixed individuals (accession that could be clearly assigned to only one group) in each genetic subpopulation were determined using the Q-matrix assignment of above 0.9. Linkage disequilibrium. LD was estimated by the D' value 58 between all pairs of SSRs with 1,000 permutations and calculated using TASSEL 3.0 software 59 . Rare alleles with an allele frequency of 5% or less were removed from the dataset before the association analysis. According to the level of LD and genetic distance among markers with intrachromosomal combinations, the regression equation of LD with the genetic distance changes was calculated by the regression analysis. The LD decay plot was drawn to observe the relation between LD and genetic distance.
Association mapping. The associations between traits and markers were calculated using an MLM as described in TASSEL 3.0 59 . An MLM can significantly reduce spurious marker-trait associations (Type I error showing false positives) resulting from the population structure because Q and K matrices are used as covariants in the analysis. The Q matrix was adapted from the analysis results obtained from Structure 2.2. The K matrix (kinship matrix) was obtained from the result of the relatedness analysis using SPAGeDi. A false discovery rate (FDR) of 0.05 was used as a threshold for significant associations using the Benjamini and Hochberg 60 correction method. Using the identified association locus, the 'null allele' (non-amplified allele) was used to determine the phenotypic effects of other alleles 32 .