Identification of domestication-related loci associated with flowering time and seed size in soybean with the RAD-seq genotyping method

Flowering time and seed size are traits related to domestication. However, identification of domestication-related loci/genes of controlling the traits in soybean is rarely reported. In this study, we identified a total of 48 domestication-related loci based on RAD-seq genotyping of a natural population comprising 286 accessions. Among these, four on chromosome 12 and additional two on chromosomes 11 and 15 were associated with flowering time, and four on chromosomes 11 and 16 were associated with seed size. Of the five genes associated with flowering time and the three genes associated with seed size, three genes Glyma11g18720, Glyma11g15480 and Glyma15g35080 were homologous to Arabidopsis genes, additional five genes were found for the first time to be associated with these two traits. Glyma11g18720 and Glyma05g28130 were co-expressed with five genes homologous to flowering time genes in Arabidopsis, and Glyma11g15480 was co-expressed with 24 genes homologous to seed development genes in Arabidopsis. This study indicates that integration of population divergence analysis, genome-wide association study and expression analysis is an efficient approach to identify candidate domestication-related genes.

T he transitions from vegetative to reproductive growth (days to flowering) and from generation to generation through seeds are important stages of the plant life cycle. Flowering time reflects the adaptation of a plant to its environment, and the time required to mature varies widely among cultivars 1 . Seeds are important in the reproduction and spread of flowering plants, and seed size partly reflects the efficiency of plant production 2 . Both flowering time and seed size are important traits involved in domestication, a process accompanied by reduction in genetic diversity and loss of important traits preserved in wild relatives.
Morphological, physiological and molecular markers have been widely used to measure the genetic diversity of wild and cultivated plants in rice 3 , soybean 4 , and wheat 5 , to deduce the geographic regions of domestication, and to screen for breeding material. Linkage and association studies have identified quantitative trait loci (QTL) associated with domestication-related (DR) traits in various plants 6,7 and animals 8 , such as the lateral branching locus teosinte branched1 (tb1) in maize 9 , the seed shattering locus qSH1 in rice 10 , and the milk-production locus DGAT1 in dairy cattle 11 . Numerous DR genes have also been identified, including GW8 in rice 12 , Q in wheat 13 , vrs1 in barley 14 , and RYR1 in pig 15 . As whole genome sequences from almost all major plants have become available, substantial progress has been achieved, including 1) Hundreds of DR genes have been identified by comparative analysis of genomes in plants 16 and animals 17 ; 2) Candidate genes resulting from selection were detected by comparative analysis and functional tests 7 , for example, the genes LOC_Os05g20290 and LOC_Os08g40710 in rice 18 and the genes GRMZM2G448355 and abph1 (GRMZM2G035688) in maize 19 ; and 3) The molecular mechanisms underlying DR traits have been reported 20,21 . In addition, the ancient DNA and genomes of fossils Significances of all the SNPs in the above three analyses were marked by red (significant) and blue ( have contributed to the histories of domestication and evolution in horses 22 , cattle 23 , and Arabidopsis thaliana 24 . However, little is known about the loci/genes underlying DR traits in soybean. The soybean plant originated in China and was first domesticated by Chinese farmers between 6,000 and 9,000 years ago 25 . The modern domesticated soybean (Glycine max) is an economically important crop because it has high protein and oil contents, can fix nitrogen through microorganisms in the soil 26,27 , and is a model plant for legume research. The cultivated soybean (G. max) differs from the wild soybean (G. soja) in many traits, for example, the cultivated  1,6,7,9,11,12,13,15,16  soybean forms flowers earlier and has larger seeds than the wild soybean 28 . Previous research has led to the discovery of many QTL correlated to DR traits such as seed size [29][30][31][32][33] and other traits (http:// www.SoyBase.org). Some genes controlling DR traits in soybean have also been discovered, e.g., E1-E4 34-37 , FLC, VRNA, ELF8, PHYE, PHYA 16 , CDF3, VRN1, SVP, AP3 and PIF3 for flowering time 38 , and Dt1 for determinate growth habit 39 . Although the genes and their molecular mechanisms for some DR traits in soybean have been investigated 38,39 , the genes/loci underlying many other DR traits remain to be addressed.
In this study, restriction-site-associated DNA (RAD) tags from 14 wild, 153 landrace, and 119 bred accessions were sequenced, and the sequence variants were analyzed to detect DR loci by testing the independence between the SNPs and soybean evolutionary classes (wild, landrace, and bred) and comparing the genetic diversity between the wild and cultivated soybeans. Genome-wide association of the detected DR loci with DR traits (flowering time and seed size) were also studied. Candidate genes predicted to be involved in these two traits were pinpointed using comparative genomics technology. Co-expression analysis for individual candidate genes was also conducted.

Results and Discussion
Phenotypic characteristics of flowering time and seed size. Distinct difference of these traits between G. soja and G. max suggested that these traits are domestication-related. Flowering time and seed size were also considered as DR traits in other reports 25,28,[40][41][42] , although domestication traits in soybean include more than these two traits, such as indeterminate habit and pod dehiscence.
Detection and distribution of domestication-associated loci. Based on the sequence obtained from 286 accessions through the RAD-seq genotyping approach, a total of 106,013 SNPs were identified. x 2 tests of independence between the SNPs and evolutionary classes (wild, landrace, and bred) of soybean showed that 198 SNPs were significant at P-value # 4.72 3 10 27 (Fig. 1a). A U-test determined that the allelic frequency for 72 of the 198 SNPs were significantly different between the wild and cultivated classes (Fig. 1b). The fixation index (F ST ) analysis demonstrated that 48 of the 72 SNPs had an F ST value greater than 0.45 ( Fig. 1c; Table S1).
A total of 12 DR loci were identified in 11 genes with 4 loci in the coding regions of the genes. Of the 11 genes whose functions were previously annotated in the Pfam and NCBI non-redundant databases, three and two genes were located on chromosomes 11 and 15, respectively. The common allele for each of the 12 loci was different in the bred and wild soybean, and common alleles for each of the 3 DR loci on chromosome 11 were the same in both the landrace and wild soybeans.
Genome-wide association studies (GWAS) for flowering time and seed size. GWAS for flowering time (first and full) and seed size (SL, SW and 100 SW) was performed based on a dataset containing 286 accessions genotyped with 55,052 SNPs (Fig. S1). Because our purpose was to investigate the association of the DR loci with the two DR traits, we only analyzed the association of the DR loci listed in Fig. 2 and Table 1.
A total of 8 QTL, each explaining 0.91-16.85% of the total phenotypic variance of seed size, were identified (P-values of 3.24E-15 to 7.76E-7). These QTL were mapped to genomic positions 10957940, 11100801, 11120981, 11121256, and 11121307 bp on chromosome 11 and 30207191, 30207175, and 30210113 bp on chromosome 16 (Fig. 2c-e). Among the five SNP loci on chromosome 11, the last four loci, which were close to the DR locus at 11111516 bp (,10 kb), were associated with the three seed size traits. The DR locus at 10957940 bp was associated with SW in 2008 to 2010 and with SL in 2009. Among the three loci on chromosome 16, one locus at genomic position 30207191 bp, which was close to the DR locus 30207175 bp, was correlated with SL and 100 SW in 2011 and 2012 and with SW in 2011. The DR locus at 30207175 bp was associated with 100 SW in 2012, and the DR locus at genomic position 30210113 bp was associated with SL and 100 SW in 2012. In summary, all four DR loci were associated with seed size traits.
Candidate DR genes for flowering time. Candidate DR genes were selected if: 1) genes containing DR loci or genes in the adjacent regions to which DR loci were significantly associated with the flowering time or seed size, and 2) genes with high level of tran-scriptome expression. The locus at 15362036 bp on chromosome 11 was in the gene Glyma11g18720, which was homologous to the gene OTLD1 regulating FLC for flowering time and seed germination in Arabidopsis [46][47][48] (Table 1, Fig. S2). The locus at 39607166 bp on chromosome 15 was in the gene Glyma15g35080 which regulated flowering time in soybean 38 . We also observed a relatively high expression of the Glyma11g18720 and Glyma15g35080 at specific stages of flowering time and seed development (Fig. 3).
The genes Glyma12g08150 and Glyma12g08160 were adjacent to the domestication regions of 5895633 to 5895785 bp on chromosome 12, and the genes Glyma12g08210 and Glyma12g08230 were in close proximity to the DR locus at 5957099 bp on chromosome 12. The likely association of these genes with domestication was also reported by Chung et al. 44 . In addition, based on the soybean transcriptome data deposited at NCBI (http://www.ncbi.nlm.nih.gov/ geo/), all the genes had relatively high expression levels in floral bud tissue (Fig. 3). Therefore, the above 10 genes may be associated with flowering time.
Candidate DR genes for seed size. The genes Glyma11g15300 and Glyma11g15310 were close to the DR locus at 10957940 bp on chromosome 11. Glyma11g15480 contained a DR locus at 11111516 bp on chromosome 11, and Glyma16g26030 and Glyma16g26050 were close to the domestication regions of 30207175 to 30210113 bp on chromosome 16. The three genes Glyma11g15300, Glyma11g15480 and Glyma16g26030 had relatively higher expression levels during seed development than did Glyma11g15310 and Glyma16g26050 (Fig. 3). The soybean gene Glyma11g15480 is homologous to the Arabidopsis gene NOT2A 49 (Fig. S8) and was found to be associated with seed size in this study. A correlation analysis of Glyma11g15480 with each selected soybean gene based on the gene expression dataset showed that 24 genes were significantly correlated with Glyma11g15480 (P-values of 1.00E-4 to 4.90E-2) ( Table 2). Among the 24 genes, 11 were homologous to five epigenetic regulation endosperm genes in Arabidopsis (FIS2, MSI1, FIE/FIS3, DDM1, and MET1) [50][51][52] (Figures S9-S13); six were homologous to four endosperm development genes (AP2, AGL61, SHB1, and                   EMS1) 50,53-55 (Figures S14-S17); five were homologous to two integument development genes (ARF2/MNT, and TTG2) [56][57] (Figures S18-S19); and two genes were homologous to the embryo development gene (ANT) 58 (Fig. S20). Therefore, the above 27 genes may be associated with seed size. In addition, significant difference of allele frequencies for the SNPs in coding region of the above eight candidate DR genes in GWAS was observed between wild and cultivated soybeans (Tables S2, S3). Although the number of wild accessions was smaller than the number of cultivated soybeans, it is close to that reported in Chung et al. 44 and Lam et al. 59 , and we adopted a stringent threshold of statistical significance in determining domestication loci (a 5 4.72E-7 and F ST . 0.45) and GWAS (a 5 9.08E-7).
Some QTL (or genes) associated with DTs in this study were consistent with those reported previously (Table S4), for example, Glyma12g08150, Glyma12g08210 and Glyma12g08230 were reported as candidate domestication genes by Chung et al. 44 .

Conclusion
A total of 48 DR loci in the soybean genome were identified. Most of these loci were on chromosomes 11 and 15. Among these DR loci, 10 loci were associated with flowering time or seed size. Eight genes near the 10 loci were associated with the two traits. Among the eight genes, three genes Glyma11g18720, Glyma11g15480 and Glyma15g35080 were homologous to Arabidopsis genes, three known DR genes Glyma12g08150, Glyma12g08210 and Glyma12g08230 were linked to flowering time; and the other two DR genes Glyma11g15300 and Glyma16g26030 were reported for the first time. Glyma11g18720 and Glyma05g28130 were co-expressed with five genes homologous to flowering time genes in Arabidopsis, and Glyma11g15480 was coexpressed with 24 genes homologous to seed development genes in Arabidopsis. The method that integrates domestication analysis, GWAS and gene expression analysis is an efficient approach for identifying potentially useful DR genes.

Methods
Germplasm for genome-wide re-sequencing. The 286 soybean accessions used here included 14 wild, 153 landrace, and 119 bred accessions that were randomly selected from 6 geographic regions in China using a stratified random sampling method (Table S5). The seeds of the accessions were obtained from the National Centre for Soybean Improvement and Linyi Academy of Agricultural Sciences, and planted in three-row plots in a completely randomised design at the Jiangpu Experimental Station of Nanjing Agricultural University during 2008-2012. The plots were 1.5 m wide and 2 m long. A total of 20 seeds from five plants of the middle row in each plot were randomly selected for the measurement of the seed size traits, including SL and SW, using digital verniercalipers. The SL and SW for each accession were averaged based on 20 seeds and 100 SW for each replicate. The first and full flowering dates for each accession were recorded in the field during 2010 and 2012.
Genome-wide re-sequencing and sequence alignment. Approximately 0.3 g of fresh leaves obtained from each accession in 2012 was used to extract genomic DNA using the cetyltrimethylammonium bromide method. The DNA was digested with the EcoRI (G ' AATTC) enzyme. The RAD-seq libraries for sequencing were prepared according to the protocols described by Baird et al. 60 . We sequenced 50 bp at each end and used SolexaPipeline 1.0 for base call of 50-bp reads from the raw fluorescent images. To ensure quality, the raw sequence data were processed in two steps. In the first step, reads with incorrect adapter sequences were deleted, and any reads containing more than 50% low quality bases (quality value ,5 5) were removed. In the second step, the remaining reads were demultiplexed according to the index of each sample. The sequences were subsequently aligned on the soybean reference genome Glyma1.1 (http://www.jgi.doe.gov) using the Burrows-Wheeler Alignment Tool (BWA) 61 . The base recalibration and determination of the SNP allele were performed using GATK 62 .
Detection of DR loci. The independence between each SNP and the evolutionary class (wild, landrace, and bred) was determined using the x 2 test 63 . The significant threshold a for each test was adjusted using the Bonferroni correction, that is, a 5 0.05/106,013 5 4.72 3 10 27 (106,013 was the total number of SNPs identified). The SNPs significantly associated with evolutionary classes were used for the subsequent analysis.
A U-test at the 0.05 significance level was used to test the significance of the allelic differences between the wild (14) and cultivated (272) accessions for each candidate SNP.
For all the SNPs with significant differences in the U-test, the fixation index (F ST ) was calculated for the purpose of measuring population differentiation between the wild (14) and bred (119) classes using the software Genepop v4.2 64 , and a F ST threshold value of 0.45 from Lam et al. 59 were adopted for the identification of DR loci.
Population structure analysis. The STRUCTURE 2.2 software was used to investigate the population structures based on all available accessions 65 . The number of subpopulations (K) was set from 2 to 7 32 . In this study, the DK method of Evanno et al. 66 was used to determine the optimal value of K. The Q matrix was calculated based on inferred K.
Genome-wide association study. A GWAS was performed using the general linear model in TASSEL V4.3 (www.maizegenetics.net/tassel) with the total average and population structure as covariates. The Manhattan plot of 2log 10 P was generated using SAS software. As P-values from the software were not corrected for multiple test, the significance level a for each test was determined after Bonferroni correction, a 5 0.05/55,052 5 9.08E-7 (55,052 was the number of SNPs with the concordance rate ,99% by the TASSEL software). Elimination of the SNPs with concordance rate .99% was conducted by the TASSEL software.
Identification of soybean DR genes and homologous Arabidopsis genes. Many genes have been annotated with high confidence in the soybean genome 26 . The most likely functions of the genes have also been determined based on the annotation of the Arabidopsis homologs. The protein sequences of the annotated genes in soybean (Glyma1.1) and Arabidopsis (TAIR9 release) were downloaded from the Phytozome (www.phytozome.net) and The Arabidopsis Information Resource (TAIR) (www. arabidopsis.org) websites, respectively.
The Pfam gene annotations were downloaded from the Phytozome database version 9.0 (http://www.phytozome.net/). The annotations of the DR genes using BLASTX (e-value lE-4 or better) against NCBI's non-redundant database have been described at http://www.ncbi.nlm.nih.gov/. The top best hits from the species of interest were extracted for each gene.
Transcriptional activity and gene expression data analysis. The transcriptome sequences were obtained from the Gene Expression Omnibus database at http://www. ncbi.nlm.nih.gov/geo/ss (accession number GSE29163), and were used to identify the expression level of soybean gene during seed development and throughout the life cycle.