The genetic architecture of water-soluble protein content and its genetic relationship to total protein content in soybean

Water-soluble protein content (WSPC) is a critical factor in both soybean protein quality and functionality. However, the underlying genetic determinants are unclear. Here, we used 219 soybean accessions and 152 recombinant inbred lines genotyped with high-density markers and phenotyped in multi-environments to dissect the genetic architectures of WSPC and protein content (PC) using single- and multi-locus genome-wide association studies. In the result, a total of 32 significant loci, including 10 novel loci, significantly associated with WSPC and PC across multi-environments were identified, which were subsequently validated by linkage mapping. Among these loci, only four exhibited pleiotropic effects for PC and WSPC, explaining the low correlation coefficient between the two traits. The largest-effect WSPC-specific loci, GqWSPC8, was stably identified across all six environments and tagged to a linkage disequilibrium block comprising two promising candidate genes AAP8 and 2 S albumin, which might contribute to the high level of WSPC in some soybean varieties. In addition, two genes, Glyma.13G123500 and Glyma.13G194400 with relatively high expression levels at seed development stage compared with other tissues were regarded as promising candidates associated with the PC and WSPC, respectively. Our results provide new insights into the genetic basis of WSPC affecting soybean protein quality and yield.

selection assisted by marker-assisted selection (MAS) would facilitate the development of soybean cultivars with improved WSPC. However, the way how soybean WSPC is genetically controlled remains largely unknown. Our current knowledge of WSPC is mainly based on the genetic studies of its correlated trait, PC. Thus, dissecting the genetic architecture of soybean WSPC and identifying the genetic relationship between WSPC and PC are urgently needed.
In the past decades, tremendous efforts have been made to dissect the genetic basis of soybean protein related traits. More than 100 QTLs related to soybean PC have been reported, including its components (7S and 11S) (http://www.soybase.org/), but most of these QTLs/genes were intensively focused on soybean total protein. Thus far, only several genes related to soybean PC have been cloned and functionally identified, but no WSPC-related genes have been reported. For example, seven major glycinin genes (G1 to G7) encode glycinin subunits, with Group-1 (G1, G2, G3) showing higher expression level 9,10 . In addition, some studies showed that the amino acid permease in Vicia narbonensis and pea can increase seed storage proteins 11,12 . Recently, QTLs underlying WSPC have been identified in soybean 8,13 . Despite a preliminary understanding of the soybean WSPC obtained, the molecular basis of natural variation in WSPC biosynthesis has not been fully elucidated because the QTL resolution is limited by the low density of molecular markers used in these studies. Additionally, the genetic relationship between WSPC and PC is unclear, resulting in difficulty in the improvement of the soybean cultivars with increased content of both WSPC and PC by MAS. Thus, a comprehensive genetic study is needed to determine the extent of genetic relevance between soybean WSPC and PC.
Genome-wide association studies (GWAS) using high-density DNA markers offer an opportunity to dissect the genetic architecture of complex traits in soybean. Compared with the QTL linkage mapping approach, GWAS can greatly increase the range of detection of natural variation, the number of genome-wide significant loci, and even QTL resolution for complex agronomic traits. By applying GWAS, many important QTLs could be narrowed down and associated candidate genes could be identified 14,15 . Recently, a soybean collection containing 367 diverse accessions has been genotyped using a high-throughput NJAU 355 K SoySNP array, which provides a high-resolution of genome-wide markers facilitating GWAS of complex traits in soybean 16 .
In this study, we conducted a high-resolution GWAS of soybean WSPC and PC within 219 diverse association accessions (a large portion of the 367 diverse accessions) genotyped with NJAU 355 K SoySNP array to dissect the underlying genetic architecture of WSPC and PC in soybean. In addition, to fully understand the genetic architecture of WSPC and its genetic relationship to PC at the QTL level, epistatic GWAS (EGWAS) were also presented in this study as previously described 17,18 . Moreover, a recombinant inbred lines (RILs) population whose parents were selected from the association panel was used to validate the significant signals identified in GWAS. Candidate genes within these significant association loci that were potentially involved in the regulation of WSPC and PC were also predicted.
Our results identified 32 loci distributed over different chromosomes significantly associated with WSPC and PC in at least three or more environments, and only four regions exhibited pleiotropic effects for WSPC and PC. This observation may explain low correlation coefficient between the two traits as observed in phenotypic correlation. Moreover, QTLs associated with soybean WSPC and PC exhibit a moderate level of genetic sharing, suggesting these two traits may be under differential directional selection during soybean domestication and improvement. Those WSPC-specific loci might be responsible for high WSPC in the low-protein soybean varieties. GqWSPC8 is a highly significant major-effect locus that specifically affects WSPC. This region contained two candidate genes encoding seed storage 2S albumin proteins (Glyma.08G112300) and amino acid permease (AAP8, Glyma.08G113400), which may be responsible for the high WSPC in soybean. In addition, other candidate genes, such as Glyma.13G123500 and Glyma.13G194400 with relatively high expression levels at seed development stage were also regarded as promising candidates associated with the PC and WSPC, respectively.

Results and Discussion
WSPC and PC exhibited significant phenotypic variation. The selection of appropriate mapping populations genotyped with saturated markers is important for the dissection of mapped QTLs and further understanding the genetic architecture 19 . In this study, the 219 accessions were collected from three different ecological habitats, which represent all geographic ranges of soybean cultivation in China, suggesting that this panel is representative and is expected to contain a great level of genetic variation. For example, a majority of accessions (approximately 175 accessions) from this panel have been used to identify QTLs associated with yield 20 , seed shape traits 21 , phosphorus efficiency, and soybean protein 8 , suggesting that this collection might contain diverse phenotypic variation in complex quantitative traits of soybean. As expected, a great level of genetic variation in WSPC and PC were observed in its expanded panel (219 accessions) that was used in the present study (Table S1, Fig. 1 and Figure S1). Similarly, the RIL population whose parents were from the association panel also exhibited significant variation in biological yield 22 , and responses to low-P stress 23,24 . The two parents also differed considerably for the PC and WSPC across environments 8 . On the other hand, the availability of the dense genome-wide markers for both populations would be also beneficial in enhancing mapping resolution. For example, the association panel has been genotyped with approximate 355,000 SNP markers 16 , which is approximately ten times more than the marker density (approximate 30,000 SNPs) that was used in recent GWAS studies in soybean 17,25 . These high-coverage markers here can increase the resolution of association mapping (one SNP/3.3 kb). In addition, we have used the RIL population to construct a high-density genetic map with 6,159 SNP markers, with an average distance of 0.49 cM between adjacent markers 24 . Thus, the SNP datasets, the association panel, and the segregating RIL population are appropriate to dissect the genetic basis of WSPC, PC, and other complex traits.
Both mapping populations exhibited significant phenotypic variation in PC and WSPC across environments. As shown in Table S1, the means, standard deviation, range, skew, and the broad-sense heritability of PC and WSPC were calculated. The mean PC for the individual accessions in the GWAS population ranged from 30.3 to 57.1%, and the maximum value for WSPC can reach 46.8% (E3), which was approximately eight times higher than the minimum value (6.3%, E1) (Table S1). In RIL population, the phenotypic values ranged from 37.4-46.5% for PC and 22.7-38.8% for WSPC. The distribution of both traits for these GWAS accessions is roughly normal, and the range exceeds those that were observed among the RILs at both sides of the distribution (Table S1 and Fig. 1). More extensive natural variation may be an advantage for GWAS, which can break the limitation of allelic diversity in a segregating population 26 . Thus, both PC and WSPC exhibited wide phenotypic variation among this association panel across different environments and displayed very high genetic diversity (Table S1, Figs 1 and S1).
The high correlation coefficient of phenotypic value for a trait between different years (Figs 1 and S1) indicates the high quality of the phenotypic dataset. However, the correlation coefficient between the PC and WSPC is relatively low in the same year, although they are significant at the 0.01 level in all of the environments (Figs 1 and S1). In addition, the correlation between PC and WSPC: PC was not significant (r = 0.12, P = 0.082).

Significant loci associated with PC and WSPC.
Of 292,053 high-quality SNPs, firstly, we selected a total of 201,994 polymorphic SNPs with minor allele frequency (MAF) ≥0.05 were used for GWAS using GAPIT package. For one trait, lead SNPs less than or around 2-Mb were considered as caused by one common gene 27 . Base on this rule, our GWAS identified a total of 25 loci (15 for PC and 10 for WSPC) comprising 589 SNP signals significantly associated with WSPC and PC at least three or more environments (containing their BLUP) at P < 4.95 × 10 −6 ( Table 1 and Fig. 2a,b and c). The full list of significant SNPs associated with both traits across six environments and the BLUP is presented in Tables S2 and S3. As shown in the quantile-quantile and Manhattan plots for BLUP of PC and WSPC (Fig. 2a,b) and individual environment ( Figures S2 and S3), we identified positive associations significant for PC and WSPC after using mixed linear model in accounting for population structure and familial relatedness 28 . Of 25 loci, more than half (18) of the loci identified by GWAS co-localized with the previously-identified QTLs associated with seed protein or protein components (Table 1). Moreover, the genomic regions of almost all these previously-identified QTLs were significantly narrowed (Table 1). In addition, we also identified seven loci (GqPC4-1, GqPC5, GqPC6, GqPC11-2, GqPC15-2, GqWSPC11-2, GqWSPC18-2) that were not found in previous reports, representing novel loci underlying soybean protein content. These results indicated that GWAS is a powerful strategy in the genome-wide identification of phenotype-genotype associations 23,29 . Moreover, the highly diverse association panel and saturated genome-wide DNA markers are both critical factors improving the mapping resolution. Both advantages enable GWAS-based mapping to finely map previously-identified QTLs and detect novel loci.  A total of 15 loci associated with PC were identified, and the average phenotypic variation explained by each loci ranged from 17.4% (GqPC5) to 29.2% (GqPC15-1) ( Table 1). This result suggests that soybean PC is a complex trait and controlled by many loci with relative larger effect. For WSPC, a total of 10 loci were detected, explaining 7.9-19.3% of the average phenotypic variation (Table 1). Compared with PC, most of the loci (except GqWSPC8) for WSPC explained <10% of the phenotypic variation (Table 1 and Table S3), suggesting that the genetic architecture of WSPC is relatively more complex than we were expected and WSPC might be controlled by a major QTL plus many relative small-effect QTLs. Notably, GqWSPC8 on Gm08, comprising 427 significant SNP signals affecting soybean WSPC, represented the strongest associated hotspot region and was detected across all six environments. This QTL explained 17.9% of the average phenotypic variation (Figs 2 and 3a).
The above GWAS was single locus-based with multiple tests, and Bonferroni correction is adopted to reduce false discovery rate. Although stringent, this correction might miss some important loci. To maximally capture the important variation underlying PC and WSPC, we also adopted several multi-locus-based GWAS methods to analyze the above datasets, such as ISIS EM-BLASSO and mrMLM, and the results were compared with GAPIT results. In the result, the major loci (GqWSPC8, GqWSPC19) detected by GAPIT could also be identified using two multi-locus methods (Tables 1, S2 and S3). For example, the locus (GqWSPC8) on Gm08 for WSPC could be identified in all the environments and their BLUP. Meanwhile, seven additional loci (Table 1) were identified by the two multi-locus methods (Tables 1, S2 and S3).

Epistatic association analyses. Previous studies associated with marker-assisted and genomic selections
have shown that epistasis should be considered in soybean breeding 17,30,31 . In this study, we found that the average values of PC and WSPC in E1 (32.1°N 118.4°E) in southern China was less than those in northern China (34.8°N 113.6°E) (Table S1). This observation indicates that the geographic environments might be an important factor affecting soybean protein related traits 32 . Therefore, additive QTL effects might only explain a limited proportion of the heritability for complex traits 33 . The interaction between genetic variants may be an important source of the missing heritability. To understand whether epistatic associations occurred between PC and WSPC loci across all environments, we performed epistatic GWAS (EGWAS).
Because high marker density may result in prohibitive computing time, only the additive SNPs (P < 1.0 × 10 −4 ) for each trait at each environment were selected as representatives and tested for epistatic interactions. In the result, 14 and 36 epistatic interactions were found to be associated with PC and WSPC, respectively, in at least two environments (Table S4). We found that most of the interactions detected occurred between the SNPs on different chromosomes. For example, the interaction between AX-93822697_T_A (MAF = 0.16) on Gm14 and AX-93952504 _G_T (MAF = 0.13) on Gm18 were associated with PC in three environments, having the strongest epistatic effect (P = 1.77 × 10 −29 ) (Table S4 and Fig. 4). Individually, each SNP was not significantly associated with PC (Fig. 4a). However, accessions with the genotype TA combination (47.5%, T*A) of the epistatic loci has the highest PC of 10%, more than that with the TT combination (43.1%) (T*T) (Fig. 4b). A further examination of all four genotype combinations for the epistatic loci revealed that they were significantly different from each other, implying selection based on the epistatic effect is still effective. However, selection using AX-93822697_T_A or AX-93952504 _G_T alone may have no effect on soybean PC if both are segregating in a population.
For WSPC, it is important to note that 19 interactions shared a common locus at the 8.6-Mb genomic region on Gm08 containing six significant SNPs (AX-94048155, AX-93751882, AX-94048176, AX-93751901, AX-93751903 and AX-94048210) within a high LD block (r 2 = 0.62-0.94). This LD block was repeatedly detected for WSPC in different environments by GWAS and EGWAS (Tables 1 and S4), implying that this genomic region has both additive and epistatic effects to WSPC. The above results suggest that a complex network of genetic effects involved in the regulation of PC and WSPC in soybean. Similar complex effects affecting soybean protein and oil were also previously observed in soybean 34 and other species such as maize 35 , rice 36 , and pea 37 . GWAS signals were supported by QTL linkage mapping. In this study, a complex genetic architecture underlying natural variation of soybean PC and WSPC was unraveled through GWAS. To validate the GWAS signals, we conducted QTL mapping for PC and WSPC using 152 F 8: 12 RILs, with parents being previously selected from the GWAS association panel. Based on the high-density genetic map (6,159 SNPs) (Zhang et al. 15 ), six and three QTLs underlying soybean PC and WSPC, respectively, were identified across at least two environments ( Table 2 and Fig. 5). Detailed information about these additive and epistatic QTLs across five environments and   Table 2. Nine QTLs associated with protein content (PC) and water-soluble protein content (WSPC) across four environments in RIL population. a The name of the QTL is defined by the abbreviation of traits and the chromosome number. b Chromosome; c confidence interval of QTL; d The interval of physical distance in soybean genome; e the logarithm of odds score; f the mean phenotypic variance explained by related QTL; g the mean additive effect of QTLs. The QTLs in bold were detected simultaneously by ICIM and GCIM, and the underlined QTLs were detected by GCIM only.
their BLUP was summarized in Tables S5 and S6. Of the nine QTLs identified via linkage mapping, six (qPC3, qPC5, qPC10, qPC19, qWSPC8, and qWSPC10) were co-located with the significant loci that were identified in GWAS. The high consistency between QTLs and GWAS results suggest the robustness of both mapping strategies, and further verify that the populations and methods for analyzing used here were appropriate for PC and WSPC analysis.
To increase the statistical power in the detection of small-effect and linked QTLs, controlling background via selecting markers in the CIM was replaced by estimating polygenic variance in GWAS. This method was named genome-wide composite interval mapping (GCIM). The GCIM was also used to analyze the above datasets. As a result, major QTLs, detected by ICIM, could still be identified in various environments (Tables 2 and S5). For example, the QTL on Gm08 could still be detected in all the environments and their BLUP. Meanwhile, additional loci were found by the GCIM. For instance, a novel QTL, qPC9, on Gm09 was detected in three environments and their BLUP.

Genetic relationship between PC and WSPC.
To evaluate the genetic relationship between PC and WSPC, we compared the mapping results for both traits. In GWAS results, four of 32 genomic regions (16 for PC and 16 for WSPC) exhibited pleiotropic effects for both PC and WSPC (Table 1 and Fig. 2), suggesting that PC and WSPC are commonly regulated at a moderate level. It was consistent with the phenotype correlation between both traits (Fig. 1). For example, the SNP, AX-93933852, on Gm10 affecting soybean both PC and WSPC was identified across all six environments and explained 13.0-32.4% of the phenotypic variation. The other significant SNP, AX-93795201, on Gm11 was detected for both PC and WSPC across five environments and the average -log P of this loci was 6.94 (ranged from 5.6 to 8.6). The third region containing the SNP, AX-94196006, on Gm19 was associated with both PC and WSPC across all six environments and explained 7.9-26.7% of the phenotypic variation. These four loci on Gm10, 11, 12, and 19 (Table 1) explained high phenotypic variance and were genetically stable across years, suggesting that these genomic regions play an important role in contributing to the variation of PC and WSPC. In addition, we also identified a total of 11 PC-specific loci on Gm03, 4,5,6,9,11,13, and 15 across at least three environments, five of these loci are novel (Table 1 and Fig. 6a). Briefly, a major PC-specific locus, GqPC3, identified here co-localized with previous identified QTLs 38, 39 and soybean storage protein-related genes, G2 and G3 9, 10 . Moreover, GqPC3 was also identified by linkage mapping and explained 11.7% of the phenotypic variance (Table 2). A comparison of the PC of 196 G-type and 23 A-type soybean accessions genotyped at the leader AX-93995056 demonstrated that the PC in the accessions with A genotype at this locus was increased by 9.5% relative to the G-type accessions (t test, P = 3.17 × 10 −8 ) (Fig. 6b).
Meanwhile, we also identified six loci specific for WSPC on Gm01, 8, 13, 17, and 18 across at least three environments, suggesting that these loci might be responsible for the low WSPC of soybean varieties with high PC. It is important to note that the WSPC-specific signal (around AX-94048210 on Gm08) that was identified across all environments by GWAS was also previously identified associated with soybean WSPC 8,13 . This region affecting soybean WSPC was stably identified across environments and could explain 19.3% (ranged from 8.7 to 33.8%) of phenotypic variance, with an average increase of 38.2% of WSPC per T allele (Fig. 6c). These WSPC-specific loci might be important for the genetic variation in WSPC and could be used in the marker-assistant breeding of soybean with high WSPC.
For the QTLs underlying PC and WSPC in linkage mapping, only one QTL commonly responsible for PC and WSPC was detected. This overlap is not significantly different from expected by chance alone (P = 0.072), suggesting that PC and WSPC might be under relatively independent genetic control in the RIL population. This result is consistent with the lower phenotypic correlation between PC and WSPC when compared with the phenotypic correlation of WSPC/PC itself (Figs 1 and S1). In addition, the analysis of the positive or negative effect of QTL revealed that the alleles at three (50%) of six QTLs for PC were associated with the effect of decreasing the soybean protein content (Table 2). However, in contrast to the allelic effect distribution of PC QTLs, all the alleles of three QTLs for WSPC tended to increase the WSPC (Table 2). This contrasting QTL allele effect direction for PC and WSPC suggests that the two traits may be under differentially directional selection during soybean improvement.
Candidate genes underlying the QTLs for PC and WSPC. In this study, the major loci for PC and WSPC identified via GWAS had considerable overlaps with previously reported QTLs related to soybean protein or its components, further supporting to our findings (Table 1). Of the 32 identified PC/WSPC loci, more than two-thirds (22/32) were located within the previously-found QTLs or overlapped with the many QTLs identified in the present study by linkage mapping (Table 1). Moreover, we also found that several of these QTLs contain previously reported genes related to soybean protein. For example, the locus (GqPC3) affecting soybean PC that was detected in both GWAS and linkage mapping localized at a genomic region ~40 kb away from a published glycinin gene, Gy2 (encoding glycinin A2B1a precursor) 9, 10 . Another major QTL (GqPC13) on Gm13 associated with PC identified in this (Tables 1 and 2) and previous studies 40 contain a glycinin gene (Glyma.13G123500) encoding glycinin A3B4 subunits related soybean protein 41,42 . The third overlap include a stably QTL (GqPC19) on Gm19 that was significantly associated with soybean PC and WSPC via association and linkage analysis in our (Table 1 and Fig. 2) and previous studies 8, 43 . This QTL contains a globulin gene (Glyma.19G236600) encoding 7S globulin precursor 44,45 . The co-localization of previously reported protein-related QTLs/genes with QTLs identified here provideds a strong evidence showing the robustness of the mapping results.
Identification of candidate gene underlying QTL relies on high mapping resolution. However, the accuracy and precision of locating QTL depend, in part, on the density of molecular markers 46 . In this study, the average SNP spacing was approximately 3.3 kb along the 20 chromosomes of soybean (975 Mb) 16 , this resolution is much higher compared with the average SNP spacing of ~850 kb in a previous study 8 . Based on the high mapping resolution, candidate genes potentially underlying these main-effect loci (identified at least across three environments) associated with PC and WSPC could be predicted (Table 1). Moreover, examination of the expression pattern of these genes during the seed developmental stages would be helpful for us to understand its relevence to the traits (Table S7). We found that five of these genes were expressed specifically in seed developmental stages (Table S7), suggesting that these genes may be involved in the accumulation of soybean seed protein. Two candidate genes (Glyma.13G123500 and Glyma.13G194400) that were located within the QTL significantly associated with both PC and WSPC were highly expressed in seeds, especially during late developmental stages. In addition, expression of three genes (Glyma.15G098100, Glyma.08G112300, and Glyma.08G113400) could also be detected in seeds, leaves, flowers, root and nodule. The roles of these genes potentially involved in the seeds protein composition need further determination.
In this study, we were particularly interested in the loci that can enhance the solubility of soybean protein or related to WSPC, such as the WSPC-specific SNP locus AX-94048210 (MAF = 0.15) on Gm08 that was repeatedly detected in both the present and previous studies 8 . In comparison with the alternative allele, the desired alleles of this locus may increase the soybean WSPC more than 30% (Fig. 6c). Further analysis showed that AX-94048210 tagged to an LD block harboring two candidate genes encoding seed storage 2S albumin protein (Glyma.08G112300) and amino acid transporter (AAP8, Glyma.08G113400), respectively (Fig. 3b,c), which may involve in the regulation of storage protein biosynthesis in soybean 47 .
Another important WSPC-specific QTL, Gq18-1, on Gm18 were detected across all environments, this QTL was previously identified to be associated with soybean protein 43 . In this region, Glyma.18G071900 encoding an amino acid permease located under the peak SNP (AX-93870261) of Gq18-1 was regarded as a promising candidate, as amino acid permease can improve plant nitrogen status and lead to higher seed protein contains by increasing seed sink strength for nitrogen 11,47 .
For epistasis loci, although some of the SNP-SNP interactions included the SNPs that were not significant in GWAS (Fig. 4a), we found that the certain allels combinations showed a strong epistatic effect by EGWAS (Fig. 4b). For example, two SNPs, AX-93822697_T_A (MAF = 0.16) and AX-93952504 _G_T (MAF = 0.13), conferring the interaction between Gm14 and Gm18, were located within known protein related loci 48 . Further examination of the gene candidates for these loci revealed that the putative gene Glyma.14G087900 containing AX-93822697_T_A on Gm14 and Glyma.18G229300 containing AX-93952504 _G_T on Gm18 encode an aminotransferase and a malate enzymes, respectively (Fig. 4c,d). Aminotransferase (AAT) is a key enzyme in the synthesis of amino acids and involved in the regulation of carbon and nitrogen metabolism in almost all organisms. Over-expression of aspartate aminotransferase genes in Arabidopsis 49 or rice 50 resulted in altered nitrogen metabolism and increased amino acid content in seeds. Malic enzymes (ME) catalyze the decarboxylation of malate generating pyruvate, CO 2 , and NADH or NADPH., ME also involved in oil biosynthesis in plants 51 , and is negatively correlated with protein synthesis and compete for the same basic substrates 52 . The roles of the two genes in the regulation of soybean protein merit further determination.

Conclusions
We identified a total of 32 additive loci and 51 epistatic interactions associated with soybean PC and WSPC at various environments by applying the high-resolution GWAS mapping, demonstrating that epistatic effects are a substantial complement to additive effects in contributing to soybean protein. We also identified ten novel loci (GqPC4-1, GqPC5, GqPC6, GqPC11-2, GqPC15-2, GqWSPC11-2, GqWSPC12-1, GqWSPC15, GqWSPC16 and GqWSPC18-2) associated with soybean PC and WSPC. Phenotypic correlation and QTL contrastive analysis exhibited a moderate level of genetic sharing and different genetic effect, suggesting that both traits tend to be under relatively independent genetic control and should be under differential directional selection during soybean improvement. These results provide important genetic insights into the high PC in soybean varieties with low WSPC and a genetic basis for improvement in soybean protein quality through marker-assisted selection and genomic selection. The putative amino acid transporter-encoding gene, AAP8, and other candidate genes potentially involved in soybean protein synthesis were the promising candidate genes meriting further determination. Further studies, such as expression profiling and functional analyses of these candidate genes, will be helpful to facilitate the elucidation of the molecular mechanisms underlying soybean protein content and its solubility.

Materials and Methods
Plant materials. The association panel for GWAS consisted of a diverse collection of 219 soybean accessions (including 195 landraces and 24 elite varieties) originated from 26 different provinces and six different ecological regions of soybean growing areas in China, ranging from latitude 53 to 24°N and longitude 134 to 97°E 16,53 .
A segregating soybean population consisting of 152 F 8:12 RILs that were derived from a cross between varieties "Nannong94-156" (male parent, low protein varieties) and "Bogao" (female parent, high protein varieties) was used to map QTL for soybean WSPC and PC. Both varieties were also included in the association panel, and a high-density genetic map containing 6,159 SNPs were used as previously described 24 .  E6). A randomized block design was used for all field trials. All accessions were planted in two replications at E1 and in three replications at all other growing environments (E2, E3, E4, E5 and E6). In all environments, each accession was planted in three rows per plot, with each row was 200-cm long and 50-cm row spacing.
Measurement of soybean WSPC and PC was conducted using a near infrared spectrophotometer (NIR) seed analyzer (DA7200, Perten Instrument, Huddinge, Sweden) as previous described 8 . Briefly, approximately 60-g dry seeds per sample was fitted in a 75-mm-diameter cup, and the cup was rotated during NIRS scanning. The average value of three scans per sample were used in data analysis. These calibrations involved more than 700 uniform soybean samples that varied in seed PC, oil content, and 146 soybean samples in seed WSPC.
Genotyping and genetic map. This association panel was genotyped using NJAU 355 K SoySNP array as previously described 16 , and a total of 292,035 high-quality single nucleotide polymorphism (SNPs) were used for association mapping. The number of SNPs is estimated to provide approximately one SNP per 3.3 kb along the 20 soybean chromosomes. In this study, SNPs with MAFs less than 5% were excluded from further analysis. The final set of 201,994 SNPs distributed over the whole soybean genome (20 chromosomes) was used to study genetic diversity, population structure, genetic relatedness and marker-trait associations in relation to genetic distance.
The linkage map used in this study was constructed as previously described 24 . This map, spanning 3020.59 cM in length, contained 6,159 SNP markers on 20 chromosomes, with an average distance of 0.49 cM between adjacent markers.
Statistical analysis. Phenotypic data for soybean seed WSPC and PC across different environments were used to an ANOVA using the PROC GLM (general linear model) mixed model of SAS (version 9.2). The linear statistical model includes the effects of genotype, environment and the environment × genotype interaction. The decomposition of variance components was evaluated using PROC VARCOMP. The correlation coefficients between WSPC and PC in soybean were calculated with PROC CORR. The BLUP for each line was predicted with PROC MIXED in SAS and used as the phenotypic input in subsequent GWAS and QTL mapping.
The linkage disequilibrium (LD) block structure was examined using Hapview4.2 software by estimating the squared allele frequency correlation (r 2 ) of alleles in each QTL region. The significance of LD between the sites was determined using Fisher's exact test.
Genome-wide association mapping. SNPs with MAF ≥ 0.05 and the number of accessions with the minor allele ≥10 in the diverse panel were used to carry out GWAS. As a result, the final set of 201,994 SNPs (MAF ≥ 0.05) and the phenotypic values for all genotypes from the association panel across six environments and the BLUP over all environments, were used to perform marker-trait association analysis. GWAS was performed with a compressed mixed linear model 28 using the GAPIT package 54 . The population structure was accounted for by principle component analysis (PCA) and the relatedness was calculated by VanRaden method 55 using GAPIT. Markers were identified as significantly associated with traits by comparison with the significant threshold of P-value < 1/n (P < 4.95 × 10 −6 ) 56 .
The genome-wide epistatic interaction test was implemented in PLINK version 1.07 57 . To remedy the shortcoming of a less stringent significance threshold (1/n), we also carried out multi-locus GWAS by using ISIS EM-BLASSO 58 and mrMLM 59 (with the software rmMLM: https://cran.r-project.org/web/packages/mrMLM/ index.html) simultaneously in natural population.
Linkage QTL mapping. For the RIL population, measurement of WSPC and PC was conducted using the same method as described above. The additive and epistatic QTLs underlying the WSPC and PC were identified by the QTL IciMapping program v4.0 60 using single environment phenotypic values and the best linear unbiased prediction (BLUP) over all environments. Briefly, for the additive QTL, the inclusive composite interval mapping (ICIM) method was used in the software, the P-values for entering variables (PIN) and removing variables (POUT) were set at 0.01 and 0.02, and the scanning step was 2 cM. The ICIM-EPI method was used to detect epistatic QTL, the PIN and POUT were set at 0.0001 and 0.0002, respectively. The LOD thresholds for each index of QTL were determined by 1000 permutation test at 95% confidence level. In addition, to evaluate the significance of correspondence of QTLs underlying soybean PC and WSPC, a previously statistical test based on the hypergeometric probability function was used to calculate the probability of QTL correspondence by chance alone as described by Lin et al. 61 . Identification of small-effect and linked QTLs were performed using GCIM as previously described 62 .
Scientific RepoRts | 7: 5053 | DOI:10.1038/s41598-017-04685-7 Prediction of candidate genes. To uncover the candidate genes underlying association signals, the predicted genes in the target genomic regions were retrieved from the annotation of the soybean reference genome (Wm82.a2.v1) in Phytozome v10.3 (http://phytozome.net) and were manually analyzed using protein blasting (http://www.ebi.ac.uk/Tools/sss/ncbiblast/). First, we selected the candidate genes in the region defined by clustering of trait-associated SNPs at LD r 2 > 0.70 or in a region of 70 kb each side of the peak SNP. Genes with known functional descriptions related to soybean protein content or participating in seed protein synthesis pathway were selected as candidate genes. The expression data of these candidate genes at soybean seed developmental stages were retrieved from Phytozome (https://phytozome.jgi.doe.gov/) and Soybase (https://soybase.org/ soyseq/) to analysis its relevance to PC and/or WSPC.