Introduction

Seed protein content is an important economic factor since whole or crushed soybeans are used as animal feed and also for human consumption. Through plant breeding, high seed protein alleles have been selected within cultivated soybean (Glycine max (L.) Merr.) germplasm or through introgression from wild G. soja germplasm1,2. Notably, the high seed protein cultivar AC Proteus3 was developed for short season Canadian conditions and it has become the parent of numerous current varieties with high seed protein4. Previous work has indicated that populations developed from AC Proteus may not exhibit the typical inverse relationship between seed yield and seed protein5. These desirable attributes of AC Proteus have not yet been investigated using molecular genetic approaches.

Molecular markers in plant breeding have a broad scope of applications, including but not limited to, genotyping, germplasm characterization, genetic diversity studies, genetic mapping, and QTL analysis6. Molecular breeding employs a breeding procedure called Marker Assisted Selection (MAS) in which DNA marker detection and selection are incorporated into a traditional breeding program7,8.

Molecular markers have been used as an important set of tools in many field crop breeding programs due to their reproducibility in large quantities, their stability when exposed to environmental changes and their independence from any tissue or growth stage9,10. Single-nucleotide polymorphism (SNP) is the variation of a single nucleotide at a specific location on the genome among individuals10. SNPs are common in plant genomes appearing every 100–300 bp or less6 and about ninety percent of human sequence variations are due to SNPs11. Therefore, SNPs used as DNA markers are very useful due to their abundance, stability, efficiency, ease in automation and lower assay cost9,10,12.

In this study we have included diversity arrays technology (DArT), and diversity arrays technology with next generation sequencing combined (DArTseq) markers for recombination mapping in soybean and also produced an integrated SSR, DArT13 and DArTseq14 marker-based recombination map for soybean15,16 to facilitate comparative mapping with the widely used soybean SSR composite map17 and other genomic studies. DArT marker genotyping has many advantages; particularly that it is a high throughput array-based system which has no prerequisite for genomic sequence information. DArT marker technology is now successfully deployed in a wide range of crop plants and was developed for soybean15,16. DArTseq markers are SNP-type markers detected on a DArT-type platform which takes advantage of the dramatic drop in the sequencing cost in the last decade and this enhanced technology has now largely replaced the original DArT. DArTseq does not depend on the availability of reference sequence for the genome (marker data extraction is “reference-free”), but enables immediate alignment of detected markers to the reference when it is available, which is the case for soybean. The present study was designed to investigate the genetics of high seed protein in AC Proteus using molecular genetic approaches by studying the high seed protein content loci in bi-parental populations and in AC Proteus-derived high protein cultivars.

Materials and Methods

Germplasm - Bi-parental population development and phenotyping

Three high protein parental lines were used in this study:

  1. 1)

    AC Proteus is an elite high protein cultivar adapted to early maturity zones in Ontario and Quebec3,18. The pedigree of AC Proteus is Merit/PI 153293/2/PI 189950/3/3*Maple Arrow3. Merit was developed at Agriculture Canada, Ottawa in 1960. PI 153293 was a high protein introduction from Belgium. PI 189950 was a very small seeded, high protein introduction from France (originally identified as G. gracilis now G. max).

  2. 2)

    X3144-48-1-B was developed from the cross AC Proteus/Maple Glen. Maple Glen is a high yielding cultivar3. X3144-48-1-B has the same pedigree as population X3585 used in a previous study of breeding for high protein5 but was independently developed.

  3. 3)

    X3145-B-B-3-15 has the pedigree BD22115/DW-8-3(X656-54)//CS-251-2(X1205-24-B-1)/3/Maple Glen, where BD is Amsoy/Portage//PI 438477, and DW is Renville/Capital(M387)//(M406)Harosoy/Norchief(M62-173)/3/USDA T106, G. soja, and CS is Hardome/PI 189950, G. gracillis//Merit/PI 153293/3/PI 438475.

Three low protein parental lines were used in this study:

  1. 1)

    Maple Arrow is the low protein recurrent parent used in the development of AC Proteus.

  2. 2)

    906318.

  3. 3)

    AC Brant3.

Five high x low seed protein and one high x high seed protein recombinant inbred line (RIL) populations were used in the present study:

  1. 1)

    XH939 is AC Proteus/Maple Arrow. This is an F6 derived RIL population. AC Proteus is a backcross two line derived from Maple Arrow and this cross is a backcross three population developed by Dr. Richard Buzzell at the Harrow Research and Development Centre of Agriculture and Agri-Food Canada.

  2. 2)

    X4049 is X3145-B-B-3-15/9063. This is an F5 derived RIL population.

  3. 3)

    X4050 is X3145-B-B-3-15/AC Brant. This is an F5 derived RIL population.

  4. 4)

    X4074 is X3144-48-1-B/9063. This is an F5 derived RIL population.

  5. 5)

    X4075 is X3144-48-1-B/AC Brant. This is an F5 derived RIL population.

  6. 6)

    X4038 is X3145-B-B-3-15/X3144-48-1-B. This is an F5 derived RIL population derived from a high protein x high protein cross.

Phenotyping of these populations was carried out at the Central Experimental Farm at Ottawa, Canada from 1997 to 2000. The X4050 population was also grown in 1999 at Exeter, Listowel, and Woodstock, ON and St-Cesaire, Ste-Rosalie, and Plessisville, QC. Population XH939 was only grown for three years (1998 to 2000) at Ottawa. Seed protein and oil content of field grown RIL populations were determined with infrared transmittance spectroscopy (Infratec 1241, FOSS) and expressed on a dry matter basis.

DNA extraction

DNA was extracted from frozen leaves of plants grown in the greenhouse or the field using a modified urea extraction technique19.

Markers, recombination mapping and QTL analysis

Previously designed soybean SSR primers17 were used in this study for DNA amplification. DArT and DArTseq marker analyses were performed as described elsewhere13,15,16,20. To assist with interpreting the recombination map, please note that typical nomenclature for microsatellite or Simple Sequence Repeat (SSR) markers is Satt100, for DArT markers it is soPb_100000 and for DArTseq markers it is 1000000. QTL analysis was performed with the software program MQTL21,22. Ten thousand permutations of the data were used to calculate the threshold for QTL detection. Regions with a test statistic above the threshold were considered a QTL. The major QTL was anchored and the map was re-scanned for regions that have additive or epistatic effects.

AC Proteus genome-wide allele analysis

A Genome by Sequencing (GBS) database of 155616 SNPs characterized across 300 Canadian soybean varieties23,24 was used as a source of genotype information for SNP haplotype analysis. Tassel 5 was used to sort the SNP data set for rare allele frequency analysis (AC Proteus rare allele frequency varies from 0.05–1.1% (ratio of 1–0.6, AC Proteus rare allele frequency in contrast to the other lines; 1 represents 100% match, while 0.6 represents 60% match) of the entire allelic frequency presented within the SNP panel) for AC Proteus (http://www.maizegenetics.net/tassel)25,26. Using the Canadian soybean collection of GBS-SNP data, AC Proteus alleles, at homozygous loci, were compared with seven AC Proteus derived high protein cultivars (AC Hercule, AC Proteina, Venus, Kamichis, Krios, AAC Invest 1605, Jari) and SNPs that were common across 66% of the derived high protein lines were identified; the first step was to identify AC Proteus alleles that were rare in the database. These SNPs were then compared to the low protein cultivars Maple Arrow, AC Brant (low protein cultivar), and Maple Glen (high yield cultivar) which were parents of populations. A second analysis was also carried out of those SNPs in which the criterion was that the AC Proteus allele was common across all AC Proteus derived high protein lines. Pedigree information of key cultivars used in this analysis are presented in Fig. 1, where the high protein cultivars are shown in grey. The pedigree graph was created using Helium software27.

Figure 1
figure 1

Pedigrees of high protein soybean AC Proteus and its high protein progeny. High protein cultivars used in the current SNP pedigree study are shown in grey.

Results

Protein content of parental germplasm

Values for seed protein and oil from trials at Ottawa were measured for several of the parental lines of the Ottawa derived RIL populations (Table 1). The high protein parents had about 48% seed protein while the low protein parents had about 40% seed protein

Table 1 Least square means for seed protein and oil of parental and check cultivars grown from 1998 to 2000 at Ottawa.

The six RIL populations showed variation for seed protein and seed yield (Fig. 2A). One population (X4050) was selected for detailed QTL analysis (Fig. 2C). This population was chosen because of the four Ottawa populations derived from high x low protein parents, the X4050 population’s frequency distribution for protein content most closely approximated a standard normal distribution. As a complementary cost-efficient strategy, high and low protein bulks from the other five populations (X4038, X4049, X4074, X4075 and XH939) were selectively genotyped (Fig. 2B).

Figure 2
figure 2

(A) Seed protein (%) versus seed yield (Kg ha-1) for all six populations. (B) Mean protein content of low and high protein bulks and parents for the X4038, X4049, X4074, X4075 and XH939 populations. (C) Seed protein histogram for RIL population X4050 and parents.

Recombination mapping with SSR, DArT and DArTseq markers

In preparation for QTL analysis, a recombination map was developed in the X4050 RIL population (n = 100) using novel DArT and DArTseq markers as well as the widely used SSR markers. The resulting map (Fig. 3, Table 2) contains 264 SSR markers17, 83 DArT markers, and 297 DArTseq markers, for a total of 644 molecular markers. This is believed to be one of the very few soybean recombination maps with DArT and DArTseq markers co-mapped with SSR markers15, which facilitates comparative mapping between emerging DArT and DArTseq maps and the many published SSR based soybean maps and studies.

Figure 3
figure 3

Recombination map for the X4050 RIL population. QTLs and near QTLs identified in X4050 and regions identified by BSA in the remaining populations (X4038, X4049, X4074, X4075 and XH939) have been added in this map. For comparison published protein Meta-QTLs28,30 are also shown.

Table 2 Statistics of the recombination map for soybean population X4050.

QTL analysis for protein content in X4050

QTL analysis in population X4050 for protein content was performed using a map containing SSR, DArT and DArTseq markers (Figs. 3 and 4, supplementary file 1).

Figure 4
figure 4

Scans of test statistic (composite interval mapping) for declaring a QTL (or near QTL) in X4050 for soybean protein content. SSR, DArT and DArTseq markers used for QTL analysis are a subset of those on the map in Fig. 3. The vertical line indicates the test statistic threshold for significance in declaring a QTL.

As presented in Fig. 4, two QTLs for seed protein content were detected, one on chromosome 20 (LG I, at SSR marker Satt496/Sat_174, explaining 60% of the population variation) and one on chromosome 15 (LG E, Satt213, 23%). In addition, there were two genomic regions with a highly elevated test statistic, but below the statistical threshold required to declare a QTL; one on chromosome 1 (LG D1a, Satt077, 14%), and the other one on chromosome 16 (LG J, Satt287, 13%). A region on chromosome 5 (LG A1, at DArTseq marker 1368291, 1%) was detected based on its epistatic interaction with the large QTL on chromosome 20 (data not presented).

Bulk segregant analysis for protein content

An additional five populations were studied in an effort to validate QTLs found in the X4050 population, assess their applicability across germplasm, and perhaps detect additional relevant loci. Four of these populations (X4049, XH939, X4074, and X4075) were amenable to bulked segregant analysis (BSA) and therefore high and low protein bulks were selectively genotyped. Bulks were similarly genotyped in the fifth population (X4038), however, because it is a cross between two high protein parents, the results are more challenging to decipher. Therefore, classical BSA was not used in the X4038 population to discover high protein loci; however, the genotypes of the X4038 population bulks could be used to follow alleles at loci identified by BSA in the other four populations (Figs. 3 and 4).

Several genomic regions of interest were identified by comparison of the results obtained through QTL analysis and BSA (Fig. 3) (total of 37 locations identified by BSA analysis among the five populations). Among the four populations used for BSA, particular attention was given to positive BSA results with population XH939. That is because population XH939 (AC Proteus x Maple Arrow) is a “quasi-near isogenic population” since AC Proteus is a back cross two with Maple Arrow as the recurrent parent, with selection in each generation for high seed protein content, and XH939 is the third back cross to Maple Arrow. Thus, the high and low protein bulks derived from the XH939 population used for BSA should be highly specific for the genetic loci and/or alleles responsible for high seed protein in AC Proteus. Results from the present study were then compared to published results from GWAS, genome wide association study, analysis for high seed protein using some of the same germplasm23,24 as well as to published results for QTL analysis for high seed protein content (Soybase.org).

Genome-wide approach to identifying AC Proteus rare alleles

A database of SNP genotypes of 300 Canadian soybean cultivars created by23,24 was used as a source for SNP haplotypes to investigate rare alleles in AC Proteus. Since there are a limited number of high protein lines in the Canadian SNP database, high protein alleles may appear rare but be present at higher frequency in the global germplasm and correspond to genomic regions previously reported in the high protein soybean literature.

For the initial broad analysis using the Canadian SNP database, we looked for rare AC Proteus alleles common across two-thirds or more of seven AC Proteus derived lines (AC Hercule, AC Proteina, Kamichis, Krios, Venus, Jari and ACC Invest 1605) but absent from the low protein parental lines (Maple Arrow, AC Brant and Maple Glen). AC Proteus descendants had been developed through up to three additional breeding cycles with continuous selection for high protein.

A total of 155,616 SNPs were screened for alleles present in AC Proteus but rare within the SNP database. This subset of SNPs (1,721) was further screened for those that contrasted between AC Proteus and its low protein recurrent parent Maple Arrow and additionally those where the AC Proteus allele was present with an allelic ratio of 0.66 or greater among the seven high protein derivatives of AC Proteus. Based on the selected ratio of 1–0.66, 0.05–1.1% of the alleles present within the SNP panel were selectively retained by AC Proteus and its derivatives. As shown in Supplementary file 2, the approximately 650 SNPs that met this set of criteria were sometimes in close physical proximity to each other and appear to define genomic blocks, which may represent haplotypes for high protein. Using linked SSR markers to bridge between the recombination map (Fig. 3) and genomic sequence map (Soybase, assembly 2.0) it was possible to demonstrate that five of the SNP blocks correspond to either QTLs identified in X4050 or positive genomic regions identified by BSA in the other four RIL populations (Table 3). Two of those five blocks also correspond to published Meta-QTL for protein content. An additional three blocks align with other published Meta-QTLs for protein content (Table 3). These correspondences help validate the results obtained by these three independent analytical methodologies (QTL, BSA, SNP based pedigree analysis) and support the hypothesis that these eight and possibly more genomic regions play a contributory but not necessarily essential role in the high protein and high yield phenotype of AC Proteus and derived breeding lines. It is also noteworthy that the blocks vary considerably in size. For example those in Table 3 vary from 150 kb to 11,000 kb, and the larger blocks may carry multiple genetic loci that have been retained through selection for high protein.

Table 3 Eight genomic blocks containing SNPs having high AC Proteus rare allele frequency (0.667 to 1.0) and their linked SSR loci.

In a second analysis of the SNP data, the same strategy, but employing more stringent screening criteria (allele frequency of 1.0), was used to search for essential and perhaps novel alleles responsible for the desirable high protein phenotype of AC Proteus and its descendants. AC Proteus SNP alleles (not shared by Maple Arrow) which are rare in the Canadian germplasm but retained in all seven AC Proteus derived cultivars were identified. Those which were not already reported in Table 3 are shown in Fig. 5. These criteria were met by 7 blocks (11 genes) of SNPs. These blocks are identified on chromosomes 2, 17 and 18. Such putatively novel regions that are perfectly conserved through multiple breeding cycles may carry genes having important high protein alleles derived from AC Proteus. Also shown in Fig. 5 are those SNPs which are located within genes, however none are implicated as candidate genes by the current analyses.

Figure 5
figure 5

Genome-wide analysis of AC Proteus rare alleles, which were maintained across three cycles of breeding for high protein in all seven derived high protein soybean cultivars, and which contrast with Maple Arrow, the recurrent parent of AC Proteus. All the items included in Table 3, are excluded from Fig. 5.

Discussion

Taken together, the five genomic regions identified in this study account for 70% of the phenotypic variation for seed protein in this population. Major QTLs for protein content have been identified on chromosome 20; this region corresponds to the most frequently reported protein content QTL in the literature and to the protein content Meta-QTL #1828. However, the QTL identified on chromosome 20 is distant (~4–6 cM, map unit) from the reported Meta-QTLs, and also supported by BSA analysis in three different population (X4049, X4074, and XH939), and can thus be considered as a new QTL. The second QTL for protein content is at Satt213, on chromosome15; Satt213 is distant from the closest protein content Meta-QTL and likely an independent locus, and supported by BSA analysis at close proximity. However, Satt213 is tightly linked to the QTL seed protein content 1–5 (the peak marker is RFLP pSAC-7a aka pSAC7_1), identified in the A81356022 (G. max) x PI468916 (G. soja) population29 and reported in SoyBase.

The major protein content QTL on chromosome 20 was detected by BSA at Satt496 in three of the four populations investigated in this study, and by BSA at the adjacent marker Satt587 in the fourth population. Additional positive BSA results at flanking markers support the hypothesis that the identified locus on chromosome 20 is likely the major locus for protein content in all five populations. The related shoulder peak (significant peaks close to the major peak) at Satt419 and Satt562 on chromosome 20 was also detected by BSA in three of the four populations.

The second protein content QTL detected in population X4050 was at Satt213 on chromosome 15. BSA at the flanking marker Satt411 was positive for two of the four populations investigated in this study (X4049, and X4075), while the high protein parent’s allele was fixed in the other two populations. Note that this is consistent with the hypothesis that this locus is very important (significant) for achieving high protein content in all five populations. BSA identified other loci potentially important for protein content which were not detected by QTL analysis for protein content in population X4050. Further along on chromosome 15, Satt212 was positively identified in three populations and fixed for the high protein allele in the fourth. On chromosome 8, BSA gave a positive result for Satt341 in three populations and the fourth population was fixed for the high protein parent’s allele. Also, on chromosome 8, BSA gave a positive result for two populations at Satt327. Since the two markers are approximately 30 cM apart, they may well represent different loci. Satt341 and Satt327 span a genomic region with numerous seed composition QTL reported on SoyBase.

BSA gave positive results in two of the four populations at several additional loci. The first was at Satt066 on chromosome 14, which is tightly linked to the protein content Meta-QTL628. At Satt396 on chromosome 4, which is linked to protein content Meta-QTL7, a third population was fixed for the high protein parent’s allele. On chromosome 15, BSA gave positive results for Satt384 in two populations while a third was fixed for the high protein allele. The Satt384 locus is linked to protein content Meta-QTL mPO15–230. Also, on chromosome 15, Satt231 was highlighted by BSA and is linked to protein content Meta-QTL1428. In all four cases, linkage to a Meta-QTL would appear to validate the identification of these four loci by BSA and suggest that they contribute to achieving high protein content in this germplasm.

A positive BSA result in only one of the four populations might well be a false positive. However, it is worth noting that in the cases of Satt192 on chromosome 12 and Satt559 on chromosome 9, the other three populations were fixed for the high protein parent’s allele. Additionally, at Satt319 on chromosome 6, two of the three other populations were fixed for the high protein parent’s allele and the Satt319 locus is tightly linked to Meta-QTL1128.

A recent study31 using high protein parents AC Proteus and AC Proteina did not find protein QTLs in the AC Proteus population but did find QTLs on chromosome 15 and 20 in the commonly reported regions on the AC Proteus-derived AC Proteina population.

As presented in Table 3 and 4, AC Proteus, and derived high protein progeny, carry rare alleles in comparison to Canadian low protein germplasm but many of these regions are commonly identified in the high protein literature. Some novel regions were identified; none of the genes identified in Fig. 5 have Meta-QTL in close proximity except for Glyma.15g197800. To facilitate comparison of our SNP allele data with our QTL and BSA data, we have searched the soybean genomic sequence physical map near the AC Proteus rare alleles (SNPs) to identify the closest SSR marker (Supplementary files 2 and Fig. 3). These data are consistent with our hypothesis that AC Proteus may carry novel high protein alleles.

In summary, we developed a recombination map which integrates DArT and DArTseq markers with the widely used SSR markers. QTL analysis and bulk segregant analysis identified QTLs for high protein in our populations which correspond to important QTLs in previous research and supported with Meta-QTL analyses. We identified two QTLs for seed protein content on chromosomes 15 and 20 (five genomic regions in total considering the two with highly elevated test statistic, but below the statistical threshold and the one with epistatic interactions) which have not been included in Meta-QTL regions. It is worth mentioning, among all the regions identified by BSA in this study (Fig. 3 and Table 3), those located on chromosomes 1, 8, 9, 14, 16, 17, 19, and 20 are considered novel (identified in this study and no reported Meta-QTLs located in close proximity). We further identified regions on chromosomes 2, 17 and 18 which were maintained in high protein cultivars derived from AC Proteus over multiple breeding cycles. These high protein regions may prove useful for further development of high yielding high protein cultivars.