Introduction

Legume seeds are an important source to provide human food and animal feed. The high contents in proteins and carbohydrates, as well as fibers and minerals in legumes are an essential component of human diets 1. With the world population growing and the increasing need of plant proteins, producing highly nutritious seeds with high protein content, essential amino acids and minerals is in great demand.

Compared to grains, legume seeds have naturally high protein contents; however, they are deficient in sulfur-containing amino acids and have lower concentrations of certain dietary minerals such as Fe, Ca and Zn compared to animal proteins2. Increasing seed protein production and improving seed nutritional quality have been a challenge in the agronomic field.

The existing natural diversity of legume could help identify key molecular players in achieving these challenges by understanding its underlying molecular mechanisms and by identifying molecular markers. Medicago truncatula is a Mediterranean originated plant and has been a model plant of legumes from 19903,4. Its genome was sequenced and has still been under development with a recent fifth release5.

Several quantitative trait loci (QTL) analyses have been performed in M. truncatula to identify loci affecting seed protein and mineral compositions6,7. Nevertheless, QTL identification depends on mapping population genetics of a few parents limited its use in exploratory genetic approach. Genome-wide association studies (GWAS) use a broad panel of natural accessions with high genetic diversity and could overcome QTL analysis limitations8. Nowadays, GWAS has become a useful approach to explore the genetics of natural accessions and agronomic traits. A Medicago HAPMAP collection of over 200 natural accessions has been developed, which contains several millions of single nucleotide polymorphisms (SNPs)9. This Medicago GWAS panel has been successfully employed to identify candidate loci/genes associated with various agronomic traits4 such as seed protein composition7.

In this study, we performed GWAS focusing on seed traits related to seed size and seed composition using 162 accessions from the M. truncatula HAPMAP collection. Moreover, we performed association studies using both single and multi-locus models as well as several postGWAS analyses in order to identify potential loci/genes that could be involved in seed nutritional qualities in M. truncatula.

Results

Phenotypic evaluation of seed traits among the HAPMAP seed collection

We evaluated the phenotypic variation of 162 Medicago accessions on 16 seed traits regarding seed size and composition, plus 16 additional traits related to seed mineral composition in a subset of 88 accessions. Seed size was determined by weight measurement, area, perimeter, length (called ‘majellipse’ for major axis of ellipse) and width (called ‘minellipse’ for minor axis of ellipse)10. Seed color variations (called CH1, CH2 and CH3) potentially reflected the secondary metabolite composition in the seed coat. Global seed composition was characterized including carbon, hydrogen, nitrogen and sulfur percentages (w/w) (called %C, %H, %N, %S). From these concentration values of nitrogen and sulfur, we estimated the nitrogen and sulfur contents per seed of each accession based on individual seed weights (traits called N Content and S Content and expressed in milligram per seed). Nitrogen concentration/content is a good indicator of the global protein content in seed and is commonly used for total protein determination in food products. Indeed, a predefined coefficient factor, Jones Factor11, is used to convert the nitrogen concentration into total protein content. This coefficient is 6.25, but might vary between species and plant tissues. We also calculated the ratio between carbon and nitrogen (C/N), which corresponds to a global seed composition estimation. Sulfur concentrations/contents were also characterized, which reflected high-quality storage proteins. Indeed, legume seeds generally have a low level of sulfur-containing amino acids, which were shown to be tightly regulated by plant sulfur status12,13. Finally, other minerals (i.e. macro- and micro-elements) were quantified in seeds from a subset of 88 accessions. Concentrations of macro- (P, K, Mg, Ca, Na) and micro- (Fe, Mn, Zn, Cu, Mo, Co, Ni, V) elements were determined in mature seeds. All phenotypic values for the analyzed accessions are provided in the Supplemental Table S1.

Phenotypic diversity and correlation between seed traits and Impact of geographical location

A wide range of phenotypic variation was observed among the different accessions tested (Supplementary Figure S1 and Supplemental Table S1) with a coefficient of variation (CV) ranging from 1% for the most stable traits such as carbon and hydrogen concentrations, to 84% for Fe concentration. Other seed traits showed a high variability such as seed weight, N content and S content with CVs around 20%. In general, seed mineral concentrations showed the highest phenotypic diversity with Fe, Zn and Na displaying higher CV values. All the phenotypic values and CVs are provided in Supplementary Table S1.

Due to the availability of geographical locations of each accession origin, we allocated different accessions to three geographical values (i.e. longitude, latitude, altimeter) and 19 bioclimatic values obtained from the WorldClim database (http://worldclim.org). These bioclimatic values (called BIO1 to BIO19) mainly represent temperature and rainfall values measured monthly, quarterly or annually (see details in Fig. 1 legend). A global correlation analysis was performed to identify correlations between seed phenotypic traits themselves and with their geographical and bioclimatic values (Fig. 1). Results showed that all seed traits related to seed size (i.e. weight, area, perimeter, minellipse and majellipse) were highly correlated (Pearson coefficient correlation, PCC > 0.9), which validated the accuracy of our measurements. Similar results were obtained for seed color values (i.e. PCC > 0.85 for CH1, CH2, CH3).

Figure 1
figure 1

Correlation matrix between Medicago seed traits, and in relation to their geographical locations and climatic data. Only Pearson correlation coefficients (PCC) with adjusted p-values below 5% are indicated after BH procedure to control false discovery rate. Red color indicates PCC above 0.2 and green color indicates PCC below − 0.2. Longitude is expressed in degrees with negative degrees representing west and positive degrees representing east. Latitude is also expressed in degrees with negative degrees representing south and positive degrees representing north. Climatic data are from WorldClim. BIO1 annual mean temperature, BIO2 mean diurnal range, BIO3 isothermality, BIO4 temperature seasonality, BIO5 max temperature of warmest month, BIO6 min temperature of coldest month, BIO7 temperature annual range, BIO8 mean temperature of wettest quarter, BIO9 mean temperature of driest quarter, BIO10 mean temperature of warmest quarter, BIO11 mean temperature of coldest quarter, BIO12 annual precipitation, BIO13 precipitation of wettest month, BIO14 precipitation of driest month, BIO15 precipitation seasonality, BIO16 precipitation of wettest quarter, BIO17 precipitation of driest quarter, BIO18 precipitation of warmest quarter, BIO19 precipitation of coldest quarter.

Regarding seed content, we observed that nitrogen and sulfur contents were also highly correlated with seed size traits (PCC > 0.89 for N content and 0.74 for S content), which suggested that variations in seed content were predominantly determined by seed size. Regarding mineral composition in seeds, we observed positive correlations between concentrations of some elements such as Ca, Mg, Fe, Cu and Na (PCC > 0.7) but also between the macro-elements P and K (PCC > 0.75, Fig. 1).

With the addition of the geographical values, we observed a moderate positive correlation between accession longitudes and seed C/N ratio (see the legend in Fig. 1), which indicated that accessions collected from the East tended to have higher C/N ratio (i.e. less nitrogen). To explain this difference, we also observed moderate positive correlations (PCC > 0.35) between seed size, seed contents (N and S) and temperature (i.e. BIO 9, 10), and at the opposite moderate negative correlations (PCC < -0.3) between seed weight, N content and precipitations (i.e. BIO 14, 17, 18). The integration of the bioclimatic data suggested that temperature and precipitation played an important role in accession adaptability to final seed size determination, with outcome in sulfur and nitrogen contents.

Genome-wide association analysis of seed traits

In order to perform genome-wide association analysis, we first, used the Box-Cox procedure14 to estimate the appropriate lambda to transform our phenotypic data and, therefore, validate the assumption of normality required when performing GWAS prediction. Out of the 32 measured seed phenotypes, 26 traits were normalized using respective lambdas to finally display a normal distribution according to Shapiro–Wilk test (Fig. 2, Supplementary Table S1 and Supplementary Figure S1). However, six seed traits corresponding to the perimeter, CH1, %C, %H, C/H ratio and Arsenic (As) concentration were discarded from subsequent GWA analyses since, even after transformation, these traits did not reach normality.

Figure 2
figure 2

Distribution histograms of seed size and composition phenotypes in different Medicago accessions. Corresponding distribution curves are indicated on histograms. Different x-axes represent the corresponding values of the phenotypes.

In this study, two different models for genome-wide association predictions were applied to normalized phenotypes: a classical single-locus mixed linear model (EMMA15) with kinship and population structure as inputs, and a multi-locus model (FarmCPU16) with correction of population structure. When performing the multi-locus FarmCPU model, we observed QQ plots with a better fit between the expected and observed results following the expected null-hypothesis distribution of p-values (Supplementary Figure S2). These QQ plots reflected that most of the tested SNPs have no significant p-values, except for a few SNPs that have a strong and significant effect. Moreover, QQ plots obtained after performing the EMMA algorithm generally showed a curve corresponding to observed results below the theoretical curve (i.e. deflated curve), which suggested that this model was not appropriate for this association study. Regarding the Manhattan plots obtained from different models, we also observed differences between EMMA and FarmCPU (Supplementary Figure S2). In general, we obtained less background noise with FarmCPU, with more precise location and lower p-values of SNPs than the ones obtained from Mixed Linear Model (MLM), especially when statistical analysis showed highly significant SNPs. Manhattan plots obtained from MLM displayed broader “peaks” made of multiple significant SNPs (i.e. SNP clusters). Overall, we note that most of the highest significant SNPs were identified in both methods but FarmCPU provided more power detection and accuracy to identify quantitative trait nucleotides (QTNs) (Supplementary Figure S2). Therefore, we decided to focus on the multi-locus mixed model with FarmCPU in the subsequent analyses. All results (Manhattan and QQ plots) obtained from FarmCPU in this study are provided as Supplementary figures S3-S7. Moreover, gwas files directly readable on any genome browsers such as the web-accessible JBrowse17 or desktop genome viewer such Integrative Genome Viewer (IGV18) are also provided as Supplemental Tables S2-S5.

As previously described, we observed two contrasting situations regarding association studies and their resulting Manhattan plots: identification of highly significant QTNs with clear genomic location and identification of clusters of SNPs indicating associated loci. As preliminary results of these analyses, we clearly identified highly significant QTNs associated with seed size (Supplementary Figure S3) and seed composition (Supplementary Figure S4) present on several chromosomes. For instance, we observed five, six, four and six QTNs highly associated respectively with seed area, seed length, seed width and seed weight with a -log10(p-value) > 10 (i.e. p-value < 10–10). Regarding seed color (Supplementary Figure S5) and seed mineral concentrations (Supplementary Figure S6, S7), QTN p-values were significantly lower and nearer to background noise, which allowed only identification of specific genomic regions (i.e. SNP clusters), rather than highly significant individual QTNs.

To identify relevant QTNs, we combined association results from highly correlated seed traits. For instance, we combined FarmCPU results from weight, area, majellipse and minellipse (Fig. 3a) and identified common QTNs between seed size traits such as MtrunA17Chr4_56801315 on Chromosome 4. Interestingly, this QTN showed high p-values with all four seed size traits (10–18, 10–25, 10–21, 10–10 with respective area, majellipse, minellipse and weight), suggesting a reliable QTN regulating seed size. This QTN is located within the genomic sequence encoding for a protein containing an RNA binding motif (gene ID MtrunA17Chr4g0065741). Another potentially reliable QTN (MtrunA17Chr1_35506650) was identified from three different seed size phenotypes with highly significant p-values of 10–9, 10–8, 10–19 for area, minellipse and weight, respectively. This QTN located on chromosome 1, closely related to a genomic sequence encoding a WD40-LIKE transcription factor (gene ID MtrunA17Chr1g0185101).

Figure 3
figure 3

Genome-wide association studies of the Medicago seed traits with Manhattan plots and QQ plots obtained from FarmCPU. (A) Combination of association studies regarding seed size (weight, area, majellipse, minellipse). (B) Combination of association studies regarding seed sulfur content (mg/seed) and sulfur concentration (%, w/w). (C) Combination of association studies regarding seed protein content (nitrogen content (mg/seed); nitrogen concentration (%, w/w); carbon/nitrogen ratio).

Similarly, we compared association studies between sulfur content and sulfur concentration to identify four major QTNs shared between these two traits with low p-values (Fig. 3b). MtrunA17Chr1_31627600 on chromosome 1, located within the coding sequence of the EXPORTIN5 protein (MtrunA17Chr1g0180461) closely related to Arabidopsis HASTY1 protein, which was shown to act as a nucleocytoplasmic transporter involved in the nuclear export of small RNAs19. MtrunA17Chr4_32623172 in chromosome 4, located in a chromosomic region rich in transposable elements. MtrunA17Chr5_8051955 present in chromosome 5 and is close to a gene encoding a salicylate methyltransferase (SAMT, MtrunA17Chr5g0404631), which catalyzes the methylation of salicylic acid with S-adenosyl-L-methionine to form methyl salicylate (MeSA), mainly in response to stress20. MtrunA17Chr8_48959923 on chromosome 8, located in the promoter region of a gene encoding a histidine kinase (MtrunA17Chr8g0392301).

Regarding nitrogen composition, we compared association studies between nitrogen concentration, nitrogen content and CN ratio in seeds (Fig. 3c). Following this experiment, it was more difficult to identify clear QTNs such as the N concentration and the CN ratio result showed more genomic regions that individual and distinct QTNs associated with these phenotypes. However, it appeared that regions mainly located on chromosomes 1, 2, 6 and 8 showed strong associations between seed nitrogen composition and different accession polymorphisms, which suggested that these regions could play a role in seed nitrogen composition. Moreover, some particular QTNs were highly relevant for further analyses and indicated in Table 1. For instance, first, we identified a highly significant QTN (MtrunA17Chr6_7310002) associated with both protein concentration and C/N ratio, which is closely located to a genomic sequence encoding a putative amino acid transporter (MtrunA17Chr2g0333321). Second, we also identified a highly significant p-value for the QTN MtrunA17Chr4, which was already identified in the four seed size traits, in the N content association study. This result was predictable due to the high PCC between seed size and nitrogen content, which suggested that this QTN could be a regulator of both traits, making this QTN a potentially interesting candidate to improve concomitantly seed size and seed protein content.

Table 1 Top five QTNs significantly associated with different seed size traits (i.e. weight, area, majellipse, minellipse) and seed compositions (S content, N content, %S, %N and C/N ratio). SNP/QTN names, positions and p-values are indicated from FarmCPU. Numbers of potential associated SNP(s) and putative causal genes are indicated from PLINK analysis. Gene expression in major Medicago plant organs, as well as tentative gene annotations are indicated. A more exhaustive list of highly significant QTNs related to all seed traits is provided as Supplementary Table S6, and complete lists of SNPs and their associated p-values are provided as Supplementary Tables S2 to S5.

Regarding seed color and seed mineral concentrations, several loci were identified by combining results from CH2 and CH3 and from all macro- and micro-element concentrations. However, no major QTNs (i.e. p-values > 10–10) and precise location of SNP clusters were identified. This absence of highly significant QTNs regarding seed mineral concentrations could be explained by the small population size used in this specific analysis (i.e. subset of 88 accessions).

PostGWAS analyses to identify putative causal genes

To shorten the list of candidate QTNs, we used p-value threshold of 10–7 when association studies displayed high SNP power detection such as seed size and seed composition phenotypes, and a p-value threshold of 10–5 when association analyses displayed low SNP power detection such as seed color and seed mineral concentrations. Then, the linkage disequilibrium (LD) was considered to identify putative causal genes associated with selected QTNs. Considering that in the Medicago HAPMAP collection, the average LD decay was determined around 15kb21, we performed genome-wide correlations between selected SNPs present within this genomic range (i.e. ± 15 kb from QTNs) using PLINK22. A threshold correlation of 0.7 was used to identify SNPs potentially in LD within these genomic regions. From this analysis, we established a list of SNPs correlated to the selected QTNs due to LD and therefore potential causal genes. From this list, we revealed 56 putative causal genes related to the 34 QTNs with highly significant p-values that are potentially involved in seed size determination, 123 putative causal genes related to the 56 QTNs potentially involved in seed composition, 90 putative causal genes related to the 45 QTNs potentially involved in seed color and 906 putative causal genes related to the 597 QTNs potentially involved in seed mineral composition (Table 1 and Supplementary Table S1). Due to the relatively low number of ecotypes used for the QTN identification related to seed nutritional composition, which might affect the statistical accuracy of the study, we decided to provide these results as supplementary data but we will not analyze them further.

In order to identify functional classes that could be involved in regulating these different seed phenotypes, we performed over-representation gene ontology (GO) analyses with corresponding lists of putative causal genes for each phenotype (Table 2). Interestingly, we observed that list of putative causal genes regulating seed size were enriched in GO terms related to the U12-type spliceosomal complex (GO:0005689). Similarly, using list of putative causal genes regulating seed protein content/concentration, we observed enrichment of genes with GO terms referring to nutrient reservoir activity (GO:0045735), amino acid transport (GO:0015171, GO:0003333) and oxalate metabolic pathway (GO:0033609, GO:0046564), which are all functional classes closely related to biosynthesis or transport of amino acids23. From putative genes regulating the seed color, we revealed that the GO terms referring to flavonoid biosynthesis were enriched (i.e. GO:0080043, GO:0080044, GO:0052696), and it has been shown that, indeed, flavonoid composition/concentration is closely related to seed coat color24. Finally, we observed enrichment of the GO term related to the protein amino acid autophosphorylation (GO:0046777) concerning genes potentially regulating mineral concentrations, which was less intuitive and presumably has indirect relations.

Table 2 Enrichment analysis of Gene Ontology (GO) terms on putative causal genes regulating different seed traits (i.e. size, composition and color).

In order to identify potential specific regulator of seed traits, we also focused on seed expression specificity and compared list of genes specifically expressed in seeds and pods with our list of candidate causal genes related to seed traits. Expression analysis in different Medicago plant organs was performed using publicly available information. To compare with our data, we mapped these reads to the Medicago genome version 55 and quantified transcript expression using the Salmon pipeline25. Out of 44,473 transcripts in the Medicago genome (v5). 375 were identified as specifically or preferentially expressed in pods/seeds (Supplementary Table S7). After combining a list of seed-specific genes and our list of putative causal genes from GWA studies, we revealed two seed-specific genes potentially regulating seed nitrogen concentration: a zinc-finger transcription factor (MtrunA17Chr7g0217321) and a CAAT-Binding Transcription factor (CBF, MtrunA17Chr2g0318461), and eight seed-specific genes potentially regulating various mineral concentrations in seeds (Supplementary Table S6).

Discussion

Improving seed protein content in M. truncatula seeds by increasing seed size

Grain legumes play a key role in providing plant proteins for food and feed. Therefore, understanding how to increase seed protein content and to produce storage proteins with high nutritional values (i.e. containing essential amino acid and sulfur-containing amino acids) represents a technological breakthrough that has to be yet overcome. In this study, we observed significant genetic variabilities regarding seed traits such as size, nitrogen content (i.e. storage protein content) and sulfur content (i.e. sulfur-containing amino acid content), which makes the Medicago HAPMAP collection a great tool to improve these agronomical traits. Interestingly, our correlation matrix between these different seed traits within the Hapmap population revealed a strong correlation (PCC > 0.9) between seed size and protein content (Fig. 1), which suggested that increasing seed protein content could be directly achieved by increasing seed size. This hypothesis could, first, be confirmed by identification of colocalized QTLs of seed size and seed protein content in garden pea26, soybean27, Common Bean28 and cowpea29. In parallel, even if several genetic studies already highlighted genes controlling seed size, which generally act via regulation of mitotic activity in embryo and endosperm, such as SBT1.130 and DASH31in M. truncatula, but also via regulation of cell elongation in endosperm and seed coat such as ZHOUPI32 and TTG233 in A. thaliana (for review34). The hypothesis that increasing seed size would increase protein content is difficult to validate from literature because mutant lines displaying larger seeds were not tested for their protein contents and inversely, mutant lines affected in protein content were not tested for seed size. One exception is the gene AP2 in Arabidopsis, which produced larger seeds in mutant plants combined with an increase in protein and fatty acid content35, which validate our hypothesis. Finally, numerous correlation analyses between seed size and protein content have been conducted on cereals and legumes but no general trend was observed. Indeed, even if several studies concluded about clear positive correlations between seed size and seed protein content in pigeon pea36, soybean37 and this study in Medicago, many others did not, suggesting genotype-environment effects. As mentioned earlier, these results are undoubtedly dependent on plant genetic background, favorable growth conditions and optimal agricultural practices. Indeed, in our study, we revealed that the geographical and bioclimatic origins of Medicago accessions played an important role in plant adaptation with correlations between seed size, seed content, temperature and precipitation during the reproductive phase (Table 1). These accessions showed a phenotypic adaptability to produce larger and higher seed protein content. Moreover, the variations of these traits within the same genetic backgrounds are also to consider as abiotic stress is known to affect proper seed development in Medicago38. Finally, one essential aspect to validate this positive correlation between seed size and protein content is the non-limiting nitrogen supply, which could be achieved via intensive nitrogen fertilization or via nitrogen fixation in legumes, which is still active during seed filling. In this study, we highlighted genes/loci potentially involved in seed size, but also in both seed size and seed protein content, which could potentially improve simultaneously seed nutritional values and agronomical performances, as it is already well documented that larger seeds tend to improve germination vigor and plantlet establishment (for review39).

Efficiency of GWAS and post-GWAS algorithms

In the past 10 years due to the rapid development of genome sequencing technologies and phenotypic capacities, numerous genome-wide association studies (GWAS) have been performed in many species. This powerful tool is becoming a standard in forward genetic study to identify genes/loci controlling various traits. Its rapid development has been accompanied by the development of mainly two association study methodologies: classical single-locus GWAS methods based on General Linear Model (GLM) and Mixed Linear Model (MLM) (e.g. EMMA15; SUPER40), and recently developed multi-locus GWAS methods such as MLMM41, FASTmrEMMA42 and FarmCPU43. In the single-locus method, statistical tests are performed one locus at each time, whereas multi-locus methods consider the information of all loci simultaneously and consequently do not require false discovery rate correction, leading to higher QTN detection power44. In our study, we compared a single-locus method, EMMA, and a multi-locus method, FarmCPU, and we had two observations. (i) When association studies revealed highly significant candidate QTNs, FarmCPU (i.e. multi-locus method) resulted in more significant QTNs with lower p-values and more precise chromosome positions. Indeed, EMMA (i.e. the single-locus method) showed higher QTN p-values, closer to the background noise, which led to the identification of loci represented by broader “peaks” containing multiple significant SNPs (i.e. SNP clusters) in Manhattan plots, therefore more difficult to precisely locate on chromosomes (Figure S2). However, even if FarmCPU identified more significant QTNs with more precise locations, most of the highly significant QTNs were observed using both methods. (Figure S2A-B). (ii) When association studies did not reveal significant QTNs, single and multi-locus methods performed similarly (Figure S2C). In conclusion, from our study, it appeared that FarmCPU, the multi-locus method, globally performed better than the single-locus method, which explains why we focused on this method to identify candidate QTNs. Better performances of GWAS multi-locus models have also been observed in several other studies such as in Xu et al.45 related to starch properties in maize, Jaiswal et al.46 related to agronomic traits in wheat, and Li et al.47 related to fiber quality in Cotton, rendering these methods attractive for association studies.

Potential regulation of seed size and protein content via RNA regulation

In order to determine reliable QTNs and mine for causal candidate genes controlling seed size and composition, we performed postGWAS analyses. First, we considered a 15 kb LD decay (r2 > 0.7), as determined in Medicago hapmap collection21, to identify associated SNPs due to LD. Then, depending on the association results, we used different approaches to refine candidate gene selection: combination of association results from correlated phenotypes to identify putative causal genes, use of over-representation analysis to identify key functional classes regulating phenotypes, and integration of transcriptomics.

Regarding seed size, we mined two highly significant QTNs associated with multiple seed size phenotypes by combining GWAS results of weight, area, majellipse and minellipse. First, MtrunA17Chr1_35506650, a QTN detected in three association studies (i.e. weight, minellipse and area), is near a gene encoding a WD40/BEACH domain protein (MtrunA17Chr1g0185101) (Table 1 and Supplemental Table S6). A potential ortholog of this gene in Arabidopsis, called SPIRRIG (SPI, AT1G03060), has been shown to be involved in cell morphogenesis via interaction with processing bodies (i.e. p-bodies)48, which is known to regulate mRNA processing during development or stress (for review49). In Arabidopsis, spi mutant lines displayed many developmental defects50 including reduced seed coat mucilage and plant growth impairment under salt stress51. Interestingly, the second QTN (MtrunA17Chr4_56801315) detected in all four association studies related to seed size was closely related with a gene encoding an RNA-binding domain (RBD, MtrunA17Chr4g0065741), which is also a gene involved in the regulation of RNA. RDB proteins belong to a large protein family, which are known to determine RNA fate from synthesis to degradation. Few of them have been functionally characterized and depending on their RNA targets, they could play tissue- and developmental stage-specific roles52. For instance, one of RDB protein family functionally characterized in Arabidopsis seed development is SUPPRESSOR OF ABI3 (SUA, AT3G54230), which controls alternative splicing of the ABI3, a master regulator of seed development and maturation53. This QTN identified from several seed size association studies was also detected in association with the seed nitrogen content (Table 2), which indicated the important role of this gene in regulating both seed size and protein content.

This role of RNA processing/regulation to regulate seed size was further highlighted by the over-representation analysis of all highly significant QTNs associated with seed size, which revealed that the “U12-type spliceosomal complex” class was over-represented. This complex is part of the minor spliceosome, which plays a crucial role in splicing regulation of the rare U12 introns. It has been shown in Arabidopsis that homozygote mutant lines impaired in the U12 spliceosome complex displayed premature embryo abortion, whereas heterozygote mutants were defective for seed maturation, indicating an essential role of this complex during embryonic development54. Moreover, proper splicing and alternative splicing have been shown to be crucial in normal embryo formation (for review55) and embryo development, which is a key stage in controlling the final seed size.

Methods

Medicago plant accession and growth

Accessions from the HapMap germplasm collection were requested from the dedicated website (http://www.Medicagohapmap.org/hapmap/germplasm). Around 200 accessions were grown in growth chambers (20 °C/18 °C, 16 h photoperiod at 200 mmol m−2 s−1) until maturity. Mature seeds of 162 accessions were collected in sufficient quantity to perform different phenotyping experiments.

Seed size and color determination

Individual seed weights of 162 accessions were estimated by weighting 50 seeds in triplicate using a precision balance at an accuracy ± 0.1 mg and displayed as mg per seed. To complete seed size phenotyping, image analyses were performed on 150 seeds of each accession using GrainScan software10 to automatically measure individual seed areas (i.e. pixel number, called “area”), seed perimeters (“perimeter”), seed lengths (“majellip”) and seed widths (“minellip”). These seed size parameters were averaged for each of the 162 accessions and used for the subsequent analyses. Image analysis also allowed us to determine seed color values using GrainScan, which measured three color channels (i.e. CH1, CH2, CH3) from raw RGB values, reflecting seed coat pigmentation.

Seed composition analysis with elemental CHNS analyzer (162 accessions)

Seed composition was characterized using a CHNS elemental analyser, which measured the percentage (w/w) of carbon (C), hydrogen (H), nitrogen (N) and sulfur (S). Mature seeds were ground in liquid nitrogen and dried in an oven at 90 °C for 48 h. Then, triplicates of approximately 5 mg of powder were analyzed using an Elementar Vario Micro cube analyzer (Germany) using flash combustion of the sample based on the “Dumas” method. Concentrations of C, H, N, S were determined by the Elementar Vario software based on exact seed weights. From which, carbon–nitrogen ratios (C/N ratio) were calculated to provide an accurate overview of the global seed composition. Nitrogen and sulfur contents per seed for each accession (i.e. N content, S content) were calculated using average seed weights of each lot.

Macro- and micro-element concentrations

A subset of 88 accessions was analyzed to determine elemental concentrations for P, K, Mg, Ca, Na, Fe, Mn, Zn, Cu, Mo, V, Co, Ni, Ti, As, Cr using Induced Coupled Plasma-Mass Spectrometry (ICP-MS, Perkin Elmer model NexION 300D). Seed powders were dried in a heating oven at 75 °C for overnight. Approximately 5 mg of seed powder were accurately weighed and transferred to a glass container with 3 ml of concentrated nitric acid (HNO3). After digestion for 15 min at 200 °C, deionized water was added to adjust the final volume to 10.0 ml and samples were injected into the ICP-MS for measurement. A blank sample containing 5% HNO3 was used for background subtraction. Concentrations (i.e. ppb or mg/L) of each element were calculated based on an internal standard mix (Perkin Elmer, ref. 9301721) and normalized according to a weight normalization procedure using the NexION software (Perkin Elmer).

Correlation analysis

Correlation matrix was performed on averages of phenotype values. Each pairwise comparison was performed using Pearson correlation calculated using the complete pairwise correlation of the ‘corr.test’ function from the R package ‘psych’. P-values were adjusted using Benjamini-Hochberg (BH) to control false discovery rate and statistical significance threshold was set below 5% of adjusted p-values.

Phenotype normality distributions

All traits were checked and transformed to reach normality as it is required to perform genome wide association studies. Box Cox algorithm14 was used to determine the appropriate transformation for each trait, and each trait was transformed separately according to the most suitable lambda values given by the Box Cox function implemented in the R package MASS56. After transformation, Shapiro–Wilk tests57 were performed to validate the normality and traits that did not reach normality were discarded of following GWAS analyses. Supplementary Table S1 provides seed trait values before and after Box Cox transformation, respective lambda values for each trait and corresponding p-values of the Shapiro–Wilk test after transformation.

Genome-wide association studies and post-GWAS analyses

Single nucleotide polymorphisms (SNP) data were obtained by whole genome sequencing of the 262 Medicago accessions from the M. truncatula Hapmap project9. From the 6 million SNPs originally identified in Medicago genome version 4, 4,852,061 SNPs were successfully mapped to the fifth version of the Medicago genome (Mtv55) and were used for subsequent analyses. The population structure and the kinship matrix used in this study were the same as previously described in Bonhomme et al.58 and le Signor et al.7, respectively. Two models were used to perform GWAS: (1) a classical single locus method using a mixed linear model called EMMA (Efficient Mixed-Model Association15 with the kindship matrix and the population structure as inputs; (2) a multi-locus model called FarmCPU (Fixed and random model Circulating Probability Unification16) with correction of the population structure, both with a statistical test p-value threshold of 1%. The Manhattan and quantile–quantile (QQ) plots were plotted using the R package rMVP (https://github.com/xiaolei-lab/rMVP). PostGWAS analysis was performed to correct for the linkage disequilibrium (LD) using PLINK algorithm22 with the “clump” function and the following options: clumb-kb-radius of 15, which represents the genomic range (in kilobases) to identify SNP in LD and clump-r2 of 0.7, which represents the r-squared threshold to identify correlation between SNPs. All GWAS result files were transformed into gwas files (Supplementary Tables S2 to S5) readable in web-application JBrowse17 containing the M. truncatula genome version 5 such as https://Medicago.toulouse.inra.fr/MtrunA17r5.0-ANR/ or in personal desktop genome viewer such as the freely available Integrative Genome Viewer (IGV18, http://software.broadinstitute.org/software/igv/). Over-representation analyses (ORA) of candidate genes were performed using ClusterProfiler package available in R using hypergeometrical test (p-values) with a Bonferroni correction (q-values)59.

RNA-seq analysis in major plant organs

Expression of Medicago transcripts in major plant organs was determined from existing experiments. Sequenced short reads (i.e. raw fastq files) were downloaded from the Sequencing Read Archive (SRA, https://www.ncbi.nlm.nih.gov/sra) from different experiments and different Medicago plant organs: nodule (SRX099057), seed pod (including seeds, SRX099058), 4-week blade (SRX099059), flower (SRX099061), 4-week root (SRX099062), all root system (SRX2943065, SRX2943064, SRX2943063) and all shoot system (SRX2943062, SRX2943058). Raw read files were mapped against the Medicago transcriptome version 5 (https://Medicago.toulouse.inra.fr/MtrunA17r5.0-ANR/) and quantified as counts using Salmon algorithm25. Counts were normalized to corresponding library sizes (equivalent to count per million, CPM) then length of transcripts (Transcript per million, TPM) and displayed as TPM in our study.