Genetic dissection of grain iron and zinc, and thousand kernel weight in wheat (Triticum aestivum L.) using genome-wide association study

Genetic biofortification is recognized as a cost-effective and sustainable strategy to reduce micronutrient malnutrition. Genomic regions governing grain iron concentration (GFeC), grain zinc concentration (GZnC), and thousand kernel weight (TKW) were investigated in a set of 280 diverse bread wheat genotypes. The genome-wide association (GWAS) panel was genotyped using 35 K Axiom Array and phenotyped in five environments. The GWAS analysis showed a total of 17 Bonferroni-corrected marker-trait associations (MTAs) in nine chromosomes representing all the three wheat subgenomes. The TKW showed the highest MTAs (7), followed by GZnC (5) and GFeC (5). Furthermore, 14 MTAs were identified with more than 10% phenotypic variation. One stable MTA i.e. AX-95025823 was identified for TKW in both E4 and E5 environments along with pooled data, which is located at 68.9 Mb on 6A chromosome. In silico analysis revealed that the SNPs were located on important putative candidate genes such as Multi antimicrobial extrusion protein, F-box domain, Late embryogenesis abundant protein, LEA-18, Leucine-rich repeat domain superfamily, and C3H4 type zinc finger protein, involved in iron translocation, iron and zinc homeostasis, and grain size modifications. The identified novel MTAs will be validated to estimate their effects in different genetic backgrounds for subsequent use in marker-assisted selection. The identified SNPs will be valuable in the rapid development of biofortified wheat varieties to ameliorate the malnutrition problems.

where t 0.025,DF w = The t-critical value from the t-distribution table with α = 0.025 and DF w is the degrees of freedom within groups from the ANOVA table. MS w = The mean squares within groups from the ANOVA table. n 1 and n 2 = The sample sizes for the first and second comparing samples where σ 2 G = Genetic variance was calculated as (MS treatments -MS residuals )/ nBlock; σ 2 E = Residual variance = MS residual ; nBlock = Number of blocks where MS treatments = Treatment mean sum of square; MS residuals = Error mean sum of square; b = Number of blocks.
The CV indicates the degree of precision with which the treatments are compared and is a good index of the experimental reliability. It expresses the experimental error as percentage of the mean and if the value is high then the precision of the experiment is low and vice versa. The h 2 BS is the proportion of phenotypic variation that is attributable to an overall genetic variation for the genotypes. LSD is the value at a particular level of statistical probability, when exceeded by the difference between two genotypes means, then the two genotypes are said to be distinct for at that or lesser levels of probability. The σ 2 G is the genetic or inherent variation that remains unaltered by environmental changes, this kind of variation responds to the selection during breeding process. In contrast, σ 2 E does not respond to selection as it is non-heritable, which is entirely due to environmental effects.
Genotyping. Genomic DNA of the GWAS panel was extracted from the leaves of 21 days-old seedlings by Cetyl Trimethyl Ammonium Bromide (CTAB) method 53 . The panel was genotyped using Axiom Wheat Breeder's Genotyping Array (Affymetrix, Santa Clara, CA, United States) having 35,143 genome-wide SNPs. The monomorphic, markers with minor allele frequency (MAF) of < 5%, missing data of > 20%, and heterozygote frequency > 25% were removed from the analysis. The remaining set of 14,790 high-quality SNPs was used in GWAS analysis (Supplementary Table S3).
Population Statistics and GWAS. The pair-wise LD values (r 2 ) between the SNPs located in each chromosome were calculated with Trait Analysis by aSSociation Evolution and Linkage (TASSEL) version 5.0 54 . The LD block size in three different subgenomes as well as in the whole genome was calculated by keeping r 2 threshold at half LD decay (Fig. 3). The principal component analysis (PCA) was done through GAPIT 55 to understand the structure of the population and included in the GWAS model to correct the structure. Furthermore, Kinship relationship was calculated through GAPIT 55 and presented in Fig. 2C. Additionally, the structure of the population was evaluated through the STRU CUR E program by keeping K-value from 1 to 10. For every single K-value, 3 independent runs were used and each run was set with 10,000 burn-in iterations followed by 10,000 Markov Chain Monte Carlo (MCMC) replications after burn-in. The STRU CTU RE HARVESTER 56 was used to detect the optimal K-value based on ad-hoc method described by Pritchard et al. 2010 90 as well as Evanno's method 57 .
The suitability of the model to account for population structure was assessed using quantile-quantile (Q-Q) plots.
The phenotypic values of GFeC, GZnC, and TKW of 280 diverse genotypes along with corresponding genotyping data were used in GWAS analysis. Significant MTAs were identified using BLINK (Bayesian-information and Linkage-disequilibrium Iteratively Nested Keyway) model 58 implemented in Genome Association and Prediction Integrated Tool (GAPIT) version 3.0 80 in R software package. Determining the correct P-value threshold for statistical significance is critical to differentiate true positives from false positives. To determine the statistical significance threshold in GWAS, Bonferroni correction has been employed. To estimate Bonferroni correction, α was set to 0.05 and which is divided by total number of SNPs. The Bonferroni-corrected SNPs were considered for significant association and R 2 was used to describe the percentage variation explained (PVE) by significant MTAs.
In silico analysis. The sequence information of the significant SNPs was used to search for putative candidate genes with Basic Local Alignment Search Tool (BLAST) using default parameters in the Ensemble Plants database (http:// plants. ensem bl. org/ index. html) of the bread wheat genome (IWGSC (RefSeq v1.0)). The genes found in the overlapping region and within the region of 10 Kb intervals flanking either side of the associated marker were considered as putative candidate genes and their molecular functions were determined. In addition, www.nature.com/scientificreports/ their expression patterns were investigated using the Wheat Expression database (http:// www. wheat-expre ssion. com/) and potential links to phenotypes was determined using Knetminer tool integrated with Wheat Expression database. The role of the identified putative candidate genes in the regulation of GZnC and GFeC, and TKW was also determined with the previous reports.

Results
Variability, heritability, and correlations. The environment-wise heritability and variance components of the GWAS panel for GFeC, GZnC, and TKW are presented in Table 1. The GFeC ranged from 26.3 mg/kg to 49.9 mg/kg, whereas, the GZnC recorded a wider distribution across the environments, as it ranged from 21.3 mg/kg to 64.1 mg/kg. Similarly, TKW ranged from 26.0 gm to 59.3 gm. The trait-wise heritability was recorded highest for TKW followed by GFeC, and GZnC, whereas, the trend for the coefficient of variation (CV) was exactly opposite with the lowest recorded for TKW followed by GFeC, and GZnC. The environment-wise heritability was ranged from 45.4% (E4) to 89.7% (E1), 33.3% (E4) to 84.9% (E5), and 89.9% (E4) to 98.8% (E1) respectively, for GFeC, GZnC, and TKW. For all the three traits, E4 has been recorded as the lowest heritability, which was corroborated with the highest recorded CV for E4. The genotypic variance (σ 2 G ) and environmental variance (σ 2 E ) are presented in Table 1. The trait and environment-wise mean values are illustrated graphically through boxplots and presented in Fig. 1. The location means of GFeC were recorded as similar and highest for E3 and E5 followed by E1, E2, and E4, whereas, E5 was recorded highest pooled mean followed by E2, E1, E4, and E3 for GZnC. The E3 and E1 recorded a similar and highest mean for TKW followed by E4, E2, and E5. The frequency distribution of grain quality traits in the GWAS panel evaluated at E1-E5 during 2020-2021 is presented in Fig. 1. The genotypes in the GWAS panel showed continuous frequency distributions for all the studied traits. Partial correlation coefficient (r 2 ) of GFeC, and GZnC by keeping TKW as a controlling factor was determined. Highly significant and positive correlation was observed between GFeC and GZnC in E1 (0.296**), E2 (0.276**), E3 (0.202**), E4 (0.520**), and E5 (0.35**) and also in pooled data (0.358**).

SNP markers statistics.
The quality processing of 35,143 SNPs from 35 K array resulted in a set of 14,790 cured genome-wide SNPs. These high-quality set of SNPs were further used for GWAS analysis. The chromosome and genome-wise marker distribution are presented in Table 2. The highest number of SNPs were mapped on the B genome (5649) followed by the D genome (4590), and the A genome (4551).
Population structure and linkage disequilibrium. The PCA plot (Fig. 2B) indicated that there were no clear distinct sub-populations in the GWAS panel; however, STRU CTU RE grouped the GWAS panel into eight sub-populations ( Fig. 2A). The LD was estimated by calculating the squared correlation coefficient (r 2 ) for all the SNPs and plotted against the genetic distance (bp). The LD decay for the whole genome was 4.9 cM and it was found that the decay was rapid in the A subgenome (3.6 cM) followed by the B subgenome (5.7 cM) and the D subgenome (5.2 cM) (Fig. 3).

Genome-wide association studies.
A total of 17 Bonferroni-corrected MTAs were identified for GFeC, GZnC, and TKW. The details of the identified MTAs are presented in Table 3 and illustrated in Manhattan plots  Fig. 4A,B. The Q-Q plots depicting the observed associations of SNPs and GFeC, GZnC, and TKW compared to the expected associations after accounting for population structure are presented in Fig. 4A

Identification of putative genes associated with MTAs. The significant SNPs associated with
GFeC, GZnC, and TKW were used to identify the putative candidate genes using the annotated wheat reference sequence (RefSeq V1.0) and are presented in Table 4

Discussion
Understanding the genetic basis of complex traits such as GFeC, GZnC, and TKW through GWAS with a diverse panel of genotypes can significantly improve QTL mapping resolution compared to bi-parental populationsbased QTL mapping. Using the genome-wide SNPs and multi-environment data, several significant SNPs were identified in this study.   www.nature.com/scientificreports/ The expression of GFeC, GZnC, and TKW is significantly affected by the environment and genotype-environment interactions (GEI). Among all traits, GZnC was the most environment-sensitive trait, whereas, TKW was relatively the most stable with minimum environmental influence. The greater magnitude of the environment and GEI have also been reported in previous studies for the expression of GFeC and GZnC 10,11 , and TKW 12,13 . The magnitude of environmental interaction decides the identification of environment-specific QTL(s) as well as QTL(s) that can express stably across environments.
The highest heritability was recorded for TKW followed by GFeC, and GZnC, whereas, the trend for the coefficient of variation (CV) was exactly opposite with the lowest CV recorded for TKW and the highest CV for GZnC. The highest and lowest heritability for TKW and GZnC respectively is also concurred with earlier studies 46,72 . The associations were highly significant positive in all the environments betweenGFeC and GZnC. Significant and positive correlations found in this study have also been reported in earlier studies 25,26 . The significant positive correlations between GZnC and GFeC indicated the possibility to map the genomic regions controlling multiple traits. Such co-mapped SNPs will be much useful in marker-assisted selection for simultaneous improvement of correlated traits.
The STRU CTU RE model explained 8 sub-groups in the populations. The genotypes in GWAS panel consists of advanced breeding lines suitable for various agro-climatic and production conditions. The first subgroup consists of genotypes mostly selected from international breeding material and suited for North West and North East Plains Zone in India. Similarly, the second group consists of international selections for restricted irrigated or rainfed production conditions. The third subgroup consists of genotypes dominated by 1B.1R translocation with genes for wider adaptation. Subpopulation 4 is mainly dominated by GW322, PASTOR, and OPATA parentage, whereas, 5th subpopulation largely consists of Indian wheat varieties/germplasm in their parentage. High frequency of SOKOLL, KIRITATI, PBW65, and MILAN was present in the 6th subpopulation parentage. Genotypes in 7th subpopulation are dominated by old salinity/alkalinity tolerant varieties. Whereas, 8th subpopulation contains mainly indigenous germplasm, old landraces, and breeding lines. The PC1, PC2 and PC3 of PCA analysis were used as covariates in the GWAS analysis to identify the MTAs. The LD may vary in different populations due to population size, genetic drift, admixtures, selection, mutation, non-random mating, pollination behavior, and recombination frequency 73,74 . The LD blocks are usually larger in self-pollinated crops such as wheat and hence decay slowly 75 , whereas, in outcrossing crop species like maize 76 , the LD decays rapidly. The presence of high LD across the genome would reduce the QTL mapping resolution and vice versa 77 . In such cases, a better QTL resolution will be achieved by using genome-wide SNPs. The decay of LD was found comparable in the B and D subgenomes (~ 5 cM) compared to the A subgenome, which had a shorter decay distance of around ~ 3 cM. A similar pattern of LD decay was also observed in other GWAS studies in wheat 49,78,79,91 .
A total of seven MTAs in different environments were identified covering the three subgenomes and all were major MTAs as they explained more than 10.0% phenotypic variation. The TKW was relatively the most stable trait compared to the rest of the other two traits, as TKW recorded the highest heritability and lowest coefficient of variation which reflected in detecting the highest number of MTAs as well. All the identified MTAs were mapped on 5A, 6A, 7B, 5B, 2D, 7A, and 5D located at 444. 8 48 and 2D, 5D, 7A and 7B 91 .
The various putative candidate genes underlying MTAs with high phenotypic variation for GZnC, GFeC, and TKW were identified through BLAST search (Table 4 and Supplementary Table 2). The MTAs identified in various chromosomes were located in gene coding regions related to transcription factors, transporters, transmembrane protein and kinase-like superfamilies. For example, Multi antimicrobial extrusion protein (TraesCS3B02G562500) has a role in the translocation of iron during iron deficiency stress in bread wheat 59 85 . Also, two MATE proteins namely, GmFRD3a and GmFRD3b play a significant role in iron efficiency in soybean 86 . Cloning and characterization of an Arabidopsis gene i.e. FRD3, a member of the multidrug and toxin efflux family is involved in iron homeostasis 81 . The FRD3, which is an efflux transporter of the efficient Fe chelator citrate is involved in Fe homeostasis maintenance throughout plant growth and development. Additionally to its well-known root expression, FRD3 is also strongly expressed in seeds and flowers 82 .
One SNP i.e. AX-94699865 associated with GFeC encodes an important F-box domain (TraesCS7B02G312400) regulates STOP1 in Arabidopsis. STOP1-ALMT1 pathway promotes iron accumulation into the apoplast of root tip regions under Pi-deficient conditions 66,67 . Another SNP i.e. AX-94524014 (TraesCS5B02G257700) associated with GZnC was found to encode LEA protein, where LEA-18 was involved in the transportation of iron in the phloem of castor 68 . The binding of LEA proteins to different molecules like Zn ion, DNA and ATP binding, were the major activities for the action of upland LEA proteins 69 .
The SNP i.e. AX-95235178 encoding Leucine-rich repeat domain superfamily (TraesCS1A02G309000) was associated with TKW. A total of 32 barley orthologs were identified as potential candidate genes that determine barley grain size or weight. The barley ortholog of the rice OsBDG1 gene is mapped on 3H chromosome at 666.35 Mb (HORVU3Hr1G104350), which encodes the leucine-rich repeat receptor-like protein kinase family 70 . The rice OsBDG1 gene encoding a small protein with short leucine-rich-repeats possessing cell elongation activity, has previously been proven to positively regulate grain size in rice 71 . Therefore, HORVU3Hr1G104350could be a reliable candidate gene affecting grain size as the function of the OsBDG1 gene.
Another grain weight controlling gene i.e. FASCIATED EAR2 (FEA2) encodes the maize ortholog of CLAV-ATA2 (CLV2), encoding a leucine-rich repeat receptor-like protein that regulates meristem size by transmitting signals from CLAVATA3 (CLV3) peptide ligand to the WUSCHEL (WUS) homeodomain transcription factor. The FEA2 has a role in total kernel number and kernel size in maize 87 . Similarly, IKU pathway represents one of the well-studied genetic networks involves four major genes including HAIKU2 (IKU2), which encodes a leucinerich repeat kinase, mutational analyses of these genes in Arabidopsis revealed their physiological significance in controlling endosperm development and thereby seed size through regulating endosperm proliferation and cellularization 88 , and loss of function mutations in IKU pathway genes cause a decrease in seed size. Another SNP (AX-95117294) encoding C3H4 type zinc finger protein (TraesCS5D02G188300) was associated with the expression of TKW. Functional prediction of maize C2H2-zinc finger gene revealed its involvement mainly in the formation of important agronomic traits in maize yield 89 .
The study with 280 diverse set of bread wheat GWAS panel has shown that GFeC, GZnC, and TKW were quantitatively inherited traits. The strong positive correlation between the GFeC and GZnC suggested the possibility of improving both the traits simultaneously. A total of 17 MTAs including 5 for GFeC, 5 for GZnC, and 7 for TKW were identified from the GWAS approach. The environment-specific and pooled-data MTAs identified www.nature.com/scientificreports/ in the present investigation represented novel genomic regions associated with trait expression. Several putative candidate genes encoding important molecular functions such as iron translocation, iron and zinc homeostasis, and grain size modifications were associated with the identified MTAs. Further validation and functional characterization of the candidate genes to elucidate the role of these genes in wheat is envisaged. The identified SNPs could be useful in marker-assisted selection programs to develop biofortified varieties to reduce micronutrient malnutrition.
Declaration. The set of 280 genotypes used in the present experiment were selected from All India Coordinated Research Project on Wheat and Barley and the imported genotypes have been obtained through the nodal agency for germplasm exchange i.e. National Bureau of Plant Genetic Resources, New Delhi following the prescribed guidelines. Also, the authors have all the required permissions and rights to collect and use the genotypes for research purpose. The experimental research and field experiments in the present study are duly approved by the institute research council of ICAR-Indian Institute of Wheat and Barley Research, Karnal.

Data availability
All the phenotypic and genotypic data used in the study is given as Supplementary