Genome-wide association studies of grain quality traits in maize

High quality is the main goal of today’s maize breeding and the investigation of grain quality traits would help to breed high-quality varieties in maize. In this study, genome-wide association studies in a set of 248 diverse inbred lines were performed with 83,057 single nucleotide polymorphisms (SNPs), and five grain quality traits were investigated in diverse environments for two years. The results showed that maize inbred lines showed substantial natural variations of grain quality and these traits showed high broad-sense heritability. A total of 49 SNPs were found to be significantly associated with grain quality traits. Among these SNPs, four co-localized sites were commonly detected by multiple traits. The candidate genes which were searched for can be classified into 11 biological processes, 13 cellular components, and 6 molecular functions. Finally, we found 29 grain quality-related genes. These genes and the SNPs identified in the study would offer essential information for high-quality varieties breeding programs in maize.

Maize (Zea mays L.) has become one of the most important crops globally for food, feed, and fuel since it appeared and spread widely 1 . In the past few years, more and more people paid attention to the quality of maize grain due to the rapid development of animal husbandry and processing industry. However, the nutritional quality of maize grain remains poor, especially the deficiency of lysine in the maize grain, which can not meet the nutritional and health requirements of people 2 . Thus, the genetic enhancement of nutritional quality in maize grains is essential to increase the nutritional value and conduct high-quality maize breeding 2 .
The results of genetic studies have indicated that variations in nutritional quality in maize grain characterize quantitative traits. Over the past two decades, genetic dissection of nutritional quality in maize kernels by classical QTL mapping has resulted in the identification of numerous nutritional quality QTLs. Mangolin et al. 3 detected 13 QTLs by QTL mapping of maize kernel oil content in F 2:3 population. Liu et al. 4 detected seven QTLs associated with protein content, six QTLs associated with starch content, and five QTLs associated with oil content using F 2:3 population and BC 2 F 2 population. Wang et al. 5 detected 38 QTLs for maize grain quality traits using three RIL populations in three environments. To date, many genes related to maize proteins have been cloned, such as opaque1 (o1), floury4 (fl4) and Mucronate (Mc) [6][7][8] . Some genes such as linoleic acid1 (ln1), Oleic acid content1 (olc1), fatty acid desaturation 2 (fad2), and fad6 have been reported to influence oil content in maize [9][10][11] . A few starch content-related genes such as Shrunken1 (Sh1), Sh2 and Brittle2 (Bt2) have been identified 12,13 .
Genome-wide association study (GWAS) is becoming a powerful tool to address interspecies genotype-phenotype association based on the development of next-generation sequencing technology. In maize, GWAS has made a significant progress in the past decade. For example, Li et al. 14 used GWAS to dissect the genetic architecture of oil biosynthesis in maize kernels. Luo et al. 15 used GWAS to detect 57 loci significantly associated with salt tolerance, and 49 candidate genes from these loci. It can be seen that GWAS is widely used in maize. However, there is still a huge problem about how to obtain phenotypic data accurately and quickly. Traditionally, phenotyping method for grain quality, such as chemical method, is not only laborious and time-consuming, but also damages the integrity of maize kernels. By contrast, Near Infrared Reflectance Spectroscopy (NIRS) is a fast, reliable, and non-destructive method. NIRS has been increasingly used in plant phenotyping measurements, such as maize kernel starch content 16 and wheat protein content 17 . Therefore, NIRS can fully measure maize grain nutritional quality.
Although some potential grain quality genes and QTLs have been identified in maize, the genetic studies of grain quality are limited. In this study, we used NIRS to measure the main nutritional quality traits of 248 maize inbred lines and used 83,057 single nucleotide polymorphism (SNPs) markers to conduct GWAS. Our study was designed to accomplish the following objectives: (1) perform GWAS to identify SNPs responsible for moisture, www.nature.com/scientificreports/ Fig. 2). For moisture content, three SNPs were located on chromosomes 1, 3, and 9, which individually explained 10.53%-23.16% of the phenotypic variation. For protein content, seven SNPs were located on chromosomes 1, 2, 3, and 4, which individually explained 5.45%-32.79% of the phenotypic variation. For oil content, 21 SNPs were located on chromosomes 1, 3, 4, 5, 6, 7, 8, 9, and 10, which individually explained the phenotypic variation of 9.04-31.24%. For starch content, eight SNPs were located on chromosomes 1, 3, 4, and 6, which individually explained the phenotypic variation of 3.77-23.61%. For lysine content, ten SNPs were located on chromosomes 2, 3, 4, 5, 8, and 9, which individually accounted for 5.71%-32.42% of the phenotypic variation.
Comparing the localization results, we found eight co-localized sites and the physical position between SNPs was not more than 200 Kb (Table 4). Notably, four co-localized sites between different traits were detected. 1_60098266-60,098,312 was associated with oil content, starch content and protein content, which explained 9.60%, 16.70%, 16.64% and 13.48% of the phenotypic variation, respectively. 3_1462112-1,462,147 was associated with starch content and protein content, which explained 3.77% and 5.45% of the phenotypic variation, respectively. 3_133182128-133,206,096 was associated with oil content and starch content, which accounted for 22.52%, 4.03% and 4.16% of the phenotypic variation, respectively. 9_97538609-97,538,609 was associated with moisture content and oil content, which explained up to 23.16% and 17.51% of the phenotypic variation, respectively.
Candidate genes associated with significant SNPs. Previous study have shown that the correlation coefficient (r 2 ) between SNP markers is less than 0.1, which is considered to be no correlation 18 . Therefore, we choose r 2 = 0.1 as the LD decay distance. Candidate genes were predicted based on LD decay (r 2 = 0.1) in the MaizeGDB genome browser. A total of 208 candidate genes were found and detailed descriptions were summarized in Table S1.
The candidate genes can be classified into 11 biological processes, 13 cellular components, and six molecular functions. The number of candidate genes involved in the grain quality traits of moisture, protein, oil, starch and lysine contents was 77, 46, 103, 136 and 49, respectively. Among them, the candidate genes in biological processes were mainly concentrated in the cellular process and the metabolic process; the candidate genes in cellular component were mainly concentrated in organelle, cell and cell part; and the candidate genes in molecular function were mainly concentrated in catalytic activity and binding (Fig. 3). As for the KEGG analysis of the candidate genes, a total of 12 pathways were identified. These pathways included the biosynthesis of secondary metabolites, and that of amino acids, starch and sucrose metabolism, inositol phosphate metabolism, and phosphatidylinositol signaling system, which could be related to grain quality (Fig. 4). In addition, a protein classification analysis tool was used to classify candidate gene proteins, 46 of which matched the PANTHER database. A further analysis showed that these 46 proteins fell into nine categories (Fig. 5), which contained www.nature.com/scientificreports/ the largest number of proteins-metabolite interconversion enzyme (PC00262). Furthermore, we identified 29 candidate genes to be associated with grain quality (Table 5). Annotation information suggested that these candidate genes may control multiple traits during maize growth and development.

Discussion
Genetic basis of grain quality traits. Maize grain quality traits are complex quantitative traits, controlled by main effect genes and lots of micro effect genes. In this study, there was a wide variety of grain quality traits in the natural population, which were normally distributed. To reduce the influence of the environment on the genotype, phenotypic BLUP values across four environments were used for association studies. Phenotypic correlations were observed among the five grain quality traits. For instance, oil content had significant positive correlation with protein content and significant negative correlation with starch content, which is consistent with previous results 4, 19 . Meanwhile, starch content had significant positive correlation with protein content and lysine content, which is consistent with previous studies 20, 21 . Moreover, all of the five grain quality traits had higher broad-sense heritability. Among them, the heritability for protein content, oil content, starch content and lysine content was higher than that in previous studies 21 . The above results indicated a stable genetic association among these grain quality traits of maize. It is well known that there are hard choices between yield and quality. Previous studies showed negative correlation between quality traits and yield 22 . Therefore, how to carry out quality breeding while continuing to improve the yield of maize will be a new subject for maize breeders in the twenty-first century. At present, maize www.nature.com/scientificreports/  www.nature.com/scientificreports/ quality breeding is mainly to increase protein content and improve the composition of base acids, especially to increase the content of essential amino acids such as lysine and tryptophan 23 . In this study, a total of 83,057 SNP markers were used to scan the whole genome, combined with moisture, protein, starch, oil, lysine content and other phenotypic traits and genotypes for association analysis. The purpose of this study was to find the main genes to control the quality traits of maize, and then to introduce the genes into the parents of maize high-yield  www.nature.com/scientificreports/ hybrids from molecular level, and then to obtain high-yield and high-quality hybrids and to provide a theoretical basis for genetic improvement of the quality traits of maize.
Significant SNPs for grain quality traits. Nowadays many researchers at home and abroad apply linkage analysis to locating the QTL of regulating grain quality traits, but few GWAS studies are on grain quality traits. In addition, the QTL detected by linkage analysis and association analysis have consistency in position 24 . The GWAS analysis is performed with a Bonferroni correction, however this was found to be too strict for less significant trait associations. In order to better detect micro-effect polygenes and identify genetic sites, we reduced the significance threshold to -log 10 (P) = 4 for all traits 25 . In this study, a total of 49 SNPs significantly associated with grain quality were detected, and the phenotypic variation explained (PVE) value by a single SNP ranged from 3.77% (3_1462112 and 3_1462147 of starch content) to 32.79% (2_43189825 of protein content). In addition, four co-localized SNPs were detected by multiple traits and single phenotypic variation explained value over 3.77%, indicating that starch content, protein content, oil content, and moisture content are interrelated in the components of corn kernels. The SNPs detected in this study were compared with previous studies, and some SNPs were found to be located in the localized QTL confidence interval. Among them, five SNPs were located in the QTL interval with Zhang et al. 26 , where 1_190758142 for oil content was located on chromosomes 1 (bnlg2086-umc1122 interval), 4_85966198 for oil content was located on chromosomes 4 (phi096-bnlg1755 interval), co-localized site 1_60098266-60,098,312 for protein content and starch content was located on chromosomes 1 (phi001-umc1988 interval). Two SNPs were located in the QTL interval with Wang et al. 27 , where 1_19252635 for oil content was located on chromosome 1 (umc1685-umc1044 interval), and 1_190758142 for oil content was located on chromosome 1 (umc1395-umc2237 interval). In addition, 2_169733611 for lysine content was located on chromosome 2 (bnlg1138-umc1065 interval) of study by Zhang et al. 2 . Moreover, three SNPs were located in the QTL interval with Yang et al. 28 , where 1_190758142 for oil content was located on chromosome 1 (umc1590-bnlg1556 interval), 3_213791486 for protein content were located on chromosome 3 (umc2275-umc1594 interval), 10_127895301 for oil content was located on chromosome 10 (umc1272-bnlg1839 interval).
However, some SNPs were not found in previous studies. There are several reasons for these differences. First, the population in our investigation mightnot be different enough for grain quality. Second, different estimating methods also caused the variations. Third, these SNPs were newly discovered and needed testing further. All in all, the results of this study can serve as a reference for other studies.
Putative genes and pathways involved in grain quality. In this study, a total of 208 candidate genes were searched, of which 17 possible candidate genes for grain quality traits were predicted.
For moisture content, three candidate genes were detected. GRMZM2G069024 encoded Beta-glucosidase 11, an important component of the cellulase system 29 . Previous studies reported that dehydration rapidly induced polymerization of AtBG1, a beta-glucosidase 30 . GRMZM2G032852 encoded putative calcium-dependent protein kinase family protein and GRMZM2G321041 encoded putative RING zinc finger domain superfamily protein. www.nature.com/scientificreports/ The two proteins emerged as key proteins in response to drought stresses in plants 31 . The three enzymes are closely related to moisture content, therefore, three candidate genes may affect moisture content by influencing enzymes activities and genes expression levels.For protein content, three candidate genes were detected. One of the candidate genes (GRMZM2G047129) encoded alpha-L-fucosidase 2. Alpha-L-fucosidase has been reported in only a few plants, such as Arabidopsis ,32 and pea 33 . It was reported that alpha-L-fucosidase can hydrolyze fucose residues from glycoproteins 34 , thus this candidate gene may affect protein content. Another candidate gene (GRMZM2G466833) encoded malate dehydrogenase 3 (MDH), that is, one of the key enzymes to synthesize malic acid. MDH played a key role in many physiological metabolic pathways, such as C4 pathway, crassulacean acid metabolism, gluconeogenesis, tricarboxylic acid cycle and photosynthesis, linking the metabolism of sugars, proteins and lipids in the body 35 . Therefore, this candidate gene may affect protein content by influencing the metabolism of proteins. Finally, GRMZM2G071714 encoded lipoyl synthase (LS) that analyzes the final step of lipoyl cofactor biosynthesis 36 . Protein lipoylation was denovo lipoylation pathway in plastids, and two octanoyltransferases and one LS provided protein lipoylation autonomy to plastids of Arabidopsis 37 . Therefore, this candidate gene may affect protein content by influencing the protein lipoylation. For oil content, three candidate genes were identified. GRMZM2G433942 encoded palmitoyltransferase ZDHHC9. Serine palmitoyltransferase (SPT) is the key enzyme of sphingolipids biosynthesis, and sphingolipids are essential components of plant cells 38 . As can be seen, it has a certain effect on the oil content of maize. GRMZM2G134308 encoded putative beta-14-xylosyltransferase IRX10L. Xylose is a kind of glycosyl component widely found in plants. Plant glycosyltransferases are enzymes that are closely related to the metabolism of glycolipids, polysaccharides, glycoproteins, nucleic acids, plant secondary products, and so on 39 . Therefore, this candidate gene may affect oil content by influencing beta-14-xylosyltransferase activities and genes expression levels. GRMZM2G033544 encoded cyclopropane-fatty-acyl-phospholipid synthase that is synonymous with cyclopropane fatty acid (CFA) synthase. CFA is an important membrane fatty acid in the stress-resistant mechanism and the presence of CFA can enhance membrane rigidity 40 . CFA synthase is a key enzyme regulating synthetic CFA. Therefore, this candidate gene may affect oil content by influencing CFA synthase activities and genes expression levels.For starch content, five candidate genes were identified. Three of the candidate genes (GRMZM2G175218, GRMZM2G082034 and GRMZM2G347708) encoded beta-amylase, which was directly involved in the synthesis of starch and metabolic process of polysaccharides 41 . Two of the candidate genes (GRMZM2G000520 and GRMZM2G404453) encoded ethylene-responsive transcription factor (ERF), a member of a transcription factor family involved in plant growth and environmental stress responses 42 . In addition, ethylene has been shown to affect starch biosynthesis by influencing enzymes activities and genes expression levels involved in starch synthesis in maize 43 .
For lysine content, three candidate genes were detected. GRMZM2G050570 encoded threonine synthase which catalyzed the terminal reaction in the biosynthetic pathway of threonine 44 . From a nutritional point of view, lysine and threonine were essential amino acids in maize, and GRMZM2G050570 may indirectly affect lysine content by influencing threonine synthase activities and genes expression levels.. GRMZM2G129209 encoded omega-3 fatty acid desaturase which was a key enzyme for α-linolenic acid (ALA) biosynthesis 45 . Moreover, previous studies demonstrated that ALA was a crucial component in storing lipids in plants 46 . Therefore, the gene may regulate grain lysine content indirectly by regulating lipid synthesis. Finally, GRMZM2G076307 encoded glycosyltransferases (GTs), which belong to a multi-member genes family. According to previous studies, GTs played a very important role in the growth and development of plants, such as regulating plant hormone levels, participating in the synthesis, modification and transportation of secondary metabolites in plants, and participating in plant defense reactions 47,48 . Therefore, the gene may regulate grain lysine content indirectly by regulating plant hormone and secondary metabolites synthesis.
All in all, these candidate genes are closely related to grain quality and future work will include functional validation of these genes and illustrate the molecular mechanisms for controlling grain quality in maize plants.

Materials and methods
Association mapping panel and genotyping. The association panel consisted of 248 diverse lines, including some excellent backbone inbred lines in China and some high-quality inbred lines introduced from abroad. Details on 248 of these lines could be found in previous studies 20 . The DNA of all the maize inbred lines were extracted using CTAB method 49 and genotyped using Genotyping-By-sequencing (GBS) method 50 . The methods of SNP filtering and calculating linkage disequilibrium (LD) decay were described in previous studies 51 . A total of 83,057 SNPs were used and the LD decay was 120 kb (r 2 = 0.1) in this study.
Field experiments and phenotyping investigation. All 248 maize inbred lines of the association panel were planted in four environments, that is, Baoding in Hebei Province in 2016 and 2017, and Shijiazhuang in Hebei Province in 2016 and 2017. In each environment, all the maize inbred lines were planted in a single row plot using a randomized block design with two replications. Each experimental plot consisted of a row length of 3 m and 0.6 m between adjacent rows. After maturity, all corns except the head and tail of each row were harvested. Perten DA7200 Near Infrared Grain Analyzer was applied to determinate the moisture, protein, oil, starch and lysine contents of maize. Each material was repeatedly measured two times.
Phenotype statistical analysis. The IBM SPSS 21.0 software was used to make the descriptive statistical analysis and the analysis of variance (ANOVA). The broad-sense heritability ( h 2 ) for each trait was estimated as h 2 = σ 2 g / σ 2 g + σ 2 gy /r + σ 2 e /yr 52 , where σ 2 g , σ 2 gy and σ 2 e are genetic variance, genotype-by-environment interaction variance and error variance, respectively, y is the number of environments, and r is the number of replica- www.nature.com/scientificreports/ tions. The "PerformanceAnalytics" package in the R software was used to perform correlations analysis. For each trait, BLUP value was evaluated by using the following mixed linear model in the "lme4" package of the R software 53 : Y = (1|rep%in%env) + (1|env) + (1|lines) + (1|env:lines), where Y stands for trait data, the parentheses indicate random effects, "1|" means groups, ":" means interactions, "lines" means all materials and "env" means the environment.
Genome-wide association study of grain quality traits. The BLUP value and 83,057 SNPs were used to conduct the GWAS by using FarmCPU model 54 implemented in the GAPIT package in the R software 55 , with both K and Q matrix taken into account.

Statement.
Experimental research and field studies complies with relevant institutional, national, and international guidelines and legislation.