Introduction

It is crucial to dissect the genetic architecture of the complex traits for understanding the relationship between DNA and these traits and for benefiting genomic selection (GS) or whole-genome prediction (WGP). Genome-wide association study (GWAS) has been widely used for the identification of single nucleotide polymorphisms (SNPs), or other DNA markers, that are significantly associated with the traits of interest. It is common in GWAS that each individual trait-associated SNP (TAS) explains a very small percent of the phenotypic variation, whereas the aggregate of the effects of all associated SNPs can account for 15–45% trait variability (Manolio et al. 2009). The so-called lost trait heritability is likely ascribed to a large number of loci with small effects, many of which cannot be detected with conventional statistical approaches. To consider the loci with minor but nontrivial effects, the linear mixed models (LMMs) with a polygenic effect have been proposed to use the entire set of genomic markers, yielding increased trait heritability in GWAS or improved trait predictability in WGP (VanRaden 2008; Yang et al. 2010). In the LMMs, the genetic contributions, denoted by regression coefficients, of all the markers are treated as random effects and are assumed to follow the same distribution (Speed et al. 2012; Yang et al. 2010). Recent studies in human and crops have shown that trait-associated loci are enriched in regulatory regions, compared with the protein-coding exon regions (Hindorff et al. 2009; Maurano et al. 2012; Schork et al. 2013). Moreover, important genetic matter, such as miRNAs and long noncoding RNAs (lncRNAs), was found to be produced in the noncoding genomic regions, suggesting a biological importance of nongenic regions in genomes (Edwards et al. 2013). Maurano et al. (2012) reported that about 93% of detected variants are located in noncoding regions, and 57.1% of these nongenic variants are positioned in DNaseI hypersensitivity sites (DHSs)—an essential gene-regulatory region. This is in agreement with other studies which also suggested that trait-associated loci tend to cluster in gene-regulatory regions, including promoter regions and transcriptional enhance elements (Hindorff et al. 2009; Schork et al. 2013). The accumulation of evidences suggested that regulatory regions in genomes generally play a more important role in dictating the phenotypes than genic regions. To compare the roles of various types of the genomic regions, analysis of heritability (analogy to ANOVA) may be performed to compare the contributions from these genomic partitions and allow for distinct distributions when LMM is used to model DNA makers in different genomic regions. For example, the functional enrichment assay showed that DHS regions account for the most heritability in human genomes (Gusev et al. 2014). Similarly, a Maize study revealed that most trait-associated variants lie in nongenic regions, which are responsible for the regulations over gene expression (Li et al. 2012). This has not been proved in rice genomes.

Rice is the most widely consumed staple food for a large part of the world population (Huang et al. 2010); therefore, understanding the roles of different parts of rice genomes will advance our knowledge of the genetic architecture of important traits and facilitate genetic improvement in breeding. Rice is an ideal model for the analysis of heritability in functional partitions of the genomes because of the following reasons: (1) the size of rice genomes is relatively small (~400 MB), (2) the high-quality reference genome with functional annotations is available (Sasaki 2005), (3) millions of genomic markers can be genotyped with cost-effective sequencing technology, (4) agronomic traits have been accurately gauged by repeated measurements in multiple projects, and (5) the major genetic determinants of agronomic traits have already been characterized (Huang et al. 2010, 2016, 2012; Yano et al. 2016). This is the first analysis of heritability on rice genomes, where we investigated five agronomic traits, i.e., yield (YD), heading date (HD), plant height (PH), grain length (GL), and grain width (GW), using a diverse population including 524 lines with 3 million SNPs. The results of the analysis of heritability based on all SNP markers showed that nongenic regions consistently account for more genetic variation in five traits than genic regions. Similar comparisons between nongenic regions and genic regions were also performed by only using the SNPs identified by GWAS, yielding the same conclusion as the analysis of heritability. We also compared the predictions of traits using selected SNPs from genic and nongenic regions, respectively, which showed that nongenetic regions had better predictabilities than genic regions. We conclude that DNA variants in nongenic regions play a more important role than those in genic regions in determining the agronomic traits of rice. Future in-depth research in regulatory regions of rice genomes will uncover the mysteries of genetics behind these traits and eventually benefit genetic improvement in breeding.

Materials and methods

Rice population and sequencing data

In the study, we considered a diverse rice collection of 533 accessions, including 200 varieties from a core/minicore collection of O. sativa L. in China (Zhang et al. 2011), 132 parental lines used in the international Rice Molecular Breeding Program (Yu et al. 2003), 148 lines from a minicore subset of the US Department of Agriculture rice gene bank (Yan et al. 2009), 18 lines used for SNP discovery in the OryzaSNP project (McNally et al. 2009), and 35 lines from the Rice Germplasm Center at the International Rice Research Institute, which represent landraces and elite varieties. A total of 524 lines have been selected for the study, since the record of their phenotype and genetic variants is available. Five agnomically important traits, i.e., YD, HD, PH, GL, and GW, were considered in the study.

About 6.7 billion 90-bp paired-end reads have been generated using the Illumina HiSeq 2000 platform (Chen et al. 2014). Missing genotypes have been imputed using the genotype data of another set of 950 rice lines from Huang et al. (2012). Aligned with the rice reference genome (Nipponbare, MSU version 6.1), a total of 6,551,358 high-quality SNPs were obtained, with the minor alleles being present in at least five accessions. To simplify the computation, the SNP markers with missing data were eliminated, leaving a total of 3,616,597 SNPs in the analysis. The SnpEff software developed by Cingolani et al. (2012) was utilized for SNP annotation.

Genome partitioning and analysis of heritability

Following Speed et al's. study (Speed and Balding 2014), an LMM is used to decompose the genetic variances for a trait, i.e.,

$${\mathrm{y}} = X\beta + \mathop {\sum}\nolimits_i^M {g_i} + \varepsilon ,$$
(1)

where y is an n × 1 vector representing phenotypic values with n equal to the sample size, β is the fixed effect, X is the design matrix for fixed effects (such as locations, years, etc.), gi represents the polygenic effect of ith genomic partition type, and ε denotes the residuals which are normally distributed, i.e., N(0,Iσ2), with I being the identity matrix and σ2 being the variance of the residual. In this study, we sorted the rice genomes into four types of partitions, including regulatory regions, intron regions, exon regions, and intergenic regions (M = 4 in Eq. (1)). Different distributions are assumed for the polygenic effects in different genomic partition types, i.e., \(g_i\sim N\left( {0,K_i\phi _i^2} \right)\), where Ki is the kinship matrix calculated using the SNPs in the ith genomic partition type and \(\phi _i^2\) is the genetic variance shared by the SNPs in this genomic partition type (Gusev et al. 2014). The heritability from Eq. (1) is calculated as the ratio of the sum of the total genetic variance to phenotypic variance, i.e., \(h^2 = \mathop {\sum}\nolimits_{i = 1}^{M = 4} {\phi _i^2{\mathrm{/}}\left( {\mathop {\sum}\nolimits_i {\phi _i^2 + \sigma ^2} } \right)}\). The restricted maximum likelihood method may be used to estimate the variance parameters \(\phi _1^2,\,\phi _2^2,\,\phi _3^2,\,\phi _4^2\), and σ2, which was similar to our previous studies (Wei et al. 2018).

The LDAK software was used for the analysis of heritability, including the estimation of genetic variance and heritability for each genomic partition type and the calculation of standard errors for the estimated parameters (Speed and Balding 2014). The enrichment score for each genomic partition type was calculated as the ratio of the percentage of the heritability explained by the SNPs in this type of genomic partitions to the percentage of SNPs falling in this partition type (the expected percentage of the heritability by these SNPs). The z-statistic was adopted to test the significance level of enrichment for each genomic partition type based on Gusev et al. (2014). The z-statistic was calculated as the difference between the percentage of \(h_{gi}^2\) and the percentage of SNPs in the ith category divided by the analytical standard error (SE).

In addition, a simulation-based test was applied to evaluate the significance level of enrichment by contrasting the observed likelihood ratio test (LRT) statistic with the empirical NULL distribution of LRT for a calculation of a p value. The observed LRT is calculated using the below formula:

$$LRT = - 2\left( {L_0\left( {\widehat \beta ,\widehat \phi ^2,\widehat \sigma ^2} \right) - L_1\left( {\beta ,\widehat \phi _1^2,\widehat \phi _2^2,\widehat \phi _3^2,\widehat \phi _4^2,\widehat \sigma ^2} \right)} \right),$$
(2)

where ϕ2 in the NULL hypothesis (L0) is assumed to be identical across four genomic partitions. For any NULL model, we randomly partitioned the entire set of SNPs into four categories of the same sizes as the originally defined functional partition types, followed by the calculation of a LRT of the NULL model using exactly the same steps by which the observed LRT was calculated. This process was repeated 100 times to generate a NULL distribution of the LRT, of which the 5th percentile was used for testing the NULL hypothesis that various partition types equally contribute to the phenotypic variability.

Genome-wide association study

The GWAS method with LMM, shown as follows, was used for the identification of trait-associated loci (Yu et al. 2006; Zhang et al. 2005),

$$y = X\beta + Z_ku_k + \xi + \varepsilon ,$$
(3)

where Zk is the genotype of the kth SNP, uk is the fixed effect for the kth SNP, and ξ is a random effect to control the polygenic background with a multivariate normal distribution N(0,Kϕ2). Note that the kinship matrix K in the GWAS analysis was calculated using the entire set of SNPs for simplicity; therefore, K did not vary when different SNPs were analyzed using the GWAS model. Other elements in Eq. (3) remain the same as Eq. (1). The LRT was used to test each SNP marker, i.e.,

$$LRT = - 2\left( {L_0\left( {\widehat \beta ,\widehat \phi ^2,\sigma ^2} \right) - L_1\left( {\widehat \beta ,u_{k,}\widehat \phi ^2,\sigma ^2} \right)} \right),$$
(4)

where L0 denotes the likelihood of the Null model (without the estimated effect of the SNP marker \(\widehat u_k\)) and L1 denotes the likelihood of the model under evaluation (with \(\widehat u_k\)). Eigen decomposition was used for handling the algebra that involved the kinship matrix to increase the computational efficiency (Kang et al. 2008).

Genomic best linear unbiased prediction (GBLUP)

BLUP has been widely employed to predict breeding values in animals, when pedigree data are available for the calculation of kinship (VanRaden 2008). The kinship can be deduced using whole-genome variant data, yielding a more powerful method, i.e., GBLUP, for the prediction of breeding values in animals and plants. We use subscripts (0) and (1) to represent the training set and the validation set, respectively, in GBLUP. The estimated breeding value for the validation set \(\widehat g_{(1)}\) can be described as

$$\widehat g_{\left( {\mathrm{1}} \right)} = \mathop {\sum}\nolimits_{i = 1}^M {\widehat g_{{\mathrm{i}}\left( {\mathrm{1}} \right)}} = \mathop {\sum}\nolimits_{i = 1}^M {K_{{\mathrm{i}}\left( {{\mathrm{10}}} \right)}\widehat \phi _i^2\left( {\mathop {\sum}\nolimits_{i = 1}^M {K_{{\mathrm{i}}\left( {{\mathrm{00}}} \right)}\widehat \phi _{\mathrm{i}}^{\mathrm{2}}} {\mathrm{ + I}}\widehat \sigma ^2} \right)^{ - 1}\left( {y_{\left( 0 \right)} - X_{\left( 0 \right)}\widehat \beta } \right)} ,$$
(5)

where \(\widehat g_{i(1)}\) is the estimated breeding component for the ith genomic partition type, Ki(10) represents the genetic covariance between the validation set and training set when the SNPs in the ith genomic partition type are considered, and \({\sum} {K_{i\left( {00} \right)}\widehat \phi _i^2 + I\widehat \sigma ^2}\) is the Var (y(0)) for the training set. The parameters, including \(\widehat \phi _1^2, \ldots ,\widehat \phi _M^2,\,\widehat \sigma ^2\) and \(\widehat \beta\), are estimated from the training set and then used for genomic prediction for the validation set. Such model is analogous to MultiBLUP by Speed et al., since multiple kinship matrices are involved in BLUP (Speed and Balding 2014), in contrast to the GBLUP models, where only one kinship matrix is used. The predicted phenotypic value for individuals in the validation sets is equal to the sum of the estimated breeding value plus fixed effects, that is, \(\widehat y_{\left( 1 \right)} = X_{\left( 1 \right)}\beta + \widehat g_{\left( 1 \right)}\). The predictability of trait is defined as the squared correlation coefficient between the observed phenotypic values and the predicted phenotypic values (Xu et al. 2016; Xu et al. 2014), as shown in the following formula:

$$r^2 = \left( {\frac{{{\mathop{\rm{cov}}} \left( {y,\widehat y_1} \right)}}{{\sqrt {{\mathop{\rm{var}}} \left( y \right){\mathop{\rm{var}}} \left( {\widehat y_1} \right)} }}} \right)^2$$
(6)

The tenfold cross-validation was implemented to calculate trait predictability for model evaluation or comparison among different genomic partition types. The data sets are equally partitioned into ten portions, in which we predicted the phenotypic value in one portion (validation) using the other nine portions (training). Each individual has a predicted value after all the portions are used as a validation set. Then we calculated the predictability as formula (6).

Results

Summary of single-nucleotide polymorphism of rice genomes

In the study, about 6.5 million high-quality SNPs distributed in the 12 chromosomes were considered (Chen et al. 2014). The SNPs with missing genotypes were eliminated, leaving a total of 3,616,597 SNPs with about 9600 SNPs per million bases on average. Since the rice genome is relatively small, such SNP density is sufficient to study the differences in heritability between functional partitions of rice genomes. Figure S1 shows that SNPs are linearly arranged in each chromosome, indicating that SNP markers are evenly distributed along the entire genome and functional partitions of the genome are equally represented in the study.

We used SnpEff tool to annotate the genetic variants based on rice reference genome (MSU version 6.1). The genome-wide variants are annotated into a total of 28 categories, the annotation for which is shown in Table S1. The 28 categories were further classified into seven genomic regions including intergenic, upstream, 5′UTR, exon, intron, 3′UTR, and downstream (Fig. 1a). Since the regulatory elements are mostly positioned in the upstream, downstream regions, and flanking untranslated regions (UTRs), we therefore further reclassified these seven genomic regions into the final four functional partitions, i.e., regulatory partition (upstream, downstream, 5′UTR, and 3′UTR), intron partition, exon partition, and intergenic partition (Table S1). Four categories are redefined based on the variant function: (1) regulatory partition, including 1,864,917 variants, about 51.6% percent of total markers, plays role in the regulation of gene expression; (2) intron partition, consisting of 491,680 variants (13.6%), represents the regions mingled with exons in the open-reading frame (ORF) and is spliced out during post transcriptions; (3) exon partition, composed of 456,341 variants (12.6%), denotes the regions that are retained in the mature mRNAs and eventually translated into protein; (4) Intergenic partition, represented by 803,659 variants (22.2%), is made up of noncoding DNA, excluding the regulatory partition (refer to Fig. 1b).

Fig. 1
figure 1

Categories of 3 million of SNPs obtained based on second-generation sequence platform. a Distributions of the seven categories directly obtained from SnpEff software and b distributions of the four categories, combining upstream, downstream, 5′UTR, and 3′UTR into regulatory categories

Comparison of genetic effects across functional categories

For each trait, we estimated genetic variations of the four functional categories, jointly using the LMM, where the kinship matrix for each category was calculated using the data of SNPs in that genome partition. Table 1 presents the results of the analysis of five agronomic traits, i.e., YD, HD, PH, GL, and GW. It appears that (1) genetic variation was mainly ascribed to regulatory and intergenic regions for YD and GL, (2) regulatory regions seemed to account for the most genetic variations in HD or PH, and (3) the associated genetic variation was only detected in intergenic regions in GW. Little genetic association (approximate to zero) has been identified for genetic regions including introns and exons. These results indicated that the variations of these five traits are mostly governed by the DNA variants outside gene regions rather than the variants located within genes. We also provided standard error of heritability of each functional category in Table S2, which will be used in the significant test of enrichment analysis. A three-component population structure has been disclosed by PCA (Fig. S3). We therefore reanalyzed the data by incorporating the population structure in the regression models as covariates, but there was no difference between the models with and without population structure (Table S3 vs. Table 1). We also removed individuals with strong relatedness based on cluster analysis (h parameters setting to 100 in cutree function) and the remaining 452 individuals were used to perform an analysis of partitions of heritability across functional annotations. The results are summarized in Table S4, which are very similar to the results using the entire sample.

Table 1 Joint estimation of genetic variance components across the four functional regions based on linear mixed model (LMM)

For each trait, we calculated the value of the logarithm of likelihood based on the estimated parameters, including five variance parameters, i.e., \(\widehat \phi _1^2,\,\widehat \phi _2^2,\,\widehat \phi _3^2,\widehat \phi _4^2\), and \(\widehat \sigma ^2\), which are marked with red triangles in Fig. S2. We also calculated logarithm of likelihood values of NULL-equivalent models by using various values, from 0 to 1.2, with an incremental step of 0.08, for these five variance parameters. Since the phenotypic values for each trait have been standardized with a standard normal distribution (N(0,1)), \(\mathop {\sum}\nolimits_{i = 1}^4 {K_i\widehat \phi _i^2 + I\widehat \sigma ^2}\) is supposed to 1; however, the actual value of phenotypic variance may vary, so we constrained it in a bracket of 0.5–1.2. For each trait, the distribution of 14,587 values of the logarithm of likelihood for the NULL-equivalent models is shown in Fig. S2, indicating that the optimal results were always achieved by the estimated values of these five variance parameters.

The enrichment analysis (Gusev et al. 2014) was then performed on each of five traits, respectively, and the enrichment scores were computed for each genomic partition as the ratio of the observed proportion of heritability to the expected proportion of heritability, which is defined as the percentage of SNPs in this category of regions. For five traits, either regulatory or intergenic regions exhibited high enrichment in heritability (>1), whereas the exon and intron regions were significantly depleted (Fig. 2). For example, variants in intergenic regions for GW were highly enriched (an enrichment score of 4.5×) compared with the expected proportion of heritability, and explained ~100% of trait heritability (SE = 1%). Following the methods described in Gusev et al. (2014), we also performed a z-test on the enrichment score for each category, which suggested that the regulatory partition in PH and the intergenic partition in GW were extremely significant (p < 0.01). The remaining functional categories did not show significant enrichment, due to the large variance of estimated heritability values. The results of simulation indicated that the LRT (or likelihood) calculated from the defined functional categories is significantly larger than that of NULL models with random genomic partitions (Fig. S4).

Fig. 2
figure 2

Partition SNP heritability of the five agronomic traits according to functional annotations. The y-axis represents what proportion the functional category can account for the total genetic variation, black bars for the estimated value, and gray ones for the expectation value. ae are corresponding to five traits, YD, HD, PH, GL, and GW, respectively

Genome-wide association study

For each of the five traits, we examined the distribution of TASs in the four genomic categories through GWAS. The Manhattan plots in Fig. 3 showed that HD, GL, and GW are controlled by several major QTLs (peaks in the plots), whereas the variabilities of YD and PH are ascribed to many modest genetic loci (or polygenic effects) without obvious peaks. Because inclusion of test genetic variants in the kinship matrix would cause the loss of detection power, we carried out analysis of GWAS by leaving one chromosome out to boost detection power, the results of which are shown in Fig. S5. There are not significant differences in the pattern of association between the results of two methods. Candidate genes near the major peaks that were identified for HD, GL, and GW are listed in Table 2. Important biological functions of these genes have been previously reported and some of them agree with the traits in the study. The gene Hd3a, which is a rice ortholog of Arabidopsis FT gene and plays a key role in the regulation of flowering through the multiple-signal pathway (Bian et al. 2011; Kim et al. 2008; Kojima et al. 2002), has been identified at the downstream of the HD-associated SNP (chromosome 6, 3.484 Mb). The two novel genes OsGSTZ2 and OSIPK have been identified in the region near a HD-associated SNP (chromosome 12). However, neither gene has been reported to be a functionally relevant HD; rather, the gene OsGSTZ2 has been reported to relate to seedling cold tolerance (Kim et al. 2011), while silencing the OSIPK gene can cause a substantial reduction in phytate levels in seeds (Ali et al. 2013). The gene GS3 (GRAIN SIZE 3), a major QTL that negatively controls rice length and weight (Fan et al. 2006; Mao et al. 2010), has been detected in proximity to the SNP (chromosome 3, 16.704 Mb) that is significantly associated with GL. The GW5 gene is a major QTL that controls rice gain width and weight (Shomura et al. 2008; Wan et al. 2008). We have found the GW5 gene near the SNP in chromosome 5 (5.373 Mb), which was significantly associated with GW. A new candidate gene OsFCA for GW has been discovered in chromosome 9 and this gene is homologous to the Arabidopsis flowering time gene (FCA) (Du et al. 2006). Previous research indicated that OsFCA is involved in the autonomous flowering pathway in rice (Jang et al. 2009) rather than contributing to grain shape. Further investigation is warranted to discover the broader functions of these new candidate genes for these agronomic traits in rice.

Fig. 3
figure 3

Manhattan plots of the five agronomic traits from the GWAS results, where the y-axis is the logarithm of the p value and the horizontal dashed lines are the critical value at the 0.05 level after Bonferroni correction, −log10(0.05/3,616,597) ≈7.86. ae represent the results of YD, HD, PH, GL, and GW, respectively

Table 2 Candidate genes related to agronomic traits in the identified regions

Intergenic regions and regulatory regions are treated as nongenic partition, while introns and exons are regarded as genic partition. We compared the distributions of TASs between these two partitions. We selected TASs based on p values using six different threshold values, i.e., 10−5, 10−6, 10−8, 10−10, 10−11, and 10−12, yielding 6280, 3811, 1723, 983, 695, and 465 TASs, respectively, for five traits in total. Out of these identified TASs with various selection criterion, the majority (4716, 2857, 1333, 764, 552, and 392, respectively) were situated in nongenic regions. In order to demonstrate that TASs are enriched in nongenic regions, we performed a comprehensive simulated study. In each simulation, we randomly selected a number of SNPs from the entire genome, with the number equal to 6280, 3811, 1723, 983, 695, or 465, respectively, which corresponds to the number of identified TASs. We then found out how many of these randomly selected SNPs are located in nongenic regions. This process was repeated 100 times, for each of the six numbers of TASs selected using various thresholds, to form a NULL distribution of TASs sitting in nongenic regions (Fig. S6). The results showed that the TASs identified using various thresholds are significantly enriched in nongenic partition. We also compared the distributions of independent TASs selected using clump function of plink software. Due to less independent TASs identified by the stringent threshold values, we only focused two threshold values (10−5 and 10−6) to identify 661 and 292 TASs for the five traits. Out of these identified TASs, the majority (499 and 216, respectively) were situated in the nongenic regions but not significantly higher than expectation values. This indicates that no enrichment was observed in genic or nongenic regions. This might be due to the loss of power if only an SNP was picked from a group of effective nongenic variants.

We then compared the genetic effects of the SNPs, represented by the genetic variances (proportional to heritability values) of the SNPs, between genic and nongenic partitions. The genetic variance is defined as 2p(1−p)a2 (Falconer et al. 1996), where p represents the minor allele frequency (MAF) and a is the additive effect of each marker estimated from LMM. We focused on the most significant SNPs, i.e., the top 50, 100, or 150 SNPs for each genomic partition among independent SNP sets selected using plink software. Figure 4 shows the distributions of the genetic effects of these top SNPs for each of the five traits. It appeared that the significant SNPs in nongenic partition generally had larger genetic effects than those in the genic partition. These results are supportive of our hypothesis that nongenic regions in the genome are the main source of genetic variations that account for the variability in complex traits.

Fig. 4
figure 4

Comparison of the effects of the top-ranked SNP (50, 100, and 150) in genic (black) and nongenic regions (gray). The y-axis represents how much one SNP can explain genetic variance, which is calculated based on the formula 2p(1−p)a2. ae are corresponding to the five traits, YD, HD, PH, GL, and GW, respectively

Whole-genome prediction

We compared the predictive performance of different sets of SNPs for the five agronomic traits based on 100 replicates of tenfold cross-validation. First, we carried out the prediction using the BLUP method, in which genetic effects of SNPs were treated as random effects. The kinship matrix (or relatedness matrix) was inferred from all the SNPs included in the regression model, either the set of SNPs in each of the four functional partitions or any combination of these sets of SNPs. The results in Table 3 showed that SNPs in nongenic partition (regulatory or intergenic) produced better prediction than those in genic partition (intron or exon), when these four sets of SNPs were used in the BLUP model individually, but the differences between their prediction performances were not phenomenal. For example, in predicting HD, regulatory partition had the highest predictability (0.562), which was slightly but significantly higher than that of exon partition (0.544) based on t test (p < 0.01). The similar results were observed when genic partition (introns and exons) and nongenic (regulatory or intergenic) were compared using the BLUP, where a single kinship matrix was calculated from SNPs in two combined categories.

Table 3 Predictability of the five traits based on the BLUP model, just inclusion of one type of kinship matrix

We also compared the predictive performance of various models using a multiBLUP framework. These models use the entire set of SNPs in the genome but different numbers of kinship matrices based on how the four functional SNP sets were grouped, yielding models with two kinship matrices (genic and nongenic), four kinship matrices, variable number of kinship matrices determined by the adaptive approach, or a single kinship matrix for all genomic SNPs. The results in Table 4 showed that although the differences between various models were not large, models with two genomic partitions yielded the best predictabilities. Conversely, adaptive strategies did not achieve the highest performance in most traits, except for GL.

Table 4 Predictabilities of the five traits based on MultiBLUP framework using various combinations of the genetic relatedness matrix

We carried out an additional simulated study to demonstrate the prediction based on the SNPs in nongenic partition outperformed that based on the SNPs in genic partition. Each time, we randomly sampled 103, 5 × 103, or 104 independent SNP markers, respectively, from genic partition (model I), from nongenic partition (model II), or from the entire genome (model III—reference model), and used these selected SNPs in the prediction analysis through tenfold cross-validation. Each scenario was replicated 100 times. Figure 5 shows the relative changes in predictability for model I or model II by referencing model III, i.e., the values along the y-axis were calculated as \(\left( {R_i^2 - R_0^2} \right)/R_0^2\), where \(R_i^2\) was the predictability of model I/II when i = 1/2 and \(R_0^2\) was the predictability of model III. Positive/Negative values along the y-axis indicate better/worse prediction performance compared with the reference model III. The results indicated that (1) model II generally had higher predictabilities than model I in most of the cases, especially when fewer markers were used for prediction. (2) Prediction using genic partition (model I) was not as good as that using the entire set of SNPs in the genome (model III), whereas prediction using nongenic partition (model II) outperformed the prediction using all SNPs (model III) in most of the cases. (3) The prediction performance for three prediction models (I, II, and III) became less variable and more reliable when more SNPs were included in these models.

Fig. 5
figure 5

Comparison of the predictabilities derived from the SNPs of genic (black) and nongenic region (gray). The y-axis represents the relative difference of predictability between model I (inclusion of SNPs within the genic region) or model II (inclusion of SNPs within the nongenic region) and the reference model in which SNPs are randomly sampled from the whole genome. The x-axis represents the number of SNPs included in the model. We performed 100 replicates for each scenario. ae are corresponding to the five traits, YD, HD, PH, GL, and GW, respectively

Discussion

Mounting studies have been conducted to investigate the genetic contributions to agronomic traits across functionally partitioned genomic regions (Li et al. 2012; Xue et al. 2016). In this study, we analyzed the trait-associated genetic variations in different partitions in rice genomes and found that nongenic regions played a key role in five agronomic traits of a natural rice population. The study population is ideal for GWAS, because this rice collection includes genetically diverse varieties from different areas, including 200 lines from a core collection in China, 132 varieties from the International Rice Molecular Breeding program, and 148 varieties from the US Department of Agricultural rice gene bank. Data for over 3 millions of high-quality SNPs are available and these SNPs are evenly distributed in the 400-Mb rice genomes (about one SNP per 130 bp), which is suitable for an analysis of heritability across genomic partitions. The analysis of heritability, or variance-component method, which was proposed by Gusev et al. (2014), outperforms the regular testing approaches based on summary statistics because linkage disequilibrium (LD) between functional categories is taken into account. In this study, we used both methods, i.e., analysis of heritability across genomic partitions and GWAS, to analyze the rice data, leading to a consistent conclusion that variation of SNPs in nongenic regions contribute most to the phenotypic variation. The analysis of heritability using individual SNP sets (intron, exon, regulatory, or intergenic) showed that the genetic contributions from these four SNP sets are very similar, with the contributions from regulatory regions or intergenic regions more than those from the introns or exons (Table S5). Similarly, predictions using SNPs in intron regions or exon regions only were almost as good as the predictions made by using SNPs in regulatory regions or intergenic regions only (Table 3). However, when the four SNP sets were used simultaneously in the analysis of heritability, the genetic contributions were mainly detected from regulatory regions or intergenic regions, whereas the contributions from introns or exons were close to zero (Table 1). In summary, a larger proportion of trait heritability was explained by SNPs in nongenic partition, based on the results of the analysis of heritability. Consistently, GWAS showed that top TASs identified in nongenic partition had larger genetic effects than those in genic partition.

Several studies of human data revealed the similar results of genetic architectures underlying complex traits. For example, Maurano et al. (2012) systematically investigate the spatial distribution of TASs and found that trait-associated variants tend to cluster in regulatory regions marked by DHSs. Gusev et al. (2014) analyzed the original GWAS data using the variance-component analysis and also reported that DHS-marked regions explained most of the genetic variations of the phenotypes. Consistent results were reported in plant studies, for example, in maize (Li et al. 2012; Rodgers-Melnick et al. 2016; Xue et al. 2016). It is not surprising to detect the importance of regulatory regions in terms of genetic contribution, because lots of cis-regulatory elements, such as enhancers and silencers, in these regions directly influence gene expressions by interacting with the promoters. Moreover, we found that, like the SNPs in regulatory regions, the SNPs in intergenic regions are also responsible for a large proportion of trait heritability. This was supported by recent research where the important role of these regions was recognized (Edwards et al. 2013), and by the fact that abundant functionally important molecules, such as small RNAs, micro RNAs, or lncRNAs, are transcribed from these regions (Bartel 2009; Cunnington et al. 2010; Ghoussaini et al. 2008; Jendrzejewski et al. 2012). For example, miRNAs typically interfere gene expressions by binding to 3′UTRs of target mRNAs and repress the transcription process (Bartel 2009). The genome does not exist as a linear entity within cells, where the DNA blueprint is actually utilized. Current research shows that both cis- and trans-interactions between genetic variants, which depend on chromosome 3D configuration, influence gene expression and provide mechanistic insights into functions associated with phenotypes. The study indicated that variants in intergenic regions played a more important role than those regulatory regions for YD, GL, and GW; however, this was not the case for HD and PH. These results suggested distinct regulatory patterns for various complex traits, i.e., trans-eQTL is key player controlling YD, GL, and GW, whereas, cis-eQTL is dominant factor for HD and PH. In this study, we conclude that the genetic variants in nongenic partition, including regulatory regions and intergenic regions, are more important than the genetic variants in genic partition in determining the traits in rice.

The positive correlation between the per-SNP heritability and the MAF associated with the SNP was reported in the analysis of heritability in human schizophrenia (Loh et al. 2015). In the study, we followed this strategy to check the possible relationship between the genetic contribution and MAF for each of the five traits. The entire set of genomic SNPs were grouped into six MAF bins based on the frequency of the minor allele (Table S6), the number of SNPs in these bins ranging from 11.1 to 23.6% of the total SNPs. Table S6 does not show any significant association between the explained heritability and the binned SNPs, with various MAFs in the rice data. We examined the distribution of MAF in the four functional categories (Fig. S7), which shows that the four categories have similar allele frequency spectrum across different MAFs. Our data and results did not support the hypothesis that higher heritability in nongenic regions region is due to enrichment of either high or low MAF SNPs.

Several literatures had used the genomic annotation data to assist a fine mapping, including genomic position information (Kichaev et al. 2014; Pickrell 2014; Yang et al. 2017). Our study leveraged these annotation data to classify the genetic variants into functional partitions for a delicate analysis, using GS with MultiBLUP strategy (Speed and Balding 2014) or GWAS. The data, results, and conclusions in our study will advance our understanding of the genetic architectures behind agronomic traits in rice. We assume that the effects of genetic variants follow the normal distribution with different variances, when they fall into different functional partitions. These variance parameters need to be first estimated from the training set and then can be applied to GS. Our Multi-BLUP strategy is analogous to the popular software BayesR, in both of which each genetic variant follows a mixture of normal distributions and the posterior membership may be inferred from data (Erbe et al. 2012). Our results indicated that considering different variance parameters for various functional partitions did not significantly improve prediction performance. This may be owing to the fact that each functional partition already includes more than enough genetic variants (in millions) that can accurately capture the genetic relatedness of individuals in the sample. A simulation study, in which small numbers of SNPs were randomly selected from these functional partitions for the analysis, showed that SNPs in nongenic partition outcompeted the SNPs in the genic partition in predictability. Therefore, to reduce the genotyping cost in breeding if microarrays are used, we only need to place the SNP markers in nongenic regions onto the genotyping chips.

A limitation of the study is that only 524 individuals were used. Estimating genetic parameters from such small sample may be a problem, because of the potential sampling bias. Combining other available rice samples (Huang et al. 2010, 2016, 2012; Yano et al. 2016) in the analysis of heritability may increase power. However, various studies may be quite different in genotyping method, data quality, number of variants, etc, posing challenges to a combined study or meta-analysis. A desirable property of this study sample is that it is a highly diverse collection because each individual represents a subpopulation. Compared with other species like human, such a study sample enjoys a larger effect size for testing genetic parameters. When other omics data become available in the future, we may leverage these additional data to tune the analysis of heritability based on genomic annotations.