Introduction

With the availability of genome-wide single nucleotide polymorphism (SNP) markers, researchers are now interested in estimating SNP-heritability/genomic heritability in animal, plant and human genetics, as well as in ecology (Visscher et al. 2006; Sillanpää 2011a; de los Campos et al. 2015). However, it is generally known that SNP-heritability estimation suffers from the missing heritability problem (Manolio et al. 2009; Eichler et al. 2010; Gibson 2012). That is, the quantitative traits that are known to have a substantial genetic component and a high heritability estimated in pedigree data sets show very low values of SNP-heritability: either when they are estimated using a few single SNPs showing most strong trait associations in genome-wide studies (Jakobsdottir et al. 2009), or estimated using genome-wide sets of SNP markers (Yang et al. 2010). However, recently, for some well studied human traits, partial concordance (60% of the pedigree value) has been reached (Yang et al. 2015; Kim et al. 2017). The origin of this missing heritability is currently the subject of heated debate. Possible explanations include loose associations between SNPs and causal variants (Shen 2013; de los Campos et al. 2013a), high fraction of causal effects may be due to rare variants (Zuk et al. 2014; Goldstein 2011), population admixture/structure (Zaitlen et al. 2014; Browning and Browning 2011), epistasis (Zuk et al. 2012; Hemani et al. 2013) and unaccounted haplotypes of common SNPs (Bhatia et al. 2015; Sun et al. 2016), all of which evidently have a role in the linkage disequilibrium (LD) pattern of the genome. Additionally, small sample sizes and gene-by-environment interactions may also be possible reasons.

SNP-heritability (the proportion of genetic factors explaining the total variance) is generally estimated under a linear mixed model (Henderson 1984), in which the random effect covariance structure between individuals is replaced with a sample covariance matrix estimated from genome-wide sets of SNP-markers. Here, we call this model the genomic best linear unbiased prediction (G-BLUP) model (Meuwissen et al. 2001; Habier et al. 2007; VanRaden 2008).

It is also possible to estimate SNP-heritability by fitting thousands of SNPs simultaneously to the whole-genome regression (WGR) model and applying variable selection for them (Meuwissen et al. 2001). This process is analogous to performing the G-BLUP analysis with a trait-specific relationship matrix having own variance component for each SNP in the diagonal (Zhang et al. 2010; Resende et al. 2012; Shen et al. 2013). In general, such G-BLUP model is equivalent to one WGR model called generalized ridge regression model (Piepho et al. 2012; Shen et al. 2013; Strandén and Garrick 2009). In simulations, WGR models following the Bayesian alphabet (BayesA, Bayes B, BayesC, etc.) have shown improved performance over G-BLUP (de los Campos et al. 2013a, b). Moreover, WGR models are more interpretable and so it is possible to develop better priors which leads to the improved performance. However, in practice, the G-BLUP model has been adopted more often than the WGR model, due to its robust performance in varying scenarios including different genetic architectures (Wimmer et al. 2013). This fact is true despite the strong assumption of equal weighting of loci in the G-BLUP model. Additionally, there are various WGR models proposed (Conti and Witte 2003; Sillanpää and Bhattacharjee 2005; Malo et al. 2008; Tsai et al. 2008; Fridley and Jenkins 2010; Yang and Tempelman 2012; Yi et al. 2015) in order to account for strong LD in the data.

Both G-BLUP and WGR models rely on the LD between SNPs and QTLs. While many methods seek to increase statistical power by better modeling of the LD, in some cases, strong LD is a problem. For instance, when QTLs are in strong LD, using the unweighted genomic relationship matrix in G-BLUP can cause upward bias in the heritability estimation (Speed et al. 2012; Fernando et al. 2017; Legarra 2016). Moreover when QTLs are in heterogeneous regions with varying degree of LD between SNPs and QTLs in each, the heritability estimate can be biased (Yang et al. 2015; Gusev et al. 2013; Yang et al. 2017). Therefore, there is a need of correcting for LD in some way in G-BLUP and WGR models.

To cope with strong LD in the G-BLUP context, the following approaches have been proposed: (1) pruning or “thinning” the SNP set (Purcell et al. 2007), (2) finding and using haplotype tagging SNPs (Lin and Altman 2004; Meng et al. 2003) and (3) using the LD weighting (Speed et al. 2012). Even if human geneticists and ecologists consider SNP selection to be an option, animal and plant breeders are more or less omitting it due to its minor influence on the genomic breeding value estimates (Ober et al. 2012). In this study, our main focus is on LD-correction and LD-weighting. To improve SNP-heritability estimation from genotype data, Speed et al. (2012) utilized LD information to calculate better weights for the contribution of each SNP to the genomic relationship matrix (GRM). This approach was used to correct for uneven LD distribution between SNPs in regions in which causal variants lie. Instead of individual SNP weighting, we present a novel conceptual framework that utilizes the linkage disequilibrium pattern between SNPs to calculate the genomic relationship matrix.

Materials and methods

Let us consider the basic G-BLUP model:

$${\bf{y}} = {\bf{X}}\beta + {\bf{Zg}} + {{\epsilon }}$$
(1)

where y is an n × 1 vector of phenotypic observations, β is a n × 1 vector of fixed (environmental) effects with design matrix X, g is a n × 1 vector of random genetic effects with design matrix Z and \({\bf{\epsilon }}\) is a n × 1 vector of error terms, \({\bf{\epsilon }} \sim N\left( {{\bf{0}},{\bf{I}}\sigma _e^2} \right)\), where \(\sigma _e^2\) is the error variance. Let M be n × m the marker matrix (n is the number of individuals and m is the number of loci), for which the elements are coded as −1, 0, and 1 for the homozygote, heterozygote and the other homozygote genotype, respectively. Let the column i of matrix P contains the allele frequencies as the difference from 0.5 and multiplied by 2 (i.e., 2 (p i  − 0.5), p i is the frequency of the second allele). Then, the unscaled genomic relationship matrix G 0 can be calculated as:

$${\bf{G}}_{\bf{0}} = {\bf{ZZ{\prime}}},\quad {\bf{Z}} = {\bf{M}} - {\bf{P}}.$$
(2)

Here, the subtraction of P from M provide more credit to the rare alleles. Following VanRaden (2008), the scaled genomic relationship matrix (G), hereafter described as VanRaden genomic relationship matrix (VanRaden G matrix), can be calculated as: G = G 0 /k, where the scaling parameter k = \({\bf{2}}{\sum} {{\bf{p}}_{\bf{i}}\left( {{\bf{1}} - {\bf{p}}_{\bf{i}}} \right)} \). This scaling makes the scale of G comparable to that of the additive genetic relationship matrix calculated from the pedigree.

Here, the computation of the elements of the genomic relationship matrix (genomic relationships between individuals based on the marker information) can be accomplished in many ways (Speed and Balding 2015). Most of the current genomic relationship matrix computation methods assume equal weighting of the markers.

Let us consider the unscaled genomic relationship matrix (G 0 ) from Eq. (2): G 0  = ZZ′. Here, Z = M − P, then G 0  = (M − P)(M − P)′. Furthermore, let I be an identity matrix of order m (the number of markers). Then, G 0 can be represented as:

$${\bf{G}}_{\bf{0}} = \left( {{\bf{M}} - {\bf{P}}} \right){\bf{I}}\left( {{\bf{M}} - {\bf{P}}} \right)^\prime .$$
(3)

Thus, the construction of the unscaled genomic relationship matrix can be observed to include the same weighting (the ones on the diagonal) for all markers. However, in a breeding population, there exists correlation between the loci due to linkage and various other factors. It is thus important to consider this LD covariance structure in the computation of the genomic relationship matrix.

One of the easiest ways to account for the LD structure in genomic relationship matrix calculation is to use the squared Mahalanobis distance (Mahalanobis 1936; De Maesschalck et al. 2000; Mitchell and Krzanowski 1985), which can take the covariance structure of SNPs into account. The concept of the Mahalanobis distance has been used in various fields, including bioclimatic modeling (Farber and Kadmon 2003) and outlier detection (Hodge and Austin 2004). Given a matrix S which contains the covariance structure of linkage disequilibrium (covariance of SNPs), then the LD-corrected genomic relationship matrix G ld0 can be calculated as:

$${\bf{G}}_{{\mathbf{ld0}}} = \left( {{\bf{M}} - {\bf{P}}} \right){\bf{S}}^{ - {\mathbf{1}}}\left( {{\bf{M}} - {\bf{P}}} \right)^\prime .$$
(4)

Here, the right hand side of Eq. (4) represents the squared multivariate Mahalanobis distance between individuals. The Mahalanobis distance has the property of projecting measurements to the space where independence holds, and it measures a distance between the individuals therein (see appendix for details). Note that a density function of the multivariate normal distribution uses Mahalanobis distance and thus it is very commonly applied procedure in practice. To simplify the example analyses, we estimated the LD structure for each chromosome independently (i.e., we assumed there is no dependence between the different chromosomes) and merged them together to form a single block-diagonal matrix, which contains the LD covariance structure for all the chromosomes. VanRaden (2008) used a genome-wide scaling factor that is averaged across all SNPs. When the markers are assumed to be independent, this scaling factor can be seen as, 2pI (1 − pi), here, I is an identity matrix with the order of number of markers, p is a vector of allele frequncies. Thus, the scaled version of the LD-corrected genomic relationship matrix (G ld ) is calculated as follows:

$${\bf{G}}_{{\bf{ld}}} = \frac{{{\bf{G}}_{{\bf{ld0}}}}}{{{\bf{2p}}{\bf{S}}^{ - {\mathbf{1}}}\left( {{\bf{1}} - {\bf{p}}} \right)}}.$$
(5)

Here, matrix S contains the covariance structure of linkage disequilibrium pattern between the SNPs. Hence, in model (1), the random genetic effects can be assumed to follow a normal distribution as: \({\bf{g}} \sim N\left( {{\bf{0}},{\bf{G}}_{{\bf{ld}}}\sigma _g^2} \right)\), where \(\sigma _g^2\) is the genomic variance. The mixed model equation (Henderson 1984) for the model (1) is as follows:

$$\left[ {\begin{array}{*{20}{c}} {{\bf{X{\prime}X}}} & {{\bf{X{\prime}Z}}} \\ {{\bf{Z{\prime}X}}} & {{\bf{Z{\prime}Z}} + {\bf{G}}^{ - 1}\alpha } \end{array}} \right]\left[ {\begin{array}{*{20}{c}} \beta \\ {\bf{g}} \end{array}} \right] = \left[ {\begin{array}{*{20}{c}} {{\bf{X{\prime}y}}} \\ {{\bf{Z{\prime}y}}} \end{array}} \right]$$
(6)

Here, G corresponds to different GRMs (LD-corrected, VanRaden G matrix or LDAK) used in this study and \(\alpha = \sigma _{\bf{e}}^{\bf{2}}{\mathrm{/}}\sigma _{\bf{g}}^{\bf{2}}\). Scaling of GRM does not influence on the prediction accuracy of genomic breeding values but still influences on the variance components and therefore SNP-heritability estimates. We used the function kin.blup in R-package’rrBLUP’ (Endelman 2011) to solve Eq. (6) to estimate the genomic breeding values (g) and variance components (the function uses restricted maximum likelihood (Patterson and Thompson 1971) approach) using different GRMs with the G-BLUP model. For understanding the relationship between the genomic breeding values estimated using mixed model and the SNP effects estimated using WGR model (ridge regression) see Piepho et al. (2012).

See appendix for details of the different approaches to calculate the LD covariance for each marker pair in matrix S of Eq. (4).

Example analyses

To demonstrate the superiority of our new approach, we used the following publicly available datasets in maize, rice (real and simulated phenotype), mice, human (simulated phenotypes) and cattle (simulated phenotypes). Before the analysis, we removed the duplicated markers (which showed more than 99% correlation) and only retained SNPs with minimum allele frequency greater than 5%. We also compared our results to those obtained using the LDAK (version 4.9, Speed et al. 2012) and GCTA (Genome-wide Complex Trait Analysis, Yang et al. 2011) packages. LDAK uses LD weighting to improve the SNP-heritability estimation and genomic prediction accuracy whereas GCTA uses a different scaling than VanRaden approach (see Uemoto et al. 2015 for the differences in the scaling factor used by different methods). “The goal of the [LDAK weighting] is that the signal from each SNP is down weighted so that replication of its signal by neighboring SNPs can be compensated for” (Speed et al. 2012). VanRaden approach implicitly assumes SNP effects and allele frequencies are independent, which would be true under evolutionarily neutral model. GCTA implicitly assumes rare variants tend to have larger effects, consistent with a model of purifying/stabilizing selection. LDAK is available at www.ldak.org/ and GCTA under http://cnsgenomics.com/software/gcta/.

Simulated datasets

In order to show the superiority of our new approach, we simulated two datasets one with high LD (rice data) and another with relatively low LD (human data).

Simulated human dataset—low LD case

Simulated human phenotypes were generated using the GCTA package conditionally on real genotype data and selected 10% of the total loci (from the LD pruned subset) serving as QTLs, where the marker effects were generated from a standard normal distribution. For the simulation we obtained the human HapMap3 (available under www.sanger.ac.uk/resources/downloads/human/hapmap3.html) dataset and we selected population the Maasai in Kinyawa, Kenya (MKK) with 171 individuals for our study. To imitate low LD situation, we selected a subset of 3024 SNPs from the chromosome 22 based on LD pruning using the PLINK software. Here, the criterion for LD pruning was set so that we had about 3000 markers. Due to LD pruning the amount of LD in the dataset ended up being very low. Using the subset of SNPs we generated 100 simulation replicates of the phenotype with a heritability of 0.7 using the GCTA package.

Simulated rice dataset—high LD case

The rice data (following section we provide more details about the data set) showed very high LD and we used the 3290 SNPs from the second chromosome in order to simulate the phenotypes. Based on the subset of SNPs we created 100 simulation replicates of the phenotypes conditionally on real genotype data using the GCTA package with a heritability of 0.7. We selected 10% of the total loci (from the LD pruned subset) serving as QTLs and the marker effects were generated from a standard normal distribution.

Maize dataset

We used the maize’IBM 302 population’ (Sharopova et al. 2002; Lee et al. 2002), which was developed as part of the Maize Mapping Project, for the analysis. The population consists of 302 lines which had genotypes at 1252 simple sequence repeats (SSR) markers and phenotypes for the trait ’leaf greenness’. The dataset is available at http://archive.maizegdb.org/qtl-data.php. We selected this dataset because (1) it is an out-breeding population, (2) it was developed as a part of a mapping project and (3) its genetic map positions are available at high accuracy. Additionally, plotting the magnitude of LD against the genetic distance in the dataset resulted in a picture with a clear LD decay pattern (see Fig. 1) as was expected based on points 1 and 2 above. Hence, we used the maize’IBM 302 population’ to illustrate our LD decay approach together with the observed LD approach.

Fig. 1
figure 1

In maize, the linkage disequilibrium estimates (R2) between pairs of marker loci plotted against the genetic distance. To estimate the R2 values we used 128 polymorphic markers selected from the second chromosome

Rice dataset

This dataset consists of 413 diverse accessions of O. sativa (Zhao et al. 2011) collected from 82 different countries. The lines were genotyped using 36,901 SNP markers. Phenotypic information was available for 34 traits, and we analyzed the trait’amylose content’ in this study. The dataset is publicly available at http://www.ricediversity.org/data/. The phenotype information was missing for 20 lines, so the remaining 393 lines were used for our analysis. The computational complexity of estimating pairwise LD for all possible pairs of markers increases rapidly with increasing number of loci (even if it is being computed for each chromosome separately). Thus, after removing the monomorphic markers, we received 36,901 SNP markers in the rice dataset. Of these, we selected a subset of 3315 markers, which were as evenly distributed along the genome as possible, for the analysis. To ensure that we did not lose too much prediction accuracy in the smaller subset due to the marker selection, we calculated the correlation coefficient between the genome estimated breeding values (GEBVs) and the true phenotypes for the subset and the whole dataset using G-BLUP of VanRaden (VanRaden 2008). The obtained correlation coefficients were roughly comparable, at 0.86 and 0.88 for the subset (3315 SNPs) and the whole dataset (36,901 SNPs), respectively. We also examined possible exponential patterns of LD decay from the dataset but were unable to find any clear indications of this (see Fig. 2). Rice is a self-pollinating plant and moreover there is an admixture between populations in this dataset as reported by Zhao et al. (2011). So as shown in Fig. 2, it was apparent that there is no clear exponential decay of LD in the dataset. Therefore, we used this dataset to validate our observed LD approach only.

Fig. 2
figure 2

The mean LD decay plot of 2165 SNP markers selected from the 12th chromosome of the rice dataset. Here, the X-axis represents the distance in base pairs, and the Y-axis corresponds to the mean of pair wise linkage disequilibrium estimate (R2) values from a bin length of 100

Mice dataset

We selected datasets from animal studies to illustrate our new approach in the context of animal genetics. This dataset contains data from a heterogeneous stock mouse population (Valdar et al. 2006a) of 2527 individuals. In this population 1940 individuals are genotyped with 12,545 biallelic SNP markers. In this study, we concentrated on the trait ’body weight’, which was measured at the age of 6 weeks. Due to the computational complexity, for the mice dataset, we selected a subset of 2336 markers out of the total 12,545 markers. The selected markers were equally distributed along the genome. Then, we obtained the prediction accuracies for the subset (r = 0.77) and the whole dataset (r = 0.80) using G-BLUP of VanRaden (VanRaden 2008) approach, which indicated that the prediction accuracy was not significantly lower for the subset. The genotype and phenotype information of the individuals are available from http://mus.well.ox.ac.uk/GSCAN/index.shtml/.

Cattle dataset

This dataset is publicly available as a part of the R package’SynbreedData’ (Wimmer et al. 2015). In this dataset, 500 bulls were genotyped using 7250 SNP markers. The phenotype was generated using simulation and the pedigree relationships between the animals are known for the dataset. Thus, it is possible to compare the heritability estimates from the pedigree-based genetic relationships under the infinite locus model to that of marker-based genomic relationships under the finite locus model. For the cattle dataset, we took a subset of 3998 SNPs, which were equally distributed along the genome, for the analysis. Here, also we looked at the prediction accuracies for the subset and whole dataset. These were r = 0.83 and r = 0.80, respectively, according to the G-BLUP of VanRaden approach.

Results

Our primary interest was to illustrate the improved SNP-heritability estimation of our new Mahalanobis distance based LD-corrected genomic relationship matrix using the restricted maximum likelihood (REML) method. Additionally, we wanted to show the improved out-sample prediction accuracy of the LD-corrected genomic relationship matrix in the context of genome-enabled breeding value prediction using G-BLUP approach. To achieve these goals, we calculated the correlation coefficient (r) between the GEBVs and the true phenotypes and compared their values to those provided by the G-BLUP of VanRaden approach (VanRaden 2008). The results are presented in the following section.

SNP-heritability

In a recent study, Speed et al. (2012) pointed out the importance of accounting for LD while estimating the narrow-sense heritability using genome-wide SNP markers. Subsequently, they proposed a weighting approach for calculating an LD-corrected genomic relationship matrix for improved genomic heritability estimation. To this end, we compared our approaches to their approach to determine whether our Mahalanobis distance-based LD-corrected genomic relationship matrices could improve the heritability estimates using genome-wide markers. The narrow-sense SNP-heritabilities (h2) were estimated using our LD-corrected genomic relationship matrix in the G-BLUP model as: h2 = \(\sigma _g^2{\mathrm{/}}\left( {\sigma _g^2 + \sigma _e^2} \right)\). Here, \(\sigma _g^2\) and \(\sigma _e^2\) are the genomic and residual variances, respectively. In following, GCTA and LDAK estimates were obtained using GCTA and LDAK packages respectively, whereas the other estimates were obtained using’rrBLUP’ package.

Estimation accuracy of SNP heritability in simulated data sets

Analysis of human simulated data

In order to compare the estimation accuracy of SNP heritability in human simulated data, first we estimated the narrow-sense heritability using 100 simulation replicates with different approaches. Figure 3 summarizes the box plots for the estimation errors (difference between the true and estimated heritability values) to visualize the estimation accuracy of different methods. Here, Y-axis corresponds to the differences between the true simulated and the estimated SNP heritability, whereas the X-axis corresponds to different GRM estimation methods. Based on results from Fig. 3, one can conclude that in the low LD situation, our observed LD approach provided same estimation accuracy like the competing methods (VanRaden and GCTA). Moreover, the estimation accuracy of LDAK was slightly better than that of our approach, VanRaden and GCTA. This is likely due to the fact that LDAK has been optimized for low LD situations.

Fig. 3
figure 3

Low LD case: Box plots for the estimation error of heritability based on different approaches to calculate the genomic relationship matrix (GRM) using 100 simulation replicates with the human data. Here the Y-axis scale corresponds to the difference between the true simulated heritability and the estimated heritability values whereas X-axis corresponds to the different approaches to calculate the GRM

Analysis of rice simulated data

Figure 4 summarizes the box plots for the estimation errors (difference between the true and estimated heritability values) to visualize the estimation accuracy of different methods with the rice data, which corresponds to high LD situation. Here, Y-axis corresponds to the differences between the true simulated and the estimated SNP heritability, whereas the X-axis corresponds to different GRM estimation methods. From Fig. 4, one can conclude that in the presence of high LD, our observed LD approach provided better estimation accuracy than the competing methods (VanRaden, GCTA, and LDAK). Thus our approach seems to work better in presence of high LD.

Fig. 4
figure 4

High LD case: Box plots for the estimation error of heritability based on different approaches to calculate the genomic relationship matrix (GRM) using 100 simulation replicates with the rice genotypes. Here the Y-axis scale corresponds to the difference between the true simulated heritability and the estimated heritability values whereas X-axis corresponds to the different approaches to calculate the GRM

SNP heritability in other datasets

Of the 302 lines in the maize dataset phenotype information for the trait ’leaf greenness’ was available for 270 lines. We also observed a clear exponential decay of LD (Fig. 1) in the dataset. We then used an expected LD decay-based approach (as a function of the genetic distance) to estimate the LD-corrected genomic relationship matrix. As shown in Fig. 1, the LD appeared to vanish rapidly after 20 cM. Therefore, in Eq. (7), we used different values of λ (between 20 and 30) to characterize the LD covariance structure. Following Eqs. (4) and (5), we calculated the scaled LD-corrected genomic relationship matrix (G ld ) for each λ. We then estimated the correlation coefficient (r) between the GEBVs and the true phenotypes based on our new G ld matrices (for different λ values). We chose λ, which gave the highest correlation (for whole data set) and was at λ = 26.

The exponential decay of LD may be slightly different for different chromosomes, and finding the optimal λ is difficult and time-consuming. Therefore, we also used our observed LD-based approach to estimate the LD-corrected genomic relationship matrix. First, we calculated the pairwise LD estimates (D) for all marker pairs on each chromosome using the R-package called ‘genetics’ (Warnes and Leisch 2006) (note that some of the LD estimates were negative, and we kept those values unchanged). We then merged all chromosome-specific LD matrices together to form the LD covariance structure matrix S for the whole genome and estimated the unscaled LD-corrected genomic relationship matrix (G ld0 ) using Eq. (4), before scaling it using Eq. (7). By applying the REML estimation to our exponential LD decay based genomic relationship matrix, G-BLUP provided an estimated genomic heritability of 0.16, with \(\sigma _g^2\) = 3.67 and \(\sigma _e^2\) = 19.46. By contrast, the observed LD-corrected matrix led to an even higher SNP-heritability estimate of 0.22, with \(\sigma _g^2\) = 4.70 and \(\sigma _e^2\) = 16.51. The SNP-heritability estimate using a common genomic relationship matrix was approximately 0.14, with a genetic variance \(\left( {\sigma _g^2} \right)\) of 3.17 and a residual variance \(\left( {\sigma _e^2} \right)\) of 19.73. Similarly, the LDAK based SNP-heritability estimate was 0.28, with \(\sigma _g^2\) = 7.30 and \(\sigma _e^2\) = 18.66, whereas, the GCTA provided a heritability estimate of 0.14 with, \(\sigma _g^2\) = 3.14 and \(\sigma _e^2\) = 19.79. The high SNP-heritability estimate of LDAK for maize dataset, where degree of LD is low, may be due to the default settings of weighting parameters in LDAK which are optimized especially for low-LD datasets.

The rice population did not show any signs of an exponential LD decay pattern (Fig. 2), so we only used the observed LD-corrected genomic relationship matrix for the estimation of genomic heritabilities using the G-BLUP model. For the rice dataset where degree of LD is high, estimates from our approach were close to VanRaden approach. Whereas the SNP-heritability estimate of LDAK were lower than estimates of the other methods. This may be due to the fact that LDAK weighting parameters are tuned for the human datasets and these parameters may not be optimal for high-LD datasets. For the mice data, the pedigree-based heritability estimate (obtained using additive genetic relationship matrix calculated from the pedigree) provides natural comparison point for genomic heritability estimates. The genomic heritability 0.63 for mice subset of observed LD-G approach was more close to the pedigree-based estimate 0.74 as reported by Valdar et al. (2006b). The estimates of other approaches were lower (0.50–0.59). Also in cattle data, the genomic heritability 0.27 of observed LD-G approach was again the highest and most close to the pedigree-based estimate 0.41. The other approaches obtained lower values. Table 1 summarizes the genomic heritability estimates obtained using our LD-corrected and other (VanRaden, LDAK, and GCTA) genomic relationship matrices in G-BLUP model with different datasets.

Table 1 The REML-estimated variance components and narrow-sense SNP-heritability estimates of the four G-BLUP mixed models (with differing genomic matrices) for the maize, rice, mice and cattle datasets

Genomic prediction

We also performed a genomic prediction (out-sample prediction) using five-fold cross-validation (Stone 1974) with four (rice, maize, mice and cattle) datasets. For the five-fold cross-validation (CV), we used 80% of the data points as the training set and the remaining 20% as the validation set. To remove the influence of random partitions on the accuracy, we repeated the cross-validation procedure 500 times and took the mean value; Table 2 summarizes these results. Our observed LD-based genomic relationship matrix provided better prediction accuracies for all datasets expect that of the maize population. It is already known that population structure has a profound effect on the genomic prediction accuracies. Our principal component analysis (PCA) (Jolliffe 2002) based plots (Fig. 5) indicated the presence of population structures for rice, mice and cattle datasets. So we also estimated the out-sample prediction based on 500 CV replicates by including the first three principal components in the model for the rice dataset, in order to assess the effect of population structure on prediction accuracy. However, this approach did not show any improvement to the prediction accuracy in rice dataset. We also applied stratified random sampling to assess the impact of the population structure on the genomic prediction accuracies with our datasets. To this end, we estimated the subpopulations using the k-means clustering with the R-package’adegenet’ (Jombart 2008). Then, from each subpopulation we selected 20% observations and formed the validation set. However, the subpopulation-based cross-validation with the different genomic relationship matrices (common G, LDAK weighted G and observed LD-G) did not show any improvement in the genomic prediction accuracies (results are not shown) compared to the random sampling cross-validation results (Table 2). Habier et al. (2007) showed that the presence of close relatives in the training dataset may inflate the genomic prediction accuracies. So we also checked for the proportion of close relatives in our real datasets by plotting the distribution of the upper off-diagonal elements of the genomic relationship matrix (Fig. 6). The plot for the rice population is bimodal so that many individuals have genomic relationship values close to −1, which may indicate the recent admixture in the population. From Fig. 6 it can be concluded that, close relatives are present in the rice dataset while maize, mice and cattle populations seem to be more unrelated.

Table 2 Average out-sample prediction accuracy (measured as the correlation between the GEBVs and the true phenotypes) of four G-BLUP mixed models (with differing genomic matrices) determined by 5-fold cross-validation for the maize, rice, mice and cattle datasets calculated over 500 random partitions of data
Fig. 5
figure 5

Population structure plot for the maize, rice, mice and cattle datasets. This scatter plot presents the first two principal components (PC1 and PC2) with each point representing a single individual. In the rice dataset, the individuals that do not clearly belong to any of the three original populations (corners) are considered to represent the out group—admixed individuals, and the corresponding points in the plot are indicated with a red color

Fig. 6
figure 6

Histograms of the upper off-diagonal elements of the genomic relationship matrix calculated using VanRaden (2008) approach in maize, rice, mice, and cattle datasets

The PCA plot based on the rice genotype information indicated the presence of an admixed population (marked in red in Fig. 5), which was already reported by Zhao et al. (2011). So we decided to compare the genomic prediction accuracies for the admixed population with different genomic relationship matrices. For that, we selected only the lines belonging to the admixed population as the validation set and obtained genomic prediction accuracies using 10-fold cross validation (note that there were insufficient amount of observations in the admixed population to perform 20% cross-validation, so we used 10% cross-validation) with 100 re-samples. In this case, our observed LD-based genomic relationship matrix showed improvement with a genomic prediction accuracy (r) of 0.38 compared to VanRaden G matrix (r = 0.32) and LDAK weighted G (r = 0.32).

Discussion

LD is the fundamental basis of association mapping studies as well as that of “Genomic selection” (Meuwissen et al. 2001). Moreover, LD plays a crucial role in evolutionary biology and provides information about the population history and the response to selection. Many studies have pointed out the importance of understanding the patterns and distributions of LD in humans (Flint-Garcia et al. 2003) as well as in other species, including animals (Qanbari et al. 2010) and plants (Ardlie et al. 2002). Meanwhile, it is evident that the pattens of LD in a population are strongly dependent on the population substructure (e.g, Gilad et al. 2002; Fricano et al. 2012; Chen et al. 2012). There are various methods proposed to correct for the population structure in association mapping studies (Yu et al. 2006; Kang et al. 2008; Sillanpää 2011b). However, less focus has been given to the importance of accounting for population structures in the estimation of genomic breeding values and heritability. Also, the relationship between epistasis and LD needs to be clarified in the estimation of genomic breeding values and heritability (e.g, Phillips 2008, Hemani et al. 2014). Arguably, LD can be considered a confounding factor along with the epistasis, population structure and genetic admixture.

In this study, a new approach to construct the genomic relationship matrix in the presence of strong dependence between collected SNPs is presented. This approach is based on, first estimating the pairwise observed LD from the marker data and then correcting the LD away from the SNPs using a Mahalanobis distance based formulation of the genomic relationship matrix. Based on our presented analyses, this method seems to be especially helpful for SNP-heritability estimation, but it also provides improvement for genomic prediction accuracy in the presence of population structure (because many of our tested datasets included strong population substructures). Our results also support substantial improvements for genomic prediction accuracy in populations that have experienced recent events of population admixture. This result is in contrast to the large number of other studies in which disappointingly low genomic prediction accuracies have been observed for admixed individuals (e.g., Vallee et al. 2014; Hidalgo et al. 2015). Why is genomic prediction accuracy usually much lower for admixed or multi-breed populations? In brief, if two populations in Hardy-Weinberg proportions and with divergent allele frequencies are analyzed together, as is performed in multi-breed genomic evaluations, the combined (or admixed) population may have a large amount of LD simply due to the process of combination (e.g., Ewens and Spielman 1995). This LD may then cause false positive signals for some loci, which do not have any connection to the studied trait in question. In the following section, we briefly discuss about population structure, which is mainly co-founded with the pattern of LD in the genome.

To study the impact and to correct for the population structure in genomic prediction as well as in the estimation of genomic heritability, de los Campos et al. (2010), Janss et al. (2012) and Guo et al. (2014), have considered the eigenvalue decomposition of G-matrix, which can be utilized to reparametrize the G-BLUP model to the form of principal component regression, in which the population structure can be separated and controlled. We argue that the population structure problem is essentially the LD problem in the data, and our observed LD-corrected G-BLUP model may be capable of doing something similar to principal component regression by transforming the estimated pairwise LD patterns away from the G-BLUP model. Our example with the rice data, in which a strong population admixture is present, suggests this type of behavior. However, sufficient validation of this claim will need to be completed in subsequent research.

In a conclusion, we hope that our method will open new avenues for SNP-heritability estimation as well as genomic prediction. Despite the good results, our method seems to generate more questions than answers. Moreover, unlike the competing methods our new approach is not able to include hundreds of thousands of markers in SNP heritability estimation due to the computational burden: (1) to estimate the pairwise LD for all marker pairs, and (2) to invert the LD structure matrix S of Eq. (4). In our example analysis, the pairwise LD calculation of the rice data with 393 lines and 3315 markers took around 90 min. For fast inversion of large S matrix, one could consider having non-zero LD values only at nearby markers or markers within a certain window/band, resulting in tridiagonal, five-diagonal, or banded matrix, where sparse matrix inversion techniques could be applied. Of course, it is very likely that other methods together with hundreds of thousands of SNPs may outperform our method with much smaller number of SNPs. Even though, there is a clear trade off between accuracy of the heritability estimation and the computation time, we still believe that our new proposed conceptual framework is able to provide the basis for GRM development towards improved SNP-heritability estimation.