Introduction

This study analyses temporal and genomic trends of additive genetic variance in different stages of a breeding programme. Genetic variance is one of the critical parameters in a breeding programme because it determines the potential for selection (Lush 1937; Falconer and Mackay 1996; Lynch and Walsh 1998; Walsh and Lynch 2018). Estimation of genetic variance has therefore received considerable attention in the literature (Lynch and Walsh 1998; Walsh and Lynch 2018), where most of the attention is on statistical models and approaches for estimation. Surprisingly, far less attention has been given to temporal trends in genetic variance, even though such trends indicate the sustainability of a breeding programme. Recent ability to genotype individuals at scale has renewed interest in analysing genetic variance. This study extends previously proposed framework for temporal analysis of genetic variance using the pedigree-based model of Sorensen et al. (2001), to a new framework for temporal and genomic analysis of genetic variance using marker-based models.

The estimation of genetic variance in breeding programmes has a long history and a recent revival with the advent of genomic information. Historically, the genetic variance was estimated with an analysis of variance (ANOVA) method in optimised experimental designs, ranging from simple parent-offspring or sib groups to North Carolina and diallel designs (Falconer and Mackay 1996; Lynch and Walsh 1998; Bernardo 2002; Awata et al. 2018). With these designs, we partition phenotypic variance into variance between and within groups and ‘translate’ these components into genetic variance based on expected genetic relationships within and between groups. Animal breeders have soon moved from such experimental designs to a general pedigree-based model to analyse their observational data (Henderson 1976). Nearly 30 years later, plant breeders have also adopted the pedigree-based model (Oakey et al. 2006, 2007; Piepho et al. 2008). There were many reasons for this late adoption. One reason is that with the pedigree-based model, we estimate genetic variance between the founders of a pedigree (Sorensen and Kennedy 1984; Kennedy et al. 1988), while genetic variance between their descendants is arguably more relevant for breeding (Piepho et al. 2008). The advent of genomic information revived interest in the estimation of genetic variance and spurred active development of genome-based models (Bernardo 1994, 1996; Meuwissen et al. 2001; VanRaden 2008). The genome-based model replaces expected relationships from the experimental designs or pedigree with realised relationships measured by marker genotypes. The estimate of genetic variance from the genome-based model pertains to genotyped individuals (Hayes et al. 2009) or their relatives (VanRaden 2008). It can be obtained using either a genome-based model with genetic values or with marker effects (marker-based model) (Stranden and Garrick 2009). We note, though, that the resulting ‘genomic variance’ is estimating genetic variance only under certain conditions (Gianola et al. 2009; de los Campos et al. 2015; Rawlik et al. 2020). Specifically, the genome-based model assumes that markers are sufficiently linked to quantitative trait loci (QTL) to capture their effects, and that genetic values at different QTL are uncorrelated. The second assumption about independence will not hold when there are population processes that induce linkage-disequilibrium as a function of QTL (Lynch and Walsh 1998; Walsh and Lynch 2018; Rawlik et al. 2020; Bulmer 1971). We expand on this second assumption in the methods and discussion.

In parallel to the development of data sources and corresponding statistical models, there has been active development in statistical and computational approaches to estimate genetic variance. The three most used are the method of moments, likelihood, and Bayesian approach. The method of moments used with the ANOVA is computationally simple but can yield biased estimates outside of the parameter space. With the likelihood approach, we specify a probability distribution for observed data and find the most likely value of model parameters that would give rise to the observed data (Meyer 1985; Thompson et al. 2005; Thompson 2019). The Bayesian approach improves the likelihood approach in two ways. First, it incorporates prior knowledge for all model parameters (means and variances), improving estimation (Sorensen and Gianola 2007; Hem et al. 2021). Second, it treats all model parameters in a probabilistically consistent manner such that estimation uncertainty is propagated to all estimated model parameters (Sorensen and Gianola 2007). However, the full probabilistic treatment is computationally demanding, despite the availability of sampling methods such as the Monte Carlo Markov Chain (MCMC) (Gilks et al. 1995; Brooks et al. 2011). We can handle this computational issue with an empirical Bayesian approach. In the marker-based model, the empirical Bayesian approach estimates model variances from the data at hand and, conditional on these, estimates all marker effects jointly to account for the uncertainty of estimating marker effects (uncertainty of estimating model variances is ignored) (Sorensen and Gianola 2007; Efron 1996). The full Bayesian approach accounts for uncertainty in estimating model variances and marker effects; however, MCMC on genome-based models with many individuals or markers can be time-consuming. To this end, various dimensionality-reduction approaches have been proposed, for example, singular value decomposition (SVD) of marker genotypes where we fit a small number of principal components that capture a majority of variance in marker genotypes (Tusell et al. 2013; Ødegård et al. 2018).

Variances from pedigree and genome-based models do not inform about temporal and genomic trends in genetic variance because they pertain to a specific group of individuals and encompass the whole genome (Sorensen and Kennedy 1984; Kennedy et al. 1988; Hayes et al. 2009). However, these models can be used for temporal and genomic analyses of genetic variance with some post-processing. Sorensen et al. (2001) showed how to analyse a temporal trend in genetic variance. They fitted a pedigree-based model and inferred genetic variance for several time partitions by sampling realisations of genetic values from the fitted model and calculating the variance of the realisations partitioned in time groups. They used the Bayesian approach and MCMC, but their concept is general and can be used with other statistical and computational approaches. The critical distinction here is between model fitting to estimate statistical/model parameters and post-processing to estimate quantitative genetics parameters. This distinction enables flexibility to fit a generic model, for example, the LASSO (Tibshirani 1996), and to estimate quantitative genetics parameters by post-processing posterior samples or internally within an analysis programme. It also gives a potential to (partially) address issues with the interpretation of estimated ‘genomic variance’ from genome-based models (Gianola et al. 2009; de los Campos et al. 2015; Rawlik et al. 2020). Lehermeier et al. (2017) used the same approach with the marker-based model and analysed the contribution of linkage-disequilibrium to genetic variance. Recently, Allier et al. (2019) also used the marker-based model on data from a maize breeding programme to infer trends in genetic mean and genetic variance as well as the contribution of allele diversity (genic variance) and of linkage-disequilibrium to genetic variance (Lynch and Walsh 1998; Walsh and Lynch 2018; Bulmer 1971).

This work aims to (i) build and validate a flexible framework based on the work of Sorensen et al. (2001), Lehermeier et al. (2017) and Allier et al. (2019), (ii) show how to evaluate the temporal and genomic analysis of additive genetic variance in different stages of a breeding programme, and (iii) indicate population processes that change genome. We also show how different statistical approaches affect the results. To this end, we have validated our work with a simulated breeding programme, used a marker-based model to estimate marker effects and, based on these, estimated temporal and genomic trends in additive genetic variance. The results show that the framework works well and gives valuable insights, provided that the fitted model captures the trait’s genetic complexity.

Materials and methods

In this section, we present study material and methods in seven parts: (1) simulation of a breeding programme where we generate true genetic values and corresponding variances, and simulated phenotype and marker genotype data, (2) theory for the temporal and genomic analysis of genetic variance assuming we know QTL genotypes and their effects, (3) statistical analysis where we describe marker-based model fitted to the simulated phenotype and marker genotype data, (4) statistical and computational approaches to estimate marker effects, genetic values and variances, (5) validation of the framework with different genetic architectures of a simulated trait, (6) summarising the results and (7) software implementation.

Breeding programme simulation

We simulated an entire wheat breeding programme considering additive genetic architecture for a quantitative trait. We have performed one simulation replicate for most analyses to focus on one dataset, but we also evaluated the consistency of estimates for a subset of analyses on ten simulation replicates. We followed a breeding programme described by Gaynor et al. (2017) with 21 years of a conventional phenotypic selection for yield (Fig. 1). We started with a coalescent simulation of whole-genome sequences for 21 chromosome pairs and extracted random 600 biallelic single-nucleotide polymorphisms (SNP) as markers per chromosome, and randomly assigned 100 SNP as QTL per chromosome. We assumed that the 2100 QTL had an additive effect on yield and sampled their effects from a normal distribution. We coded genotypes as 0 for reference (ancestral) homozygote, 1 for heterozygote and 2 for alternative (derived) homozygote. From the simulated whole-genome sequences, we created 70 inbred lines. The additive genetic variance between these inbred lines was set to 0.1. We crossed the inbred lines to generate 100 biparental populations. Each population had 100 F1 that had their genome doubled and planted in headrows (altogether 10,000). In the headrows, we visually evaluated the lines (trait heritability of 0.1) and advanced the best 500 into a preliminary yield trial. In the preliminary yield trial, we evaluated the lines in an unreplicated trial (trait heritability of 0.2) and advanced the best 50 into an advanced yield trial. In the advanced yield trial, we evaluated the lines in a small multi-location replicated trial (trait heritability of 0.5) and advanced the best 10 into an elite yield trial. In the elite yield trial, we evaluated the lines for two consecutive years in a large multi-location replicated trial (trait heritability of 0.67) and released one variety. We used the best lines from the advanced and elite yield trials as parents to start a new breeding cycle. During the breeding programme simulation process, we generated fully inbred individuals (except in the F1) and, as a consequence, we assumed only additive genetic variation because dominance variance is not visible in inbred individuals, while epistasis variance is generally small (e.g., Hem et al. 2021; Gonzalez-Dieguez et al. 2021).

Fig. 1: Simulated wheat breeding programme with parents, F1 progeny (F1), headrows (HDRW), preliminary yield trial (PYT), advanced yield trial (AYT), elite yield trial (EYT) and a released variety.
figure 1

The shaded rectangle indicates individuals included in the training population for statistical modelling, from year 16 to year 21.

We have saved phenotype and marker genotype data throughout the simulation to generate a training population for genomic modelling. For simplicity, we did not use the genomic data in simulation of selection but only saved it for a retrospective statistical analysis of temporal and genomic trends of genetic variance. To this end, we have constructed a training population that spanned the last 6 years of the simulation, from years 16 to 21. This training population covered 3070 lines with preliminary, advanced and elite yield trial phenotypes (altogether 3420 phenotypes) and corresponding genotypes at 10,500 markers.

Temporal and genomic analysis of genetic variation

Here we describe a theoretical approach to temporal and genomic analysis of genetic variation, assuming we know the QTL genotypes and their effects. In the following sub-sections, we present a framework for temporal and genomic analysis of genetic variation that closely matches this theoretical approach; however, it uses observed phenotypes and marker genotypes to analyse the genetic variation. The theoretical approach consists of four steps. First, we define whole-genome genetic values from QTL genotypes and their effects, and then, we partition individuals and their genetic values by time to calculate genetic variances over these time partitions for temporal analysis. Finally, we partition whole-genome genetic values into chromosome and locus genetic values to calculate genetic variances and covariances over these genomic partitions for genomic analysis. This calculation involves three ‘layers’ of variances: (a) total (whole-genome) genetic variance, (b) chromosome variances alongside linkage-disequilibrium covariances between chromosomes and (c) locus genic variances alongside locus linkage-disequilibrium covariances within chromosomes and locus linkage-disequilibrium covariances between chromosomes. Fourth, we combine temporal and genomic analyses.

First, let Q be ni × nq matrix of QTL genotypes for ni individuals at nq QTL, with Q[i, l] denoting QTL genotype of individual i at locus l. Also, let α be nq × 1 vector of QTL additive effects, with α[l] denoting QTL additive effect at locus l. Whole-genome genetic values of ni individuals are a linear combination of QTL genotypes and their effects, a = Qα, with a[i] denoting genetic value of individual i and a[i : j] denoting genetic values of a set of individuals spanning from the ith individual to the jth individual, inclusive, in the a. Variance of these genetic values is genetic variance, \(Var\left({{{{{\boldsymbol{a}}}}}}\right)=1/n\mathop{\sum }\nolimits_{i = 1}^{n}{\left({a}_{i}-1/n\mathop{\sum }\nolimits_{i = 1}^{n}{a}_{i}\right)}^{2}\). Note that this variance pertains to all ni individuals and might not be an informative measure if these individuals span multiple stages and years of a breeding programme. In fact, directional selection or population structure will likely inflate this variance measure and mislead breeders in overestimating the amount of genetic variance. This is why we need temporal analysis of genetic variance.

Second, for the temporal analysis of genetic variance we partition the vector of genetic values by time and calculate variance for each time partition (Sorensen et al. 2001). For example, assume that individuals and their genetic values are ordered by time and that we partition them into time groups as \(\left\{{{{{{\boldsymbol{a}}}}}}[1:k]\right.\), a[(k + 1): l], a[(l + 1): m], \(\left.\ldots \right\}\). Then the temporal analysis of genetic variance is obtained by calculating variance for each time partition: \({\sigma }_{{a}_{1}}^{2}=Var\left({{{{{\boldsymbol{a}}}}}}[1:k]\right)\), \({\sigma }_{{a}_{2}}^{2}=Var\left({{{{{\boldsymbol{a}}}}}}[(k+1):l]\right)\), \({\sigma }_{{a}_{3}}^{2}=Var\left({{{{{\boldsymbol{a}}}}}}[(l+1):m]\right)\), ….

Third, for genomic analysis of genetic variance we initially partition whole-genome genetic values a into an ni × nc matrix of nc chromosome genetic values Ac such that the sum over chromosome genetic values gives whole-genome genetic values \({{{{{\boldsymbol{a}}}}}}=\mathop{\sum }\nolimits_{c = 1}^{{n}_{c}}{{{{{{\boldsymbol{A}}}}}}}_{c}[:,c]\). We obtain these chromosome genetic values by summing locus genetic values Aq on each chromosome, Ac[i, c] = ∑lQ[i, l]α[l] for l running over \({n}_{{l}_{c}}\) QTL on a chromosome c. Note that \({{{{{\boldsymbol{a}}}}}}=\mathop{\sum }\nolimits_{q = 1}^{{n}_{q}}{{{{{{\boldsymbol{A}}}}}}}_{q}[:,q]=\mathop{\sum }\nolimits_{c = 1}^{{n}_{c}}{\sum }_{l}{{{{{{\boldsymbol{A}}}}}}}_{q}[:,l]\) for l running over \({n}_{{l}_{c}}\) QTL on a chromosome c. To calculate genetic variances over these genomic partitions we will use the variance sum rule Var(x + y) = Var(x) + Var(y) + 2Cov(x, y) and the variance product rule Var(xa) = Var(x)a2. Partitioning of the genetic variance \({\sigma }_{a}^{2}\) by chromosomes gives the sum of nc chromosome variances \(({\sigma }_{a,c}^{2})\) and \({n}_{c}\times \left({n}_{c}-1\right)\) covariances between chromosomes \(\left({\sigma }_{(a,c^{\prime} )(a,c)}\right)\):

$$\begin{array}{ll}Var\left({{{{{\boldsymbol{a}}}}}}\right)={\sigma }_{a}^{2}=Var\left(\mathop{\sum }\limits_{c}^{{n}_{c}}{{{{{{\boldsymbol{A}}}}}}}_{c}[:,c]\right)={\sigma }_{a,1}^{2}+{\sigma }_{a,2}^{2}+\cdots +{\sigma }_{a,{n}_{c}}^{2}+\\ 2\left({\sigma }_{(a,2)(a,1)}+\cdots +{\sigma }_{(a,{n}_{c})(a,{n}_{c}-1)}\right),\end{array}$$

with covariances between chromosomes being between-chromosome linkage-disequilibrium covariances (Fig. 2). Partitioning of a chromosome genetic variance \({\sigma }_{a,c}^{2}\) by loci gives the sum of \({n}_{{l}_{c}}\) locus variances \(({\sigma }_{a,c,l}^{2})\) and \({n}_{l}\times \left({n}_{l}-1\right)\) covariances between loci \(\left({\sigma }_{(a,c,l^{\prime} )(a,c,l)}\right)\):

$${\sigma }_{a,c}^{2}={\sigma }_{a,c,1}^{2}+{\sigma }_{a,c,2}^{2}+\cdots +{\sigma }_{a,c,{n}_{{l}_{c}}}^{2}+2\left({\sigma }_{(a,c,2)(a,c,1)}+\cdots +{\sigma }_{(a,c,{n}_{{l}_{c}})(a,c,{n}_{{l}_{c}}-1)}\right),$$

with locus variances being genic variances and covariances between loci being within-chromosome linkage-disequilibrium covariances (Fig. 2) (Lynch and Walsh 1998; Walsh and Lynch 2018; Bulmer 1971). Locus genic variance is a function of variance in locus genotypes and their allele substitution effects (Falconer and Mackay 1996; Gianola et al. 2009) (using variance product rule):

$${\sigma }_{a,c,l}^{2}=Var\left({{{{{{\boldsymbol{A}}}}}}}_{q}[:,l]\right)=Var\left({{{{{\boldsymbol{Q}}}}}}[:,l]{{{{{\boldsymbol{\alpha }}}}}}[l]\right)={{{{{\boldsymbol{\alpha }}}}}}{[l]}^{T}Var\left({{{{{\boldsymbol{Q}}}}}}[:,l]\right){{{{{\boldsymbol{\alpha }}}}}}[l],$$

where we emphasise that we do not use the common Hardy–Weinberg assumption of \(Var\left({{{{{\boldsymbol{Q}}}}}}[:,l]\right)=2{p}_{l}(1-{p}_{l})\) with pl being allele frequency. Instead, we advocate to calculate empirical variance in observed locus genotypes, \(Var\left({{{{{\boldsymbol{Q}}}}}}[:,l]\right)\). We will return to this point in the discussion. Note that genetic variance at a single-locus is the same as the genic variance. Locus linkage-disequilibrium covariance (=linkage-disequilibrium between locus genetic values) is a function of covariance between genotypes at two loci (=linkage-disequilibrium between locus genotypes) and their allele substitution effects:

$${\sigma }_{(a,c,l^{\prime} )(a,c,l)}={{{{{\boldsymbol{\alpha }}}}}}{[l^{\prime} ]}^{T}Cov\left({{{{{\boldsymbol{Q}}}}}}[:,l^{\prime} ],{{{{{\boldsymbol{Q}}}}}}[:,l]\right){{{{{\boldsymbol{\alpha }}}}}}[l].$$
Fig. 2: Genomic variance partitioning.
figure 2

Illustrative scheme of genomic partitioning of whole-genome genetic variance by chromosomes and loci into genic, and within- and between-chromosome linkage-disequilibrium (LD) components.

We can now partition the whole-genome genetic variance over chromosomes and loci as a sum of locus genic variances, within-chromosome linkage-disequilibrium covariances and between-chromosome linkage-disequilibrium covariances (Fig. 2):

$$\begin{array}{lll}{\sigma }_{a}^{2}=&\mathop{\sum }\limits_{c=1}^{{n}_{c}}\mathop{\sum }\limits_{l=1}^{{n}_{{l}_{c}}}{\sigma }_{a,c,l}^{2}+&(=\,{{\mbox{genic variance}}}\,)\\ &2\mathop{\sum }\limits_{c=1}^{{n}_{c}}\mathop{\sum }\limits_{l=1}^{{n}_{{l}_{c}}-1}\mathop{\sum }\limits_{l^{\prime} =l+1}^{{n}_{{l}_{c}}}{\sigma }_{(a,c,l^{\prime} )(a,c,l)}+&(=\,{{\mbox{within-chromosome linkage-disequilibrium}}}\,)\\ &2\mathop{\sum }\limits_{c=1}^{{n}_{c}-1}\mathop{\sum }\limits_{c^{\prime} =c+1}^{{n}_{c}}\mathop{\sum }\limits_{l=1}^{{n}_{{l}_{c}}}\mathop{\sum }\limits_{l^{\prime} =l}^{{n}_{{l}_{c}}}{\sigma }_{(a,c^{\prime} ,l^{\prime} )(a,c,l)}.&(=\,{{\mbox{between-chromosome linkage-disequilibrium}}}\,)\end{array}$$
(1)

With nl = 2100 QTL spread evenly over nc = 21 chromosomes, the total number of locus combinations is nl × nl = 4,410,000 and the total number of chromosome combinations is nc × nc = 441. The theoretical approach partitions genetic variance into nl = 2100 locus genic variances (nc = 21 chromosome genic variances), \({n}_{c}\times {n}_{{l}_{c}}\times \left({n}_{{l}_{c}}-1\right)=207,900\) locus within-chromosome linkage-disequilibrium covariances (nc = 21 chromosome within-chromosome linkage-disequilibrium covariances), and \({n}_{l}\times {n}_{l}-{n}_{c}\times {n}_{{l}_{c}}\times {n}_{{l}_{c}}=4,197,900\) locus between-chromosome linkage-disequilibrium covariances (nc × nc − nc = 420 chromosome between-chromosome linkage-disequilibrium covariances). In this study we work only with the genome and chromosome level partitioning of genetic variance. For a genome region level partitioning, see Burch et al. (2021). We emphasise these numbers because we often hear colleagues saying that there is no or limited between-chromosome linkage-disequilibrium (due to the lack of physical linkage). However, selection and other population processes can generate non-zero within- and between-chromosome linkage-disequilibrium covariance (Lynch and Walsh 1998; Walsh and Lynch 2018; Rawlik et al. 2020; Bulmer, 1971). Even if the between-chromosome linkage-disequilibrium covariances are very small, there is a very large number of them and they can collectively have a sizeable effect on genetic variance as we show in results. It is important to emphasise the distinction between linkage-disequilibrium between locus genotypes \((Cov\left({{{{{\boldsymbol{Q}}}}}}[:,l^{\prime} ],{{{{{\boldsymbol{Q}}}}}}[:,l]\right))\) and linkage-disequilibrium between locus genetic values \(({{{{{\boldsymbol{\alpha }}}}}}{[l^{\prime} ]}^{T}Cov({{{{{\boldsymbol{Q}}}}}}[:,l^{\prime} ],{{{{{\boldsymbol{Q}}}}}}[:,l]){{{{{\boldsymbol{\alpha }}}}}}[l])\). When looking at a whole-genome level, to obtain a non-zero linkage-disequilibrium covariance contribution to genetic variance, we require a non-zero linkage-disequilibrium between locus genotypes and a population process that couples this linkage-disequilibrium between locus genotypes with QTL effects. Selection or assortative mating are two such population processes because they are driven by QTL effects, while drift is not (Lynch and Walsh 1998; Walsh and Lynch 2018; Rawlik et al. 2019, 2020; Bulmer 1971).

Fourth, for the joint temporal and genomic analysis, we perform genomic partitioning and variance calculations for individuals and their genetic values partitioned by time.

Statistical analysis of observed data

In the previous sub-section, we assumed we know the QTL and their effects. However, in reality, we observe phenotypes and marker genotypes and make inferences based on this information. To this end, we fitted the marker-based model (Meuwissen et al. 2001; Whittaker et al. 2000; de los Campos et al. 2013):

$$\begin{array}{l}{{{{{\boldsymbol{y}}}}}}={{{{{\boldsymbol{Xb}}}}}}+{{{{{\boldsymbol{ZWm}}}}}}+{{{{{\boldsymbol{e}}}}}},\\ {{{{{\boldsymbol{m}}}}}} \sim {{{{{\mathcal{N}}}}}}({{{{{\boldsymbol{0}}}}}},{{{{{{\boldsymbol{I}}}}}}}_{{n}_{m}}{\sigma }_{m}^{2}),\ \,{{\mbox{and}}}\,\ {{{{{\boldsymbol{e}}}}}} \sim {{{{{\mathcal{N}}}}}}({{{{{\boldsymbol{0}}}}}},{{{{{{\boldsymbol{I}}}}}}}_{{n}_{y}}{\sigma }_{e}^{2}),\end{array}$$
(2)

where, y is an ny × 1 vector of ny phenotypic values, X is an ny × nb incidence matrix associated with the intercept and nb − 1 year effects b, Z is an ny × ni incidence matrix for ni lines whose marker genotype data are in an ni × nm matrix W for nm marker effects m, and e is an ny × 1 vector of ny residuals. In this study ny was 3420, nb was 6, ni was 3070 and nm was 10,500. We assumed that marker effects are a priori uncorrelated and normally distributed with zero mean and variance component describing variation between marker effects \({\sigma }_{m}^{2}\) (Eq. (2)). We further assumed that residuals are uncorrelated and normally distributed with zero mean and residual variance \({\sigma }_{e}^{2}\) (Eq. (2)). We ignored that different yield trials had different amount of replication and therefore different error variance.

The model (Eq. (2)) has location parameters (means) b and m and dispersion parameters (variances) \({\sigma }_{m}^{2}\) and \({\sigma }_{e}^{2}\). We emphasise that \({\sigma }_{m}^{2}\) is variance between marker effects and note that the commonly used approximation for ‘genomic variance’ \({\sigma }_{m}^{2}\times 2\mathop{\sum }\nolimits_{m=1}^{{n}_{m}}{p}_{m}(1-{p}_{m})\) (VanRaden 2008; Hayes et al. 2009) is scaled variance between marker effects and not genetic variance (Gianola et al. 2009; de los Campos et al. 2015; Rawlik et al. 2020). The scaling factor is the sum of expected variances for marker genotypes assuming Hardy–Weinberg equilibrium. Comparison of this approximation with Eq. (1) shows that the approximation ignores linkage-disequilibrium and non-Hardy–Weinberg components of genetic variance as well as uses variance between marker effects instead of QTL effects. However, linkage-disequilibrium affects the estimate of variance between marker effects. At any rate, this estimate of genetic variance is not amenable for our aim of doing temporal or genomic analyses. We view variance between marker effects simply as a statistical/model parameter that facilitates model fitting to observed data. We describe the model fitting and estimation of variances in the next sub-section.

Statistical and computational approaches

We used the empirical and full Bayesian approach to estimate the model’s parameters (Eq. (2)) with marker genotypes or their leading principal components. Given the variances \({\sigma }_{m}^{2}\) and \({\sigma }_{e}^{2}\), we can estimate fixed effects b and marker effects m by solving Henderson’s mixed model equations:

$$\left[\begin{array}{ll}{{{{{{\boldsymbol{X}}}}}}}^{T}{{{{{\boldsymbol{X}}}}}}&{{{{{{\boldsymbol{X}}}}}}}^{T}{{{{{\boldsymbol{ZW}}}}}}\\ {{{{{{\boldsymbol{W}}}}}}}^{T}{{{{{{\boldsymbol{Z}}}}}}}^{T}{{{{{\boldsymbol{X}}}}}}&{{{{{{\boldsymbol{Z}}}}}}}^{T}{{{{{{\boldsymbol{W}}}}}}}^{T}{{{{{\boldsymbol{W}}}}}}{{{{{\boldsymbol{Z}}}}}}+{{{{{\boldsymbol{I}}}}}}{\sigma }_{e}^{2}{\sigma }_{m}^{-2}\end{array}\right]\left[\begin{array}{l}\hat{{{{{{\boldsymbol{b}}}}}}}\\ \hat{{{{{{\boldsymbol{m}}}}}}}\end{array}\right]=\left[\begin{array}{l}{{{{{{\boldsymbol{X}}}}}}}^{T}{{{{{\boldsymbol{y}}}}}}\\ {{{{{{\boldsymbol{Z}}}}}}}^{T}{{{{{{\boldsymbol{W}}}}}}}^{T}{{{{{\boldsymbol{y}}}}}}\end{array}\right].$$
(3)

Specifically, the solution of Eq. (3) is the conditional expectation \((\hat{{{{{{\boldsymbol{b}}}}}}},\hat{{{{{{\boldsymbol{m}}}}}}})=E\left({{{{{\boldsymbol{b}}}}}},{{{{{\boldsymbol{m}}}}}}| {{{{{\boldsymbol{y}}}}}},{\sigma }_{m}^{2},{\sigma }_{e}^{2}\right)\). With these estimates we can obtain estimates of genetic values as \(\hat{{{{{{\boldsymbol{a}}}}}}}={{{{{\boldsymbol{W}}}}}}\hat{{{{{{\boldsymbol{m}}}}}}}\). These estimates have some error and ignoring it in the framework will underestimate genetic variance. To see this, imagine we have very little phenotypic information such that marker estimates will effectively follow the prior Eq. (2). In that case, marker estimates will effectively all equal zero and any variance calculation will return a zero. As shown by Sorensen et al. (2001) and Lehermeier et al. (2017), we can account for this uncertainty by estimating genetic variances from posterior samples of genetic values or marker effects. For the model (Eq. (2) and (3)), we can obtain posterior samples from the multivariate normal distribution:

$${{{{{\mathcal{N}}}}}}\left(E\left({{{{{\boldsymbol{b}}}}}},{{{{{\boldsymbol{m}}}}}}| {{{{{\boldsymbol{y}}}}}},{\sigma }_{m}^{2},{\sigma }_{e}^{2}\right),Var\left({{{{{\boldsymbol{b}}}}}},{{{{{\boldsymbol{m}}}}}}| {{{{{\boldsymbol{y}}}}}},{\sigma }_{m}^{2},{\sigma }_{e}^{2}\right)\right),$$
(4)

where conditional variance \(Var({{{{{\boldsymbol{b}}}}}},{{{{{\boldsymbol{m}}}}}}| {{{{{\boldsymbol{y}}}}}},{\sigma }_{m}^{2},{\sigma }_{e}^{2})\) can be obtained by solving the left-hand-side of the system of equations (Eq. (3)) (Sorensen and Gianola 2007).

Once we obtained samples of marker effects from Eq. (4), we have treated marker genotypes and marker effects respectively as if they were QTL genotypes and QTL effects and analysed temporal and genomic trends in genetic variance as described in the theoretical sub-section. Specifically, for each sample of marker effects, we have estimated genetic values and their variance for each group of individuals in the breeding programme (parents, F1 progeny, headrows, etc.) for each year for the temporal analysis and further partitioned along the genome for the genomic analysis. This procedure gave us posterior distribution for the genetic variance of each group, time and genome partition.

When variances are unknown, we can use the empirical Bayesian approach (Sorensen and Gianola 2007; Efron 1996) and estimate most likely variances given the data and use them to calculate conditional expectation and variance as well as draw samples from Eq. (4). Alternatively, we can use the full Bayesian approach by specifying prior distribution for all model parameters and obtain posterior distribution \(p({{{{{\boldsymbol{b}}}}}},{{{{{\boldsymbol{m}}}}}},{\sigma }_{m}^{2},{\sigma }_{e}^{2}| {{{{{\boldsymbol{y}}}}}})\propto p({{{{{\boldsymbol{y}}}}}}| {{{{{\boldsymbol{b}}}}}},{{{{{\boldsymbol{m}}}}}},{\sigma }_{e}^{2})p({{{{{\boldsymbol{b}}}}}}| {\sigma }_{b}^{2})p({{{{{\boldsymbol{m}}}}}}| {\sigma }_{m}^{2})p({\sigma }_{b}^{2})p({\sigma }_{m}^{2})p({\sigma }_{e}^{2})\) (Sorensen and Gianola 2007).

We fitted the model (Eq. (2)) both with the full and the empirical Bayesian approach. We first used MCMC for the full Bayesian approach and used one chain with 100,000 samples, 10,000 burn-in and saved every 100th sample to obtain 900 samples of all model parameters. For the empirical Bayesian approach, we also obtained 900 samples but used posterior mean for the marker effect and residual variances estimated from the full Bayesian approach when sampling from Eq. (4).

Since genomic analyses can be time-consuming, we have also investigated the use of approximation for marker genotypes with their leading principal components. We changed the model (Eq. (2)) into:

$$\begin{array}{l}{{{{{\boldsymbol{y}}}}}}={{{{{\boldsymbol{Xb}}}}}}+{{{{{\boldsymbol{ZTs}}}}}}+{{{{{\boldsymbol{e}}}}}},\\ {{{{{\boldsymbol{s}}}}}} \sim N({{{{{\boldsymbol{0}}}}}},{{{{{{\boldsymbol{I}}}}}}}_{{n}_{p}}{\sigma }_{s}^{2}),\ \,{{\mbox{and}}}\,\ {{{{{\boldsymbol{e}}}}}} \sim N({{{{{\boldsymbol{0}}}}}},{{{{{{\boldsymbol{I}}}}}}}_{{n}_{y}}{\sigma }_{e}^{2}),\end{array}$$
(5)

where T is an ni × np score matrix obtained from a truncated SVD of genotypes with the np leading principal components such that \({{{{{{\boldsymbol{T}}}}}}}_{({n}_{i}\times {n}_{p})}={{{{{{\boldsymbol{U}}}}}}}_{({n}_{i}\times {n}_{p})}{{{{{{\boldsymbol{S}}}}}}}_{({n}_{p}\times {n}_{p})}={{{{{{\boldsymbol{U}}}}}}}_{({n}_{i}\times {n}_{p})}{{{{{{\boldsymbol{S}}}}}}}_{({n}_{p}\times {n}_{p})}{{{{{{\boldsymbol{V}}}}}}}_{({n}_{m}\times {n}_{p})}^{T}{{{{{{\boldsymbol{V}}}}}}}_{({n}_{m}\times {n}_{p})}={{{{{{\boldsymbol{W}}}}}}}_{({n}_{i}\times {n}_{m})}{{{{{{\boldsymbol{V}}}}}}}_{({n}_{m}\times {n}_{p})}\), s is an np × 1 vector of np principal component effects and \({\sigma }_{s}^{2}\) is variance between principal component effects (Tusell et al. 2013; Ødegård et al. 2018; Hastie and Tibshirani 2004). The S matrix is a diagonal matrix with np singular values (square root of non-zero eigenvalues of WTW and WWT), the columns of U are left singular vectors, and the columns of V are right singular values. This model is structurally the same as the model (Eq. (2)) and we fitted it in the same way. We approximated marker effect samples by mi = Vsi, where si is the ith sample of principal component effects. Once we approximated marker effect samples we used the same approach as described above. We investigated different number of principal components (10, 50, 100, 500, 1000, 2000 and 3420). In our simulation these numbers of principal components respectively explained 14%, 38%, 52%, 84%, 94%, 99% and 100% of marker genotype variation in the first replicate.

Sensitivity to genetic architecture

To test our framework’s sensitivity to different genetic architectures, we have done additional simulations by varying the number of QTL and by adding genotype-by-environment interactions. Namely, the framework will depend on the ability of the fitted statistical model to capture the genetic complexity of the analysed trait. We have simulated an additive trait with either 10, 100 or 1000 QTL per chromosome, respectively with 210, 2100 or 21000 QTL per genome. In addition, we have added variation in QTL effects across years for genotype-by-environment interactions across years, assuming that years represent different environments. The amount of this additional phenotype variance due to genotype-by-year interactions was set to 0.2. In total, this gave us six scenarios (three for the number of QTL and two for absence/presence of genotype-by-year interactions). In each of the scenarios, we used the standard model (Eq. (2)) that is ignorant about the number of QTL or the presence of genotype-by-year interactions.

Summarising the results

We compared how obtained posterior distributions for genetic variances and their components match the true values from simulation. We also calculated the continuous ranked probability score (CRPS) (Gneiting and Raftery 2007) to compare whole posterior distributions to true values to assess both accuracy and precision and with this account for the uncertainty of estimation. For an intuitive description of CRPS, see Selle et al. (2019). Finally, we also calculated the concordance correlation coefficient (CCC) (Lin 1989) to additionally assess agreement between the true and estimated values of genetic and genic variance in some analyses. We used CCC because it has two clear components—the Pearson correlation coefficient indicating precision (closeness to the best-fit line between the true and estimated values) and the bias correction factor indicating accuracy (closeness to the equality line between the true and estimated values).

Software implementation

We have simulated the wheat breeding programme using the R package AlphaSimR (Gaynor et al. 2021). We have fitted the model with the AlphaBayes software (source code at https://github.com/AlphaGenes/alphabayes) (Gorjanc and Hickey 2019). We used R (R Core Team 2019) for post-processing the AlphaBayes marker effect samples and further analyses. To calculate the CRPS (Jordan et al. 2019), we used the scoringRules R package. To calculate the CCC (Lin 1989), we used the DescTools R package (Signorell et al. 2021).

Results

Overall, the results show that estimates from the data following the framework were in concordance with the true values for temporal and genomic analysis, provided that the fitted model is capturing the genetic complexity of a trait. We separate the result section into four areas to facilitate presentation: (1) temporal analysis, (2) genomic analysis, (3) computational analysis and (4) sensitivity to genetic architecture.

Temporal analysis

The genetic and genic variance changed through the breeding cycle. We show this in Fig. 3 with the true and estimated genetic and genic variances for different stages of one breeding cycle (see breeding scheme on the left in Fig. 1). As expected, genetic variation in F1 progeny across multiple crosses was lower than in the parents as this variance indicates variance in parent averages between crosses. When we generated doubled haploids for these full-sib families (HDRW stage), genetic variation was regenerated to the level in parents due to recombination and complete inbreeding. Genetic variation gradually reduced through the breeding cycle due to the repeated selection from headrows to elite yield trial. This change was particularly evident for genetic variance but less for genic variance. Also, the genetic variance was consistently smaller than the genic variance. The estimated genetic and genic variance matched the true values well across all breeding stages. There was more uncertainty in the estimates of genetic variance in the elite yield trial than in other stages.

Fig. 3: Breeding analysis.
figure 3

Genetic (A) and genic (B) variance estimated with the full Bayesian approach for parents in year 16, F1 progeny (F1) in year 17, headrows (HDRW) in year 18, preliminary yield trial (PYT) in year 19, advanced yield trial (AYT) in year 20 and elite yield trial (EYT) in year 21; black lines denote the true values and densities depict posterior distributions.

Genetic variation decreased over the years and genetic variance was consistently smaller and more variable than genic variance across years. We show this in Fig. 4 with the true and estimated temporal trends of genetic and genic variances for different breeding stages (see temporal scheme on the right in Fig. 1). Variances between the breeding stages differed as mentioned before, but in Fig. 4 we also see a consistent decrease over the years, which was variable for genetic variance but not for genic variance. Furthermore, this variability increased from early to late breeding stages as fewer and fewer individuals were in a stage. Thus, the genetic and genic variance estimates have matched the true values very well across all breeding stages and years.

Fig. 4: Temporal analysis.
figure 4

Temporal trends in genetic (A) and genic (B) variance estimated with the full Bayesian approach for parents, F1 progeny (F1), headrows (HDRW), preliminary yield trial (PYT), advanced yield trial (AYT) and elite yield trial (EYT); solid lines denote the true value, dashed lines denote posterior means and polygons depict 95% posterior quantiles.

Genomic analysis

The genomic analysis enabled accurate partitioning of whole-genome genetic variance into whole-genome genic variance and whole-genome linkage-disequilibrium covariances. We show this in Fig. 5 with true and estimated variances and covariances for headrows and elite yield trial from one breeding cycle. Figure 5 shows, as previously described, differences in genetic and genic variances as well as a substantial change in the between-chromosome linkage-disequilibrium covariance, which was the main driver of change in genetic variance between headrows and the elite yield trial. Specifically, genetic variance decreased from 0.0754 in headrows in year 18 to 0.0307 in the elite yield trial in year 21, with a change of 0.0447 (59% reduction). This overall change was due to 0.01 change in genic variance (22% of the initial genetic variance), 0.0036 change in within-chromosome linkage-disequilibrium covariance (8% of the initial genetic variance) and 0.0311 change in between-chromosome linkage-disequilibrium covariance (70% of the initial genetic variance). The estimates matched the true values well. Supplementary Fig. S1 shows temporal trends for within-chromosome and between-chromosome linkage-disequilibrium. Between-chromosome linkage-disequilibrium varied more and decreased over time.

Fig. 5: Genomic analysis.
figure 5

Whole-genome genetic and genic variances, and within- and between-chromosome linkage-disequilibrium (LD) covariances with the full Bayesian approach for headrows (HDRW, year 18) and elite yield trial (EYT, year 21); genetic variance is the sum of genic variance, within- and between-chromosome LD (see Fig. 2); black lines denote true values and violins depict posterior distributions.

The genomic analysis also enabled accurate partitioning of whole-genome genetic variance by chromosomes. We show this in the Supplementary material with a series of Supplementary Tables S1S4 and Supplementary Fig. S2. These supplements show genetic variance and its components (genic variance, within-chromosome linkage-disequilibrium covariance and between-chromosome linkage-disequilibrium covariance) by 21 chromosomes and how these values add up to the whole-genome variance. Specifically, the genetic variance of a quantitative trait is composed of (i) variation at the QTL genotypes (Supplementary Table S1) and (ii) variation of QTL effects, which combined with variation at the QTL genotypes gives the genetic variance of a quantitative trait (Supplementary Table S2). However, in reality we do not know QTL, we only know SNP markers, so we can only calculate (iii) variation at the genotypes of SNP markers (Supplementary Table S3) and (iv) estimate SNP marker effects, which combined with variation at the SNP marker genotypes gives an estimate of the genetic variance of a quantitative trait (Supplementary Table S4). Therefore, we are showing partitioning of genetic variance for QTL genotypes (Supplementary Table S1), marker genotypes (Supplementary Table S3), true genetic values (Supplementary Table S2) and estimated genetic values (Supplementary Table S4). The variance at QTL genotypes (Supplementary Table S1) and SNP marker genotypes (Supplementary Table S3) will likely differ because they are respectively a function of the number of QTL and SNP markers and their respective variation. Supplementary Table S1 reports genetic variance of QTL genotypes across all chromosomes of 1213.1 (this is based on 2100 QTL), while Supplementary Table S3 reports genetic variance of SNP marker genotypes across all chromosomes of 6452.6 (this is based on 10,500 SNP markers). Our aim is that the true genetic variance for the quantitative trait (Supplementary Table S2) and its estimate from SNP markers (Supplementary Table S4) will be similar. Supplementary Table S2 reports genetic variance for the quantitative trait of 0.082 (the true value), while Supplementary Table S4 reports estimated genetic variance from our framework of 0.079. The true and estimated values match well and the same holds for individual chromosomes, but there is larger variation, which is expected because there is less information per chromosome.

Supplementary Fig. S2 compares the true and estimated genetic variances directly. The Supplementary material, along with Supplementary Tables S1S4, aims to demonstrate how we estimate variation in true genetic values, which is driven by unknown QTL and unknown QTL effects, by using marker genotypes and estimated marker effects. We make five observations. First, the analysis of QTL genotypes showed that whole-genome and chromosome genetic variance in unselected headrows is largely driven by genic variance, but there are some chromosomes with a substantial within-chromosome or between-chromosome linkage-disequilibrium covariance (Supplementary Table S1). Second, the magnitude of linkage-disequilibrium covariances increased in the elite yield trial, which reduced the whole-genome genetic variance; however, between-chromosome linkage-disequilibrium was larger than within-chromosome linkage-disequilibrium (Supplementary Table S1). Third, the analysis of marker genotypes followed the same trends, but the values were sustainability larger due to the larger number of markers than QTL (Supplementary Table S3). Fourth, the analysis of true genetic values resulted in much smaller values for variances than the analysis of QTL genotypes because most QTL have small effects, but the relative magnitude of variation and their partitioning were similar (Supplementary Table S2). Fifth, the analysis of variance of estimated genetic values followed the analysis of variance of true genetic values closely—most deviations were observed for the elite yield trial, but all posterior distributions encompassed the true value (Supplementary Table S4). This analysis pertains to one single dataset to show that estimates are reasonable for a specific dataset.

Computational analysis

Full and empirical Bayesian approaches had similar posterior mean estimates of variances, but the empirical Bayesian approach had smaller posterior standard deviation. We show this in Fig. 6 with a comparison of posterior means and posterior standard deviations for genetic and genic variance between the two approaches. The posterior means matched well for both types of variances. However, the posterior standard deviation was smaller with the empirical Bayesian approach, particularly for the genic variance. Comparison with the true values, however, showed good concordance with the empirical Bayesian posterior means (Supplementary Figs. S3 and S4).

Fig. 6: Statistical analysis.
figure 6

The empirical Bayesian approach versus the full Bayesian approach for posterior mean of genetic variance (A), posterior mean of genic variance (B), posterior standard deviation of genetic variance (C) and posterior standard deviation of genic variance (D); equal value is represented by the dashed red line.

Additional evaluation with multiple replicates showed that the full and empirical Bayesian results were consistently estimated for genetic and genic variance. We show this in Table 1 with CRPS of genetic and genic variances for full and empirical Bayesian approaches by breeding stage. Note that CRPS is negatively oriented—lower values indicate better estimates compared to the true value in terms of accuracy and precision. CRPS for genetic variance matched closely between the full and empirical Bayesian approaches. On the other hand, they differ more for genic variance, with better (lower) values for the full Bayesian approach, albeit there was considerable variability across years and replicates. Moreover, CRPS was larger (worse) for genic variance than for genetic variance.

Table 1 Continuous ranked probability score (CRPS × 1000 – lower is better: mean ± standard deviation over 6 years and ten replicates) for genetic and genic variance estimated by the full Bayesian and the empirical Bayesian for parents, F1 progeny, headrows (HDRW), preliminary yield trial (PYT), advanced yield trial (AYT) and elite yield trial (EYT).

When we used a sufficient number of principal components, approximation with leading principal components accurately estimated genetic variance, but this was never the case for genic variance. We show this in Fig. 7 with estimation error, defined as the difference between the true and estimated value, for genetic and genic variance as a function of the number of leading principal components. The estimation error decreased as we increased the number of leading principal components. It decreased quickly for the genetic variance—there was no error once we captured about 80% of the variation in marker genotypes. In our simulated dataset from the first replicate, we achieved this with 500 leading principal components. On the other hand, the estimation error decreased slowly for the genic variance, and we never recovered the true estimate, even if we used all the principal components. The estimation error was smallest in the F1 progeny, followed by the elite yield trial, while the largest estimation error was in the parents.

Fig. 7: Approximation error with dimension reduction.
figure 7

Estimation error in genetic and genic variances as a function of the number of principal components in parents in year 16, F1 progeny (F1) in year 17, headrows (HDRW) in year 18, preliminary yield trial (PYT) in year 19, advanced yield trial (AYT) in year 20 and elite yield trial (EYT) in year 21; horizontal dashed line represents no estimation error.

Sensitivity to genetic architecture

Our framework relies on a statistical model that can capture the genetic complexity of an analysed trait. We have tested the effect of using a sub-optimal statistical model by varying the number of QTL and by adding genotype-by-year interaction in our simulation, without accounting for these complexities in the statistical model. Results in Supplementary Figs. S5 and S6 and Supplementary Table S5 show that variance estimates are very sensitive to genotype-by-year interaction and less to the number of QTL.

For scenarios without genotype-by-year interaction, the estimates of genetic and genic variance were very much in line with the true values and largely insensitive to the number of QTL (Supplementary Figs. S5 and S6 and Supplementary Table S5). Furthermore, concordance correlation and its two components (Pearson correlation and bias correction factor—Supplementary Table S5) showed good agreement between the true and estimated values for genetic and genic variances, with high precision and low bias.

For scenarios with genotype-by-year interaction, we can see substantial overestimation of genetic and genic variances (Supplementary Figs. S5 and S6). This overestimation decreased as the number of QTL increased, but even with 21,000 QTL, we still overestimated the variances by as much as 200%. Estimates of genic variance showed systematically consistent overestimation, while estimates of genetic variance were more variable. Concordance correlation and its two components (Pearson correlation and bias correction factor—Supplementary Table S5) showed low to moderate agreement between true and estimated values for genetic and genic variances. The Pearson correlation for genic variance was high, which indicates precise estimation. Although the true and estimated genic variances had a high linear relationship (high Pearson correlation), their estimates were biased (low bias correction factor) (Supplementary Table S5). For genetic estimates, we can see a moderate Pearson correlation and a moderate bias correction factor. Therefore, the addition of genotype-by-year interaction biased the genetic and genic variances estimates but decreased the precision for only genetic variances.

Discussion

The results show that the framework for temporal and genomic analysis of genetic variation is flexible, accurate and enables assessing the sustainability of a breeding programme as well as population processes that change genetic variance. These results highlight four topics for discussion in line with the structure of results: (1) temporal analysis of genetic variance, (2) genomic analysis of genetic variance, (3) computational aspects and (4) assumptions of this study.

Temporal analysis

This study will help breeders assess the amount of genetic variance in their programmes and better manage its utilisation for future genetic gains. Genetic variance (specifically its square root) is a key component of the breeder’s equation for predicting response to selection (Lush 1937; Falconer and Mackay 1996). While breeding programmes routinely estimate genetic variance for traits under selection, most estimates pertain to a group of individuals that is arguably not the most relevant for routine breeding (Piepho et al. 2008). Specifically, with the pedigree-based model, the estimate of genetic variance pertains to pedigree founders, which can be several generations removed from currently interesting individuals. Furthermore, pedigree founders often span multiple generations due to incomplete pedigrees and, as such, the corresponding estimate of genetic variance does not have a clearly defined time point. Estimates of genetic variance from genome-based models pertain to the group of individuals for which the allele frequencies were computed—usually for the genotyped individuals or base population, both of which again do not have a clearly defined time point. In addition, the ‘genomic variance’ is estimating genetic variance only under some conditions (Gianola et al. 2009; de los Campos et al. 2015; Rawlik et al. 2020; Schreck et al. 2019). Therefore, we propose an alternative framework for temporal and genomic analyses of genetic variation.

The framework builds on Sorensen et al. (2001), Lehermeier et al. (2017) and Allier et al. (2019) to enable a straightforward temporal analysis of a breeding programme. The framework uses all the available data spanning multiple years (generations) to estimate model parameters, which are used to infer genetic values and their variances. Such flexibility of using all data but producing estimates for any group of individuals is crucial to inform breeders how much genetic variance they have at hand and to react accordingly. Possible reactions to a temporal analysis by a breeder could be (i) keeping the current breeding programme as it is, (ii) implementing active management of genetic variance using techniques such as optimal contribution selection (e.g., Woolliams et al. 2015; Akdemir and Sanchez 2016; Gorjanc et al. 2018; Akdemir et al. 2019), (iii) germplasm exchange with other programmes or, in the extreme, (iv) introgressing landrace germplasm (e.g., Gorjanc et al. 2016). For example, temporal trends in genetic and genic variance enable straightforward trait-specific estimation of effective population size (Gorjanc et al. 2018). Using this approach in this study, we estimate the effective population size for the parents at 111. This estimate suggests that the simulated breeding programme is sustainable (Falconer and Mackay 1996; Lynch and Walsh 1998; Walsh and Lynch 2018; Hill 2016) as corroborated by small changes in genetic variance between years.

There are also other approaches to the temporal analysis of genetic variance. Tsuruta et al. (2004) used the random regression to model genetic values and their variance over the years, and Hidalgo et al. (2020) used sliding time intervals in the same fashion. Both methods have some drawbacks—random regression can be computationally demanding, while time intervals must be sufficiently large to obtain accurate estimates. These two approaches respectively enrich the model or slice the data to estimate genetic variances over time as model parameters, while our framework treats model variance parameters and genetic variances over time separately. We will return to these differences at the end of the discussion. Hidalgo et al. (2020) used sliding time intervals to investigate changes in genetic (co)variances for a breeding programme that recently implemented genomic selection. They observed rapid changes in genetic (co)variances with the implementation of genomic selection. Their results highlight a need for breeder’s reaction and further analysis. One such analysis should be related to which components of genetic variance changed due to genomic selection.

Genomic analysis

The proposed framework can estimate the size and trends for genomic components of genetic variance. We have followed a standard quantitative genetics decomposition of genetic variance (Lynch and Walsh 1998; Walsh and Lynch 2018; Gianola et al. 2009; Bulmer 1971). This decomposition involves variance of genotypes and their allele substitution effects at every locus (genic variance) and covariance between genotypes and their allele substitution effects between pairs of loci. The covariance can be further partitioned into covariance between loci on one chromosome (within-chromosome linkage-disequilibrium covariance) and covariance between loci on different chromosomes (between-chromosome linkage-disequilibrium covariance). We showed this decomposition for the variance of QTL genotypes (Supplementary Table S1), marker genotypes (Supplementary Table S3), true genetic values (Supplementary Table S2) and estimated values (Supplementary Table S4), all at the whole-genome and chromosome level. These results confirmed the prediction of Bulmer (1971) that directional selection on total genetic values or downstream phenotypes induces negative linkage-disequilibrium between QTL and other loci around them and that this component can cause rapid and major changes in genetic variance (Lynch and Walsh 1998; Walsh and Lynch 2018). We note that this negative linkage-disequilibrium is a function of genotype combinations between loci as well as their allele substitution effects. Therefore, we have to distinguish two types of linkage-disequilibrium: (i) linkage-disequilibrium between locus genotypes \(\left(Cov\left({{{{{\boldsymbol{Q}}}}}}[:,l^{\prime} ],{{{{{\boldsymbol{Q}}}}}}[:,l]\right)\right)\), which is a function of allele frequencies at loci and correlation between loci (this linkage-disequilibrium is trait agnostic); and (ii) linkage-disequilibrium between locus genetic values \(\left({{{{{\boldsymbol{\alpha }}}}}}{[l^{\prime} ]}^{T}Cov\left({{{{{\boldsymbol{Q}}}}}}[:,l^{\prime} ],{{{{{\boldsymbol{Q}}}}}}[:,l]\right){{{{{\boldsymbol{\alpha }}}}}}[l^{\prime} ]\right)\), which is a function of allele frequencies at loci, correlation between loci and allele substitution effects of the loci. Only linkage-disequilibrium between locus genetic values contributes to the genetic variance of a trait. As mentioned in methods, linkage-disequilibrium between locus genetic values is induced by population processes that are driven by QTL effects. For example, selection on a trait will induce linkage-disequilibrium between locus genetic values because the trait is influenced by QTL effects. On the other hand, drift will induce linkage-disequilibrium between locus genotypes, but not linkage-disequilibrium between locus genetic values because drift is not driven by QTL effects. While drift can generate sporadic linkage-disequilibrium between locus genetic values, the sum of such terms will tend to zero for traits with a large number of QTL. To see this behaviour, consider linkage-disequilibrium between locus genetic values \(\left({{{{{\boldsymbol{\alpha }}}}}}{[l^{\prime} ]}^{T}Cov\left({{{{{\boldsymbol{Q}}}}}}[:,l^{\prime} ],{{{{{\boldsymbol{Q}}}}}}[:,l]\right){{{{{\boldsymbol{\alpha }}}}}}[l]\right)\) and summing such terms over all pairs of loci. Even if there is non-zero linkage-disequilibrium between locus genotypes, the multiplication with seemingly randomly allocated QTL effects (since linkage-disequilibrium between genotypes is not driven by QTL effects) will drive the sum to zero. This ‘cancelling out’ will diminish when we analyse smaller genome regions or traits with a smaller number of QTL.

The importance of linkage-disequilibrium in estimating genetic variance with genomic data is growing (de los Campos et al. 2015; Lehermeier et al. 2017; Allier et al. 2019). Our study added to this literature with simulation and demonstrating temporal changes in linkage-disequilibrium under selection both within one breeding cycle (headrows to elite yield trial) (Fig. 3) and between breeding cycles over the years (Fig. 4). We observed more considerable changes within breeding cycles than between, which can be explained by strong repeated selection within cycles and recombination among parent genomes between cycles. Interestingly, we observed large between-chromosome linkage-disequilibrium covariance in comparison to within-chromosome (Fig. 5). This observation is at odds with physical linkage between loci within a chromosome and no physical linkage between loci on separate chromosomes. Our explanation is that there are more combinations between loci on separate chromosomes than within chromosomes. Furthermore, limited recombination constrains selection to induce large changes in linkage-disequilibrium within chromosomes compared to between chromosomes. To put this into perspective, in the analysed example, we observed a 59% change in genetic variance within a breeding cycle (headrows to elite yield trial). Of this change, 13% (22% relative) was due to the change in genic variance, 5% (8% relative) was due to the change in within-chromosome linkage-disequilibrium covariance and 41% (70% relative) was due to the change in between-chromosome linkage-disequilibrium covariance (Fig. 5). Even though we randomly placed QTL and randomly allocated QTL effects from one distribution (in the founding population before selection), the components of genetic variance varied considerably between chromosomes. These assumptions are likely too simple. Indeed, Allier et al. (2019) observed substantial variation between chromosomes in maize. These results indicate that linkage-disequilibrium is an important component of the genetic variance in line with the theoretical work of Bulmer (1971) and Mather and Jinks (2013).

We expected underestimation of genic variance in this breeding study but have not observed this. We have simulated a breeding programme with directional selection, which induces negative linkage-disequilibrium (Bulmer 1971) due to repulsion linkage of QTL (Mather and Jinks 2013). We expected that repulsion linkage would ‘hide’ variation in some genome regions (due to a lack of variation in haplotypes) and, therefore, underestimate genic variance. We likely did not observe underestimation because the effective population was reasonably large (111), the selection was not too strong or there was a sufficient number of markers. However, across multiple replicates, the CRPS was worse for genic than genetic variance, which indicates that estimating genic variance is more challenging than for genetic variance.

The proposed framework for genomic analysis of genetic variance will pave the way for analysing population processes that change the variance. While selection induces linkage-disequilibrium between loci, it also changes allele frequencies (Lynch and Walsh 1998; Walsh and Lynch 2018; Bulmer 1971; Gorjanc et al. 2015). Interestingly, we have not observed major changes in genic variance, suggesting that allele frequencies have not changed much. However, we must emphasise that we have reported genic variance for all loci, i.e., the sum of all locus-specific genic variances. This total genic variance is likely to change far less than genic variance at individual loci because at some loci allele frequency might change away from 0.5 (genic variance at these loci will decrease), but at other loci, allele frequency might change towards 0.5 (genic variance at these loci will increase) (Crow 2010). Therefore, when we sum locus-specific genic variances, the total genic variance will not change much. More research is needed to study trends in allele frequencies for old and recent mutations, their locus-specific genic variances, and total genic variance.

Another important population process is drift, which is always present in breeding programmes due to small effective population sizes. However, distinguishing between selection and drift in such populations is difficult (Lynch and Walsh 1998; Walsh and Lynch 2018; Gorjanc et al. 2015) and drift does not induce significant linkage-disequilibrium covariance at the whole-genome level. Assortative mating is yet another process, which can induce significant linkage-disequilibrium covariance as the whole-genome level (e.g., Lynch and Walsh 1998; Walsh and Lynch 2018; Rawlik et al. 2020, 2019).

Similarly, population structure and admixture between populations can influence genetic variance and should be addressed in the future. One way to treat population structure would be to partition individuals by sub-population and calculate separate genetic variances and covariances between sub-populations. However, this approach breaks down with admixture. Admixture could be approached by using whole population genome trees with recombination (Kelleher et al. 2019) and labelling individuals and genome segments with originating sub-populations, and expanding the framework into a population analysis of genetic variance. Finally, there is also the classic partitioning of genetic variance between and within families. An interesting line of future research is how different types of directional selection (between or within family and their combination) changes genic and linkage-disequilibrium components of genetic variance.

A final note on genomic analysis is that the proposed framework does not depend on the assumption of Hardy–Weinberg and linkage equilibrium. It is common to see expressions for genetic variance at a locus of the form 2p(1 − p)α2, which assumes independent binomial sampling of alleles with probability p (Hardy–Weinberg equilibrium). There is an excess of homozygotes over heterozygotes in some breeding programmes, particularly in plant breeding programmes that use inbreeding. In this case, we have a clear deviation from the Hardy–Weinberg equilibrium, and the expression 2p(1 − p)α2 will underestimate genetic variance at a locus. To see this consider p = 0.5 and α = 1, which gives 2p(1 − p)α2 = 0.5, but if we only have reference and alternative homozygotes (50% each) the actual variance is doubled due to complete inbreeding (Wright 1931). While there are expressions that involve inbreeding 2p(1 − p)(1 + F)α2, where 2p(1 − p)(1 + F) is the variance of genotypes under non-random mating, we suggest a simpler straightforward calculation of the variance of genotypes at a locus and multiplying that variance with α2. Bulmer (1976) was aware of these differences and partitioned genic variance into the value expected under Hardy–Weinberg equilibrium (binomial sampling of alleles) 2p(1 − p)α2 and a deviation due to non-random mating 2p(1 − p)Fα2.

Computational aspects

The framework is based on Sorensen et al. (2001), Lehermeier et al. (2017) and Allier et al. (2019) that used the full Bayesian approach and MCMC sampling. We performed our analyses with the full and empirical Bayesian approach and found a good concordance between the two approaches and true values. However, there was a tendency of the empirical Bayesian approach to underestimate uncertainty of inferred genetic variances due to ignoring uncertainty in estimating model variance parameters (Fig. 6). This underestimation is expected, but it seems that the difference is not large, though this will vary between datasets. There are also frequentist approaches that account for the uncertainty of estimating variance components (e.g. Kenward and Roger 1997). The full Bayesian analysis with the marker-based model is not too computationally demanding if the number of markers is not too large (10–50K markers can be handled easily). The full Bayesian analysis can be quite demanding with the genome-based model on individuals if the number of individuals is large, but equivalence with the marker-based model means we can fit one or another model and back-solve desired effects (Stranden and Garrick 2009), as long as we use normal prior distribution for marker effects.

The observation that leading principal components underestimate genic variance (Fig. 7) requires further studies. We expected that increasing the number of leading principal components would reduce the estimation error, which we observed for genetic variance. In contrast, we observed consistent underestimation for genic variance—even with all principal components. Since we had more markers than individuals, this is likely because ‘null’ components would still have some uncertainty in estimation, which we ignored and, therefore, we underestimated genic variance. Methods presented in the supplementary of Listgarten et al. (2012) could be used to correct for this.

Assumptions

We made two assumptions. First, we assumed a sufficiently dense panel of markers that collectively capture variation at QTL. An insufficient number of markers will deteriorate the ability of the framework to capture genetic variance at and between QTL. Our simulation shows that with a medium marker density, we can estimate genetic variance in a breeding programme accurately. Still, we highlight that we have simulated a simple genetic architecture with randomly located QTL along the genome. Second, we assumed that allele effects are constant over time and across groups of individuals. This assumption is reasonable for estimation in the sense that we used all the available data to estimate average marker effects as accurately as possible and use them across all individuals. However, allele effects can be time-, environment- or population-specific. We tested how our framework fares in such settings by extending the simulation with genotype-by-year interactions. The results have shown that we overestimate genetic and genic variance in those settings, sometimes by as much as 200%. This miss-match is expected because a too simple statistical model can capture too little genetic variation or capture non-genetic effects as genetic and lead to under- or overestimated genetic variances. While model choice is clearly an issue, we note that the proposed framework can be extended to models with effects such as genotype-by-year interactions, for example, by using a model described in Tolhurst et al. (2019). While estimating time-, environment- or population-specific effects could better reflect reality, accurately estimating such effects is challenging. The random regression and time interval approaches (Tsuruta et al. 2004; Hidalgo et al. 2020) have an advantage in this aspect compared to our framework but do not have the flexibility for the genomic analysis of genetic variance. Nuanced estimation of marker effects will likely be more important for breeding programmes that operate in highly variable environments and introgress germplasm from other populations. Still, there will likely be limited data to estimate separate effects accurately. Estimation of background-specific effects is an active research area in genetics with growing datasets across various populations (e.g., Tolhurst et al. 2019; Peterson et al. 2019; van den Berg et al. 2020). Relatedly, we assumed only additive genetic effects. While both theory and data indicate that the average effect of an allele substitution captures most of genetic variance (Hill et al. 2008), recognition of dominance and epistasis is growing (e.g., Hem et al. 2021; Varona et al. 2018; Alves et al. 2019; Legarra et al. 2021). The estimation of non-additive genetic effects and non-additive genetic variances is a very challenging topic. The proposed framework can be expanded to these settings as well and it is a subject of future research.