Introduction

Perennial plant species such as guava trees (Psidium guajava L.) have specific characteristics such as a long reproductive cycle, a high annual variation in some traits as the yield, differences in precocity, and productive longevity1. This reduces the predictive power of the models, which most often means losses on invested resources. From the point of view of genetic improvement and use in commercial orchards, these characteristics have the following consequences: use of the same genetic material selected for an over number of years; the necessity of repeated evaluations in each individual throughout time, and the reduction in the survival rate of experiments during their useful life. The last one tends to generate unbalanced data that demand accuracy in selection methods2. So, using a method for modeling that produces more accurate results can undoubtedly save resources, and in the long time improve the chance of success of experiments with perennials plants.

Perennial plant breeding typically applies the procedure of Restricted Maximum Likelihood/Best Linear Unbiased Prediction (REML/BLUP) for the prediction of genetic values and estimation of variance components2. Mixed model theory has been a reference for assessing breeding programs in perennial plants, plants in general and animals3. Even though the frequentist methodology presents a number of useful properties, there is a limitation as the REML method only provides approximate confidence intervals2.

This can be avoided by Bayesian inference using an informative prior distribution with mixed models. This approach in genetic breeding, is founded on knowledge of a posteriori distribution. In this process, the likelihood function connects the priori (previous information of the experiment) to the posterior distribution, which finally contemplates the previous knowledge and the additional information obtained in the experiment.

Among the various Bayesian methodologies, the Markov Chains Monte Carlo simulation method can be applied for generate a chain of successive iterations updating the estimates by the likelihood starting from an initial parameter (priori). In the subsequent joint distribution the variances can be obtained, enabling the construction of more accurate confidence intervals (defined as probability intervals or credibility intervals), and also estimative of genetic parameters4.

The Bayesian approach have any advantages compared to the frequentist analysis. The main one is the possibility of using informative priors about parameters of the model5. In the frequentist’s approach, if you have previous data, you can even do a joint analysis with your current experiment, which is often hampered by the difference between outlines or even incomplete data. But this usually comes as a source of variation in the model and does not add much information beyond the possibility of identifying if the previous data are different from the current experiment. Another advantage is that the credibility intervals are close than the confidence intervals, if a proper prior has been used. Because the likelihood function, if a poor priori is used with mixed models, the performance of Bayesian with mixed models is at least equal to BLUP6,7,8.

This work aims to compare REML/BLUP and the Bayesian approach using a non-informative and a proper prior. For this, a superior performance of the Bayesian models is expected, observing the deviation of this methods in relation to phenotypic mean, for the selection of superior genotypes in a perennial population of Psidium guajava.

Methods

Genetic material and experimental design

A total of 17 families of full siblings was selected for this study, all of which belong to the Genetic Breeding Program of guava tree from the Universidade Estadual do Norte Fluminense Darcy Ribeiro (UENF), Rio de Janeiro, Brazil. Genotypes are derived from crosses between seven contrasting parents chosen by diversity genetics studies9. This population is in the final stages of the breeding program.

The experiment was performed in a randomized block design with two replicates. Each family was represented by 24 individuals (12 per block) with a total initially of 408 individuals. The experiment was conducted between 2016 and 2018. The spacing was of 3 per 1.5 m between rows and between plants, respectively. All culture treatments were applied according to the culture requirements10. Harvests were carried out at the individual level, where yield (kg.plant−1) was obtained, and generated one observation per individual because it’s a sum of production. For fruit mass (FM g) and pulp mass (PM g) were taken five observations in different fruits. Some genotypes were lost during the period of the experiments, which resulted in unbalanced data.

Statistical model and analyses

First, we use the common methodology in the so-called frequentist breeding, and later we use the same model with the beyesian approach, using the mixed model:

$$y=Xb+Za+Wc+e$$
(1)

in which y is the observation vector; b is the parametric vector of the fixed effects (families), associated with the vector y by the incidence matrix known X; a and c are the parametric vectors of the random effects (block and individual within the family, respectively), also associated with y by the incidence matrices known, Z and W, respectively; and e is the residual vector, assuming that a and c ~ N (0, Gg e Ga) in which G is the genotypic and addictive variance matrix of the random effects and e ~ N (0, R) which R is the residual variance matrix of the random errors.

Was employed the method of restricted maximum likelihood (REML) to obtain the best estimates of variance components associated with non-orthogonal and unbalanced data11. The REML/BLUP method was executed using the PROCMIXED procedure in the SAS software12.

The Bayesian approach was used with the same model, applying the Monte Carlo method based on Markov Chains (MCMC), as described by Hadfield13, employing the MCMCglmm::MCMglmm package in R software14. A total of one million of iterations (nitt) were determined, discarding the first one hundred thousand first (burn-in) and performing a 1:3 (thin) sampling, totaling an chain with three hundred thousand iterations, where was obtained the variance components (a posteriori distribution). The Markov Chain convergence was tested by the Geweke criterion in accordance with the recommendations of Cowles and Carlin15 by using the coda::geweke.diag package16 in R software14.

The a posteriori means, credibility intervals, and standard deviation of the MCMC sample were obtained according to the generalized linear mixed model:

$${Y}_{lik}={\mu }_{i}+{b}_{ik}+{g}_{li}+{e}_{lik}$$
(2)

in which Ylik is the l-th = [1,…,12] phenotypic value in the i-th = [1,…,17] family within the k-th = [1,2] block; μi is the overall mean of the i-th family; bik is the effect of the i-th family within the k-th block; gli is the effect of the l-th individual within the i-th family; and elik is the residual term.

The joint data distribution (probability function) was utilized under the Bayesian approach: \({Y}_{ikl}|\beta ,g,{G}_{0},{R}_{0} \sim N({{x}^{\text{'}}}_{i}\beta +{{z}^{\text{'}}}_{ki}g,{\sigma }_{e}^{2})\), in which β is the vector of an a priori probability of systematic effects (overall mean); \(g=\{{g}_{kl}\} \sim N(0,I\otimes {G}_{0})\) is the vector of an a priori probability of genotypic values, in which I is the identity matrix and G0 is the genotypic variance matrix; \(e=\{{e}_{ikl}\} \sim N(0,I\otimes {R}_{0})\) is the vector of a prior probability of residual values with identical values of independent distribution, in which R0 with \({x}_{l}^{\text{'}}\) and \({z}_{l}^{\text{'}}\) are incidence vector relating systematization of the genotype effects for the corresponding phenotypic value; and \({\sigma }_{e}^{2}\) is the residual variance considered to be homogeneous. The prior information was based on meta-analysis or on the posterior distributions of the parameters from the previous cycle (2011–2015). The priori informative probability distribution for the fixed parameters of interest was obtained from provided by: \({\beta }_{i} \sim N({b}_{0},{V}_{b})\) in which Vb is a diagonal matrix of the a priori variance of β. An inverted Wishart distribution was adopted for each G0 and R0 as a priori for the covariance matrices: \({G}_{0} \sim {W}_{1}^{-1}({\Sigma }_{g},n)\) ande \({R}_{0} \sim {W}_{1}^{-1}({\Sigma }_{e},n)\), in which \({\Sigma }_{g}\) and \({\Sigma }_{e}\) are scale matrices.

The posteriori joint density of all the parameters, which are dependent on the genotypic effects of the respective matrix, but which assume a priori independence, is given by:

$$p(\beta ,g,{G}_{0},{R}_{0}|y)\propto p(y|\beta ,g,{G}_{0},{R}_{0})p(\beta |{b}_{0},{V}_{b})p(g|I\otimes {G}_{0})p({G}_{0}|{\Sigma }_{g},n)p({R}_{0}|{\Sigma }_{e},n)$$
(3)

A non-informative priori also tested in the model, using a standard priori of the function according with Hadfield13. This non-informative priori assumes for fixed effects a variance matrix (\(V=I\times {1}^{10}\), in which I is an identity matrix) and mean equal to zero (mu = 0). Regarding the systematics effects, a variance equal to 1 (V = 1) and a parameter of degree of confidence around zero (nu = 0.002) were adopted. These distributions are equivalent to inverse gamma distributions (inverted Wishart).

A cross-validation scheme was tested in the methodologies. Ten folds were used in the cross-validation, in each fold the dataset was divided into two subsets, the fist was composed by 90% of dataset taken at random, and was used for training the model. The second (10% ~200 individuals) was the phenotypic values predicted by model obtained on the fist. In each fold a different subset was taken, until all the individuals that were evaluated had their predicted phenotypes.

Results and Discussion

First, was applied the three methodologies throughout the data set, simulating one a common user, and we tried to observe some difference between the results obtained. Then, we plot the deviations of families mean and overall mean for the main yield trait (Fig. 1). Was possible to observe that the frequentist methodology presented a greater deviation, since in some cases the deviation reaches extreme values with errors of approximately 2.4 kg. It is worth mentioning that if this value is extrapolated to large areas of orchards, the difference can reach ~6 t.ha−1. In Bayesian approach with informative priori, it is noticed that the errors in relation to the average were constantly smaller.

Figure 1
figure 1

Differences between the mean estimates obtained by the REML/BLUP methodology and Bayesian inference and the phenotypic mean values in the total production variable (yield t.ha−1) in a full-siblings population of guava trees.

As these estimates are part of the process in the mixed models applied to determine the variance components, to allow the addition of prior information improving the inference process. This analysis provides a more accurate description on the reliability of estimates and predictions than the REML method17, with much less simple methods18, even though the Bayesian inference has very similar goals to that of Fisher, in which the subjective element is removed from the choice of the a priori distribution.

After observing the deviations, was used a cross-validation to obtain model fit dispersion measures. It was considered as a good fit, the methodology that provided lower deviance information criterion (DIC) and also high values for a posterior adjustment probability of the model (Wprob) (Table 1). We verify the predictive power of the models through the correlation between the separate phenotypic data for validation and the prediction of the model obtained by training dataset, in each fold.

Table 1 Quality of fit models by cross-validation (10 folds: 90% training and 10% for validation), in the same sample sets of data for three methodologies: frequentist (REML/BLUP) and Bayesian (with prior no informative and prior informative) tested in the variables fruit mass (g), pulp mass (g) and yield (kg.plant−1) in P. guajava.

Bayesian with a prior showed the lowest DIC with 4287.9, 17985.8 and 6145.8 for the fruit mass, pulp mass and yield variables respectively, showing higher values of Wprob and correlation. With the standard deviations and the delta, it is possible to notice that among the folds of the cross validation, there was consistency in the fit of the model, with minor values for Bayesian inference with informative priori. Thus, whenever a random percentage of the data was used to test the model, it obtained very close results, mainly for the Bayesian approach than for the frequentist.

In the yield variable, where the setting with poor priori for Bayesian inference was worse than the frequentist. It was observed that a poor prior impaired the model as it can be observed in the DIC that although smaller than the frequentist had greater deviations between the folds of the cross-validation, result of the inconsistency of the model depending on the data. Since yield data consist of a single observation (total production), we can infer that Bayesian inference circumvents well the small dataset problem as long as an adequate priori is provided19.

This accuracy arises because the MCMC method still exhibited great variations in the mean chains, therefore the lower significance, already justified by the greater consistency of the chain when starting from informative priori (Fig. 2). It is clear the great difference between the chains using an proper distribution for priori and a poor priori. Silva, et al.20 tested three distributions for informative priori, searching for the best model for variables in pigs. These authors also showed the difference in the accuracy that proper priori provides. The observation of the chain behavior is also a quality control criterion of the adjustment of the model to the data, given that the burn-in itself is a preventive measure to discard the inconsistent starting of the chain21. In this work, the importance of the informative priori is further evident when observing the chains of blocks, plants, and error (Fig. 2A,B).

Figure 2
figure 2

Distribution chain of mean estimates of 300k estimates for the variable yield in the sources of variation block, plants, and error (units) of the model using an informative priori (A) and a poor priori (B). On the right the density function of the distribution corresponding to the chain.

It is also important to note that the stop iteration criterion in PROCMIXED is when the difference between the parameters of the distribution between one iteration and another is smaller than 1E-812. In the Bayesian approach the chain of iterations is defined by the user (in this case 1 mi). At the onset of warming the MCMC method still produces estimates of averages with considerable variation, which tend to decrease with the increase in the chain13. When the user inserts a priori that represents the data well, providing good distribution parameters, that variation between one iteration and another is even smaller, and together with the excessive size of the chain, it generates more precise estimates2. We believe that the poor prior caused so much disturbance in the chain that not even the excessive size was able to stabilize the parameters and promote good distributions posteriori but it was still better than frequentist.

If was used a non-informative distribution for the parameters of the mixed model, Bayesian inference and BLUP should be equivalent. Thus a priori changes the posterior distribution, so that the information contained in it does not come only from the data (likelihood function)6. That is, it adds more information in the analysis, which is not based on the data. So, we proceeded with the selection of the individuals using Bayesian approach with proper prior to obtain the estimated means and predicted genotypic values. We believe to get more accurate genotypic values, because the Bayesian MCMC methods consider uncertainties in the parameters throughout the inference process. On the other hand the BLUP are predicted by point estimates of variance components and are used as true values, ignoring uncertainty in the variance parameters22.

The selection of the best families was performed to be recombined and to generate new populations. The objective is to increase the general population mean, and for this purpose the first nine families were selected, whose estimates were higher than the general average of the population (Fig. 3).

Figure 3
figure 3

Estimated means in a population of full-sibs of guava trees obtained by Bayesian approach for yield, fruit mass and pulp mass traits.

The credibility intervals for this means were generally quite accurate, with a high degree of reliability. If we observe the credibility intervals for Bayesian and the confidence intervals for REML/BLUP, we can see better results with Bayesian inference (Fig. 4 and Table 2).

Figure 4
figure 4

Phenotypic mean of the yield trait for the 17 families of Guava trees and the confidence intervals obtained by the REML/BLUP methodology and the credibility intervals obtained by the Bayesian approach with informative priori and poor priori.

Table 2 Estimates of averages obtained through the frequentist methodology by REML/BLUP and by Bayesian inference (with poor priori and a proper priori) for the variables yield (kg.plant−1), fruit mass (g) and pulp mass (g) in P. guajava.

This is because the REML method provides only approximate confidence intervals through the use of approximations and assumptions of asymptotic normality. The distribution and variance of the estimators are not known and, therefore questions regarding the effectiveness of the selection to be practiced cannot be answered with rigor. On the other hand, Bayesian analysis is based on the knowledge of the posterior distribution of the parameters, and allows the construction of exact confidence intervals (Bayesian probability intervals or credibility intervals)17.

Another part of the population was selected for test value of cultivation and use (VCU) (Table 3). These individuals were selected according to predicted genotypic values and gain estimates based on heritability (Table 4). Heritability estimates showed values within the expected range for the traits, considering that these are controlled by a large number of genes and are highly influenced by the environment2. The heritability also showed highs predict accuracy and standard deviation lowers. These measures are fundamental to planning the breeding program, allowing for more realistic forecasts of the next steps. Similar heritability were observed in guava fruit10, and even higher for this traits, but as shown in the standard deviation values were so high that they approached the estimates presented.

Table 3 Genotypic values and estimates of gains obtained through Bayesian inference for the variables yield (kg), fruit mass (g) and pulp mass (g) in P. guajava.
Table 4 Heritability, predict accuracy and standard deviation values for the variables fruit mass (g), pulp mass (g) and yield (kg.plant−1) in P. guajava obtained with Bayesian inference.

Individuals were selected independently of the aim; industrial processes - where we consider the yield variable or in nature consumption - considering of greater interest the variables fruit mass and pulp mass looking for bigger and more vigorous fruits with less seeds and greater pulp mass. Since the components of variance were estimated through stochastic simulation (Gibbs sampling), we believe that the genetic values best represent the real value of the individual. The idea behind this argument is the exact analysis of finite-size samples because the data are fixed in the posterior distribution, instead of assuming multivariate normal distributions. Better statistical discussions on BLUP obtained by Bayesian inference may be found in2,23,24,25.

Perennial plant breeding programs have a particularity compared to annual plants. This difference is that the production period of perennials is very long. Therefore, the amount of resources needed to improve these species is much larger. Thus, to avoid estimation of variance components with less precision and thus make a program even more difficult, the Bayesian approach can be used. Another advantageous point of this approach is the possibility of using a priori information in the model. Thus, the breeder can make better use of the information available in the literature by using them as distribution measures in his model, instead of just comparing his results.

Conclusions

In general, Bayesian inference provided the best fit of the model to this dataset, considering a population of full-siblings of Psidium guajava. This approach has provided a more complete and reliable result, thus allowing the selection of the best families to give continuity to the program and the best individuals to test crop value according to the expectations. The use of a priori information is the main advantage, and although it is subjective when the prior distribution is informative, the credibility intervals are narrower than the confidence intervals, and this is the main contributor to the accuracy of the model and help you bypass problems of small/unbalanced datasets.

Bayesian inference clearly has advantages over frequentist methodology, and with the advancement of computational powers this inference tends to become popular. We emphasize that we do not say that the Bayesian approach will be superior in all cases, but because of the advantages it can provide the investment to be tested it is worth it.