Prediction of heterosis using genome-wide SNP-marker data: application to egg production traits in white Leghorn crosses

Amuzu-Aweh, E N; Bijma, P; Kinghorn, B P; Vereijken, A; Visscher, J; van Arendonk, J AM; Bovenhuis, H

doi:10.1038/hdy.2013.77

Download PDF

Original Article
Published: 09 October 2013

Prediction of heterosis using genome-wide SNP-marker data: application to egg production traits in white Leghorn crosses

E N Amuzu-Aweh^1,2,
P Bijma¹,
B P Kinghorn³,
A Vereijken⁴,
J Visscher⁴,
J AM van Arendonk¹ &
…
H Bovenhuis¹

Heredity volume 111, pages 530–538 (2013)Cite this article

3191 Accesses
13 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Prediction of heterosis has a long history with mixed success, partly due to low numbers of genetic markers and/or small data sets. We investigated the prediction of heterosis for egg number, egg weight and survival days in domestic white Leghorns, using ∼400 000 individuals from 47 crosses and allele frequencies on ∼53 000 genome-wide single nucleotide polymorphisms (SNPs). When heterosis is due to dominance, and dominance effects are independent of allele frequencies, heterosis is proportional to the squared difference in allele frequency (SDAF) between parental pure lines (not necessarily homozygous). Under these assumptions, a linear model including regression on SDAF partitions crossbred phenotypes into pure-line values and heterosis, even without pure-line phenotypes. We therefore used models where phenotypes of crossbreds were regressed on the SDAF between parental lines. Accuracy of prediction was determined using leave-one-out cross-validation. SDAF predicted heterosis for egg number and weight with an accuracy of ∼0.5, but did not predict heterosis for survival days. Heterosis predictions allowed preselection of pure lines before field-testing, saving ∼50% of field-testing cost with only 4% loss in heterosis. Accuracies from cross-validation were lower than from the model-fit, suggesting that accuracies previously reported in literature are overestimated. Cross-validation also indicated that dominance cannot fully explain heterosis. Nevertheless, the dominance model had considerable accuracy, clearly greater than that of a general/specific combining ability model. This work also showed that heterosis can be modelled even when pure-line phenotypes are unavailable. We concluded that SDAF is a useful predictor of heterosis in commercial layer breeding.

Performance of Bayesian and BLUP alphabets for genomic prediction: analysis, comparison and results

Article 04 May 2022

A statistical package for evaluation of hybrid performance in plant breeding via genomic selection

Article Open access 27 July 2023

Genomic diversity, linkage disequilibrium and selection signatures in European local pig breeds assessed with a high density SNP chip

Article Open access 19 September 2019

Introduction

Heterosis or hybrid vigour is the observed increase in growth, productivity, fertility and vigour of a hybrid organism over that of its parents (Shull, 1914; Dobzhansky, 1950). This genetic phenomenon is an essential element of commercial poultry, pig, sheep and plant breeding schemes. In poultry breeding, heterosis was exploited even as early as 1893 (Warren, 1942). Over the years, poultry breeders have established pure lines (not necessarily homozygous) that when crossed produce F₁ hybrids with superior performance in traits of economic importance like growth, egg production and survival. In plant breeding, hybrid cultivars are produced by crossing inbreds from opposite and complementary heterotic groups (Bernardo, 1994). The wide application of such breeding designs demonstrates that the benefits of heterosis are widely exploited by breeders.

In practice, selecting lines to be used as parents in crossbreeding programmes is a challenge because testing all possible line combinations is expensive and time consuming. Also, predicting the F1 performance from per se phenotypic records of pure lines has failed (Duvick, 1999; Hallauer et al., 2010), and prediction methods based on microsatellite markers have not been very conclusive (Gavora et al., 1996; Minvielle et al., 2000; Atzmon et al., 2002; Jagosz, 2011; Di et al., 2012). Therefore, there is the need to find reliable methods for predicting heterosis because it has the potential to substantially increase the efficiency of crossbreeding schemes, by identifying optimal parental combinations and reducing costs of field-testing.

Some hypotheses have been put forward as possible explanations for the genetic mechanisms underlying heterosis: the dominance hypothesis attributes heterosis to the masking of deleterious recessive alleles from one parental line by dominant alleles in the other parental line; the overdominance hypothesis attributes heterosis to advantageous combinations of alleles at heterozygous loci; and the epistasis hypothesis assumes that interactions among loci lead to heterosis (Lynch and Walsh, 1998; Crow, 1999; Goodnight, 1999; Lamkey and Edwards, 1999).

In a single locus model, heterosis is solely due to dominance and is proportional to the squared difference in allele frequency (SDAF) between the parental lines (Falconer and Mackay, 1996). This finding has triggered research into predicting F1 heterosis and overall performance based on microsatellite marker information from parental pure lines. In poultry, evidence to support the theory that heterosis is higher in offspring from more genetically distant parents has been found (Gavora et al., 1996; Haberfeld et al., 1996; Atzmon et al., 2002). Also, many prediction studies have been carried out on commercial crops such as maize, rapeseed, sunflower, chick pea and carrot. Some of these studies reported correlations between genetic distances (GD) and heterosis (Reif et al., 2003; Balestre et al., 2009), but others concluded that GD is not a reliable predictor of heterosis (Dias et al., 2004; Krishnan et al., 2013).

Because of inconsistencies in the results from previous studies, one cannot conclude whether the prediction of heterosis based on molecular marker information has been a success or not, as pointed out in reviews by Dias et al. (2004) and Krishnan et al. (2013). The former authors reviewed several studies in plants and suggested that the number of molecular markers (averages of 160 random amplified polymorphic DNAs, 281 restriction fragment length polymorphisms and 25 simple sequence repeats) should be increased for accurate predictions. Gavora et al. (1996) and Minvielle et al. (2000) reported studies on poultry using ∼85 DNA fingerprint bands. Nowadays, genotyping technologies have advanced, producing large amounts of genome-wide marker information and creating opportunities to reinvestigate the genetic basis of heterosis, and methods for its prediction.

A further difficulty in the study of heterosis, particularly in livestock populations, is that phenotypic values on pure-bred individuals are often recorded only in specific environments that differ systematically from the environments in which crossbred phenotypes are recorded. In those cases, heterosis cannot be observed because it is fully confounded with the environment. Although an analysis of crossbred data using a specific vs general combining ability model is feasible in such cases, this provides estimates of combining ability rather than heterosis. In contrast to heterosis, general and specific combining ability (GCA/SCA) depend on the set of crosses included if the crossing scheme is incomplete, and this is generally the case in animal populations. Dependency of results on the set of crosses included hampers the comparison of results with the literature, and the prediction of future crosses. Hence, animal breeders are interested primarily in heterosis and hybrid performance, rather than combining ability, but are faced with the problem that pure-bred phenotypes are unavailable.

The aim of this study was to determine whether genome-wide difference in allele frequencies between pure lines can be used to predict heterosis for egg number, egg weight and survival days in white Leghorn crosses. For this purpose, we used allele frequencies on 60 K single nucleotide polymorphism (SNP) loci from 11 pure lines of white Leghorns, and phenotypic data on 47 crosses between those lines, representing ∼400 000 individuals. No phenotypic data were available on the pure lines. In animals, this is the largest data set ever used for the prediction of heterosis and the first to utilise genome-wide SNP-marker data. We performed a cross-validation to test how accurately we could predict heterosis in crosses for which phenotypic records were unavailable. Moreover, we investigated the estimation of heterosis in the absence of phenotypic data on pure lines, and compared the predictive ability of heterosis vs combining-ability modelling.

Materials and methods

Population structure

Phenotypic records of crossbred hens originating from 11 pure-bred white Leghorn layer lines (5 sire- and 6 dam-lines) were obtained from the Institut de Sélection Animale B.V. (ISA). Phenotypic records were available on crossbreds only; phenotypic records on pure lines reared under similar conditions were not available. Coding of the pure lines was as follows: S1, S2, S3, S4, S5 represented sire-lines and D1, D2, D3, D4, D5, D6 represented dam-lines. A cross produced by an S1 sire and a D1 dam is referred to as S1 × D1 and its reciprocal as D1 × S1. Within each line there were multiple sires and dams, resulting in full- and half-sibs within a cross. The mating scheme shown in Table 1 produced a total of 47 crosses, some being reciprocal crosses. Phenotypic records were from routine performance tests carried out on test farms in the Netherlands, Canada and France from 2004 through 2010. On the test farms, each henhouse had several rows of cages, and each row had three tiers: bottom, middle and top. Crossbreds were kept in group cages of a mix of full- and paternal half-sibs which were assigned randomly to a row and tier within the henhouse, but ensuring that the different crosses and families were randomized across all rows and tiers. On average, there were ∼5 hens per cage. All hens had been beak-trimmed.

Table 1 The mean and number of records (given in brackets) per cross for egg number, egg weight and survival days

Full size table

Phenotypic data

Traits studied were egg number, egg weight and survival days.

Egg number

Hens were kept in cages and all records were taken at the cage level (rather than at the level of the individual hens). Hen-day records of eggs produced from 100 through 504 days of age were used. Hen-day egg number was calculated as the total number of eggs laid in the cage divided by the total number of days that a hen was present (days are summed for all hens that were placed in the cage), and then multiplied by the maximum number of days the production period lasted. As an example, consider a production period lasting 410 days. If total number of eggs laid is 1650 in a cage that started with five hens, and all hens survived until the end of the production period, then summed hen days are 5 × 410 days=2050 days. Hen-day egg number is (1650/2050) × 410=330 eggs. In a case where the same egg numbers were reached, but one hen died 50 days before the end of the period, the summed hen days would be 2000 days. This would give a hen-day egg number of (1650/2000) × 410=338.25 eggs. This cage-based value represents one record and in this paper we will simply refer to this trait as ‘egg number’. After descriptive statistics of the data on egg number, we discovered that three consecutive performance tests conducted by the same farmer had ∼9% of the records above the biological limit of one egg per hen per day. We studied hen-day egg number, so those unusually high records could be because of mistakes in recording the duration of the production period or mortality records. We therefore decided to eliminate all of that farmer’s tests from further analysis. For other performance tests with only a few (<3%) of the records above the biological limit, we only excluded those particular records but kept the other records from that performance test in the analyses. No two tests in this category were from the same farmer. Also, total egg number records that were less than 150 eggs were considered to be errors (personal communication Jeroen Visscher, ISA poultry breeders) and therefore excluded. Excluded records represented 7.6% of the total record count. The final data set used had 76 640 records.

Egg weight

Approximately five times throughout the production period (at around 25, 35, 45, 60 and 75 weeks of age), for each cage, the average weight of all eggs laid on a particular day was recorded. At the end of the production period, these five averages were again averaged to give one value for egg weight per cage for the entire production period. The data set used was the same as that for egg number but there were some missing records for egg weight, leaving 57 759 records.

Survival days

The trait survival days is the average number of days that the hens within each cage survived. For example, if a cage started with five hens, three of which survived for 410 days, one for 405 days and the other for 400 days, the record for that cage would be ((3 × 410)+405+400)/5=407 days. Fractions were truncated. There were 76 640 records on survival days.

Allele frequency data

For each pure line, blood from 75 randomly chosen males was pooled, and DNA was extracted for genotyping. The Illumina chicken 60 K SNP BeadChip was used (Groenen et al., 2011). The same array was used for all genotyping. Quality control criteria were call rate and visual inspection of the clustering of the three genotypes at each SNP. The total number of SNPs used in this study was 53 582, after excluding the sex chromosomes. The sex chromosomes were excluded because females are the heterogametic sex in chickens (ZW), thus the sex chromosomes do not contribute to heterosis by dominance in females. Estimated allele frequencies were corrected for unequal amplification by ‘k-correction’, using the relative allele signal of heterozygous individuals (Hoogendoorn et al., 2000), and then normalised with respect to the two homozygotes (Craig et al., 2005). Correction factors were obtained from 288 individually genotyped animals across all lines. On average, estimation of allele frequencies from the DNA pooling technique has an accuracy of 0.993, with a range of 0.986 to 1 (Hoogendoorn et al., 2000).

Statistical analyses

Allele frequencies

Our statistical analysis rests on two assumptions. The first assumption is that heterosis is due to dominance. Under this assumption, the heterosis due to a single locus, say l, is proportional to the SDAF between the parental lines at that locus,

where d_l is the dominance deviation at locus l, p_i,l is the allele frequency at locus l in parental line i, and p_j,l is the allele frequency at locus l in parental line j (Falconer and Mackay, 1996). Under the assumption that heterosis is due to dominance, total heterosis is simply the sum of heterosis at each locus,

The second assumption is that the dominance deviation at a locus is independent of the SDAF between parental lines at that locus, so that =. Under this assumption, expected heterosis: =, where n_loci is the total number of loci. Thus, under this assumption, heterosis is linear in the SDAF between parental lines, averaged over all loci, with a coefficient of proportionality of n_loci E(d₁), which will be higher with directional than ambidirectional dominance. We therefore used the genome-wide average of SDAF as a predictor of heterosis. For any two parental lines, say i and j, SDAF_ij was calculated as

where is the difference in allele frequency between pure lines i and j at SNP n, and N is the total number of SNPs.

We also calculated Nei’s standard GD (Nei, 1972) from the allele frequencies using the PHYLIP software (Department of Genetics, University of Washington, Seattle, WA, USA) (Felsenstein, 1993). Nei’s standard GD is given by

where is the allele frequency of the a-th allele at the l-th locus in line 1, and is the allele frequency of the a-th allele at the l-th locus in line 2. To visualise the genetic differences between the pure lines, we constructed a phylogenetic tree using MEGA (Tamura et al., 2011).

Prediction of heterosis

To test the significance of SDAF for predicting heterosis, we fitted a linear mixed model where we regressed the phenotypes of crossbreds on the SDAF between both pure lines producing the cross:

where y_ijklm was a phenotypic record, sire-line_i and dam-line_j were the fixed effects of the i th sire-line and j th dam-line of each cross (i, j=1–10), β was the regression coefficient of y on SDAF, test_k was the fixed effect of each performance test (k=1–50); test is a factor that represents the year in which the test was carried out, and on which farm. Hen density_l was a fixed effect accounting for the initial number of hens within a cage. It had 205 levels, and was nested within the test because the physical size of cages differed across some performance tests. The combined effect of the henhouse, row and tier of the cage was accounted for by including the term ‘HRT_m’ as a random effect (m=1–1088) and e_ijklm was the random residual error term. Data were analysed using the MIXED procedure in SAS version 9.2. This model was used for all three traits. Under the assumptions given above, Model 1 is a heterosis model, where the estimates of sire-line and dam-line are estimates of the pure-line performance, whereas the estimate of β × SDAF_ij is an estimate of heterosis. (See Discussion and Supplementary Information).

Predicted heterosis was calculated by multiplying the estimated regression coefficient of the phenotypes on SDAF (obtained from Model 1) by the SDAF between the lines in each cross,

For example, predicted heterosis for egg number in an S1 × D1 cross was.

Note that as SDAF_ij is the same as SDAF_ji, the predicted heterosis for reciprocal crosses is the same, although their trait values may differ.

Egg number had a markedly skewed distribution; a characteristic that causes model assumptions of normally distributed residuals to fail. Also, P-values obtained from the statistical analyses may not be valid. To correct for this, a Box-Cox transformation (Box and Cox, 1964) is commonly applied before the analysis (Ibe and Hill, 1988, Besbes et al., 1993). We therefore applied this transformation to the egg number records. The general form of the Box-Cox transformation equations is:

where y is the original variable, z(t) is the standardized variable, G_y is the geometric mean of the data and t is the parameter by which data are normalised. We used an empirically selected ‘optimum’ t=4 based on the minimal residual variance of the model used to describe the transformed records. We also considered the minimum test statistic for the Kolmogorov–Smirnov normality test.

We fitted our models on both the transformed and original scale, however, to facilitate interpretation, the estimated effects are given only on the original scale in the Results.

Accuracy of predicted heterosis

To evaluate the accuracy of predicted heterosis, we used two approaches. First, we calculated Pearson’s correlation coefficient between predicted and observed heterosis; second, we used cross-validation to assess the accuracy of predicted heterosis for crosses not included in the estimation of β.

Correlations between observed and predicted heterosis

We calculated Pearson’s correlation between observed and predicted heterosis. As we did not have phenotypic records of the pure lines, we did not have true observed heterosis. We therefore used the following strategy to obtain values of ‘observed heterosis’.

Observed heterosis, y^#, was obtained by correcting all records for effects of sire-line, dam-line, test, hen density and HRT (henhouse-row-tier) using estimates from Model 1,

There are two issues in relation to y^#. First, the correction factors in the expression for y^#_ijklm were estimated from Model 1, which includes the SDAF term. Under a dominance hypothesis, therefore, y^# is an estimate of heterosis, rather than of SCA (see Discussion and Supplementary Information for more details). Second, to obtain independent estimates for correction, Model 1 was fitted separately for each of the crosses, and each time the cross for which observed heterosis was to be calculated was omitted from the data set. Thus, correction factors for each cross were obtained without using data on that cross, so as to avoid that correction factors would be biased towards the data to be validated. As we had 47 crosses, we obtained 47 different sets of factors for correction, each based on data of 46 crosses.

Finally, accuracy was taken as Pearson’s correlation between observed and predicted heterosis.

Cross validation

The measure of accuracy presented above describes the fit for the current data set, but may not reflect the accuracy of predicted heterosis in an independent data set. To investigate the accuracy with which a cross that was not in the data set could be predicted, we performed a ‘leave-one-cross-out’ cross-validation, in which one cross at a time was left out of the estimation of β. As we had 47 crosses in our data set, this resulted in 47 different estimates of the regression coefficient, , for each trait. We then predicted heterosis for each i × j cross that had been left out as:

where is the estimated regression coefficient when the i × j cross is omitted from the training data set. Accuracy was taken as Pearson’s correlation between observed (y^#) and predicted heterosis. To quantify the bias of SDAF as a predictor of heterosis, we also calculated the regression coefficient of observed heterosis on both the ‘regular' (equation 2) and cross-validated predictions (equation 4).

Selection of crosses based on predicted heterosis

To quantify the benefits of selecting crosses based on genomically predicted heterosis, we considered a two-step selection process. In the first step, heterosis was predicted for all crosses, and a subset of crosses was selected based on the prediction. In the second step, only crosses selected in the first step were field-tested and a final selection was made based on the observed (that is, true) heterosis and hybrid performance. Compared with a selection based entirely on observed/true heterosis, this two-step selection will yield lower heterosis after the final selection, because the truly best cross may have been discarded based on predicted heterosis in the first step.

The methodological problem is to predict true heterosis after the two-step selection, as a function of the selected proportion in the first step. To enable prediction, we assumed that the predicted and observed heterosis approximately follow a bivariate normal distribution. Then the standardized response in true heterosis after the two-step selection can be obtained from the moment generating function of the truncated bivariate normal distribution (Tallis, 1961), and is given by:

where t₁ is the standardized truncation point applied in the first step of selection, t₂ is the standardized truncation point used in the second step (relative to the original distribution), p=p₁p₂ is the overall selected proportion (10% in Figure 4), ρ₁₂ is the correlation between both normal variates (that is, the accuracy of predicted heterosis), is the standard univariate normal density function evaluated at t₁, Φ(T₁₂) is the (cumulative) univariate normal distribution function evaluated at T₁₂, and

The standardized maximum response in heterosis, that is, heterosis obtained when the selection is based entirely on true heterosis, so that p₁=1 and p₂=p, is given by:

where t₂ is the standardized truncation point belonging to a selected proportion p in a univariate normal distribution. Finally, the proportion of maximum heterosis obtained in a two-stage selection is given by:

Application of the expressions for R_2-step and R_max requires values for the truncation points t₁ and t₂ corresponding to the selected proportions p₁ and p₂ of a bivariate standard normal distribution with correlation ρ₁₂. Those can be obtained using algorithms for the integration of multivariate normal distributions, such as the Dutt’s algorithm (Dutt, 1973, Ducrocq and Colleau, 1986). From the %R_max, we calculated the amount of heterosis lost due to preselecting animals based on genomically predicted heterosis as %loss=100%−%R_max.

Results