Linkage group correction using epistatic distorted markers in F2 and backcross populations

Xie, S-Q; Feng, J-Y; Zhang, Y-M

doi:10.1038/hdy.2013.127

Download PDF

Original Article
Published: 05 March 2014

Linkage group correction using epistatic distorted markers in F₂ and backcross populations

Heredity volume 112, pages 479–488 (2014)Cite this article

1342 Accesses
9 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Epistasis has been frequently observed in all types of mapping populations. However, relatively little is known about the effect of epistatic distorted markers on linkage group construction. In this study, a new approach was proposed to correct the recombination fraction between epistatic distorted markers in backcross and F₂ populations under the framework of fitness and liability models. The information for three or four markers flanking with an epistatic segregation distortion locus was used to estimate the recombination fraction by the maximum likelihood method, implemented via an expectation–maximisation algorithm. A set of Monte Carlo simulation experiments along with a real data analysis in rice was performed to validate the new method. The results showed that the estimates from the new method are unbiased. In addition, five statistical properties for the new method in a backcross were summarised and confirmed by theoretical, simulated and real data analyses.

The genome and population genomics of allopolyploid Coffea arabica reveal the diversification history of modern coffee cultivars

Article Open access 15 April 2024

Jarkko Salojärvi, Aditi Rambani, … Patrick Descombes

Genetic gains underpinning a little-known strawberry Green Revolution

Article Open access 19 March 2024

Mitchell J. Feldmann, Dominique D. A. Pincot, … Steven J. Knapp

Genome-wide association studies

Article 26 August 2021

Emil Uffelmann, Qin Qin Huang, … Danielle Posthuma

Introduction

The non-Mendelian segregation of markers, known as distorted segregation, is a common biological phenomenon and has been reported since the early twentieth century (Mangelsdorf and Jones, 1926; Sandler et al., 1959; Rick, 1966; McCouch et al., 1988; Paterson et al., 1988; Brummer et al., 1993; Xu et al., 1997; Kaló et al., 2000; Lu et al., 2002; Barchi et al., 2010). It may lead to a biased estimate of the recombination fraction and affect the accuracy of linkage groups (Lorieux et al., 1995a, 1995b). For example, slight but significant segregation distortion results in a reduced estimate of the recombination fraction (Cloutier et al., 1997; Kaló et al., 2000), and an overwhelming number of heterozygous individuals in the F₂ population leads to a false genetic linkage of markers (Kaló et al., 2000) and the overestimation of the recombination fraction (Lashermes et al., 2001). These conclusions are not contradictory and can be clearly explained. More specifically, two linked segregation distortion loci (SDL) underestimate the recombinant fraction in most cases and overestimate the recombinant fraction under an additive model with opposite additive effects (Zhu et al., 2007). Therefore, the importance of accurate genetic linkage groups necessitates an in-depth study of marker segregation distortion.

To date, several approaches have been proposed to construct linkage groups. Lander and Green (1987) developed a multi-point method using a Hidden Markov chain model. Jiang and Zeng (1997) extended the multi-point method suitable for dominant and missing markers. However, a question remains how can distorted markers be utilised in the construction of linkage groups? The simplest method is to exclude significantly distorted markers from linkage groups, but this treatment usually reduces the coverage and saturation of the genome (Wang et al., 2005). The most common method is to insert distorted markers into a linkage group. If the new linkage group is seriously different from the old one, the recombination fraction between distorted markers should be re-estimated. However, the traditional approach does not work well because a new variable, selection coefficient, is involved (Kärkkäinen et al., 1996; Kreike and Stiekema, 1997; Faris et al., 1998). To overcome this issue, Lorieux et al. (1995a, 1995b) regarded the selection coefficient as a parameter and adopted the maximum likelihood method to estimate the recombination fraction and selection coefficient simultaneously under a fitness model. Compared with the traditional method, this approach leads to more precise linkage groups, and new software, named MapDisto, is available (Lorieux, 2012). Recently, Zhu et al. (2007) further extended the multi-point method suitable for distorted, dominant and missing markers under the framework of a quantitative genetics model for viability selection (Luo et al., 2005). However, epistatic distorted markers have been not considered in the above methods.

Epistasis, the interaction between loci, has been shown to have a strong association with segregation distortion (Bomblies et al., 2007; Alheit et al., 2011). Epistatic SDL has a significant implication for inbreeding depression (Phillips, 2008), which is mainly manifested as hybrid male or female sterility. Törjék et al. (2006) reported that marker segregation distortion is due to reduced fertility caused by epistasis. Kubo et al. (2008) showed that hybrid male sterility is caused by epistasis between two novel genes, S24 and S35, on rice chromosomes 5 and 1. Similar results have also been found in Drosophila (Chang and Noor, 2010), alfalfa (Li et al., 2011), rice (Xie and Chen, 2012; Yang et al., 2012) and Arabidopsis lyrata (Leppälä et al., 2013). Thus, the Dobzhansky–Muller model, in which hybrid inviability is assumed to be caused by epistasis (Dobzhansky, 1936; Muller, 1942), has been widely accepted. In addition, McMullen et al. (2009) investigated genome-wide segregation distortion among nested association mapping populations and indicated that epistasis affected fitness. Therefore, epistatic SDL should be considered in the construction of precise linkage groups.

In this study, we integrated the fitness model for viability selection with the liability model and developed a new method to correct the recombination fraction between epistatic distorted markers in backcross and F₂ populations. A series of simulated data sets along with a real data set was analysed to validate the proposed method, and the statistical properties of the new method were summarised and confirmed.

Materials and methods

Genetic model in a backcross population

The new method in this study was developed on the basis of a backcross population. The extension to F₂ populations is mentioned briefly in a subsequent section. In this study, the recombinant fraction between epistatic distorted markers was corrected, and the molecular marker information from all n individuals was used to detect the epistatic SDL under the liability and fitness models. The gametic and zygotic selections in the backcross are the same. Thus, the two cases are discussed together.

Liability model

If the selection in a backcross is controlled by two linked SDL, with a recombinant fraction of r, the liability z_j of the jth individual may be described by the following model:

where a_k is the main effect of the kth SDL (k=1, 2); i is the epistatic effect between the two SDL; two genotypes for any one locus are assumed to be SS and Ss, respectively; x_jk is the dummy variable defined as x_jk=1 for SDL homozygote SS and as x_jk=−1 for SDL heterozygote Ss; and ε_j∼N(0, σ²) is a normally distributed residual error. In addition, set σ²=1 for convenience (Luo et al., 2005). The model (1) can be simply expressed as

We hypothesise that the liability is subject to natural selection. An individual will survive if z_j⩾0 and will be eliminated from the population if z_j<0. As all of the sampled individuals have survived from the viability selection, the liability of each observed individual will follow a truncated normal distribution with a cumulative probability:

This result may be considered to be the relative fitness for individual j and is denoted by Φ(X_jb). Because four possible genotypes for two linked SDL exist, the relative fitness (l=1,…,4) can be easily defined. Therefore, the expected frequencies of the four genotypes after selection are easily calculated and are listed in Table 1.

Table 1 Expected frequencies of four genotypes under the liability and fitness models in a backcross population

Full size table

Fitness model

In the fitness model, the viability coefficients for the S₁s₂, s₁S₂ and s₁s₂ gametes relative to S₁S₂ are defined to be v, u and x, respectively, which means that the fitnesses for S₁S₁S₂S₂, S₁S₁S₂s₂, S₁s₁S₂S₂ and S₁s₁S₂s₂ in the backcross are 1, v, u and x, respectively. The case u=v=x=1 indicates no selection, which is a typical Mendelian segregation. Therefore, the expected frequencies (l=1,…,4) of the above four genotypes among surviving individuals are also easily calculated and are listed in Table 1.

Relationship between parameters in the above two models

The expected frequencies of one genotype under the liability and fitness models should be the same, that is, (l=1,…,4). Therefore, the relationship between parameters in the two models can be expressed as

Likelihood function and parameter estimation in a backcross

Although the genotypes of two SDL in the above two models are unobserved, the genotypes of markers flanking with the SDL are observed. Assume that two loci, S₁ and S₂, are located between markers A and B and between markers C and D, respectively, and that the recombination fractions between A and S₁, between S₁ and B, between B and C, between C and S₂ and between S₂ and D are r₁, r₂, r_BC, r₃ and r₄, respectively. The expected frequencies of the 16 observed genotypes of markers A, B, C and D are calculated and listed in Table 2.

Table 2 Expected frequencies of the 16 genotypes of markers A, B, C and D under the epistatic SDL genetic model in a backcross population

Full size table

Let n_k and p_k (k=1,…,16) be the observed number and expected frequencies of the kth genotype for the four markers and be the total number of all individuals. The likelihood function in a backcross is

However, the maximum likelihood estimate in equation (5) is complicated. Thus, the complete information that includes all 64 genotypes for four markers and two SDL was used to construct the likelihood function, which is expressed as

where p_kl and n_kl (k=1,…,16; l=1,…,4) are the expected frequency and the observed number for the kth marker genotype and the lth SDL genotype, respectively, and . Theoretically, the Newtow–Raphson method may be used to obtain the maximum likelihood estimates in equation (6). Here, we adopt the expectation–maximisation (EM) algorithm (Dempster et al., 1977). The logarithm likelihood function is

where . The maximum likelihood estimate of each parameter is found by setting its partial derivative to zero and solving the equation to obtain

where , , , , and . The estimates for r₁ and r₂ were used to correct the recombination fraction between markers A and B: r_AB=r₁+r₂−2r₁r₂; similarly, r_CD=r₃+r₄−2r₃r₄. When m markers are located in a linkage group, the number of estimates for r_AB is . Among these estimates, some may be overestimated and some may be underestimated; in this study, the median is our suggested estimate, which is validated by Monte Carlo simulation experiments. Although only selection parameters u, v and x were estimated, these parameters in the fitness model can be transferred to those in the liability model using equation (4). Therefore, only the estimates of parameters in the fitness model are given in this study.

Variance of recombination fraction

The expected Fisher’s information score of the recombination fraction is given by

Where ln L=(n_AB+n_ab)ln(1−r)+(n_Ab+n_aB)ln r+n_Ab ln v+n_aB ln u+n_ab ln x−n ln[(1−r)(x+1)+r(u+v)]. For large samples, the variance of r was estimated by