Estimating the correlation of pairwise relatedness along chromosomes

Hu, X-S

doi:10.1038/sj.hdy.6800586

Download PDF

Original Article
Published: 08 September 2004

Focus Article

Estimating the correlation of pairwise relatedness along chromosomes

X-S Hu¹

Heredity volume 94, pages 338–346 (2005)Cite this article

1304 Accesses
3 Citations
3 Altmetric
Metrics details

Abstract

The ‘spatial’ pattern of the correlation of pairwise relatedness among loci within a chromosome is an important aspect for an insight into genomic evolution in natural populations. In this article, a statistical genetic method is presented for estimating the correlation of pairwise relatedness among linked loci. The probabilities of identity-in-state (IIS) are related to the probabilities of identity-by-descent (IBS) for the two- and three-loci cases. By decomposing the joint probabilities of two- or three-loci IBD, the probability of pairwise relatedness at a single locus and its correlation among linked loci can be simultaneously estimated. To provide effective statistical methods for estimation, weighted least square (LS) and maximum likelihood (ML) methods are evaluated through extensive Monte Carlo simulations. Results show that the ML method gives a better performance than the weighted LS method with haploid genotypic data. However, there are no significant differences between the two methods when two- or three-loci diploid genotypic data are employed. Compared with the optimal size for haploid genotypic data, a smaller optimal sample size is predicted with diploid genotypic data.

Rank-invariant estimation of inbreeding coefficients

Article Open access 25 November 2021

Effect of heterogeneity in recombination rate on variation in realised relationship

Article 20 June 2019

Scalable bias-corrected linkage disequilibrium estimation under genotype uncertainty

Article 09 August 2021

Introduction

Pairwise relatedness at a single locus and its correlation among linked loci along chromosomes give insights into patterns of genomic evolution in natural populations. Pairwise relatedness reveals the genetic similarity between individuals due to recently shared ancestors, and is affected by several evolutionary forces (Wright, 1969). The correlation of pairwise relatedness and its ‘spatial’ pattern along chromosomes may reflect differential processes of co-evolution that have occurred among different regions of chromosomes. Hitchhiking effects and background selection (Maynard Smith and Haigh, 1974; Charlesworth et al, 1993) can enhance a positive regional correlation of pairwise relatedness. Recombination separates linked alleles that have a common ancestor, and hence alters the coalescence process for linked loci (eg Hudson, 1991), resulting in a negative correlation in pairwise relatedness for the two loci. Random drift or founder effects can also reduce the correlation in pairwise relatedness. In general, the joint effects of different forces can bring about a nonuniform distribution of the correlation of relatedness along chromosomes. Thus, a combination of linkage maps and the correlation of pairwise relatedness among loci can help us understand the ‘patchy’ pattern of genomic evolution.

The significance of pairwise relatedness has long been appreciated in understanding population genetic structure and evolution (Wright, 1922, 1969; Cotterman, 1940; Jacquard, 1974). Estimation of the pairwise relatedness is now simplified by the advent of abundant polymorphic markers, such as microsatellite markers and single-nucleotide polymorphisms (SNPs) (eg Brookes, 1999). The populations studied can either be populations with known pedigree or populations with unknown pedigree. There have been numerous studies exploring the marker-based statistical methods for the purpose of estimating pairwise relatedness (eg Thompson, 1975; Pamilo and Crozier, 1982; Lynch, 1988; Queller and Goodnight, 1989; Ritland, 1996; Lynch and Ritland, 1999; Wang, 2002; Milligan, 2003). However, these previous analyses mainly focus on the average pairwise relatedness per locus and have not been coupled with assessment of the genome-wide diversity in relatedness. The heterogeneity of pairwise relatedness along chromosomes cannot be assessed using these methods.

Previous theories on kinship mapping, the relationship between the joint probability of identity-by-descent (IBD) of linked markers and the recombination fraction (Morton et al, 1971; Morton and Simpson, 1983) did not examine the interaction among linked loci in terms of the correlation of relatedness. A recent theoretical study evaluated the role of drift, gene flow, selfing, and mutation in affecting the association of gene identity, and shows that a small identity disequilibrium (ID) within subpopulations is present (Vitalis and Couvet, 2001a). ID is termed as the difference between the joint two-loci probability of identity-in-state (IIS) and the expected product of each component's IIS probability. Such an ID approach is essentially distinct from the correlation of relatedness, which assesses the difference in terms of IBD. Unlike Vitalis and Couvet (2001b), who apply ID to jointly estimate the local effective population size and migration rate, I use the joint probabilities of two- and three-loci IBD to address the correlation of pairwise relatedness among loci.

The purpose of this article is to develop a new method for jointly estimating the pairwise relatedness at a single locus and its correlation among loci. Unlike previous methods where pairwise relatedness is estimated at individual loci and then weighted to gain the average (eg Ritland, 1996; Lynch and Ritland, 1999; Wang, 2002), the present method relies on the estimation of the joint probabilities of IBD at two or three loci.

In my analyses of pairwise relatedness and its correlation among loci, data can be randomly sampled from either the haploids or diploids genotyped with codominant markers, each with an arbitrary number of alleles. The use of haploid and of diploid genotypic data is evaluated through extensive Monte Carlo simulations. The preexisting approaches are based on sampling diploids (eg Ritland, 1996; Lynch and Ritland, 1999). However, the approach of using the haploid genotypic data allows the method to be applicable to specific chromosomes, such as the sex chromosomes, and has an advantage of ignoring the effects of mating system.

Two statistical methods are extensively evaluated through simulation study: the weighted least squares (LS) and the maximum likelihood (ML) methods. Application of the simulation results to practical analysis is discussed.

Two-loci relatedness

Throughout the study, the concept of pairwise relatedness at a single locus refers to the probability that an allele randomly sampled from one individual is IBD with an allele at the same locus randomly sampled from another individual (eg Jacquard, 1974; Ritland, 1996; Lynch and Ritland, 1999). Rousset (2002) reviewed some properties of this relatedness definition.

Haploid data

Consider a pair of two linked co-dominant loci in a population, denoted by A and B, with numbers of n_A and n_B alleles, respectively. Let p_u and q_v be the frequencies of alleles A_u (u=1, 2, …, n_A; ∑_up_u=1) and B_v (v=1, 2, …, n_B; ∑_vq_v=1), respectively. Following the classical definition on the probability of IBD at a single locus (Jacquard, 1974), four parameters are defined in the two-loci case: κ₁₁ (0≤κ₁₁≤1) is the probability that the two alleles at each of the two loci are IBD, κ₁₀ (0≤κ₁₀≤1) is the probability that the two alleles at the A locus are IBD but the two alleles at the B locus are not, κ₀₁ (0≤κ₀₁≤1) is the probability that the two alleles at the B locus are IBD but the two alleles at the A locus are not, and κ₀₀ (=1−κ₁₁₋κ₁₀−κ₀₁) is the probability that the two alleles at each of the two loci are not IBD. Note that there are several definitions of the probability of two-loci descent (Whitlock et al, 1993; Vitalis and Couvet, 2001a 2001b; Laurie and Weir, 2003), but the present definition of κ₁₁ is actually the same as the definition F₁₁ of Whitlock et al (1993).

The basic approach for estimating the relatedness is to use the probabilities of IIS to infer the relatedness parameters, similar to previous studies (eg Ritland, 1996). Denote by P^uv_u′v′ the probability that a pair of two-loci gametes have genotypes A_uB_v and A_u′B_v′ (u, u′=1, 2, …, n_A; v, v′=1, 2, …, n_B). When linkage disequilibrium (LD) is absent, P^uv_u′v′ can be decomposed as

where δ_uu′ is Kronecker delta variable, which is equal to unity when u=u′ and zero otherwise. The factor $2^{- (1 - δ_{u u'}) (1 - δ_{v v'})}$ in the coefficient of κ₀₀ on the right side of equation (1) is introduced so that the coupling and repulsion linkage phases can be separated, which is distinct from the previous four-gene case at a single locus (Lynch and Ritland, 1999). Note that the gamete and allele frequencies in equation (1) are assumed known beforehand with sufficient accuracy, as in previous studies (eg Ritland, 1996, 2000; Lynch and Ritland, 1999).

When LD is present, a more general expression of P^uv_u′v′ can be written as

where g_uv and g_u′v′ are the frequencies of gametes A_uB_v and A_u′B_v′ in the population_, respectively.

In the case of two alleles per locus (say A_i, A_j, B_k, and B_l), there are four categories of haplotype pairs according to the number of shared alleles: IIS for both the two alleles of each locus (A_iB_k−A_iB_k, A_iB_l−A_iB_l, A_jB_k−A_jB_k, A_jB_l−A_jB_l), IIS for the two A alleles but not for the two B alleles (A_iB_k−A_iB_l, A_jB_k−A_jB_l) and the reverse case (A_jB_l−A_iB_l, A_iB_k−A_iB_l), and no shared alleles for both loci (A_iB_k−A_jB_l, A_jB_k−A_iB_l). There are 16 types of haplotype pairs, but only 10 of them are distinguishable with codominant markers (Table 1). For an arbitrary number of alleles at each locus, the number of distinguishable haplotype pairs, denoted by n_AB, is shown to be equal to n_An_B(n_An_B+1)/2.

Table 1 Probabilities for the two-loci haplotype pairs (two alleles per locus: A_i, A_j; B_k, B_l)

Full size table

Diploid data

When diploid genotypic data are used, the preceding method can be applied as long as the gamete frequencies are available. However, estimation of gamete frequencies requires the assumption of random association of gametes (Hill, 1974; Weir, 1996; Kalinowski and Hedrick, 2001); otherwise, the relative proportion of heterozygotes with repulsion versus coupling linkage phases is indistinguishable. When there are nonrandom associations between gametes in forming zygotes (Yang, 2002), expectation–maximization (EM) (Dempster et al, 1977) and other methods are not applicable. The following method is only valid under the assumption of random association of gametes.

Denote by H_AB the probability that both A and B loci are heterozygous, H_A the probability that the A locus is heterozygous but the B locus is not, H_B the probability that the B locus is heterozygous but the A locus is not, and H₀₁ (H₀₁=1−H_A−H_B−H_AB) the probability that both loci are homozygous. I obtain

Note that equation (3d) differs from the previous study where only one parameter is considered (Morton and Simpson, 1983). When only one locus is considered, equations (3a), (3b), (3c) and (3d) reduce to the classical results (Falconer and Mackay, 1996, p 66). In practice, the heterozygote frequencies (H₀₁–H_AB variables) can be estimated directly from the genotypic data. Thus, according to equations (3a), (3b), (3c) and (3d), the three unknown parameters (κ₁₁, κ₁₀, and κ₀₁) can be estimated.

Correlation of pairwise relatedness

Denote by r_A and r_B the probabilities of pairwise relatedness at the A and B loci, respectively, and c_r the covariance of the probabilities of pairwise relatedness between A and B. If the three unknown parameters (κ̂₁₁, κ̂₁₀, and κ̂₀₁) are estimated, the probability of pairwise relatedness at a single locus (r_A and r_B) and its covariance among loci (c_r) can be calculated from the following equations:

Solution to equations (4a), (4b) and (4c) is r̂_A=κ̂₁₁+κ̂₁₀, r̂_B=κ̂₁₁+κ̂₀₁, and ĉ_r=κ̂₁₁κ̂₀₀−κ̂₁₀κ̂₀₁

From equation (4a), c_r can also be viewed as the kinship disequilibrium since it is expressed as the difference between the joint probability of two-loci IBD and the product of single-locus probability of IBD, analogous to the definition of ID (Vitalis and Couvet, 2001a). c_r may be negative when κ₁₁κ₀₀<κ₁₀κ₀₁, or positive when κ₁₁κ₀₀>κ₁₀κ₀₁. Theoretically, c_r is associated with the recombination fraction, or inversely proportional to the physical distance between the two linked loci. A smaller distance between two loci implies a stronger correlation of their relatedness. In order to make the correlation be comparable among different pairs of loci, the correlation coefficient of pairwise relatedness is defined as ĉ_r(r̂_Ar̂_B(1−r̂_A)(1−r̂_B))^−1/2 so that its value ranges from −1 to 1. This formula can also be proven using the general definition of statistical correlation (see also Hartl and Clark, 1989, pp 53–54).

Three-loci relatedness

Compared with the two-loci analysis, the advantage of a three-loci analysis is that it considers the event of double crossovers during meiosis and hence can give more precise estimates.

Haploid data

The preceding two-loci method can be extended to the three-loci case. Suppose that an additional marker C with n_C (≥2) alleles is linked to the A and B markers. The ordering of the three loci is unknown. Denote by κ₁₁₁ (0≤κ₁₁₁≤1) the joint probability that the two alleles at each of the three loci are IBD, κ₁₁₀ (0≤κ₁₁₀≤1) the joint probability that the two alleles at both A and B loci are IBD but the two alleles at the C locus are not IBD. The definitions of other parameters κ₁₀₁–κ₀₀₀ can be given in a similar way. The joint probability for a pair of three-loci gametes (A_uB_vC_w–A_u′B_v′C_w′; w, w′=1, …, n_C), denoted by P^uvw_u′v′w′, can be written in a general formula,

where o_w is the frequency of allele C_w in the population, and g_uvw and g_u′v′w′ are the frequencies of gametes A_uB_vC_w and A_u′B_v′C_w′, respectively. For an arbitrary number of alleles at each locus, the number of distinguishable three-loci gamete pairs, denoted by n_ABC, is shown to be equal to n_An_Bn_C(n_An_Bn_C+1)/2.

Diploid data

As in the two-loci case, denote by H_AC the probability that both A and C loci are heterozygous, H_BC the probability that both B and C loci are heterozygous, H_ABC the probability that the three loci are heterozygous, and H_O2 the probability that three loci are homozygous. With the assumption of random association of gametes, I obtain

The seven unknown parameters (κ₁₁₁–κ₀₀₁) can be solved using equations (6a), (6b), (6c), (6d), (6e), (6f), (6g) and (6h), provided that the frequencies of alleles and three-loci gametes are available with sufficient accuracy.

Correlation of pairwise relatedness

Denote by c_r1, c_r2, and c_r3 the covariances of pairwise relatedness between the A and B loci, the B and C loci, and the A and C loci, respectively. Let $e_{θ_{A} θ_{B} θ_{C}}$ (θ_A, θ_B, θ_C=0, 1) be the residual part of $κ_{θ_{A} θ_{B} θ_{C}} (\sum_{θ_{A}} \sum_{θ_{B}} \sum_{θ_{C}} κ_{θ_{A} θ_{B} θ_{C}} = 1)$ after the deduction of individual components of the covariances of two-loci relatedness. The residual part comes from the effects of double crossover among the three loci. Thus, $k_{θ_{A} θ_{B} θ_{C}}$ can be written in a general form,

There are eight configurations of the sequence of θ_Aθ_Bθ_C, that is, (111), (110), (101), (011), (100), (010), (001), and (000). The probabilities of relatedness at individual loci can be estimated by ${\hat{r}}_{A} = \sum_{θ_{B}} \sum_{θ_{C}} {\hat{k}}_{1 θ_{B} θ_{C}}$ , ${\hat{r}}_{B} = \sum_{θ_{A}} \sum_{θ_{C}} {\hat{k}}_{θ_{A} 1 θ_{C}}$ , and ${\hat{r}}_{C} = \sum_{θ_{A}} \sum_{θ_{B}} {\hat{k}}_{θ_{A} θ_{B} 1}$ . According to equation (7), the coefficients for the three covariances ( ${(- 1)}^{θ_{A} + θ_{B}}$ , ${(- 1)}^{θ_{B} + θ_{C}}$ , and ${(- 1)}^{θ_{A} + θ_{C}}$ ) are the same between the partitions of κ₁₁₁ and κ₀₀₀, κ₁₁₀ and κ₀₀₁, κ₁₀₁ and κ₀₁₀, and κ₀₁₁ and κ₁₀₀. Thus, I can only use the partitions of four three-loci relatedness values (κ₁₁₁, κ₁₁₀, κ_{1 01}, and κ₀₁₁) to estimate c_r1, c_r2, and c_r3. The analytic solution from the least-square method is given by

Denote by R₁, R₂, and R₃ the correlation coefficients of relatedness between the A and B loci, the B and C loci, and the A and C loci, respectively. Estimates of these three correlation coefficients are respectively given by

Monte Carlo simulation

Like previous studies of the pairwise relatedness at a single locus (eg Ritland, 1996; Lynch and Ritland, 1999; Wang, 2002), the aims of the simulations are to examine the effects of (i) sample size, (ii) allele frequency distribution, (iii) the type of data sets (haploid or diploid), and (iv) LD.

Statistical methods

The weighted LS and ML methods are used to estimate the correlation coefficient of pairwise relatedness. With the two-loci haploid genotypic data, estimates of pairwise relatedness with the weighted LS method can be written as

where κ̂ is the vector of (κ̂₁₁,κ̂₁₀,κ̂₀₁)′, 1 is the vector of $(1, 1, ..., 1)'_{n_{A B} \times 1}$ , X is the known coefficient matrix with n_AB × 3 elements calculated from equation (2), Y is the known vector ${(y_{u' v'}^{u v})}_{n_{A B}} \times 1$ in which

W is the known diagonal matrix with the diagonal element being w^uv_u′v′=1/P^uv_u′v′(1− P^uv_u′v′), and is the estimate of the mean of y^uv_u′v′.

For the ML method, the likelihood function is set as , where M^uv_u′v′ is the observed number of gamete pairs (A_uB_v−A_u′B_v′) in the random sample with N haploids . ML estimates are obtained through Newton–Raphson iteration. The estimate of κ̂ at the (t+1) step is iteratively calculated by

where s(κ) is the score vector, equal to , and is the Fisher information matrix. The above iterative calculation is continued until is sufficiently small (convergence).

When the estimates of pairwise relatedness are obtained, the correlation coefficients of pairwise relatedness are calculated according to ĉ_r(r̂_Ar̂_B(1−r̂_A) (1−r̂_B))^−1/2 for the two-loci case and equations (9a), (9b) and (9c) for the three-loci case. The correlation coefficients of pairwise relatedness in the other cases (two-loci diploid, three-loci haploid and diploid) can be estimated in a way similar to the above two approaches.

Data generation

The simulated samples with the haploid or diploid data are generated in the following steps. Given a set of parameters, including LD, the number of alleles, allele frequencies and the distribution type, and pairwise relatedness (κ₁₁–κ₀₀ in the two-loci case and κ₁₁₁–κ₀₀₀ in the three-loci case) calculate the probabilities for each two-gamete pair according to equation (2) for the haploid case, and the probabilities of each type of heterozygote according to equations (3a), (3b), (3c), (3d) and equations (6a), (6b), (6c), (6d), (6e), (6f), (6g) and (6h) for the diploid case. Then, use this probability distribution (multinomial distribution) to create random samples. It can be shown that a sample of N/2 diploids (or N haploids) can generate a total of N(N−1) gamete pairs for either the two- or the three-loci case. Simulation programs in C are available upon request.

In all, 5000 independent data sets are created, and each is used for estimating the correlation coefficients of pairwise relatedness according to the theories described in the preceding two sections. Means and standard deviations of estimates are calculated from these replicated data sets.

Results

With the two-loci haploid data and the weighted LS method, average estimates of the correlation coefficients of pairwise relatedness gradually become consistent with their actual values as the sample size increases (Figure 1a). The standard deviations decrease with the number of haploids (Figure 1b). However, there are large differences in the sample sizes required for obtaining appropriate estimates: 50 haploids for the two-allele case, 120 haploids for the four-allele case, and more than 180 haploids for the eight-allele case (Figure 1a).

A difference between the weighted LS and ML methods is that the ML method can give better estimates when the sample size is small. For example, when the number of haploids is greater than 40 for the four- or eight-allele case, an appropriate estimate of the correlation coefficient of pairwise relatedness can be obtained (Figure 2a, b).

The effects of the distribution of allele frequency (uniform versus triangular distribution; see Lynch and Ritland, 1999) on the LS or ML methods are very small. Although the triangular distribution produces a slightly greater standard deviation than does the uniform distribution (Figure 3), there are no differences in obtaining the unbiased average of the correlation coefficients of pairwise relatedness.

Unbiased estimates of the correlation coefficient of pairwise relatedness can be obtained when LD is present and the sample size is appropriate. For example, there are no significant differences when LD is changed from 0 to 0.2 in the two-allele case (Figure 4a, b). Also, no significant differences are observed between the weighted LS and ML methods.

With the two-loci diploid data, an unbiased estimate of the correlation coefficient of pairwise relatedness can be obtained with each of the two methods when an appropriate sample size is provided. Compared with the haploid case, the optimal sample size is smaller. For example, a good estimate can be obtained with sampling 40 diploids in the four-allele case (Figure 5). Both the weighted LS and ML methods have the same performance (Figure 5).

With the three-loci diploid data, the unbiased average estimates of the three correlation coefficients of pairwise relatedness can be simultaneously obtained when the sample size is appropriate. For example, when the sample size is more than 80 individuals, the three unknown parameters (R₁, R₂, and R₃) can be estimated with a good accuracy and precision in the case of four alleles per locus (Figure 6a–f). Both the weighted LS and the ML methods have a very similar performance.

Discussion

In this paper, I have shown that the correlation of pairwise relatedness among loci along chromosomes can be estimated from the approach of partitioning the joint probabilities of IIS into the probabilities of IBD at two or three loci. Such an approach of using two- or three-loci probabilities of IBD enables the estimation of individual pairwise relatedness and its correlation among loci simultaneously, and allows us to study the picture of ‘landscape’ relatedness along chromosomes and to infer the naturally occurring pattern of co-evolution. Unlike traditional kinship mapping, which pictures the ‘static’ relationships between the physical position of markers and IBD (eg Morton and Simpson, 1983), the map of the correlation of relatedness implies ‘dynamic’ relationships among linked loci.

When the linkage maps of all the markers assayed are available beforehand, the relatedness at each locus and the correlation of relatedness among linked loci can be readily mapped. Since the present method involves only one-generation data randomly sampled from the population with unknown pedigree, the recombination fraction between linked loci cannot be estimated. The physical linkage map of relatedness cannot be directly constructed. However, the following properties are likely applicable to constructing the map of pairwise relatedness and its correlation among loci. First, a significant difference of κ₁₁ from zero indicates that the two loci are likely to be linked. Second, the joint probability of two-loci IBD (κ₁₁) is negatively correlated with the mapping distance between the two loci, while the joint probabilities of one-locus IBD and one-locus non-IBD (κ₁₀ and κ₀₁) are positively correlated with mapping distance. Thus, a larger κ₁₁ and a smaller κ₁₀ or κ₀₁ indicate a shorter distance between the two loci. Third, a larger positive correlation of relatedness indicates a shorter physical distance, while a larger negative correlation of relatedness indicates a larger physical distance. These properties can be combined for ordering markers. Among all possible linkage maps for a given set of markers, the optimal one should have the largest sum of all κ₁₁'s among adjacent markers.

The analytical formulae presented here are only suitable for the case of two- and three-loci relatedness, where each locus has an arbitrary number of alleles. When many loci on a chromosome (more than three loci) are analyzed, these loci can be analyzed in terms of two- or three-loci as a unit, similar to the procedure of classical linkage mapping analysis. The individual two-or three-loci results are then jointly analyzed to map the correlation of pairwise relatedness.

Statistically, the critical problem for the weighted LS method with haploid data is that the number of distinguishable pairs is substantially increased with the number of alleles. Only when the number of sampled haplotype pairs is much greater than the number of distinguishable haplotype pairs can an appropriate estimate be obtained. Thus, the optimal sample size varies with the number of alleles. The present simulations suggest sample sizes of 100–200 haploids for moderate numbers of alleles, the lower bound for the two-allele case, the upper bound for the eight-allele case. However, such situations are not often met in the single locus case (Ritland, 1996; Lynch and Ritland, 1999; Wang, 2002), where the sample size is much larger than the expected number of distinguishable two-gene pairs.

Compared with the haploid case, the weighted LS method with the diploid data has a better performance, and the optimal sample size is also smaller (eg 40 diploids for two loci, each with four to eight alleles). The reason for the better performance of the weighted LS method with the diploid than with the haploid data is a small number of ‘units of observations’ – the heterozygous types, that is, four in the two-loci case and eight in the three-loci case. These ‘condensed’ variables contain all possible sampled haplotype pairs, and display a ‘robust’ property even when the sample size is small. Clearly, the ML method for the haploid data is suggested when highly polymorphic markers (eg, ≥4 alleles per locus) and a small sample size are used. However, either the weighted LS or ML method can be applied with diploid data.

The advantage of the ML over the weighted LS method with the haploid data is that all information is utilized, including different sampling variances for individual pairs and the correlation between different pairs. This can be seen from the Fisher information matrix (F(κ)). One of the assumptions underlying the weighted LS method is the independence among different observations of distinguishable gamete-pairs, which is actually violated.

Another striking result is the same performance for the ML method with either the haploid or diploid data. The condensed H variable does not affect the accuracy and precision of estimation, compared with the analysis with the haploid data. The reason for such robust behavior is that the score function (s(κ)) and the Fisher information matrix (F(κ)) are essentially the same with either approach under the assumption of random association of gametes, and this can be shown algebraically. However, the advantage of using diploid over haploid data is significant in practice.

There are two distinctions between the present two-loci haplotype approach and the previous ‘four-gene’ pairs at a single locus in pairwise relatedness analysis. There are three parameters in the former (κ₁₁, κ₁₀, and κ₀₁), but only two parameters in the latter (eg Ritland, 1996; Lynch and Ritland, 1999). There are 10 informative pairs in the former for a four-gene case (two alleles per locus), but six in the latter (four alleles per locus). Such distinctions imply that a larger sample size is required in the two-loci analysis, compared with the case of a single locus.

Finally, one must acknowledge the assumptions underlying the present method of estimating the correlation of pairwise relatedness. First, the method is based on the assumption of the availability of accurate and precise estimates of gamete and allele frequencies (eg Ritland, 1996; Lynch and Ritland, 1999; Ritland, 2000). In practice, gamete and allele frequencies will probably be estimated from the same data sets used for relatedness analysis. Clearly, biased estimates of these frequencies can bring about biased estimates of the correlation coefficients of pairwise relatedness. Second, the present diploid approach is based on the assumption of random association of gametes. Extension to partially selfing populations is clearly needed in future study, as selfing populations show greater linkage disequilibria (Wright, 1969) and likely enhance the correlation of pairwise relatedness along chromosomes.

References

Brookes AJ (1999). The essence of SNPs. Gene 234: 177–186.
Article CAS PubMed Google Scholar
Charlesworth B, Morgan MT, Charlesworth D (1993). The effect of deleterious mutations on neutral molecular variation. Genetics 134: 1289–1303.
CAS PubMed PubMed Central Google Scholar
Cotterman CW (1940). A calculus for statistico-genetics. Unpublished thesis, Ohio State University, Columbus, OH.
Google Scholar
Dempster AP, Laird NM, Bubin DB (1977). Maximum likelihood from incomplete data via EM algorithm. J R Stat Soc Ser B 39: 1–38.
Google Scholar
Falconer DS, Mackay TFC (1996). Introduction to Quantitative Genetics, 3rd edn. Longman Sci and Tech: Harlow, UK.
Google Scholar
Hartl DL, Clark AG (1989). Principles of Population Genetics, 2nd edn. Sinauer Associates Inc.: Sunderland.
Google Scholar
Hill WG (1974). Estimation of linkage disequilibrium in randomly mating populations. Heredity 33: 229–239.
Article CAS PubMed Google Scholar
Hudson RR (1991). Gene genealogies and the coalescent process. In: Futuyma D, Antonovics J (eds) Oxford Surveys in Evolutionary Biology, Vol. 7. Oxford Unversity Press: Oxford, pp 1–44.
Google Scholar
Jacquard A (1974). The Genetic Structure of Populations. In: Krickeberg K, Lewontin RC, Neyman J, Schreiber M (eds) Biomathematics. Springer-Verlag: Berlin. Vol. 5.
Google Scholar
Kalinowski ST, Hedrick PW (2001). Estimation of linkage disequilibrium for loci with multiple alleles: basic approach and application using data from bighorn sheep. Heredity 87: 698–708.
Article CAS PubMed Google Scholar
Laurie C, Weir BS (2003). Dependency effects in multi-locus match probabilities. Theor Pop Biol 63: 207–219.
Article Google Scholar
Lynch M (1988). Estimation of relatedness by DNA fingerprinting. Mol Biol Evol 5: 584–599.
CAS PubMed Google Scholar
Lynch M, Ritland K (1999). Estimation of pairwise relatedness with molecular markers. Genetics 152: 1753–1766.
CAS PubMed PubMed Central Google Scholar
Maynard Smith J, Haigh J (1974). The hitch-hiking effect of a favorable gene. Genet Res 23: 23–35.
Article Google Scholar
Milligan BG (2003). Maximum-likelihood estimation of relatedness. Genetics 163: 1153–1167.
PubMed PubMed Central Google Scholar
Morton NE, Simpson SP (1983). Kinship mapping of multilocus systems. Hum Genet 64: 103–104.
Article CAS PubMed Google Scholar
Morton NE, Yee S, Harris DE, Lew R (1971). Bioassay of kinship. Theor Pop Biol 2: 507–524.
Article CAS Google Scholar
Pamilo P, Crozier RH (1982). Measuring genetic relatedness in natural populations: methodology. Theor Pop Biol 21: 171–193.
Article Google Scholar
Queller DC, Goodnight KF (1989). Estimating relatedness using genetic markers. Evolution 43: 258–275.
Article PubMed Google Scholar
Ritland K (1996). Estimators for pairwise relatedness and individual inbreeding coefficients. Genet Res 67: 175–185.
Article Google Scholar
Ritland K (2000). Marker-inferred relatedness as a tool for detecting heritability in nature. Mol Ecol 9: 1195–1204.
Article CAS PubMed Google Scholar
Rousset F (2002). Inbreeding and relatedness coefficients: what do they measure? Heredity 88: 371–380.
Article CAS PubMed Google Scholar
Thompson EA (1975). The estimation of pairwise relationships. Ann Hum Genet 39: 173–188.
Article CAS PubMed Google Scholar
Vitalis R, Couvet D (2001a). Two-locus identity probabilities and identity disequilibrium in a partially selfing subdivided population. Genet Res 77: 67–81.
Article CAS PubMed Google Scholar
Vitalis R, Couvet D (2001b). Estimation of effective population size and migration rate from one- and two-locus identity measures. Genetics 157: 911–925.
CAS PubMed PubMed Central Google Scholar
Wang J (2002). An estimator for pairwise relatedness using molecular markers. Genetics 160: 1203–1215.
CAS PubMed PubMed Central Google Scholar
Weir BS (1996). Genetic Data Analysis II. Sinauer Associates Inc.: Sunderland.
Google Scholar
Whitlock MC, Phillips PC, Wade MJ (1993). Gene interaction affects the additive genetic variance in subdivided populations with migration and extinction. Evolution 47: 1758–1769.
Article PubMed Google Scholar
Wright S (1922). Coefficients of inbreeding and relationship. Am Nat 56: 330–338.
Article Google Scholar
Wright S (1969). Evolution and the Genetics of Populations, Vol. 2. The Theory of Gene Frequencies. University of Chicago Press: Chicago.
Google Scholar
Yang RC (2002). Analysis of multilocus zygotic associations. Genetics 161: 435–445.
CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

I sincerely appreciate John Brookfield and three anonymous referees for very constructive comments and editorial assistance, and Kermit Ritland for useful comments and discussions during the preparation of this article.

Author information

Authors and Affiliations

Department of Forest Sciences, University of British Columbia, Vancouver, British Columbia, Canada, V6T 1Z4
X-S Hu

Authors

X-S Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to X-S Hu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, XS. Estimating the correlation of pairwise relatedness along chromosomes. Heredity 94, 338–346 (2005). https://doi.org/10.1038/sj.hdy.6800586

Download citation

Published: 08 September 2004
Issue Date: 01 March 2005
DOI: https://doi.org/10.1038/sj.hdy.6800586

Estimating the correlation of pairwise relatedness along chromosomes

Abstract

Similar content being viewed by others

Rank-invariant estimation of inbreeding coefficients

Effect of heterogeneity in recombination rate on variation in realised relationship

Scalable bias-corrected linkage disequilibrium estimation under genotype uncertainty