Abstract
The two alleles an individual carries at a locus are identical by descent (ibd) if they have descended from a single ancestral allele in a reference population, and the probability of such identity is the inbreeding coefficient of the individual. Inbreeding coefficients can be predicted from pedigrees with founders constituting the reference population, but estimation from genetic data is not possible without data from the reference population. Most inbreeding estimators that make explicit use of sample allele frequencies as estimates of allele probabilities in the reference population are confounded by average kinships with other individuals. This means that the ranking of those estimates depends on the scope of the study sample and we show the variation in rankings for common estimators applied to different subdivisions of 1000 Genomes data. Allelesharing estimators of withinpopulation inbreeding relative to average kinship in a study sample, however, do have invariant rankings across all studies including those individuals. They are unbiased with a large number of SNPs. We discuss how allele sharing estimates are the relevant quantities for a range of empirical applications.
Similar content being viewed by others
Introduction
Allelic dependence at a locus is usually quantified by inbreeding coefficients for individuals or populations, with these measures referring either to correlations of allelic state indicators (Wright, 1922) or to probabilities of identity by descent, ibd, (Malécot, 1948). Here we use ibd and we have advocated allelesharing estimators ((Weir & Goudet, 2017), WG17 henceforth; (Goudet et al., 2018)) that are unbiased for individual and population inbreeding coefficients relative to average kinships among specified pairs of individuals. Estimators such as those in PLINK ((Purcell et al., 2007) and GCTA (Yang et al., 2011), that use sample allele frequencies, confound inbreeding estimates by the averages of individual kinships. Our work recognizes the need to estimate inbreeding coefficients from many millions of SNP genotypes where likelihood methods may not be feasible and we employ momentbased methods.
There have been many published accounts of inbreeding estimation, including the recent evaluation of several methods by Alemu et al. (2021). Among those that refer to allele sharing, Li & Horvitz (1953) discussed an inbreeding estimator based on observed homozygosity, i.e., withinindividual sharing of maternal and paternal alleles. They compared observed sharing to the value expected without inbreeding. They also constructed an estimator from the proportions of each allele type that were homozygous in a sample and gave an expression that was investigated further by Ritland (1996). Ritland used allele sharing within and between individuals and his inbreeding estimates assumed “independence or nearindependence” of individuals. If individuals are not independent, the rankings of his inbreeding coefficient estimates change with the sample. In WG17 we estimated inbreeding coefficients by comparing withinindividual allelesharing to average sharing between pairs of individuals in a sample. By not making explicit use of sample allele frequencies, we preserved the ranking of estimates across different samples and this is our central theme here.
Ritland’s individuallevel inbreeding coefficients were also derived by Yang et al. (2011) as the correlation between uniting gametes and were expressed in terms of allele dosages for an individual and sample allele frequencies. This estimator was written as \({\hat{F}}_{{{{{{\rm{UNI}}}}}}}\) in Yengo et al. (2017), and is less biased than the estimator in Yang et al. (2011) obtained from the diagonal elements of a genomic relationship matrix (GRM) of VanRaden (2008). We compare these two estimates below with allelesharing and other methods: pedigreebased pathcounting (Wright, 1922), maximumlikelihood estimation, MLE, (e.g., (Hall et al., 2012)) and runs of homozygosity (ROH) (e.g., (Ceballos et al., 2018)).
Methods
Statistical sampling
We can describe the dependence between pairs of uniting alleles in a single population without invoking an evolutionary model for the history of the population. In this “statistical sampling” framework (Weir, 1996) we do not consider the variation associated with evolutionary processes but we do consider the variation among samples from the same population. Although extensive sets of genetic data allow individuallevel inbreeding coefficients to be estimated with high precision, we start with populationlevel estimation.
Allelic dependencies can be quantified with the withinpopulation inbreeding coefficient, written here as f_{W} to emphasize it is a withinpopulation quantity, defined by
where H_{l} is the population proportion of heterozygotes for the reference allele at SNP l and p_{l} is the population proportion of that allele. The same value of f_{W} is assumed to apply for all SNPs. An immediate consequence of this definition is that the population proportions of homozygotes for the reference and alternative alleles are \({p}_{l}^{2}+{p}_{l}(1{p}_{l}){f}_{W}\) and \({(1{p}_{l})}^{2}+{p}_{l}(1{p}_{l}){f}_{W}\) respectively. This formulation allows f_{W} to be negative, with the maximum of −p_{l}/(1 − p_{l}) and −(1 − p_{l})/p_{l} as lower bound. It is bounded above by 1. Hardy–Weinberg equilibrium, HWE, corresponds to f_{W} = 0 and textbooks (e.g., (Hedrick, 2000)) point out that negative values of f_{W} indicate more heterozygotes than expected under HWE.
Observed heterozygote proportions \({\tilde{H}}_{l}\) have H_{l} as withinpopulation expectation \({{{{{{\mathcal{E}}}}}}}_{W}\) over samples from the study population, \({{{{{{\mathcal{E}}}}}}}_{W}({\tilde{H}}_{l})={H}_{l}\), and this would provide a simple estimator of f_{W} if the population allele proportions were known. In practice, however, these proportions are unknown. Steele et al. (2014) suggested use of data external to the study sample to provide reference allele proportions in forensic applications where a reference database is used for making inferences about the population relevant for a particular crime. The more usual approach is to use study sample proportions \({\tilde{p}}_{l}\) in place of the true proportions p_{l}, as in equation 1 of Li & Horvitz (1953):
The moment estimator in Eq. (2) is also an MLE of f_{W} when only one locus is considered, but it is biased (Robertson & Hill, 1984) since not only is it a ratio of statistics but also the expected value \({{{{{{\mathcal{E}}}}}}}_{W}[2{\tilde{p}}_{l}(1{\tilde{p}}_{l})]\) over repeated samples of n from the population is 2p_{l}(1 − p_{l})[1 − (1 + f_{W})/(2n)] (e.g., (Weir, 1996), p39).
This approach can be used to estimate the withinpopulation inbreeding coefficient f_{j} for each individual j in a sample from one population. These are the “simple” estimators of Hall et al. (2012) and the \({\hat{f}}_{{{{{{{\rm{HOM}}}}}}}_{j}}\) of Yengo et al. (2017):
The sample heterozygosity indicator \({\tilde{H}}_{jl}\) is one if individual j is heterozygous at SNP l and is zero otherwise. Averaging Eq. (3) over individuals gives the estimator based on SNP l in Eq. (2).
A single SNP provides estimates that are either 1 or a negative value depending on \({\tilde{p}}_{l}\), so many SNPs are used in practice. In both Hall et al. (2012) and Yengo et al. (2017) data were combined over loci as weighted or “ratio of averages” estimators:
Gazal et al. (2014) referred to this estimator as f_{PLINK} as it is an option in PLINK. We show below the good performance of this weighted estimator for large sample sizes and large numbers of loci. We will consider throughout that a large number L of SNPs are used so that ratios of sums of statistics over loci, such as in Eq. (4), have expected values equal to the ratio of expected values of their numerators and denominators. Ochoa & Storey (2021) showed statistics of the form \({\tilde{A}}_{L}/{\tilde{B}}_{L}\), where \({\tilde{A}}_{L}=\mathop{\sum }\nolimits_{l = 1}^{L}{a}_{l}/L\) and \({\tilde{B}}_{L}=\mathop{\sum }\nolimits_{l = 1}^{L}{b}_{l}/L\), have expected values that converge almost surely to the ratio A/B when \({{{{{{\mathcal{E}}}}}}}_{W}({\tilde{A}}_{L})=A{c}_{L}\) and \({{{{{{\mathcal{E}}}}}}}_{W}({\tilde{B}}_{L})=B{c}_{L}\). This result rests on the expectations \({{{{{{\mathcal{E}}}}}}}_{W}({a}_{l})=A{c}_{l}\) and \({{{{{{\mathcal{E}}}}}}}_{W}({b}_{l})=B{c}_{l}\) with \({c}_{L}=\mathop{\sum }\nolimits_{l = 1}^{L}{c}_{l}/L\). It requires ∣a_{l}∣, ∣b_{l}∣ to both be no greater than some finite quantity C, c_{L} to converge to a finite value c as L increases, and for Bc not to be zero. For the ratio in Eq. (4), \({a}_{l}={\tilde{H}}_{jl}\), \({b}_{l}=2{\tilde{p}}_{l}(1{\tilde{p}}_{l})\) so A = (1 − f_{j}), B = 1 for large sample sizes n, and c_{L} = ∑_{l}2p_{l}(1 − p_{l})/L ≤ 1/2. The conditions are satisfied providing at least one SNP is polymorphic. For an “average of ratios” estimator of the form \(\mathop{\sum }\nolimits_{l = 1}^{L}({a}_{l}/{b}_{l})/L\), the denominators b_{l} can be very small and convergence of its expected value is not assured.
As an alternative to using sample allele frequencies, Hall et al. (2012) used maximum likelihood to estimate population allele proportions for multiple loci whereas Ayres & Balding (1998) used Markov chain Monte Carlo methods in a Bayesian approach that integrated out the allele proportion parameters. Neither of those papers considered data of the size we now face in sequencebased studies of many organisms, and we doubt the computational effort to estimate, or integrate over, hundreds of millions of allele proportions in Eqs. (2) or (4) adds much value to inferences about f. The allelesharing estimators we describe below regard allele probabilities as unknown nuisance parameters and we show how to avoid estimating them or assigning them values.
Hall et al. (2012) used an EM algorithm to find MLEs for f_{j} when population allele proportions were regarded as being known and equal to sample proportions. Alternatively, a grid search can be conducted over the range of validity for the single parameter f_{j} that maximizes the loglikelihood
Estimation of the withinpopulation inbreeding coefficients f_{W} (F_{IS} of (Wright, 1922)) and f_{j} does not require any information beyond genotype proportions in samples from a study population, nor does it make any assumptions about that population or the evolutionary forces that shaped the population. The coefficients are simply measures of dependence of pairs of alleles within individuals.
Genetic sampling
Inbreeding parameters of most interest in genetic studies are those that recognize the contribution of previous generations to inbreeding in the present study population. This requires accounting for “genetic sampling” (Weir, 1996) between generations, thereby leading to an ibd interpretation of inbreeding: ibd alleles descend from a single allele in a reference population. It also allows the prediction of inbreeding coefficients by path counting when pedigrees are known (Wright, 1922). If individual J is ancestral to both individuals \(j^{\prime}\) and j″, and if there are n individuals in the pedigree path joining \(j^{\prime}\) to j″ through J, then F_{j} = ∑(0.5)^{n}(1 + F_{J}) where F_{J} is the inbreeding coefficient of ancestor J and F_{j} is the inbreeding coefficient of offspring j of parents \(j^{\prime}\) and j″. The sum is over all ancestors J and all paths joining \(j^{\prime}\) to j″ through J. The expression is also the coancestry \({\theta }_{j^{\prime} j^{\prime\prime} }\) of \(j^{\prime}\) and j″: the probability an allele drawn randomly from \(j^{\prime}\) is ibd to an allele drawn randomly from j″.
The allele proportion p_{l} in a study population has expectation π_{l} over evolutionary replicates of the population from an ancestral reference population to the present time. Sample allele proportions \({\tilde{p}}_{l}\) provide information about the population proportions p_{l}, and their statistical sampling properties follow from the binomial distribution. We do not invoke a specific genetic sampling distribution for the p_{l} about their expectations π_{l} although we do assume the second moments of that distribution depend on probabilities of ibd for pairs of alleles. One consequence of the assumed moments is that the probability of individual j in the study sample being heterozygous, i.e., the total expected value \({{{{{{\mathcal{E}}}}}}}_{T}\) of the heterozygosity indicator over replicates of the history of that individual, is
The quantity F_{j} is the individualspecific version of F_{IT} of Wright (1922) and we can regard it as the probability the two alleles at any locus for individual j are ibd. There is an implicit assumption in Eq. (5) that the reference population needed to define ibd is infinite and in HWE: there is probability F_{j} that j has homologous alleles with a single ancestral allele in that population and probability (1 − F_{j}) of j having homologous alleles with distinct ancestral alleles there. In the first place, the single ancestral allele has probability π of being the reference allele for that locus and the implicit assumption is that two ancestral alleles are both the reference type with probability π^{2}. This does not mean there is an actual ancestral population with those properties, any more than use of \({{{{{{\mathcal{E}}}}}}}_{T}\) means there are actual replicates of the history of any population or individual, and we note that Eq. (5) does not allow higher heterozygosity than predicted by HWE. Nonetheless, the concept of ibd allows theoretical constructions of great utility and we now present a framework for approaching empirical situations.
Inbreeding, or ibd, implies a common ancestral origin for uniting alleles and statements about sample allele proportions \({\tilde{p}}_{l}\) require consideration of possible ibd for other pairs of alleles in the sample. The total expectation of \(2{\tilde{p}}_{l}(1{\tilde{p}}_{l})\) over samples from the population and over evolutionary replicates of the study population is ((Weir, 1996), p176)
where F_{W} is the parametric inbreeding coefficient averaged over sample members, \({F}_{W}=\mathop{\sum }\nolimits_{j = 1}^{n}{F}_{j}/n\), and θ_{S} is the average parametric coancestry in the sample, \({\theta }_{S}=\mathop{\sum }\nolimits_{j = 1}^{n}{\sum }_{j^{\prime} \ne j}{\theta }_{jj^{\prime} }/[n(n1)]\). Equivalent expressions were given by McPeek et al. (2004) and DeGiorgio and Rosenberg (2009). We note the relationship f_{W} = (F_{W} − θ_{S})/(1 − θ_{S}) given by Wright (1922) and we showed in WG17 the equivalent expression f_{j} = (F_{j} − θ_{S})/(1 − θ_{S}) for individualspecific values (θ_{S} is Wright’s F_{ST}).
For a large number of SNPs, the expectation of a ratio estimator of the type considered here is the ratio of expectations (Ochoa & Storey, 2021). Therefore, the total expectations of the \({\hat{f}}_{{{{{{{\rm{Hom}}}}}}}_{j}}\), taking into account both statistical and genetic sampling, are
For all sample sizes, \({\hat{f}}_{{{{{{{\rm{HOM}}}}}}}_{j}}\) has an expected value less than the true value f_{j}, with the bias being of the order of 1/n. The ranking of \({{{{{{\mathcal{E}}}}}}}_{T}({\hat{f}}_{{{{{{{\rm{HOM}}}}}}}_{j}})\) values, however, is the same as the ranking of the f_{j} and, therefore, of the F_{j}. For large sample sizes, Eq. (7) reduces to \({{{{{{\mathcal{E}}}}}}}_{T}({\hat{f}}_{{{{{{{\rm{HOM}}}}}}}_{j}})={f}_{j}\). Averaging over individuals shows that \({{{{{{\mathcal{E}}}}}}}_{T}({\hat{f}}_{{{{{{\rm{HOM}}}}}}})={f}_{W}\): the populationlevel estimator in Eq. (2) has total expectation of f_{W}, not F_{W}.
A different outcome is found for the \({\hat{f}}_{{{{{{{\rm{UNI}}}}}}}_{j}}\) estimator of Yengo et al. (2017) (i.e., \({\hat{f}}^{III}\) of Yang et al. (2011); \({\hat{f}}_{{{{{{\rm{GCTA}}}}}}3}\) of (Gazal et al., 2014)). This estimator, with the weighted (w) ratio of averages over loci we recommend, as opposed to the unweighted (u) average of ratios over loci used in their papers, is
In this equation X_{jl} is the reference allele dosage, the number of copies of the reference allele, at SNP l for individual j. It is equivalent to the estimator given by (Ritland (1996), eq. 5) and attributed by him to Li & Horvitz (1953).
Ochoa & Storey (2021) showed that \({\hat{f}}_{{{{{{{\rm{UNI}}}}}}}_{j}}^{w}\) has expectation, for a large number of SNPs and a large sample size, of
where Ψ_{j} is the average coancestry of individual j with other members of the study sample: \({{{\Psi }}}_{j}=\mathop{\sum }\nolimits_{j^{\prime} = 1,j^{\prime} \ne j}^{n}{\theta }_{jj^{\prime} }/(n1)\). We term ψ_{j} = (Ψ_{j} − θ_{S})/(1 − θ_{S}) the withinpopulation individualspecific average kinship coefficient. The Ψ_{j} have an average of θ_{S} over members of the sample, so the average of the ψ_{j}’s is zero and expected value of the average of the \({\hat{f}}_{{{{{{{\rm{UNI}}}}}}}_{j}}^{w}\) is f_{W}, as is the case for \({\hat{f}}_{{{{{{{\rm{AS}}}}}}}_{j}}\) below.
Equation (9) shows that the \({\hat{f}}_{{{{{{{\rm{UNI}}}}}}}_{j}}^{w}\) have expected values with the same ranking as the F_{j} values only if every individual j in the sample has the same average kinship ψ_{j} with other sample members.
Finally, we mention another common estimator described by VanRaden (2008), termed f_{GCTA1} by Gazal et al. (2014) and available from the GCTA software (Yang et al., 2011) with option ibc. We referred to this as the “standard” estimator in WG17. The weighted version for multiple loci is
and it has the largesample expectation of (f_{j} − 4ψ_{j}) as is implied by WG17 (Eq. 13) and as was given by Ochoa & Storey (2021). We summarize the various measures of inbreeding and coancestry in Table 1, and we include sample sizes in the expectations shown in Table 2.
The \({\hat{f}}_{{{{{{\rm{HOM}}}}}}}\), \({\hat{f}}_{{{{{{\rm{UNI}}}}}}},{\hat{f}}_{{{{{{\rm{STD}}}}}}}\) and \({\hat{f}}_{{{{{{\rm{MLE}}}}}}}\) estimators of individual or population inbreeding coefficients make explicit use of sample allele proportions. This means that all four have smallsample biases, and none of the four provide estimates of the ibd quantities F or F_{j}. We showed that \({\hat{f}}_{{{{{{\rm{HOM}}}}}}}\) is actually estimating the withinpopulation inbreeding coefficients: the total inbreeding coefficients relative to the average coancestry of pairs of individuals in the sample, but \({\hat{f}}_{{{{{{\rm{UNI}}}}}}}\) and \({\hat{f}}_{{{{{{\rm{STD}}}}}}}\) are estimating expressions that also involve average kinships ψ.
Allele sharing
In a genetic sampling framework, and with the ibd viewpoint, we consider withinindividual allele sharing proportions A_{jl} for SNP l in individual j (we wrote M rather than A in WG17 and in (Goudet et al., 2018)). These equal one for homozygotes and zero for heterozygotes and sample values can be expressed in terms of allele dosages, \({\tilde{A}}_{jl}={({X}_{jl}1)}^{2}\). We also consider betweenindividual sharing proportions \({A}_{jj^{\prime} l}\) for SNP l and individuals j and \(j^{\prime}\). These are equal to one for both individuals being the same homozygote, zero for different homozygotes, and 0.5 otherwise. Observed values can be written as \({\tilde{A}}_{jj^{\prime} l}=[1+({X}_{jl}1)({X}_{j^{\prime} l}1)]/2\), with an average over all pairs of distinct individuals in a sample of \({\tilde{A}}_{Sl}\). Astle & Balding (2009) introduced \({\tilde{A}}_{jj^{\prime} l}\) as a measure of identity in state of alleles chosen randomly from individuals j and \(j^{\prime}\), and Ochoa & Storey (2021) used a simple transformation of this quantity. The allele sharing for an individual with itself is A_{jjl} = (1 + A_{jl})/2.
The same logic that led to Eq. (5) provides total expectations for allelesharing proportions for all \(j,j^{\prime}\):
Note that θ_{jj} = (1 + F_{j})/2. The nuisance parameter 2π_{l}(1 − π_{l}) cancels out of the ratio \({{{{{{\mathcal{E}}}}}}}_{T}({\tilde{A}}_{jj^{\prime} l}{\tilde{A}}_{Sl})/{{{{{{\mathcal{E}}}}}}}_{T}(1{\tilde{A}}_{Sl})\) and this motivates definitions of allelesharing estimators of the inbreeding coefficient for individual j and the kinship coefficient for individuals \(j,j^{\prime}\) as
For a large number of SNPs, these are unbiased for f_{j} and \({\psi }_{jj^{\prime} }\) for all sample sizes. We showed in WG17 there is no need to filter on minor allele frequency to preserve the lack of bias. Note that \({\hat{f}}_{{{{{{{\rm{AS}}}}}}}_{j}}\) is a linear function of the form \({a}_{S}+{b}_{S}{\tilde{A}}_{j}\) with \({\tilde{A}}_{j}\) being the total homozygosity for j and constants a_{S}, b_{S} being the same for all individuals j. Changing the scope of the study, from population to world for example, preserves linearity (with different values of a_{S}, b_{S}). The changed estimates are linear functions of the old estimates: old and new estimates are completely correlated and are rank invariant over all samples that include particular individuals, i.e., over all reference populations. Unlike the case for \({\hat{f}}_{{{{{{\rm{UNI}}}}}}}\) or \({\hat{f}}_{{{{{{\rm{STD}}}}}}}\), rank invariance is guaranteed for \({\hat{f}}_{{{{{{{\rm{AS}}}}}}}_{j}}\) for any two individuals even if only one more individual is added to the study.
For large sample sizes, \((1{\tilde{A}}_{Sl})\approx 2{\tilde{p}}_{l}(1{\tilde{p}}_{l})\). Under that approximation, \({\hat{f}}_{{{{{{{\rm{AS}}}}}}}_{j}}\) is the same as \({\hat{f}}_{{{{{{{\rm{Hom}}}}}}}_{j}}\) but the approximation is not necessary in computerbased analyses. Summing the largesample estimates over individuals not equal to j gives an estimator for the average individual kinship ψ_{j}:
Adding \(2{\hat{\psi }}_{{{{{{{\rm{AS}}}}}}}_{j}}\) to \({\hat{f}}_{{{{{{{\rm{UNI}}}}}}}_{j}}^{w}\) gives \({\hat{f}}_{{{{{{{\rm{AS}}}}}}}_{j}}\), as expected, as does adding \(4{\hat{\psi }}_{{{{{{{\rm{AS}}}}}}}_{j}}\) to \({\hat{f}}_{{{{{{{\rm{STD}}}}}}}_{j}}^{w}\). Similarly, \({\hat{\psi }}_{{{{{{{\rm{AS}}}}}}}_{jj^{\prime} }}\) is obtained by adding \({\hat{\psi }}_{{{{{{{\rm{AS}}}}}}}_{j}}\) and \({\hat{\psi }}_{{{{{{{\rm{AS}}}}}}}_{j^{\prime} }}\) to \({\hat{\psi }}_{{{{{{{\rm{STD}}}}}}}_{jj^{\prime} }}\), where (Yang et al., 2011)
These are the elements of the first method for constructing the GRM given by VanRaden (2008).
When inbreeding and coancestry coefficients are defined as ibd probabilities they are nonnegative, but the withinpopulation values f and ψ will be negative for individuals, or pairs of individuals, having smaller ibd allele probabilities than do pairs of individuals in the sample, on average. Individualspecific values of f always have the same ranking as the individualspecific F values, and they are estimable. Negative estimates can be avoided by the transformation to \(({\hat{f}}_{{{{{{{\rm{AS}}}}}}}_{j}}{\hat{f}}_{{{{{{{\rm{AS}}}}}}}_{j}}^{\min })/(1{\hat{f}}_{{{{{{{\rm{AS}}}}}}}_{j}}^{\min })\) where \({\hat{f}}_{{{{{{{\rm{AS}}}}}}}_{j}}^{\min }\) is the smallest value over individuals of the \({\hat{f}}_{{{{{{{\rm{AS}}}}}}}_{j}}\)’s. We don’t see the need for this transformation, and we noted above the recognition of the utility of negative values. Ochoa & Storey (2021) wished to estimate F_{j} rather than f_{j} and, to overcome the lack of information about the ancestral population serving as a reference point for ibd, they assumed the least related pair of individuals in a sample have a coancestry of zero. We showed in WG17 that this brings estimates in line with pathcounting predicted values when founders are assumed to be not inbred and unrelated, but we prefer to avoid the assumption. We stress that, absent external information or assumptions, F is not estimable. Instead, linear functions of F that describe ibd of target pairs of alleles relative to ibd in a specified set of alleles are estimable and have utility in empirical studies.
Runs of homozygosity
Each of the inbreeding estimators considered so far has been constructed for individual SNPs and then combined over SNPs. Observed values of allelic state are used to make inferences about the unobserved state of identity by descent. Estimators based on ROH, however, suppose that ibd for a region of the genome can be observed. Although F is the probability an individual has ibd alleles at any single SNP, in fact ibd occurs in blocks within which there has been no recombination in the paths of descent from common ancestor to the individual’s parents. Whereas a single SNP can be homozygous without the two alleles being ibd, if many adjacent SNPs are homozygous the most likely explanation is that they are in a block of ibd (Gibson et al., 2006). There can be exceptions, from mutation for example, and several publications give strategies for identifying runs of homozygotes for which ibd may be assumed (e.g., Gazal et al. (2014); (Joshi et al., 2015)). These strategies include adjusting the size of the blocks, the numbers of heterozygotes or missing values allowed per block, the minor allele frequency, and so on. These software parameters affect the size of the estimates (Meyermans et al., 2020). Some methods (e.g., Gazal et al. (2014); (Narasimhan et al., 2016)) use hidden Markov models where ibd is the hidden status of an observed homozygote. Modelbased approaches necessarily have assumptions, such as HWE in the sampled population.
We provide more details elsewhere, but we note here that ROH methods offer a useful alternative to SNPbySNP methods even though they cannot completely compensate for lack of information on the ibd reference population. We note also that shorter runs of ibd result from more distant relatedness of an individual’s parents, and ROH procedures can be set to distinguish recent (familial) ibd from distant (evolutionary) ibd. SNPbySNP estimators do not make a distinction between these two time scales.
Results
Simulation study
We used the quantiNemo software (Neuenschwander et al., 2019) to simulate a fivegeneration pedigree of hermaphroditic individuals mating randomly, excluding selfing, with each mating producing a number of offspring drawn from a Poisson distribution with mean two. The zeroth generation was made of 50 founders, the first generation had 47 individuals and the second, third, fourth and fifth generations had 58, 56, 57, and 65 individuals respectively. This pedigree was then fed to a custom R script to draw gametes from each parent at each reproductive event, allowing for recombination based on a 20 Morgan recombination map with a genetic marker every 0.1 cM, for a total of 20,000 markers.
Each of the 100 alleles per marker among the 50 founders was given a unique identifier so that alleles in subsequent generations with the same identifier had actual identity by descent relative to the founders. The average actual ibd proportions over loci, within individuals and between each pair of individuals, provided “gold standard” inbreeding and coancestry coefficients, as opposed to the pedigreebased values we calculated by path counting. The gold values for inbreeding coefficients F_{j} and coancestry coefficients \({\theta }_{jj^{\prime} }\) then allow calculation of gold values for f_{j}, ψ_{j} and, therefore, \({f}_{{{{{{{\rm{STD}}}}}}}_{j}}\) and \({f}_{{{{{{{\rm{UNI}}}}}}}_{j}}\).
Finally, the two unique identifiers for each marker of the 50 founders were mapped to the SNP genotypes of the 50 founders generated with the msprime program (Kelleher et al., 2016) as follows: we assume the founders originated from a population with effective size N_{e} = 10^{4}, mutation rate μ = 10^{−9}, recombination rate between neighboring base pairs r = 10^{−7}. We assumed 20 chromosomes each 10 Megabase (10^{7}) long. The necessary arguments are mspms 100 20 t 400 r 40000 10000000 p 9. This generated a dataset of 100 gametes and over 40,000 SNPs, with the first 20,000 used for the mapping of unique identifiers to SNP alleles. This mapping was applied to the genotypes of the nonfounder individuals of the pedigree to generate their SNP genotypes.
The pedigree was constructed to provide fairly high levels of predicted coancestry among pairs of the 283 nonfounder individuals, ranging from 0 to 0.464, with a mean of θ_{S} = 0.053, assuming the 50 founders were unrelated and not inbred. The pedigree inbreeding coefficients ranged from 0 to 0.367, with a mean of F_{W} = 0.050. The withinpopulation inbreeding coefficient for the set of 283 nonfounder individuals is f = (F_{W} − θ_{S})/(1 − θ_{S}) = −0.003. Note, however, that the 50 individuals regarded as founders for the subsequent 283 had their own joint histories from the msprime simulation. These 50 had an average withinindividual allele sharing of \({\tilde{A}}_{W}=0.80385\) and an average betweenindividual allele sharing of \({\tilde{A}}_{S}=0.80355\). The difference of these two proportions, which would be zero for a reference set of noninbred and unrelated individuals, provides a withinfounder allelesharing inbreeding coefficient \({\hat{f}}_{{{{{{\rm{W}}}}}}}\) of 0.0015.
The various estimators of inbreeding examined with these data are shown in Table 2, and the correlation coefficients for each pair of estimates over the whole set of 283 nonfounder individuals are shown in Table 3. There are very high correlations between pedigree and goldstandard values and also very high correlations between \({\hat{f}}_{{{{{{\rm{HOM}}}}}}}\) and \({\hat{f}}_{{{{{{\rm{AS}}}}}}}\) values, both as expected. There are lower correlations of \({\hat{f}}_{{{{{{\rm{UNI}}}}}}}\) and \({\hat{f}}_{{{{{{\rm{STD}}}}}}}\) with pedigreebased or goldstandard inbreeding coefficients since those estimates reflect both f and ψ.
We see in Table 3 that \({\hat{F}}_{{{{{{\rm{ROH}}}}}}}\) values are the most highly correlated with F_{Gold}: this high correlation was obtained by adjusting the block size (100 SNPs) and the block overlap amount (50 SNPs) to bring estimates close to the known F_{Gold} values. In practice the F_{Gold} values are not known and the other estimators are all evaluated without external information. The high correlation of \({\hat{f}}_{{{{{{\rm{AS}}}}}}}\) and maximum likelihood values suggests that \({\hat{f}}_{{{{{{\rm{MLE}}}}}}}\) is estimating f rather than F because it uses the sample allele frequencies in place of the unknown allele probabilities. The weighted and unweighted versions of \({\hat{f}}_{{{{{{\rm{UNI}}}}}}}\) are highly correlated with each other and with their gold values, but not with f_{Gold}. There are generally low correlations for weighted and unweighted \({\hat{f}}_{{{{{{\rm{STD}}}}}}}\) values.
Figure 1 (left) illustrates the linear relationship between \({f}_{{{{{{{\rm{Ped}}}}}}}_{j}}\) and \({F}_{{{{{{{\rm{Ped}}}}}}}_{j}}\): \({f}_{{{{{{{\rm{Ped}}}}}}}_{j}}=({F}_{{{{{{{\rm{Ped}}}}}}}_{j}}{\theta }_{{{{{{{\rm{Ped}}}}}}}_{S}})/(1{\theta }_{{{{{{{\rm{Ped}}}}}}}_{S}})\) where \({\theta }_{{{{{{{\rm{Ped}}}}}}}_{S}}=0.053\) is the average coancestry of pairs of nonfounders, calculated from the pedigree. The \({F}_{{{{{{{\rm{Gold}}}}}}}_{j}}\) and \({f}_{{{{{{{\rm{Gold}}}}}}}_{j}}\) values are also correlated with the corresponding pedigree values, as is shown for \({f}_{{{{{{{\rm{Gold}}}}}}}_{j}}\) in Fig. 1 (center). The variation we see in Fig. 1 (center) for \({f}_{{{{{{{\rm{Gold}}}}}}}_{j}}\) around \({F}_{{{{{{{\rm{Ped}}}}}}}_{j}}\) reflects the variation of actual inbreeding about expected values, even for whole genomes, pointed out by Hill & Weir (2011). Wang (2016) showed that the number of SNPs also has an effect. The lack of relationship between pedigreebased values of individual average coancestry ψ_{j} and individual inbreeding f_{j}, leading to variable rankings for some estimators based on sample allele frequencies, is shown in Fig. 1 (right).
Figure 2 (left) illustrates the similarity of \({\hat{F}}_{{{{{{\rm{ROH}}}}}}}\) and F_{Gold} and Fig. 2 (center) shows general agreement between \({\hat{F}}_{{{{{{\rm{ROH}}}}}}}\) and \({\hat{f}}_{{{{{{\rm{AS}}}}}}}\), bearing in mind that \({\hat{f}}_{{{{{{\rm{AS}}}}}}}\) estimates (F − θ_{S})/(1 − θ_{S}). Figure 2 (right) shows general agreement of the allelesharing estimators \({\hat{f}}_{{{{{{{\rm{AS}}}}}}}_{j}}\) with the goldstandard withinpopulation inbreeding coefficients \({f}_{{{{{{{\rm{Gold}}}}}}}_{j}}\). Figure 3 shows \({\hat{f}}_{{{{{{{\rm{UNI}}}}}}}_{j}}\) to be a better estimator of \({f}_{{{{{{{\rm{Gold}}}}}}}_{j}}\) than is \({\hat{f}}_{{{{{{{\rm{STD}}}}}}}_{j}}\), as noted by Yang et al. (2011), and better performance for the weighted than unweighted averages over SNPs but still not as good as \({\hat{f}}_{{{{{{{\rm{AS}}}}}}}_{j}}\).
1000 genomes data
We used 77m SNPs from the 22 autosomes for the 26 populations of the 1000 Genomes whole genome data to estimate inbreeding coefficients for all 2504 individuals in the project. Our focus was on the algebraic invariance of estimate rankings as the reference set of individuals changed from the population from which each individual was sampled, to the continental group for that population, to the whole world. We calculated the estimates \({\hat{f}}_{{{{{{{\rm{AS}}}}}}}_{j}}\) and \({\hat{f}}_{{{{{{{\rm{UNI}}}}}}}_{j}}^{u}\) for each individual and each reference set, and ranked estimates within each population. The two sets of estimates for all individuals are shown separately in Fig. 4. Figures S1 and S2 show \({\hat{f}}_{{{{{{{\rm{UNI}}}}}}}_{j}}^{u}\) vs \({\hat{f}}_{{{{{{{\rm{AS}}}}}}}_{j}}\) for estimates and ranks respectively.
Figure 4 shows that withinpopulation inbreeding coefficients \({\hat{f}}_{{{{{{\rm{AS}}}}}}}\) for all 1000 Genomes populations outside the AMR group are essentially the same, and generally close to zero, when they are estimated relative to average coancestry within each population or continental group but change when the complete set of 26 populations is used as a reference. These latter values compare the allele sharing for each individual to the same reference value, the average sharing over all pairs of individuals in the whole dataset. The world reference gives markedly lower \({\hat{f}}_{{{{{{\rm{AS}}}}}}}\) values for the African populations (AFR), reflecting their higher levels of genetic diversity. The rankings for \({\hat{f}}_{{{{{{\rm{AS}}}}}}}\) within a population, by construction, do not change with reference set. High \({\hat{f}}_{{{{{{\rm{AS}}}}}}}\) values reflect admixture, consanguineous matings and high evolutionary coancestry. In contrast, the \({\hat{f}}_{{{{{{\rm{UNI}}}}}}}\) values are higher for African individuals than for any other individuals when the allele frequencies are from all 26 populations: this reflects an Africanspecific pattern of negative average individual kinships ψ, shown in the bottom row of Fig. 5.
The critical role that average kinship plays in inbreeding estimation is illustrated in Fig. 5. With each reference set, the allelesharing inbreeding estimates \({\hat{f}}_{{{{{{\rm{AS}}}}}}}\) are clustered for European (EUR) individuals, a little more diverse for East Asian (EAS) individuals, much more diverse for South Asian (SAS) and African (AFR) individuals, and extremely diverse for American (AMR) individuals. These values are consistent with those reported for the numbers of variant sites per genome (The 1000 Genomes Project Consortium, 2015). The variation among African and American average kinships \({\hat{\psi }}_{{{{{{\rm{AS}}}}}}}\) is substantial: as these quantities determine how the expected values of \({\hat{f}}_{{{{{{\rm{UNI}}}}}}}\) and \({\hat{f}}_{{{{{{\rm{STD}}}}}}}\) differ from the f target parameters, it is clear that these estimates cannot be used to rank individuals by their inbreeding levels.
For the African population ASW, individual NA20294 has \({\hat{f}}_{{{{{{\rm{AS}}}}}}}\) values of −0.009, 0.001,−0.130 using ASW, AFR or World as a reference set and each estimate is ranked as number 16 among the 61 ASW estimates. The same individual has \({\hat{f}}_{{{{{{\rm{UNI}}}}}}}^{u}\) values of −0.007 (rank 36), 0.001 (rank 16) and 0.028 (rank 60) using ASW, AFR or World allele frequencies. Estimator \({\hat{f}}_{{{{{{\rm{UNI}}}}}}}^{u}\) indicates NA20294 to be among the least inbred of the ASW individuals when AFR sample allele frequencies are used, but among the most inbred when worldwide sample allele frequencies are used, even though the individual’s own genotype is the same for each analysis. Other examples of rankings changing with reference population for \({\hat{f}}_{{{{{{\rm{UNI}}}}}}}\) are shown in Fig. S3; for the admixed ACB and ACB populations, for example, the individuals appearing the most inbred with continental reference appear the least inbred with world reference and vice versa. This can have implications for studies of inbreeding depression, where trait values are regressed on estimated inbreeding coefficients.
A comparison of runsofhomozygosity estimates \({\hat{F}}_{{{{{{{\rm{ROH}}}}}}}_{j}}\) with SNPbySNP estimates is shown in Fig. 6. The ROH estimates were produced with the homozyg homozygsnp2 homozygkb100 options in PLINK (Meyermans et al., 2020). The values of \({\hat{F}}_{{{{{{{\rm{ROH}}}}}}}_{j}}\) depend on the PLINK settings for minor allele frequency pruning and linkage disequilibrium pruning, as well as on SNP density, so their expected values may differ from the true F_{j} values. The left panel shows \({\hat{f}}_{{{{{{{\rm{AS}}}}}}}_{j}}\) values and these have a correlation of 0.998 with \({\hat{F}}_{{{{{{{\rm{ROH}}}}}}}_{j}}\). The right panel shows \({\hat{f}}_{{{{{{{\rm{UNI}}}}}}}_{j}}^{u}\) estimates and these have a correlation of −0.337 with \({\hat{F}}_{{{{{{{\rm{ROH}}}}}}}_{j}}\) estimates.
Gazal et al. (2015) reported inbreeding estimates \({\hat{F}}_{{{{{{{\rm{Fsuite}}}}}}}_{j}}\) from ROH, although their method requires sample allele frequencies and so may have estimates of F confounded by average individualspecific average kinships. They also assumed Hardy–Weinberg equilibrium. However, there is good agreement of \({\hat{f}}_{{{{{{{\rm{AS}}}}}}}_{j}}\) values with \({\hat{F}}_{{{{{{{\rm{Fsuite}}}}}}}_{j}}\) values (Fig. S4). The agreement between \({\hat{F}}_{{{{{{{\rm{Fsuite}}}}}}}_{j}}\) and \({\hat{f}}_{{{{{{{\rm{UNI}}}}}}}_{j}}^{u}\) is seen there to be not as good.
Discussion
Discussions on the estimation of individual inbreeding coefficients generally refer to F, the probability an individual has pairs of homologous alleles that are identical by descent. Among the estimators we have considered here, \({\hat{F}}_{{{{{{\rm{ROH}}}}}}}\) addresses F by assuming that long runs of homozygous SNPs represent ibd regions. The ROH estimates, however, are conditional on the settings used to calculate the estimates, and actual ibd in short runs of homozygotes may be ignored, so the expected values of these estimators is not known. The Bayesian approach of Vogl et al. (2002) also addresses F but at the computational cost of estimating allele proportions in a reference population assumed to have zero inbreeding or relatedness. All the other estimators considered here are, instead, addressing the withinpopulation inbreeding coefficient f that compares F values to ibd probabilities for pairs of individuals. There is no need to specify the reference population implicit in the definition of identity by descent. There is also no need to assume the particular individuals in a sample have an inbreeding coefficient of zero. For large numbers of SNPs, allelesharing estimators \({\hat{f}}_{{{{{{\rm{AS}}}}}}}\) are unbiased for f for all sample sizes and have values for a set of individuals that have invariant ranks over studies that include that set. We show that most estimators using sample allele frequencies are estimating some combination of f and of individualspecific average kinships ψ with individuals in the study. Estimators with expectations depending on ψ do not have invariant rankings, as we showed with data from the 1000 Genomes project as the study scope varied from the population to the continent to the world.
Our ibdbased model rests on expectations of allelesharing proportions satisfying expressions such as Eq. (5). There is no requirement for nonoverlapping generations, or homogeneous populations, for example. This generality is a consequence of not needing allele frequencies, whether these refer to a population or to an individual.
The role of ibd probabilities in theoretical population and quantitative genetic contexts is well known, but we suggest it is rankinvariant estimators for the withinpopulation parameters f_{j} that are of relevance for empirical studies and we offer the examples in the following sections.
Genotype probabilities
There is often a need to estimate genotype probabilities from observed allele proportions using formulations with allele probabilities and ibd probabilities F (e.g., (National Research Council, 1996) for forensic science). Following Eq. (6) we see that it is \(2{\tilde{p}}_{l}(1{\tilde{p}}_{l})(1{f}_{j})\) rather than \(2{\tilde{p}}_{l}(1{\tilde{p}}_{l})(1{F}_{j})\) that is unbiased for 2π_{l}(1 − π_{l})(1 − F_{j}) if F_{j} and f_{j} are known.
Inbreeding depression
Inbreeding is known to affect, linearly, the expected value of quantitative traits, and studies of inbreeding depression often proceed by regressing trait means on inbreeding levels. In Yengo et al. (2017), we used \({\hat{F}}_{{{{{{\rm{ROH}}}}}}}\), \({\hat{f}}_{{{{{{\rm{HOM}}}}}}}\) and \({\hat{f}}_{{{{{{\rm{UNI}}}}}}}\) as inbreeding estimates and Kardos et al. (2018) pointed out that we did not discuss the distinction between F and f. We responded (Yengo et al., 2018) with reasons for not wishing to use \({\hat{F}}_{{{{{{\rm{ROH}}}}}}}\) and we could have pointed out the linear relationship between f_{j} and F_{j} and the high correlation we showed above between \({\hat{f}}_{{{{{{{\rm{AS}}}}}}}_{j}}\) and \({\hat{F}}_{{{{{{{\rm{ROH}}}}}}}_{j}}\) means that regressing on either \({\hat{F}}_{{{{{{\rm{ROH}}}}}}}\) or \({\hat{f}}_{{{{{{\rm{AS}}}}}}}\) should lead to similar results. In lesshomogeneous populations than represented by the UK Biobank data (Allen et al., 2012) we used in Yengo et al. (2017), it would appear to be better to use \({\hat{f}}_{{{{{{{\rm{AS}}}}}}}_{j}}\) than \({\hat{f}}_{{{{{{{\rm{UNI}}}}}}}_{j}}\) to avoid any effects of individualspecific average kinships on inbreeding estimates. The correlation of trait and \({\hat{f}}_{{{{{{{\rm{AS}}}}}}}_{j}}\) values is invariant over reference populations. Alemu et al. (2021) pointed out that \({\hat{f}}_{{{{{{\rm{HOM}}}}}}}\) (and \({\hat{f}}_{{{{{{\rm{AS}}}}}}}\)), gives equal weights to all SNPs, whereas \({\hat{f}}_{{{{{{\rm{UNI}}}}}}}^{u}\) gives greater weight to SNPs with rare alleles. Alemu et al. did not consider the role of individual average kinships in the bias of \({\hat{f}}_{{{{{{\rm{UNI}}}}}}}\).
Genetic relatedness matrix
Inbreeding is also known to affect, linearly, the additive component of genetic variance. For additive traits, the genetic variance for individual j is \((1+{F}_{j}){\sigma }_{A}^{2}\) where \({\sigma }_{A}^{2}\) is the additive variance for populations in Hardy–Weinberg equilibrium. Consequently, the expected value of the sample variance \({\tilde{V}}_{T}\) of trait values over a sample of n individuals is (Speed et al., 2012)
Here the trait is additive and the errors, with variance \({\sigma }_{e}^{2}\), are independent of genetic effects. The GRM G has trace \({{{{{\rm{tr}}}}}}({{{{{\boldsymbol{G}}}}}})\) and sum of offdiagonal elements Σ_{G}. If the GRM elements are (1 + F_{j}) on the diagonal and \(2{\theta }_{jj^{\prime} }\) off the diagonal then the trace is n(1 + F_{W}) and the sum of offdiagonal elements is n(n − 1)θ_{S} so the genetic component of V_{T} is \((1+{F}_{W}2{\theta }_{S}){\sigma }_{A}^{2}\). If the GRM is replaced by a matrix with allelesharing inbreeding and kinship estimates, this becomes \((1+{f}_{W}){\sigma }_{A}^{2}\), reflecting that it is the withinpopulation estimated GRM that is used in practice. We show elsewhere that the same expected variance holds with GRMs constructed with \({\hat{f}}_{{{{{{\rm{STD}}}}}}}\) or \({\hat{f}}_{{{{{{\rm{UNI}}}}}}}\).
In summary, we have shown that inbreeding measures of utility in empirical studies are “withinpopulation” with the choice of population being at the discretion of the investigator. With allelesharing inbreeding estimators, the population specifies the set of individuals whose pairwise coancestry is the reference against which inbreeding is measured. For estimators making explicit use of sample allele frequencies, it is the population that furnishes those frequencies, although then inbreeding estimates are confounded by individualspecific average kinships. We showed algebraically and empirically that allelesharing estimators have invariant rankings across choice of population.
Software
Estimation of inbreeding coefficients can be performed with the following software.
\({\hat{F}}_{{{{{{\rm{HOM}}}}}}}\): PLINK
\({\hat{F}}_{{{{{{\rm{Uni}}}}}}}\): PLINK2, GCTA.
\({\hat{F}}_{{{{{{\rm{Std}}}}}}}\): PLINK1, GCTA.
\({\hat{F}}_{{{{{{\rm{ROH}}}}}}}\): PLINK1, BCFtools/ROH, FSuite.
\({\hat{F}}_{{{{{{\rm{AS}}}}}}}\): SNPRelate, hierFstat.
\({\hat{F}}_{{{{{{\rm{MLE}}}}}}}\): SNPRelate.
Software is available at: BCFtools/ROH: https://samtools.github.io/bcftools/howtos/rohcalling.html
FSuite: http://genestat.cephb.fr/software/index.php/FSuite
GCTA: http://gump.qimr.edu.au/gcta
hierFstat:https://cran.rproject.org/web/packages/hierfstat/index.html
PLINK: http://pngu.mgh.harvard.edu/purcell/plink/
PLINK2: https://www.coggenomics.org/plink/2.0/
SNPRelate:http://www.bioconductor.org/packages/release/bioc/html/SNPRelate.html
Data availability
The simulated data are available in the online supplement. The 1000 Genomes data are available at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/.
References
Allen N et al. (2012) UK Biobank: current status and what it means for epidemiology. Health Policy Technol 1:123–126
Alemu A. W. et al. An evaluation of inbreeding measures using a wholegenome sequenced cattle pedigree. Heredity 126:410–423.
Astle W, Balding DJ (2009) Population structure and cryptic relatedness in genetic association studies. Stat Sci 24:451–471
Ayres KL, Balding DJ (1998) Measuring departures from HardyWeinberg: a Markov chain Monte Carlo method for estimating the inbreeding coefficient. Heredity 80:769–777
Ceballos FC, Joshi PK, Clark DW, Ramsay M, Wilson JF (2018) Runs of homozygosity: windows into population history and trait architecture. Nat Rev Genet 19:220–234
Chang CC, Chow CC, Tellier LCAM, Vattikuti S, Purcell SM, Lee JJ (2015) Secondgeneration PLINK: rising to the challenge of larger and richer datasets. GigaScience 4:7
DeGiorgio M, Rosenberg NA (2009) An unbiased estimator of gene diversity in samples containing related individuals. Mol Biol Evol 26:501–512
Gazal S, Sahbatou M, Perdry H, Letort S, Génin E, Leutenegger A (2014) Inbreeding coefficient estimation with dense SNP data: comparison of strategies and application to HapMap III. Hum Hered 77:49–62
Gazal S, Sahbatou M, Barbron MC, Génin E, Leutenegger A (2015) High level of inbreeding in final phase of 1000 Genomes Project. Sci Rep 5:17453
Gibson J, Morton NE, Collins A (2006) Extended tracts of homozygosity in outbred human populations. Hum Mol Genet 15:789–795
Goudet J (2005) HIERFSTAT, a package for R to compute and test hierarchical Fstatistics. Mol Ecol Notes 5:184–186
Goudet J, Kay T, Weir BS (2018) How to estimate kinship. Mol Ecol 27:4121–4135
Hall N, Mercer L, Phillips D, Shaw J, Anderson AD (2012) Maximum likelihood estimation of individual inbreeding coefficients and null allele frequencies. Genet Res 94:151–161
Hill WG, Weir BS (2011) Variation in actual relationship as a consequence of Mendelian sampling and linkage. Genet Res 93:47–74
Hedrick P. W. (2000). Genetics of Populations, 2nd edn. Jones and Bartlett, Sudbury, MA.
Joshi PK et al. (2015) Directional dominance on stature and cognition in diverse populations. Nature 523:459–462
Kardos M, Nietlisbach P, Hedrick PW (2018) How should we compare different genomic estimates of the strength of inbreeding depression. Proc Natl Acad Sci USA 115:E2492–E2493
Kelleher J, Etheridge AM, McVean G (2016) Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS Comp Biol 12:e1004842
Li CC, Horvitz DG (1953) Some methods of estimating the inbreeding coefficient. Am J Hum Genet 5:107–117
Malécot G. (1948), The Mathematics of Heredity. Translated by Yermanos DM (1960). Freeman, San Francisco.
McPeek MS, Wu X, Ober C (2004) Best linear unbiased allelefrequency estimation in complex pedigrees. Biometrics 60:359–367
Meyermans R, Gorssen W, Buys N, Janssens S (2020) How to study runs of homozygosity using PLINK? A guide for analyzing medium density SNP data in livestock and pet species. BMC Genom 21:94
Narasimhan V, Danecek P, Scally A, Xue Y, TylerSmith C, Durbin R (2016) BCFtools/RoH: a hidden Markov model approach for detecting autozygosity from nextgeneration sequencing data. Bioinformatics 32:1749–1751
National Research Council (1996) The Evaluation of Forensic DNA Evidence. National Academies Press, Washington DC
Neuenschwander S, Michaud F, Goudet J (2019) quantiNemo 2: a Swiss knife to simulate complex demographic and genetic scenarios, forward and backward in time. Bioinformatics 35:886–888
Ochoa A, Storey JD (2021) Estimating F_{ST} and kinship for arbitrary population structures. PLoS Genet 17:e1009241
Purcell S et al. (2007) Plink: a toolset for wholegenome association and populationbased linkage analysis. Am J Hum Genet 81:559–575
Ritland K (1996) Estimators for pairwise relatedness and individual inbreeding coefficients. Genet Res 67:175–185
Robertson A, Hill WG (1984) Deviations from HardyWeinberg proportions: sampling variances and use in estimation of inbreeding coefficients. Genetics 107:703–718
Speed D, Hemani G, Johnson MR, Balding DJ (2012) Improved heritability estimation from genomewide SNPs. Am J Hum Genet 91:1011–1021
Steele CD, SyndercombeCourt D, Balding DJ (2014) Worldwide F_{ST} estimates relative to five continentalscale populations. Ann Hum Genet 78:468–477
The 1000 Genomes Project Consortium (2015) A global reference for human genetic variation. Nature 526:68–87
VanRaden PM (2008) Efficient methods to compute genomic predictions. J Dairy Sci 91:4414–4423
Vogl C, Karhu A, Moran G, Savolainene O (2002) High resolution analysis of mating systems: inbreeding in natural populations of Pinus radiata. J Evol Biol 15:433–439
Wang J (2016) Pedigrees or markers: which are better in estimating relatedness and inbreeding coefficient. Theoret Pop Biol 107:4–13
Weir BS (1996) Genetic Data Analysis II. Sinauer, Sunderland, MA
Weir BS, Cockerham CC (1984) Estimating Fstatistics for the analysis of population structure. Evolution 38:1358–1370
Weir BS, Goudet J (2017) A unified characterization for population structure and relatedness. Genetics 206:2085–2103
Weir BS, Hill WG (2002) Estimating Fstatistics. Ann Rev Genet 36:721–750
Wright S (1922) Coefficients of inbreeding and relationship. Am Nat 56:330–338
Yang J, Lee SH, Goddard ME, Visscher PM (2011) GCTA: a tool for genomewide complex trait analysis. Am J Hum Genet 88:76–82
Yengo L et al. (2017) Detection and quantification of inbreeding depression for complex traits from SNP data. Proc Natl Acad Sci USA 114:8602–8607
Yengo L et al. (2018) Estimation of inbreeding depression from SNP data REPLY. Proc Natl Acad Sci USA 115:E2494–E2495
Zheng X, Levine D, Shen J, Gogarten S, Laurie C, Weir B (2012) A highperformance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics 28:3326–3328
Acknowledgements
This work was supported in part by Grants 31003A138180 and IZKOZ3157867 from the Swiss National Science Foundation, by Grants GM081062 and GM075091 from the US National Institutes of Health, and by the Fondation Herbette UNIL. We are grateful for the comments of reviewers and editors on earlier versions of this paper.
Author information
Authors and Affiliations
Contributions
QSZ, JG, and BSW all contributed to the design of this study, data analysis, and paper preparation.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Associate editor: Olivier Hardy.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhang, Q.S., Goudet, J. & Weir, B.S. Rankinvariant estimation of inbreeding coefficients. Heredity 128, 1–10 (2022). https://doi.org/10.1038/s41437021004714
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41437021004714
This article is cited by

Mating system and inbreeding depression in Hymenaea stigonocarpa
Tree Genetics & Genomes (2024)

SNP heterozygosity, relatedness and inbreeding of whole genomes from the isolated population of the Faroe Islands
BMC Genomics (2023)

A comparison of markerbased estimators of inbreeding and inbreeding depression
Genetics Selection Evolution (2022)