Introduction

The size of ancestral haplotypes around a mutation is inversely correlated to the time, in generations to the common ancestor. Once the mutation has been introduced into the population, recombination and mutation in subsequent generations break down the ancestral haplotype at nearby markers. The rationale is apparently simple: small haplotypes indicate ancient mutations. However, the apparent simplicity of this rationale conceals real difficulties in the modeling of haplotype-sharing decay for deciphering past events.1, 2 There are two main problems: determination of the time to the most recent common ancestor (MRCA) if the genetic locations of the variant and of the markers are known and the estimation of the location of a variant.1 Specific methods are required to address these two problems, which represent very different challenges. For example, genetic heterogeneity – a well-known obstacle to fine-scale genetic mapping by linkage disequilibrium (LD) – is not generally a problem when trying to date a specific mutation.3 As LD is used to map complex diseases, large samples are analyzed and efficient algorithms do not necessarily have to be expressed algebraically.4, 5

Calculating the age of a mutation is greatly simplified by assuming that all present day haplotypes descended from a single common ancestor, with all branches of equal length and evolving independently. Such ‘star-shaped ‘ or ‘star-like ‘ genealogies are representative of samples taken from exponentially growing populations and are by far the easiest to analyze, as this model bypasses all the difficulties of translating a set of mosaic haplotypes into a specific genealogy for a genetic variant. Branch length is the only parameter required to describe a star-like genealogy. The assumption that the entire genealogy arose at a single time point in the past leads to a simple formula relating the proportion of identical-by-descent (IBD) alleles and the age of the mutation.6, 7, 8, 9 Star-shaped genealogies underestimate the number of generations from the common ancestor, and correction factors have been proposed to give more realistic figures for actual age.10, 11 Assuming a single ancestor being an unsatisfactory approximation, testing of the fit to the single ancestor model have been proposed.12 The question of the genealogy of a mutation is of particular importance if the mutation concerned is thought to have been subjected to recent positive selection, because unexpectedly long haplotypes for a frequent mutation are thought to be a genetic signature of recent positive selection.13, 14, 15 The resolution of time intervals for possible population expansions requires the demonstration of star genealogies rather than their prior assumption.

We need to describe the genealogy of a mutation appropriately, according to the level of accuracy desired. Backwards inference from present haplotypes to past processes is difficult owing to the large number of possible genealogies and the stochastic nature of haplotype decay. As an illustration of the complexity of the problem, 10 haplotypes would give rise to 104 topologically different trees, and for each shape of tree, nine branch lengths must be evaluated (n−1 nodes representing root and intermediate founders). The number of parameters needed to describe all the possible genealogies of a mutation is generally much larger than the number of independent parameters obtained from the analysis of genetic markers.

We consider here the simplest bifurcating model – a two-ancestor model. The probability density of haplotype block size is shown to be defined in a closed form, facilitating its implementation in a multipoint Bayesian algorithm. In populations with a long tradition of marriages between relatives, secondary founder effects appear as large regions of IBD alleles. We used GUCY2D – the gene most frequently implicated in our series of patients with Leber congenital amaurosis (LCA, MIM 204000) – to illustrate our methods, focusing on two recessive mutations present in Mediterranean populations. This scenario may correspond to an approach midway between oversimplistic and overcomplicated genealogical models.

Methods

Families

Thirty-six unrelated families originating from North Africa and one family from Portugal were screened for mutations in all known LCA genes. All patients fulfilled the minimal criteria for the diagnosis of LCA described elsewhere.16 Twenty-two of the 36 North African families (16 consanguineous) were from Algeria, seven (all consanguineous) were from Tunisia, five (all consanguineous) were from Morocco, and two (both consanguineous) were from Egypt.

Genotyping

DNA samples available from individuals belonging to the seven families harboring the 387delC mutation were subjected to PCR amplification, using primers specific for exon 2 of the GUCY2D gene, as described elsewhere.17 Intronic primers were used to amplify the 8th exon of the GUCY2D gene of all members of the four unrelated families carrying the Phe565Ser mutation for whom DNA samples were available. The purified PCR fragments were directly sequenced, using the Big Dye Terminator Cycle Sequencing Kit. Single nucleotide polymorphism (SNP) analysis was carried out and the segregation of two GUCY2D SNPs (SNP; 531G>T, 12717T>C; Genbank accession no. AJ222657) was studied in families harboring either the 387delC microdeletion (n=7) or the Phe565Ser missense mutation (n=4). PCR amplifications were carried out with the following primers.

ForwardG531T: (5′–3′) CATGGGTTACTCGGGCTTGGAGAAA; reverse531G>T: (5′–3′) GAGAGAAGATGGGGTCGCAAGCCCA and forward12717T>C: (5′–3′): TGCTCCCTGTCCCATCTG; reverseT12717C: (5′–3′): AGACAGTATGCCTTTATTTCAC.

An annealing temperature of 60°C was used for both 531G>T and 12717T>C. The 315 bp (for 531G>T) and 200 bp (for 12717T>C) fragments amplified were directly sequenced.

Genetic markers flanking the LCA1 locus were studied for LD in families carrying one of these two mutations (Hanein et al, 200417). The position of the markers (with physical distances in megabases, shown in parentheses) were estimated from human genome working draft data available from the University of California, Santa Cruz (UCSC). We also studied GUCY2D SNPs, selected based on their position within the gene. The (c.531G>T) SNP was located 229 bp upstream from the 387delC mutation and 11957 bp downstream from the Phe565Ser mutation, whereas the (c.12717T>C) SNP was located 1163 bp upstream from the 387delC mutation and 11023 bp downstream from the Phe565Ser mutation (Table 1).

Table 1 Haplotype analysis of the markers flanking the GUCY2D gene

Statistical analysis

Two-ancestor model problem

Comparing a trio of haplotypes side by side, the sizes of the complete LD region and the partial disequilibrium region can be determined on both sides of the mutation (Figure 1a and b). We first assume that a high density of markers clarifies the question of the intervals, and that allele sharing implies IBD. Identity-by-state (IBS) will later be taken into consideration for the calculation of haplotype density. If the same pair of haplotypes shares the longest track of identical alleles on both sides of the mutation, the LD pattern is said to be concordant (Figure 1c).

Figure 1
figure 1

Schematic representation of the two-ancestor problem. (a) The genealogy of the mutation is defined by two coalescence times (n1, n2) and the position of the node. The left to right arrangement of haplotypes is taken into account. (b) On each side of the mutation, the parameters studied are the size of the linkage disequilibrium regions, either complete (θ1a and θ1b) or partial (θ2a and θ2b). Dark gray bars: alleles inherited from ancestor n1. Light gray bars: alleles inherited from ancestor n2. One of the several possible combinations is shown. (c) The likelihood that the distal LD boundaries concern the same pair of haplotypes on both sides of the mutation is a function of the ratio n2/n1. Concordance of distal haplotype blocks suggests a low n2/n1 ratio. Left, concordant pattern; right, discordant pattern.

Haplotype core boundary

The existence of a region of identical alleles surrounding the mutations indicates a common origin of present-day mutation-carrying haplotypes. The size of this region in complete LD is represented by the distribution of the most proximal recombination event. The probability that no recombination occurred in an interval θ after m meioses is (1−θ)m, whereas the probability that a single recombination occurred exactly at a distance θ is proportional to (1−θ)m−1. The distribution normalized to unity over the interval [0,1] is

The number of meioses m is 2n1+n2 in our two-ancestor model problem. As any given value of m may result from several combinations of n1 and n2, the size of the haplotype core (region in complete LD) cannot resolve the two branch parameters unambiguously. Distal and proximal boundaries must be considered together to obtain insight into the history of the mutation. The power function, (1−θ)n, describing the probability of there being no recombination at a genetic distance θ from a mutation is usually approximated by the exponential e, with no notable consequences, when θ is small and n is large. As normalization constants, which are required to ensure that the probabilities sum to 1, are easier to calculate for power functions than for exponential functions, the expression of the base case has not been simplified unnecessarily. The relationship between the number of meioses and distance from the mutation is best visualized as a 3-dimensional plot (Figure 2).

Figure 2
figure 2

Representation of the relationship between the number of meioses and the size of the region in complete linkage disequilibrium. Each orthogonal point of view represents either the probability of haplotype block edge expressed for the interval (1, 5 cM) or the probability of ancestry expressed for the interval (0, 100 meioses). In a Bayesian framework, the recombination interval is known and the number of generations is calculated.

Distal haplotype boundary

We will first consider the construction of a three-branch tree from a simple two-branch model (Figure 3). If we start from a pair of haplotypes and add a third chromosome, two possibilities must be considered; the third sequence may be smaller or larger than the homologous region of the initial pair. As these probabilities are alternative events, they are complementary. Since two of the three probability density functions are described by equation (1), the last function corresponds to the difference between the other two. When the three branches are of identical length, the distribution of the resulting tree is

The first factor (6n) is the normalizing constant making the integral of the density function equal to 1.

Figure 3
figure 3

A graphical approach to the construction of probability density functions. (a) A branch is added to a pair of haplotypes, corresponding to the simplest possible tree. (b) The dotted line represents the power (approximately exponential) function for the common region of two haplotypes. The solid line represents the common region of three haplotypes. The dashed line represents the distal boundary function, expressed as the difference between two power functions. The X-axis is the distance from the mutation, and the Y-axis is a dimensionless quantity.

In the general situation in which the third branch is not identical to the first two, three haplotype patterns can be recognized, according to the position of the critical recombination events in the mutation tree (Figure 4a and b). The density probabilities for these three possibilities are

Patterns X1 and X2 are distinguished here only for the calculation of the proportion n1/n2 in the mosaic haplotype. By simplifying Equations (3) and (4), we obtain the density function (normalized to unity) of the distal edge:

The complete LD region and distal edges result from shared stochastic events and cannot not be considered to be independent variables. The probability of observing both quantities θ1 (complete LD region size) and θ2 (partial LD region size) must be calculated from transitional probability expressions. The probability generating functions as a function of relative branch length are and for the X and YZ conditions, respectively (Figure 4c). When all branches have the same length, the density probability of n is

The base case functions being expressed in a closed form, the modal figure (maximum likelihood of number of generations) is simply calculated from the derivative, as the expression

gives rise to the derivative

The value of n that passes through zero at the peak's inflection points is

This is Risch's formula, which relates the proportion of identical alleles and the number of generations from the MRCA.6 This derivation also holds for determination of the proximal edge in larger samples, as for n-1 out of n haplotypes.

Figure 4
figure 4

Haplotype patterns and their probabilities. (a) On each side of the mutation, three possible patterns are recognized. X1 and X2 patterns are the ‘natural’ patterns, where the shortest haplotype corresponds to the longest branch (X1 ancestor 2 determines the common region; X2 ancestor 1 determines the common region). YZ is the pattern and probability when the shortest segment is either one of the two haplotypes at the tip of the forked branch. (b) Density probability of the three possible patterns (X-axis: recombination fraction, Y-axis: arbitrary units). (c) Graphical representation of the expressions of the conditional probabilities. Solid lines with no symbols indicate the theoretical expectations for the proximal edge (θ1), given the distal edge (θ2). Circles and squares indicate simulations for θ2=0.1 and θ2=0.15 respectively simulations (n1=6; n2=4). Unblackened and blackened symbols indicate the position of the recombinations for the X and YZ patterns respectively.

Allele sharing: IBS or IBD?

Coalescence times may be severely underestimated if allele sharing is assumed to be accounted for by IBD alone.10 For each marker, the probability of observing, by chance, x identical alleles of a total of z alleles if the allele frequency is f, follows a truncated binomial distribution with parameter f. These probabilities, calculated from present-day allele frequencies, are taken as prior probabilities of IBS. A synthetic picture of coalescence time density functions is obtained as a compound probability for every proportion of identical alleles actually corresponding to IBD.

Results

The results of the haplotype analysis of the markers flanking the GUCY2D gene are shown in Table 1. The families are grouped according to mutation, delC or F565S. All patients carry two copies of the same mutation, and none is a compound heterozygote. As would be predicted for patients born to first or second cousins, F565S carriers presented long blocks (up to 20 cM) of identical alleles. Although all F565S-bearing families originate from the same region, we found no genealogical evidence for a common ancestor of two patients. Haplotype analysis of the markers flanking the GUCY2D gene suggested a common founder for each mutation.

Pairwise analysis of allele-sharing decay, using Risch's formula, provided different coalescence numbers, according to the marker considered. For delC, the age of the common ancestor with two-point methods varied from 25 to 146 generations. For F565S, estimation of the age of the common ancestor with a two-point formula, using the proportion of identical alleles as a parameter, gave even more variable results for the number of generations elapsed. The long region of apparent allele identity observed for each patient and the familial context of marriage between cousins suggest that a grand parent or a great grand parent was probably an intermediate ancestor. Based on this hypothesis, the number of haplotypes to be analyzed can be reduced from eight to four, and the calculation with the pairwise method gives variable coalescence times extremely sensitive to the choice of the marker analyzed (20–110 generations).

Analyzing every possible haplotype trio confirmed the hypothesis of intermediate ancestors (Figure 5a). Averaging over all possible trios is not appropriate, because not all trios of a sample can have the same weight for calculation of the root of the tree. A branch-and-bound approach, using the density functions described in the methods section as a skeleton, reduces the number of haplotypes to be considered as a first step and then calculates the time, in generations, to the MRCA, using haplotype sizes as parameters (Figure 5b).

Figure 5
figure 5

Ancestry of the GUCY2D mutations. (a) Calculation of the n2/n1 ratio for a trio of haplotypes (delC, haplotypes 1, 3, and 14, Table 1). (b) Compound probability of time to common ancestor. The dashed line indicates the number of generations for the mutation F565S and the solid line the number of generations for the mutation delC.

Discussion

This paper presents a new approach for constructing mutation genealogies from present-day haplotypes. Using the simplest branched model as a starting point, exact calculations can be made, using the size of the preserved regions around the mutation.

When the ancestral haplotype is known, the expected genetic distance from either edge of the ancestral haplotype is simply the reciprocal of the age of the variant.1 In practice, the ancestral haplotype is not known, and determining whether any block of alleles actually represents the ancestral sequence of the root ancestor or an intermediate ancestor is subject to the same problems as the direct calculation of coalescence times. The core region around the mutation therefore seems to be the best region to start from, as this region in complete LD almost certainly correspond to the sequence of the root ancestor. The size of the region in complete LD reflects only the sum of the number of meioses in each branch, and is an imperfect estimator of time to the common ancestor. The number of meioses accounting for the distribution of the size of the complete LD block is sensitive to sample size (number of haplotypes studied): large samples may have very small core haplotypes that nonetheless correspond to a recent mutation; the mean ancestor haplotype length is however insensitive to sample size.18 Furthermore, the uncertainty associated with the size of the region in complete LD is always large. This uncertainty is explained by the dimension of the density function, which is proportional to , whereas the distribution of haplotypes approximates to a gamma function in star-like genealogies,1 and is therefore proportional to and represented by a much narrower density function in large samples.

Risch's formula describes the relationship between the proportion of shared alleles for a single marker of known distance from the mutation and the number of generations from the common ancestor. As Risch's formula is a moment estimator method, we would expect to find a difference between the mode and the mean if the distribution has an asymmetric shape. Analysis of the relationship between the modal and mean values of the unnormalized distribution of the distal edge provided an unexpected link between our Bayesian framework and Risch's approach.

In a mutation history tree, branch length represents time of successive meioses. In our model with only a small number of nodes, a star-like or star-shaped genealogy is not assumed, but may result from a calculation in which the two coalescence times are found to be very similar. As our two-ancestor equations could be transposed to the broader situation of the two most proximal edges of haplotype-sharing regions for larger samples, it is tempting to extrapolate the equations to the case of a proportion of shared alleles of any size with more than two ancestors. However, such model-free formula are unfortunately unattainable because the expected distribution of haplotypes always depends on an underlying model of genealogy and model-free formulae must therefore represent every possible tree for a given number of chromosomes. An absolute calculation is at odds with the inevitable uncertainty of experimental parameters (recombination rates, allele, and mutation frequencies) and is mostly not required and difficult to represent.

The probability of a mutation genealogy is easier to understand in terms of a limited number of parameters ad hoc algebraic equations based on simplified genealogies could be constructed for any specific shape of tree because we can derive an algebraic correspondence between tree shapes and IBD blocks. Inference about mutation history is likely to concern only a few ancestors near the root of the tree, suggesting that the ‘small number of ancestors’ model may provide the framework of a heuristic trade-off between accuracy and simplicity. Analyses based on haplotype size naturally leads to a multipoint method, the distribution of the proximal and distal edges being sufficient for the calculation of two branch lengths at a time. Considering the entire haplotype leads to more robust estimates than pairwise disequilibrium.19

Bayesian procedures require specification of prior distributions for all the parameters.20 We implicitly assume in our analysis that the prior age is drawn from a uniform distribution. This assumption of a uniform prior distribution is appropriate for a rare mutation because the large frequency variations associated with the small number of mutation carriers usually generate an approximately flat distribution. A well-known relationship links the frequency of a DNA variant and effective population size.21 For rare variants, the expected figure is often blurred by stochastic variations of variant frequencies. Known distributions from the same population could be used to define the prior distribution required for a fully Bayesian approach.

As most recessive disease-causing mutations are recent, such mutations are a potentially valuable tool for probing the recent molecular history of populations. Branching points are more likely to occur when the population is expanding, and are surrogate markers for such periods. When the frequency of a DNA variant is large and accurately known, the estimated age of this variant may be used to calculate the population expansion coefficients because we have both the size of the population and the time needed to reach the observed frequency.

In this paper, we have analyzed genotypic data from a series of families with GUCY2D mutations. The 387delC and F565S mutations were observed in North African and Southern European (Portuguese) patients recruited from five different countries. Founder effects may account for this high prevalence of the two mutations analyzed here in North African patients. No compound heterozygotes were observed. This observation suggests that mutation frequency was variable in the regions considered, possibly due to secondary founder effects. All carriers of the 387delC mutation had a conserved haplotype over 1-cM interval containing the GUCY2D locus. As mean ancestor haplotype length is the inverse of time to the common ancestor, the apparent midpoint of haplotypes upon visual inspection provides a first approximation of ancestry (1 cM=100 generations). Unfortunately, haplotype midpoint is only a proxy for the mean residual ancestral haplotype, one of the most remarkable figures in the molecular history of a mutation because this mean haplotype is sensitive neither to sample size nor to a specific genealogical tree.18 The key challenge is not obtaining a crude estimate of ancestry based on visual inspection, but estimating the precision of the calculation of branching points as proxies for periods of population expansion. The branch-and-bound algorithm groups together similar haplotypes and reduces the number of branches. The final step calculates the expected number of generations for these independent branches. This final heuristic step is justified by the simplicity of the algebraic description of unconnected branches. The precision of the estimation is limited by the uncertainty about the local variations of the sex-average genetic map: more recombinations are expected from female ancestors than male ancestors, these differences of the male-specific and female-specific map are expected to be limited on the long run. As recombination hotspots occur at intervals of 200 kb or less on average,22 fine-scale variations in recombination rates limit the time span of methods based on haplotype-decay models.

Selection against homozygotes has a limited impact on the calculation, even in populations with high inbreeding coefficient, because the fraction of mutations removed at each generation is approximately proportional to a usually low gene frequency. After 100 generations, a mutation of frequency 0.01 in a population of inbreeding coefficient 0.02 is affected by less than 2%.

In this study, we have described the density function of haplotype block edges in a simple bifurcating model. Because it is difficult to translate haplotype sequences into molecular history, genealogies of intermediate complexity, with a small number of ancestors, represent a heuristic middle-ground between simple but unrealistic star-like trees and general models of untractable complexity.