Introduction

Parametric linkage analysis searching for a unique founder mutation is a powerful tool to identify rare genetic variants with large effects. Genetically isolated populations provide extended pedigrees for linkage analysis as evidenced by numerous successes for complex traits, including type 2 diabetes1 and Alzheimer's disease.2 In such populations, pedigrees can be reconstructed based on genealogical records resulting in deep pedigrees that include a large number of multiple lines of descent. However, utilizing such large pedigrees in linkage analysis is computationally challenging.

For exact multipoint linkage analysis, several software packages implementing Lander–Green–Kruglyak algorithm,3, 4 such as Genehunter,3 Merlin,5 and Allegro,6 are frequently used. The computational complexity of this algorithm increases linearly with the number of markers, but exponentially with the bit-size of the pedigree. The bit-size is defined as two times the number of individuals with parents presented in the pedigree minus the number of pedigree founders.3 As long as the pedigree bit-size is small, the Lander–Green–Kruglyak algorithm can analyse a large number of markers. In modern implementations,5 the time to compute multipoint LOD score using the Lander–Green–Kruglyak algorithm also depends on the fraction of pedigree members with missing genotypic information, which may be large for deep pedigrees because phenotypes and genotypes are usually unknown from the upper generations. Therefore, exact multipoint calculations are limited to pedigrees of several dozens of bits. Programs using Markov-chain Monte Carlo (MCMC) algorithms, such as Simwalk2,7 Loki,8 and Morgan9 can analyse larger pedigrees but still fall short in the context of large pedigrees with hundreds of loops, especially for genome screens with large number of densely spaced markers.10, 11

A common solution to reduce the computational burden is to split a large pedigree into smaller, and thus computable, subunits. Analyses of Hutterite pedigrees have revealed that a substantial amount of linkage information may be lost when truncating the pedigree for linkage analysis based on recent generations.12 Manual splitting by an expert is only possible for relatively small pedigrees. For cutting large pedigrees, a semi-automatic method has been proposed based on factor analysis.13 This method relies partly on the expert decisions and often yields subpedigrees that are still too complex for the Lander–Green–Kruglyak algorithm-based linkage analysis. Falchi et al14 suggested a pedigree-splitting method based on graph theory maximum-cliques partitioning algorithm. This algorithm, although automatic, requires a number of parameters to be prespecified, for example a maximum number of generations, range for the measure of relatedness used to group individuals, and the range of the number of subjects of interest in a subpedigree. The optimization of these parameters largely depends on examining different sets of resultant subpedigrees. Furthermore, the available software implementing this algorithm does not guarantee that all of the resultant subpedigrees fall within specific bit-size limit and thus can be efficiently analysed by parametric linkage analysis using the Lander–Green–Kruglyak algorithm.

In this work, we develop a fast automatic algorithm for splitting large pedigrees based on user-specified maximum bit-size restriction. The algorithm specifically aims to split deep pedigrees where patients are relatively remotely related through genealogic connections and to produce an optimal set of subpedigrees for parametric linkage analysis of binary traits under rare dominant mutation model.

Pedigree splitting algorithm

In pedigree splitting we focus on the family members who have known genotypes and/or phenotypes. These people are denoted as ‘subjects of interest’ (SOI). We assume that some SOI share the genetic variant explaining their phenotype identical by descent from their common ancestor(s). The aim of our heuristic algorithm is to split a large pedigree into subpedigrees containing a maximal number of SOI who are related to a common ancestor and the bit sizes of the resultant pedigrees should be smaller than or equal to that specified by the user. This aim can be theoretically achieved by evaluating all SOI subgroups in terms of bit-sizes of pedigrees relating them to a common ancestor. The number of subgroups to evaluate, however, grows exponentially with the number of SOI studied, and becomes prohibitively large with more than 20–30 SOI. The number of subgroups to evaluate, therefore need to be reduced.

In our algorithm the kinship coefficient, φij, is used to measure the degree of relatedness between SOI. The kinship coefficient is defined as the expected probability that two alleles randomly sampled from a pair of relatives i and j are copies of the same ancestral allele (identical by descent). For example the kinship coefficient is 1/4 if i and j are first-degree relatives (eg siblings) and 1/8 if they are second-degree relatives (eg uncle–niece). In our work, the coefficients were calculated using a modified version of the PEDIG software developed by Didier Boichard.15

In the first step, the algorithm constructs a matrix of subgroups that are sorted based on kinship. Given a group of SOI of size N, consider individual m (1, N). The set of relatives of m is sorted in decreasing order according to their kinship to m, so as m is the first element of this set. Let us call this set as Rm. Let Rm(f) be a set of the first f elements of Rm, f (2, N). When f=2, Rm(2) includes m and his closest relative. When f=3, the next closest relative of m is included into Rm(2), so that Rm(1) Rm(2) Rm (N).

Iterating the central individual m over all SOI gives an N by N-1 matrix,

where each row represents the subgroups of SOI derived from the same central individual but having different sizes (2 to N). Each column represents the subgroups of the same size, which are derived from each individual, who is considered as the center of the corresponding subgroup. Identify unique groups of relatives by iterating over matrix R and give Rm(f) a binary index of 1 when the content of Rm(f) is seen for the first time, otherwise assign zero.

At the second step of our algorithm, the genealogies connecting the members of identified unique subgroups (as identified by index of 1) are constructed using the PedHunter program developed by Agarwala et al.16 For each subgroup, a subpedigree linking the maximal number of subgroup members to the most recent common ancestor is reconstructed and the bit-size of every subpedigree is computed. For each row Rm, the construction of subpedigrees starts at Rm(2) and stops when the bit-size of Rm(f) violates the maximal bit-size limit. There is no need to construct subpedigrees for Rm(f+1) to Rm(N) because the bit-size of Rm(f+1) is always greater than that of Rm(f). At this stage, all constructed subpedigrees satisfy the bit-size limit and contain a unique configuration of SOI. The subpedigree connecting the largest number of SOI, as heuristically the most interesting pedigree, is selected as the first subpedigree. When several non-overlapping subpedigrees are eligible all of them are selected. When several partly overlapping subpedigrees are eligible, the one with the smallest bit-size is selected. If still several subpedigrees are eligible, the one with the highest average kinship among SOI is selected. When, additionally, the average kinships are the same, a random subpedigree is selected and the alternative selections are saved in a log file. From our experience, the latter scenario is very rare for reasonably large bit-sizes and complex pedigrees with multiple lines of descent. Within the search space, this exhaustive algorithm guarantees that the selected subpedigree has the maximal number of SOI in respect to the bit-size restriction B.

At the third step, the algorithm removes the SOI belonging to the identified pedigree(s) from further consideration and is recursively applied, starting with the Step 1, to the remaining SOI, until no further SOI can be assigned to a subpedigree.

The algorithm is presented in Figure 1. The described algorithm is implemented in a software package, PedCut, which is available at http://mga.bionet.nsc.ru/soft/index.html.

Figure 1
figure 1

Flowchart of the pedigree splitting algorithm.

Splitting a large pedigree with PedCut and Greffa

We tested properties of our program PedCut using a large and complex pedigree and compared it with the Greffa program developed by Falchi et al.14 This program implements the maximum clique-partitioning algorithm. The pedigree comprises a part of the genealogy used in a genome-wide screen for late onset Alzheimer's disease (AD) in genetically isolated Dutch population.2 The original pedigree contains 103 AD patients and 4645 family members. For demonstration purpose, we used a fraction of the original pedigree that contains 50 randomly selected AD patients and their 2460 ancestors spanning over 18 generations (Table 1 and Figure 2). These 50 AD patients, who are considered as SOI, are scattered in the most recent generations. Except five sibling–pairs and two aunt–niece pairs, all SOI are related distantly via multiple genealogic connections (Table 1). Here a genealogic connection is defined as a unique genealogic pathway via which an allele identical by descent can be transmitted. For example a pair of siblings have two unique lines connecting them, one from the mother and another from the father whereas a pair of half sibs have only one line connecting them via the shared parent.

Table 1 Genealogic characteristics of the pedigree including 50 Alzheimer's disease patients
Figure 2
figure 2

The initial pedigree consisting of 50 Alzheimer patients (blue: patients). This pedigree was drawn using software package Pedfiddler version 0.5 (http://www.medicine.mcgill.ca/statgene/software.html).

We used maximum bit-size restriction of 18, 24, and 30 bits to test PedCut and Greffa. To make the results from Greffa comparable with our results, we tried a number of combinations of the parameters required by Greffa and reported the results from the optimal set of parameters, which give subpedigrees with the maximum number of SOI within 18, 24 or 30 bits.

Table 2 shows that PedCut can assign more patients into subpedigrees compared to Greffa in splitting this pedigree. Under any bit-size limit investigated, PedCut could successfully assign most SOI to subpedigrees whereas Greffa left a considerable number of SOI unassigned. For example, Greffa left 12 (24%) to 28 (56%) SOI unassigned, and thus completely lost for linkage analysis. On the other hand, PedCut left at maximum one (2%) SOI unassigned. The only person who could not be assigned to a pedigree by PedCut at bit-size of 18 and 24 is connected to a closest SOI through 10 meioses; this person was assigned when bit-size limit was set to 30.

Table 2 Result of a pedigree of 50 Alzheimer's disease patients using PedCut and Greffa under various bit-size restrictions

Also, the average number of SOI per subpedigree from PedCut was larger (2.9, 3.5, and 4.2 for 18, 24, and 30 bits) than that from Greffa (2.9, 2.4, and 2.5). Furthermore, the maximum number of SOI from Pedcut (6, 6, and 7 for 18, 24, and 30 bits) is larger than that from Greffa (5, 4, and 3). Finally, compared to Greffa, PedCut produced more uniformly sized subpedigrees, as evidenced by a narrower range of bit and pedigree-sizes (Table 2).

With an Intel Pentium 4 2.4 GHz CPU, PedCut could split the pedigree in about 3 min. The resultant subpedigrees from PedCut using 24 bits as the pedigree limit are depicted in Supplementary Figure 1, where the 14 subpedigrees are ordered in the same sequence as PedCut identified them. All five sib-pairs were captured by the first two subpedigrees and the two aunt–niece pairs were captured by subpedigrees 3 and 4. The pedigrees 5 and 6 also contain multiple SOI who are relatively closely related to each other. The subpedigree seven contains two SOI who are double-first cousins and the subpedigree 10 contains two SOI who are second cousins. The subpedigrees 8, 9, 11 and 12 contain multiple distantly related SOI. The pedigree 13 and 14 contains only two distantly related SOI each.

Power comparison

We compared the power to detect linkage to a rare fully penetrant variant using subpedigrees derived in the previous section. Using these pedigrees, we simulated genetic data conditional on the phenotypes observed, using method of Boehnke,17 implemented in software package SIMLINK version 4.12. We assumed a genetically homogenous binary trait, determined by two underlying alleles (D being causal and d being normal allele). The penetrance of DD was fixed at 1.0 and penetrance of dd was fixed at 0.0. The penetrance for Dd was set at 1.0 (dominant model), 0.75, 0.5, 0.25, or 0.0 (recessive model). The frequency of D allele was set in such manner, that locus-specific population attributable fraction was 0.0001. One marker locus having 5 alleles with equal frequency of 0.2 was modelled. One thousand replicas were simulated and analysed under each scenario concerning Dd penetrance. The distance between the trait and marker loci was set to zero. Consequent linkage analysis was performed using correct model.

For 18-bits pedigrees, results of power calculation are shown in Figure 3. The subpedigrees derived from PedCut consistently showed higher power compared to the ones from Greffa, across all models analysed. This is most likely due to the fact that Greffa left a considerable amount of patients unassigned to any subpedigree. Of interest, the subpedigrees derived from PedCut showed a clear trend of increase in power when the underlying genetic model was becoming more dominant. The same trend exists for the pedigrees from Greffa but in much less degree. Similar results were obtained for 24- and 30-bit pedigrees as well (not shown).

Figure 3
figure 3

Expected LOD score for pedigrees derived using PedCut and Greffa with a bit-size limit of 18.

Discussion

In this work we have developed an algorithm that recursively groups the subjects of interest (SOI, eg genotyped patients) into subgroups that fall within a certain bit-size limit and include the maximum number of SOI who share a common ancestor. With the algorithm we exploit the basic rationale of linkage analysis, that is that affected relatives are likely to share the disease-causal allele identical by descent from a common ancestor. Fast grouping is achieved by prioritizing relatives using kinship. Our algorithm guarantees that the derived subpedigrees can be directly and efficiently analysed by software implementing the Lander–Green–Kruglyak algorithm. Further, it is fully automated.

Any method for subpedigree identification involves breaking relationships between SOI. Breaking relationships may increase false-positive linkage signals18, 19 or reduce the power of linkage analysis.12 Bias is especially pronounced when close relationships are broken. In our program we include a default option to preserve close relationships between SOI, keeping all second cousin or closer relationships between SOI. In general, type 1 error for split pedigree should be worked out by gene-dropping using complete pedigree, and re-analysis using split pedigrees.2

Exhaustive search algorithm guarantees (within the search space) that the selected subpedigree has the maximal number of SOI in respect to the bit-size restriction, if no equally eligible subpedigrees are available. The latter situation, though possible, is in our experience unlikely in deep pedigrees coming from genetically isolated populations, where mating is mostly random, and multiples genealogic connections are observed between SOI.

We have previously applied PedCut to split a pedigree including 4645 people of which 112 were Alzheimer's disease patients.2 In that study, we could assign 103 patients to 35 pedigrees using bit-size limit of 35 in about 11 min. Empirical threshold for 5% genome-wide significance, established using simulations in complete pedigree, was estimated to be 3.64. Regions of significant linkage were genotyped using dense single-nucleotide polymorphisms (SNPs) marker map in an independent cohort. Significant associations were observed between some of these regions and cognitive function. This proves applicability of our approach in real studies.

In summary, we developed a heuristic algorithm that is suitable to split large and complex pedigrees coming from genetically isolated populations. Our algorithm and associated software (PedCut, http://mga.bionet.nsc.ru/soft/index.html) can facilitate fast genome-wide linkage search for rare mutations.