Introduction

In isolated populations, most individuals share relatively recent common ancestors. If more than one individual inherited the same ancient haplotype in a region, we call them haplotype sharers. Segments of their chromosomes are identical by descent – their haplotypes ‘descend from a common ancestor without either of them experiencing a recombination’.1 Although SNP arrays and next-generation sequencing do not reveal gametic phase, haplotype sharers can be identified using computational methods. The applications of inferred regions of identity by descent (IBD) include optimization of resequencing studies and mapping genetic effects on complex traits.2, 3

In the first application, when the SNP genotypes are available and next-generation sequencing is planned, haplotype sharing between the individuals can save resources. With sharing inferred from SNP genotypes, it is possible to choose a minimally redundant subset of individuals to be sequenced, and then to impute sequence data into other subjects with SNP genotype data. Imputations that rely on IBD are now recognized to increase the power of sequencing studies in population isolates.4

In the second application, shared haplotypes may make it possible to detect the effects of genes in which functional variants that are rare in the general population have drifted to high frequency in the isolated population. Furthermore, the reduced allelic heterogeneity in an isolate provides an opportunity to detect associations with these otherwise rare variants. However, conventional GWAS studies may fail to detect associations with rare variants, as these may not be in linkage disequilibrium (LD) with SNPs on genotyping arrays that have been optimized to tag common variants.5, 6 The uncovered ancestral haplotypes can be in stronger association with rare functional variants and hence improve the power of association tests. The most ambitious attempts to map effects of shared haplotypes reconstruct descent trees, but this approach has been found computationally infeasible.7

Genomic phase can be revealed by long-range phasing methods that exploit regions of IBD between related individuals, or by local phasing methods that rely on patterns of LD.8 The resulting haplotypes from either of the models make it possible to revise the IBD regions in the former case, or find them in the latter case. The first rule-based algorithm for long-range phasing was described by Kong et al,9, 10 and is similar to the method presented later by Hickey et al.11 Both of these methods detect IBD sharing only in pre-specified genetic regions, identical for all pairs of compared diplotypes, whereas in reality boundaries of IBD regions can occur anywhere across the genome. Systematic long-range phasing (SLRP)12 is a fully probabilistic model for phasing and IBD detection in isolated populations, yet the inference algorithm used by SLRP requires significant computational resources. FastIBD is an example of a local phasing method for cohorts of unrelated individuals that was adapted for inference of IBD sharing.13 The method constructs a model of haplotypes, and even though it was designed for outbred populations and may struggle to capture long-range haplotypes present in an isolated populations, multiple resampling of haplotypes may make fastIBD also suitable for more closely related individuals. Long-range methods explicitly capture all shared haplotypes, whereas the local methods build more parsimonious models, which, however, may carry enough information for the detection of IBD sharing also in population isolates.

We describe a new long-range algorithm for the detection of identical by descent haplotypes in isolated populations named ANCHAP. Our method is designed to detect borders of regions of IBD precisely, at minimal computation time and with state-of-art sensitivity and false discovery rate. We compare ANCHAP with other long-range methods and a local method, and demonstrate an application of the identified IBD regions for optimization of sequencing studies.

Materials and methods

Cohorts under study

In our study of ancestral haplotypes, we analyzed four European cohorts, three of which (Orkney Complex Disease Study (ORCADES), CROATIA-VIS, CROATIA-KORCULA) are from isolated island populations and one from a mainland population (Study of Colorectal Cancer in Scotland (SOCCS)). The ORCADES is a family-based, cross-sectional study in the isolated Scottish archipelago of Orkney.14 Genetic diversity in this population is decreased compared with mainland Scotland, consistent with the high levels of endogamy throughout history. Orkney has been inhabited for over 5000 years, but the original population was almost completely replaced by Norse Vikings about 800–900 CE. From about 1300 to 1600 CE, there was an influx of mainland Scots.15 For this analysis, we used data from 749 participants aged 18–100 years from 10 islands; however, for the purposes of evaluation of methods we removed parents from the genotyped parent–offspring pairs, which reduced the cohort size to 597 individuals. Genotyping in the study was done using the Illumina HumanHap300 array (San Diego, CA, USA). The CROATIA-VIS study is a family-based, cross-sectional study in the villages of Komiza and Vis on the isolated island of Vis, which included 1056 examinees aged 18–93 years.16 The CROATIA-VIS study genotyping used the Illumina Hap300v1 SNP chip (San Diego, CA, USA). The CROATIA-KORCULA study is a family-based, cross-sectional study in the villages of Lumbarda, Zrnovo and Racisce on the isolated island of Korcula in Croatia.17 The study included 965 examinees aged 18–95 years. The CROATIA-KORCULA study genotyping used the Illumina Hap370CNV SNP chip (San Diego, CA, USA). The SOCCS study is a case–control study of prospectively collected colorectal cancer cases from all Scottish hospitals, and matched controls. One thousand participants in each group in the first phase of the study were genotyped with the Illumina HumanHap300 array. The participants for the control group were matched by age, sex and region to cases according to a nearly complete population-based register, and then selected at random. We analyzed the genotypes from the control group so as to obtain a sample representative of the Scottish population as a whole. Details of data pre-processing are given in the Supplementary Materials.

Recent IBD

Haplotypes that are identical by descent originate ‘from a common ancestor without either of them experiencing a recombination’.1 Unless a recent mutation occurred, the alleles on IBD haplotypes should be identical, also at untyped loci. For two individuals sharing a haplotype inherited from a common ancestor, the lengths of the shared regions are exponentially distributed with a mean equal to (2n)−1 morgans, where n is number of generations back to most recent common ancestor (MRCA).18 However, the distribution of segment lengths has a variance of (2n)−2 morgans; hence, the correspondence of segment length with time to common ancestor is only approximate. In an isolated population such as the Orkney population, founded by the Viking settlement about 50 generations ago, and where population size has been constrained over many generations, the time to MRCA is either of the order of 1000 generations ago, during the early settlement of Europe, or less than 50 generations ago. To be able to use IBD sharing to infer sharing of rare variants, taking into account mutation rates,19, 20 we restrict the definition of IBD sharing to sharing via a recent common ancestor. In practice, we can only do this by setting a minimum length for the shared region. For this study we set the cut-off value at 2 cM, equal to the expected length of sharing given a time to MRCA of 25 generations. In addition, the cut-off value at 2 cM has been suggested in literature as a threshold above which accurate detection from contemporary genotyping arrays can be obtained.18

Algorithm of ANCHAP

The objective of ANCHAP is to infer recent IBD from the SNP data with maximum sensitivity and specificity. The algorithm should declare IBD only where the haplotype was co-inherited from a recent common ancestor, and find all of such regions (see Figure 1).

Figure 1
figure 1

Haplotype sharing structure within a population isolate. Genotyped individuals are identified by numbers 1 to 5. Each individual has two haplotypes, represented by thick bars. Red, blue and green dotted lines represent IBD of two haplotypes in a genomic region. The dark gray-shaded haplotypes are unique in the sample, and they are not shared between the sampled subjects.

The algorithm consists of three stages:

  1. 1

    Stage I. First scan for IBD sharing from comparisons of multilocus genotypes of all pairs of individuals (Algorithm 1 in Supplementary Materials).

  2. 2

    Stage II. Splitting haplotype sharers by alignment and phasing. Individuals carrying parts of the individual’s maternal haplotype are distinguished from ones that carry the paternal haplotype (Algorithm 2 in Supplementary Materials).

  3. 3

    Stage III. Second scan for haplotype sharing: a more sensitive and specific scan for IBD sharing, by pairwise comparisons of partially uncovered haplotypes (Algorithm 3 in Supplementary Materials).

At Stage I, ANCHAP detects IBD sharing between pairs of unphased diplotypes, using a variable-length sliding window to screen for long regions with no opposing homozygotes between each pair of individuals. Where there are no opposing homozygotes over a long region, a haplotype is likely to be shared IBD. To account for uncertain sharing near the boundaries of a region with no opposing homozygotes, a number of markers at the margins are trimmed and are not included into the shared region. The parameters of the method are as follows: the IBD threshold – the minimum genetic length in centimorgans of a region without opposing homozygotes between a pair of diplotypes, and the number of markers to be trimmed.

In a given region, to reconstruct a phase, the proband’s haplotype sharers can be split into two groups by alignment at Stage II. If sharers of the proband’s haplotypes on both the gametes are present, they will form two groups. If sharers of only one gamete are present, or if the proband is homozygous by descent, they will form one group. When sharers of each proband’s haplotypes are identified, phasing becomes possible. We can recover the proband’s haplotypes at each locus where at least one of the haplotype sharers is homozygous, but information about all of the homozygotes among the sharers eliminates phasing errors. As a haplotype is shared, the haplotype sharer’s allele at a homozygous locus must be the same as the allele on the proband’s haplotype.9 As the method distinguishes groups of individuals sharing each of the proband’s haplotypes in a region without recourse to pedigree, the actual paternal or maternal origin of these proband’s haplotypes is not known, and in any case is irrelevant for phasing.

At Stage III of ANCHAP, we make use of the phase information obtained from the IBD regions that were detected earlier. When partially complete haplotypes have been inferred, a second scan for IBD sharing is undertaken, exploiting the additional phase information gained. A pair of completely known haplotypes can mismatch at any phased locus, whereas in a pair of unphased genotypes only loci homozygous for both individuals are indicative of sharing or lack thereof. Therefore, partially complete haplotypes carry more information for IBD detection, and thus the second scan can be more accurate. The idea for this second scan for haplotype sharing was inspired by the hidden Markov model described by Genovese et al.21 To detect IBD from comparisons of nearly complete haplotypes, we can use a threshold shorter than the one that delineates IBD sharing from IBS in diplotype–diplotype comparisons, and we no longer need to trim the borders of the shared regions. Recent IBD is declared in a region that spans a sufficient genetic distance, and when number of phased markers in the region exceeds the threshold of minimum phase information.

Comparison of methods

We compared ANCHAP against SLRP – a fully probabilistic method for long-range phasing, and against fastIBD – a local phasing method designed for populations of unrelated individuals. We evaluated their results genome-wide against recent IBD that can be reliably detected by comparison of haplotypes phased using parental genotypes. Among the individuals genotyped in ORCADES, there were 58 individuals with both parents genotyped, and on average 80% of heterozygous loci of such reference individuals were phased. We identified the regions of true recent IBD sharing between pairs of reference individuals where their haplotypes are identical for at least 2 cM. Each of the compared methods was used on genotype data from the 597 individuals in ORCADES, free from parent–offspring pairs. The results between the reference individuals were evaluated against the regions of true recent IBD. The total number of markers in true regions and in resulting regions is TP, in true regions but not in the resulting regions is FN, and not in true regions but in the resulting regions is FP. For each method, we quote sensitivity defined as the ratio TP/(TP+FN) and false discovery rate FP/(FP+TP).

Parameter tuning

All of the compared methods require setting different parameters. The methods were tuned according to their sensitivity and false discovery rate on a subset of the ORCADES data set from chromosome 2, using the reference individuals phased in parent–offspring trios. Other metrics, like inconsistencies between genotypes in supposed IBD regions are further described in Supplementary Materials.

We attempted to set the IBD threshold at Stage I of ANCHAP, such that the length of falsely assumed IBD regions is reduced while recovering as much of the true IBD regions as possible, and thus the phase recovery that uses the IBD segments is most accurate and maximally spread. The margin sizes were set by comparison of borders of IBD regions deduced from genotypes and the reference haplotypes. The setting of the alignment parameters at Stage II aimed at increasing the ratio of the IBD segments aligned into haplotypes and minimizing the inconsistencies between them, which indicate alignment errors. At Stage III, the minimum number of markers phased for both individuals in a putative IBD region was set using the reference haplotypes and the sensitivity and specificity values. The details of the experiments are shown in Supplementary Materials.

In addition, using the values of sensitivity and false discovery rate in data from chromosome 2, we adjusted the parameters of SLRP and fastIBD. SLRP required setting the expected length of IBD regions and expected regions of IBS, but not the IBD regions. The scale parameter in fastIBD controlled the parsimony of the haplotype model.

Optimization of resequencing studies in population isolates

The uncovered IBD sharing within a cohort can be used for efficient selection of individuals to resequence, with a view to using them as a reference for imputation. Selection of individuals for resequencing is based on maximizing representation of haplotypes and minimizing multiple resequencing of the same haplotypes. As the first individual for resequencing, we choose the one whose haplotypes have the most copies in the rest of the cohort. After excluding regions that have been covered by sharing with individuals already chosen, we repeat the procedure of selecting the individual with the most copies until a target level of coverage has been achieved (Algorithm 4 in Supplementary Materials).

Results

Phase propagation in ANCHAP

Table 1 shows the gain in sensitivity and the reduction in false discovery rate in the detection of recent IBD regions that are obtained at Stage III of our algorithm, as compared with Stage I. On chromosome 2, sensitivity of IBD detection between the 58 reference individuals per pair of individuals per marker grew from 0.75 to 0.81 in the second round. Detection of IBD for partially phased haplotypes in the second round helped to reduce the false discovery rate from 0.16 to 0.01.

Table 1 Part III of ANCHAP offers better accuracy in detecting regions of IBD than the first one

Comparison of ANCHAP against other methods

Table 2 compares different tuning settings of ANCHAP, SLRP and fastIBD. Using data from chromosome 2, we manipulated the parameters of SLRP and fastIBD, to match sensitivity and false discovery rate of ANCHAP. Notably, as the sensitivity of fastIBD grows to exceed ANCHAP’s 0.81, false discovery rate of fastIBD reaches 0.024.

Table 2 Parameter tuning of ANCHAP, SLRP and fastIBD

Table 3 shows the accuracy of IBD detection of ANCHAP against the other methods and their running times. Genome-wide, the methods achieved similar sensitivity of IBD: from 0.75 for SLRP, through 0.78 for ANCHAP and to 0.82 for fastIBD. Long-range methods, ANCHAP and SLRP resulted in similar false discovery rates of 0.009 and 0.007, respectively, whereas for fastIBD it is 0.025. Genome-wide inference of IBD with the SLRP model took much longer than for the other methods: the analysis with SLRP took 207 h, whereas ANCHAP handled the same task in 20 h and fastIBD in 12 h.

Table 3 Comparison of accuracies of methods for IBD detection

Sharing in different cohorts and across the genome

The average number of haplotype sharers per locus varied from 9.4 in CROATIA-KORCULA, through 12.3 in ORCADES and to 12.6 in CROATIA-VIS. In SOCCS, which consists of genotypes of nominally unrelated individuals, there were only 0.9 sharers per locus on average.

The frequency of haplotype sharing varies not only between the cohorts but also across the genome. Figure 2 shows average counts of the haplotype sharers in different locations across the genome. Drops at the telomeres can be consistently observed, as well as the peaks on chromosomes 2, 6, 8 and 9. In SOCCS, particularly notable are the peaks on chromosomes 2 and 6, which also occur in ORCADES and CROATIA-VIS but not in CROATIA-KORCULA.

Figure 2
figure 2

Density of surrogate parents across the genome in the four cohorts (3 cM threshold, 2 cM in the second round). The horizontal axis shows index of a SNP, and not its physical or genetic location. Supplementary Materials give the locations of the peaks highlighted on x axes.

Optimization of resequencing studies

The uncovered sharing in ORCADES was used to choose an optimal subset of 200 individuals to be sequenced. Figure 3 shows that if such an optimal 20% of ORCADES individuals are sequenced, 65% of haplotypes in the cohort would be genotyped either directly or through an IBD copy. For CROATIA-VIS and CROATIA-KORCULA, the corresponding coverage would be 57% and 55%, respectively, whereas for SOCCS it would be 25%.

Figure 3
figure 3

Coverage of surrogate parents for different resequencing scenarios in the four cohorts.

Discussion

Comparison with other methods for IBD detection

Design of the algorithms affects the performance of the methods for IBD inference. SLRP is a model-based probabilistic approach for simultaneous IBD detection and phasing. It can simultaneously handle genotyping errors and phase uncertainty, yet this comes at the price of high computational demand. The loopy belief propagation that SLRP uses for inference may not find the optimal solution and is not guaranteed converge. ANCHAP does not explicitly model genotyping errors; phasing and IBD detection are separate steps, yet our tests detects recent IBD similarly well. When a genotyping error gives rise to a pair of opposing homozygotes in a region of IBD sharing, the region of sharing detected by ANCHAP may be shorter or missed altogether. However, in ORCADES we encountered on average only 1 opposing homozygote per 10 000 markers in genotypes of parent–offspring pairs; hence, genotyping errors will not prevent most of the sharing regions from being detected. FastIBD is a method for IBD detection designed for general, not necessarily isolated populations.13 It builds a model of haplotypes, which can capture only short-range allele correlations. This deficiency is then ameliorated by sampling multiple haplotypes for each individual, and checking overlap of such samples between pairs of individuals. FastIBD turned out to be more sensitive than ANCHAP or SLRP, but also returned more false discoveries. A possible explanation for why fastIBD yields more false discoveries is that haplotype resampling of short blocks may occasionally yield matches between individuals by chance.

It seems unlikely that the differences in sensitivity and false discovery rate between the methods would seriously affect the uses of detected IBD region in mapping complex traits or IBD-based imputations. Sensitivity approaching 100% would be desirable for downstream applications,8 but none of the methods achieves sensitivity of IBD detection of more than 81% for the ORCADES data. In case of ANCHAP, this probably results from inability to handle the incorrect assignments in Stages I and II, which trigger phasing errors, and IBD is no longer detected in Stage III. Ability to recover from sporadic phasing errors would certainly improve the sensitivity of IBD detection. For SLRP, incomplete IBD detection could be because of limitations of the inference algorithm, conservative approach to declaring IBD or low tolerance to inconsistencies between the IBD sharing relationship and, possibly, noisy data. In case of fastIBD, if for a pair of individuals sharing IBD there are few haplotypes that would explain the genotypes, the program may not sample the matching pair. In accordance with this observation, the highest sensitivity we could achieve was by runs of fastIBD with scale 4.0 repeated 10 times. Together with the sensitivity going up to 89%, the false discovery rate also grew to 7%.

The comparison of methods as well as parameter tuning are based on the presence of parent–offspring trios among the ORCADES samples. The borders of reference sharing regions as determined by the parent–offspring phasing are only as accurate as the SNP density allows. The reference regions may still have false endpoints, because a recombination may not be detectable from SNP alleles. However, the endpoints should not affect the results of the comparison, as they will be small compared with the regions themselves; long matching haplotypes that do not descent from a common ancestor are unlikely. In absence of the parent–offspring trios, several types of inconsistencies between the genotype data and IBD relationships recovered could indicate errors. For example, the multipoint genotypes of haplotype sharers, which are recognized to carry one of the proband’s haplotypes, cannot have opposing homozygotes with respect to not just the proband but also each other.

Genetic maps and peaks of IBD

With a genetic map, we not only estimate minimal physical sizes of regions that could be IBD, but also try to account for extensive LD that existed in genotypes of isolate founders to avoid false detections of IBD. If haplotypes are very similar to each other, and we observe only unphased SNP genotypes, our method may declare IBD incorrectly even when two individuals do not share a recent common ancestor and their full sequences are not identical. To account for such regions of extended LD between isolate founders, inference of IBD requires longer sequences of SNPs without opposing homozygotes. ANCHAP's threshold for the length of segments identical by state required to declare THEM identical by descent, is expressed in centimorgans, referring to the HapMap genetic map. This map is based on modeling haplotype structure in outbred populations.22, 23 When the true recombination rate is low, or when there has been recent drift or selection, this will be reflected in the ‘recombination map’, and using it will correct for extended LD. Such a correction is evident in the SOCCS data, where using the HapMap markedly reduces the size of the peaks for apparent IBD sharing on chromosomes 6 and 11 (Supplementary Figure 4).

The remaining peaks of IBD sharing observed in Figure 2 could result from either increased sensitivity to IBD sharing, increased IBD false discovery rate, or selection pressure. A possible explanation for these peaks may be that the HapMap genetic map too does not appropriately adjust for extended LD in the European and possibly the Scottish populations. Excess sharing in SOCCS, a cohort composed mainly of unrelated individuals from across Scotland, concentrates around the two peaks on chromosome 2 and 6 (exact locations in Supplementary Materials). Furthermore, these peaks also occur in ORCADES and partially in the Croatian cohorts. The peaks are less visible in CROATIA-KORCULA, where a different SNP array was used. In Supplementary Figure 5c, we show there is no marked variation of marker density with respect to physical or genetic maps at the peak regions. Alternatively, if the peaks are not the result of an inappropriate adjustment for the background LD, they could indicate a drift or selection in the Scottish populations and the isolates.24 A list of genes present within these peaks, including the HLA genes, is presented in Supplementary Materials.

Resequencing optimization

The inferred shared haplotypes in an isolated population can be exploited to increase the efficiency of a sequencing study given a fixed budget. One possible strategy is to identify an optimal subset of individuals for resequencing at high coverage so as to obtain an accurate sequence data, then to impute these sequences into the other cohort members with whom they share the IBD. For the selection of individuals, our algorithm favors individuals who share the largest identical by descent regions with individuals who were not chosen for resequencing. The imputation of sequences of not chosen individuals could be most effective when for each haplotype throughout the genome there were few sequenced sharers. We have examined the strategy based on resequencing an optimal 20% of individuals from ORCADES, which would reduce the cost fivefold, with 65% of the unsequenced diploid genomes sharing with the sequenced individuals. Had we chosen the individuals randomly or based on kinship coefficients, the IBD coverage of unsequenced haplotypes would have been 61% or 62%, respectively (details of comparison in Supplementary Materials). A recent study in another population isolate also confirmed the merit of IBD-based optimization procedures.25

A number of factors determine the accuracy of imputations based on IBD. First, only correctly detected IBD would result in correct imputations. Second, a variant could only be correctly imputed if it is older than the most common ancestor from whom the haplotype was co-inherited. As long-range methods detect IBD from recent common ancestors, they should allow for more accurate imputation of more recent variants. Finally, the ease of sequence imputations with the IBD sharing information depends on whether the sequences can be phased, and whether it is possible to overlay the array and sequence haplotypes. There is work underway both on algorithmic and laboratory methods for phasing of sequences.8 If the sequences are not phased, the IBD sharing can still be exploited for imputation, but this would result in more uncertainty about imputed alleles.