There is an ongoing discussion in the literature regarding the value of phase data for detecting association to disease genes.1 For some analyses,2, 3 haplotype information is the key for more powerful studies. Unfortunately, experimental methods for haplotype determination in unrelated individuals often require manipulation of separated single chromosomes or of large fragments thereof,4 a technically complicated and cost prohibitive procedure. Meanwhile, current technologies offer an array of practical molecular techniques for determination of genotype data, without assignment of distinct alleles to the chromosomes that carry them.5 As a consequence, the study design of choice is to infer haplotype architecture from genotype data, rather than measure it directly. Haplotype resolution, whether in genetic disease studies or in preceding efforts to map haplotypes, is currently approached by statistical or computational methods,6, 7, 8, 9, 10, 11, 12 by a process called phasing or haplotyping. This strategy uses computation to make up for less informative, yet more practical and cost-effective experimentation (see Figure 1).

Figure 1
figure 1

Levels of genetic information. The spectrum of levels of genetic information of an individual, depicted along a 6-SNP region: full haplotype sequences provide full information (including previously unknown SNPs); SNP haplotypes separate the alleles at known SNP sites into the two chromosomes; genotypes provide only the confluence of the allele identities at each SNP site, and xor-genotypes (introduced here) only tell which sites are heterozygous. Arrows indicate that haplotypes are computationally inferred from genotypes, and can also be computationally inferred from xor-genotypes by a process called xor-haplotyping (as shown in this work). Information decreases from full haplotype sequencing to xor-genotype determination (top). Costs diminish from full haplotypes to genotypes, and may potentially decrease further if only xor-genotypes are measured (bottom).

In this work, we show how additional reduction in the complexity of experimentation can be achieved with only minor effect on the informativeness of measurements: we explore the conditions under which a yet less detailed and so far unexploited type of data, may become attractive owing to its potentially reduced costs. This level of information, hereby called xor-genotype, differentiates only between homozygous and heterozygous states at biallelic loci such as single nucleotide polymorphisms (SNPs), and provides no data as to the allele identity of homozygotes. Several techniques are already available for xor-genotype determination. One such example is DHPLC,13 which can determine whether an individual is homozygous or heterozygous for each SNP, but without additional information cannot readily distinguish between the two alternative homozygous states. (The name xor-genotypes originates from the logical ‘exclusive-or’, or xor operation, which like our typing can only distinguish if a pair of binary elements, the alleles, are identical or not). Moreover, since xor-genotypes contain less information than genotypes, development of cost-competitive methods for xor-genotyping may be possible.

We recently demonstrated14 that computation can, under defined circumstances and with minimal genotype data, resolve xor-genotyping into individual genotypes or haplotypes (see Figure 1). Therein we described algorithms for reconstruction of the complete haplotype information from the partial, xor-genotype data, a computation we termed Xor Perfect Phylogeny Haplotyping (XPPH). We show here that computation can render xor-genotypes nearly as informative as full genotypes, at possibly a fraction of the cost, and with only a minor increase in the number of individuals that need to be examined. We base our theoretical analysis on the observation that large blocks of the genome have not undergone any significant recombination4, 15, 16 and therefore have essentially evolved according to the no-recombination infinite-site coalescence model.17 This is powered by the forces of selection and demography, which are likely to have reduced diversity within the human genome, leading to a lack of directly observable recombination events or branches of the genealogy tree. This allows use of the perfect phylogeny assumption8 for haplotype structure, which guarantees that the haplotypes can fit into a tree that describes their genealogical ancestry. In such a tree, the path between two haplotypes is labeled by the SNPs that they are heterozygous for (See Figure 2b). This tree can be inferred from data while creating a catalog of variation (a haplotype map), and can be utilized in subsequent genetic studies.

Figure 2
figure 2

Haplotype resolution from xor-genotypes. Schematic example of the proposed paradigm for haplotype resolution from xor-genotype data, during the creation of a haplotype map (a and b) and its application in genetic disease studies (c and d). (a) The xor-genotypes (lists of heterozygous sites) of a five-individual sample are measured. XPPH computation reveals the perfect phylogeny structure that gave rise to the (yet unresolved) sample haplotypes (circles) and their evolutionary intermediates (ovals). Each xor-genotype corresponds to a path in the perfect phylogeny tree. (b) In order to resolve haplotype sequences the genotypes of two individuals are measured. Their homozygous alleles resolve all unknown haplotypes. (c) Xor-genotypes of the disease cohort are measured, and heterozygous samples are readily translated to haplotypes using the map (perfect phylogeny) information. (d) Completely homozygous individuals need to be genotyped.

Figure 2 illustrates the conceptual process of haplotype resolution from xor-genotype data, including alternating stages of data collection and computation in each 4-gamete block.

  1. i)

    Collect DNAs for a haplotype mapping cohort and determine the xor-genotype of each individual in this sample. Apply XPPH computation to resolve the structure of the perfect phylogeny tree.

  2. ii)

    Identify at most three sample individuals that need to be genotyped (see14 for details). Genotype the selected individuals. Embedding these genotypes in the tree structure allows their homozygous alleles to resolve all haplotypes for the corresponding SNPs based on the perfect phylogeny structure. This gives rise to a structured haplotype map of the block at hand.

  3. iii)

    Collect xor-genotype data for individuals in a disease study. Resolve the haplotypes of heterozygous individuals by the perfect phylogeny.

  4. iv)

    Genotype individuals who are completely homozygous in the current block.

Our implementation for XPPH computation (in step i), is available for public use at http://www.cs.tau.ac.il/~rshamir/greal/. We note that our method can detect 4-gamete blocks from xor-genotypes by sliding window analysis, and in principle can be modified to produce a chain of block-wise haplotypes along the genome, as done for genotyping data (eg Eskin et al18).

The economic advantage of our approach over regular genotyping will depend on the relative cost of regular vs xor-genotyping. Suppose the cost of regular genotyping one SNP in an individual is 1 unit, and the cost of its xor-genotyping is β. If we type t SNPs in N individuals, regular genotyping will cost Nt. If we xor-genotype them instead, we pay βNt +3t (steps i and ii), but the overall cost also has to take into account the chances of XPPH failure (step iii) and the homozygosity rate (step iv). Chances of failure are <10% for N>50 (see Figure 3), and decrease as N grows; and a conservative estimate for homozygosity rate of a complete haplotype is <30% (as expected with 4–6 common haplotypes.)16 In total, the expected cost of the whole strategy is at most 0.9(βNt+0.3Nt+3t)+ 0.1(βNt+Nt). Hence, whenever β<0.6 (and irrespective of t) the cost of the xor-genotyping strategy will be preferable to the cost of the standard approach. Note, that this estimate is conservative as it does not exploit allele correlation that crossblock boundaries. Such correlation may resolve homozygous haplotypes by information from adjacent blocks at no additional cost.

Figure 3
figure 3

The power of xpph. The rate of successful haplotype resolution (y-axis) vs the number of individuals samples (x-axis). Each point (pink) represents an average success rate in XPPH computation on 5000 simulated samples of 50 common SNP haplotypes (minor allele frequency >0.05) for prescribed-size cohort, simulated by an infinite-site, no-recombination model21 on a typical, 25 kb-long haplotype block of 50 SNPs. Statistics of haplotype recovery from regular genotypes were similarly obtained in Reference22 are plotted for comparison (blue). For a cohort of size 60, XPPH guarantees 90% chance for a complete successful solution. Similar results were obtained also in simulations that assume exponential population expansion (data not shown).

Large-scale efforts like the HAPMAP projects,19, 20 are genotyping enough individuals to detect the regions where the perfect phylogeny description is most accurate and to describe the tree on these regions. If the full perfect phylogeny tree of a region is already in hand, the resolution of a single xor-genotype into haplotypes is unique (due to the 4-gamete rule) and easily obtained. Therefore the xor-haplotyping approach offers further potential advantages for future studies that exploit the high-resolution haplotype maps that will soon be available.

In summary, we introduced here a novel type of genetic data called xor-genotypes, and described computational tools to resolve haplotypes based primarily on xor-genotype data. We argued for the potential economical advantage of xor-genotypes over the full genotypes common today. We showed that simulated genetic data in blocks indicates that xor-genotypes are nearly as informative as full genotypes, potentially at a fraction of the cost, both for haplotype mapping and for genetic disease studies, and determined the boundaries under which these conditions hold. Additionally, our computational work14 revealed that xor-genotypes and their perfect phylogeny provide important insights even if full genotypes were obtained, and, for instance, allow selection of tag SNPs without phased data. Hence, genotyping methods that distinguish only between heterozygotes and homozygotes, together with the appropriate computational solutions, may offer a cost-effective alternative to genotyping in genetic studies, and may play a role in whole genome mapping approaches.