Box 2 | Genotype data and haplotype phase

From the following article:

Linkage disequilibrium — understanding the evolutionary past and mapping the medical future

Montgomery Slatkin

Nature Reviews Genetics 9, 477-485 (June 2008)


When the genotype of a dipoid individual is determined, the result is a list of genotypes for each locus surveyed. If three diallelic loci are surveyed, the genotypes of four individuals might be AA bb CC, Aa BB cc, aa Bb Cc and Aa Bb Cc. The haplotypes of the first two individuals are immediately apparent. Individual 1 has two copies of AbC and individual 2 has ABc and aBc. There is no uncertainty if no more than one locus is heterozygous. Otherwise, haplotypes cannot be determined without further information. Individual 3 could have haplotypes aBC/abc or aBc/abC. The number of possible resolutions increases exponentially with the number of heterozygous loci. Individual 4 could have haplotypes ABC/abc, ABc/abC, aBc/AbC or aBC/Abc.

There are several ways to determine haplotyes from genotypes; this is commonly referred to as resolving haplotype phase. If the parental genotypes are known, the haplotype phase of the offspring can usually, but not always, be determined. If the parents of individual 3 have genotypes Aa BB Cc and Aa Bb cc, then the individual's haplotype phase has to be aBC/abc. However, if instead the parents' genotypes are Aa Bb Cc and Aa Bb cc, then the haplotype phase still cannot be resolved.

Another way to resolve haplotype phase is to use a biochemical method that separately amplifies each chromosome, allowing direct determination of haplotype phase121. Although such methods exist, they are currently too slow and costly to be used in large genomic surveys.

It is much more common to use a statistical method based on the assumption that haplotypes are randomly joined into genotypes. The basic idea is that individuals that are homozygous at all loci or all but one locus provide some information about haplotype frequencies that can then be used to infer the haplotype phase of the other individuals. Various methods — including those based on maximum likelihood122, parsimony123, combinatorial theory124 and a priori distribution derived from coalescent theory125 — have been developed. The last method is the basis for the program PHASE, which has performed the best in extensive simulation studies126. The emerging view of this problem is that inferring haplotype phase is similar to other cases in which missing data (in this case the haplotype phase of a diploid genotype) has to be imputed127.