Abstract
DNA sequence variants in specific genes or regions of the human genome are responsible for a variety of phenotypes such as disease risk or variable drug response1. These variants can be investigated directly, or through their non-random associations with neighbouring markers (called linkage disequilibrium (LD))2, 3, 4, 5, 6, 7, 8. Here we report measurement of LD along the complete sequence of human chromosome 22. Duplicate genotyping and analysis of 1,504 markers in Centre d'Etude du Polymorphisme Humain (CEPH) reference families at a median spacing of 15 kilobases (kb) reveals a highly variable pattern of LD along the chromosome, in which extensive regions of nearly complete LD up to 804 kb in length are interspersed with regions of little or no detectable LD. The LD patterns are replicated in a panel of unrelated UK Caucasians. There is a strong correlation between high LD and low recombination frequency in the extant genetic map, suggesting that historical and contemporary recombination rates are similar. This study demonstrates the feasibility of developing genome-wide maps of LD.
Present-day chromosomes are mosaics of ancestral chromosomes that have
arisen through multiple recombination events in the past. Each copy of a
chromosome within a population can be uniquely characterized on the basis of a
specific pattern of sequence variants, which together comprise an individual
'haplotype'. When these haplotypes do not occur in the population
at the frequencies expected from the component variants, the variants are said
to be associated or in linkage disequilibrium (LD). Characterizing the
empirical patterns of LD across the genome will help to reconstruct the genetic
history of human populations, enhance our understanding of the biological
processes of recombination and natural selection, and facilitate association
mapping studies that seek to localize genetic variants influencing complex
traits and diseases. Previous studies of LD in humans have shown a high degree
of variability, indicating that LD may extend between a few and several hundred
kilobases1, 2, 3, 4, 5. For practical applications, however, the
local patterns of haplotype conservation are of primary interest6, 7, 8, as the variability in LD overwhelms the average level.
Therefore we characterized 59 independent haplotypes of human chromosome 22,
derived from 77 members of three-generation pedigrees of the CEPH reference
data set9. Using family samples helped to construct long
haplotypes and detect genotyping error. All markers were selected from publicly
available single-nucleotide polymorphisms (SNPs) and small insertions/deletions
(indels)10, 11, 12 at regularly spaced intervals along the
chromosome. A total of 951 of the 1,504 (63%) polymorphisms genotyped in the
CEPH panels were common (minor allele frequency of
0.2; see Methods), which
is similar to the proportion of common variants expected from comparing two
chromosomes drawn from a constant size neutral population (60%). LD between
pairs of markers was calculated using the measure D', following its usage
in previous empirical studies, and the r2 measure, which
is preferred by population geneticists13.
LD decays with increasing distance, but also shows extensive variability
(Fig. 1a, b). Maximal D' values extend up to
distances over 400 kb, contrasting with occurrences of no detectable LD
(D' < 0.20) between markers less than 5 kb apart. The
distribution of r2 values shows a similar degree of
variability, differing from D' mainly in measurement scale (Spearman rank
correlation
(r2, D') = 0.95).
Similar results were obtained in analysis of two separate populations of
unrelated individuals, also of Caucasian origin, one from the UK (90
individuals) and one from Estonia (51 individuals; Fig. 1c,
d). In the combined CEPH and UK data sets, average D' declines
from 0.70 for adjacent markers to 0.11 for unlinked markers, whereas average
r2 declines from 0.38 to 0.01. Although the two measures
differ in scale, their decay profiles are similar.
Figure 1: Distribution of linkage disequilibrium on chromosome 22.*

a, b, Variability of D' (a) and
r2 (b) for the CEPH samples, using all pairwise
values for markers with minor allele frequency > 0.20 separated by
1 Mb. c, d, Sliding window results of average D'
and r2 in successive bins of 200 markers (100 marker
overlap). Insets provide enhanced views of the observed LD decay from
0–300 kb. CEPH, red; unrelated UK, green; combined CEPH/UK, dark
blue; Estonia, violet
We assessed the pattern of LD along the chromosome by calculating average D' and r2 for markers within contiguous 1.7-Mb stretches of DNA (sliding windows). In addition, statistical models of LD decay were fitted to summarize the patterns and to account for marker density. The results highlight areas with very high levels of LD, notably at positions 11–16 Mb and 21–27 Mb of the reference sequence (Fig. 2a, b). The degree to which these exceed background LD levels was confirmed by significance testing using extreme value distribution theory applied to 'runs' of LD (see Methods). Figure 2d shows various regions where LD is slightly above background levels (light bands), as well as shorter runs of extremely high disequilibrium relative to the rest of the chromosome (dark bands). These regional differences are not due to unequal marker spacing, as they align well with the model-fitting results (Fig. 2b) that account for such variability. The estimates were consistent between the CEPH and UK samples (Fig. 2b), indicating the usefulness of both family-based and unrelated samples for initial detection of long LD runs. The general patterns of LD also appear similar in the Estonian samples (Fig. 1a), but the marker density in the smaller Estonian data set (median 34.72 kb) is too coarse for formal delineation of specific regions. In the CEPH and UK samples, average D' levels in the regions of high LD are 2–5 times greater than the background levels, presenting obvious distinctions of high and low LD tracts, whereas in the Estonian data, the highest LD region is less than twice the background level. Extrapolating these results to the genomic scale suggests that a median marker density greater than one marker per 35 kb is required for any first-generation map, and that, in the absence of family data, LD estimates in 100 chromosomes or fewer may be too variable for reproducible patterns at a broad scale.
Figure 2: Linkage disequilibrium across chromosome 22.

a, Average D' and r2
coefficients (top and bottom groups, respectively) plotted in sliding windows
containing all common polymorphisms separated by 50 and 500 kb in
successive 1.7-Mb segments (1.6-Mb overlap). Sequence position 1 refers to the
centromeric q-arm origin
(ftp://ftp.sanger.ac.uk/pub/human/chr22/sequences/Chr_22/complete_sequence/Chr_22_19-05-2000.fa).
The colour scheme is as in Fig. 1. b, Expected half length estimated
from application of the model
E(r2) = 1/(1
+ 4Nc) to each sliding window. The model could not be
fitted to the Estonian data because of the sparser marker density. c,
Relationship between genetic and physical distance on chromosome 22 (refs 15,
16). d, Significant regions of excess LD (see Methods). The longest runs
of LD (using f = 0.50
D') are shown
in light grey; shorter runs of high LD are shown in dark grey (f
= 1.00
D') and black (f =
1.50
D').
There is considerable evidence that sites of recombination in humans are not randomly distributed, but are often localized into specific hotspots7, 14. Current, low-resolution genetic maps can be used to model local recombination rates and provide additional predictors of LD beyond physical distance. Chromosome 22 has an elevated degree of recombination15, 16, 17, averaging 2.46 cM Mb-1 in our interpolated sex-averaged genetic map (see Supplementary Information), compared with the genome average of approximately 1.3 cM Mb-1. All components of the high LD tracts at 11–16 Mb and 21–27 Mb are situated in regions of exceptionally low recombination (<1 cM Mb-1; Fig. 2c) relative to the chromosome average. Indeed, nearly all of the high LD runs on chromosome 22 are located in regions of low recombination (Fig. 2d), a pattern previously noted in a localized region of this chromosome18, but which differs from a previous assessment of chromosome 22 microsatellite markers19. Collectively, the most exceptionally high-LD/low-recombination tracts cover about 9% of the chromosome, and 40% of the total variation in D' along chromosome 22 can be explained by the (interpolated) genetic distance between markers. These results indicate that the extant genetic maps may be used in practice as guides to genomic regions having high or low LD, and are of immediate use in position-based association studies.
To search for other predictors of LD, we examined correlations between
pairwise LD measures and various sequence features (Table
1). LD is positively correlated with gene density and short
interspersed nuclear elements (SINEs) such as Alu repeats (which are features
of (G + C)-rich sequence), and negatively correlated with long
interspersed nuclear elements (LINEs), long terminal repeats (LTRs) and other
DNA repeats ((G + C)-poor sequence). There is also a negative correlation
between the (G + C)-rich sequence features and genetic distance. When
genetic distance is factored out, by assessing sequence correlations with
residual LD measures that are independent of genetic distance by multiple
regression or using model-based regression residuals, nearly all of the
relationships are reduced or effectively eliminated. The single exception to
this collinearity occurs with LTRs, which maintain a significant negative
correlation with LD independently of genetic distance (
= -
0.15). It is conceivable that LTRs predict recombination at a relatively fine
resolution, whereas the currently available coarse genetic map predicts
recombination at a broader resolution.
In addition to pairwise disequilibrium assessments, our data provide an opportunity to identify common conserved haplotypes along the chromosome. We conducted a systematic search for regions of limited haplotype diversity and strong disequilibrium in the combined CEPH and UK samples, in which we detected 97 'haplotype networks' (see Methods), each including three or more markers (Fig. 3), including 329 out of 787 (41.8%) common polymorphisms and covering approximately 9.1 Mb of the long arm of chromosome 22 (22q). Interestingly, the two most common haplotypes are complementary at all sites in 55 of these networks. In some regions the networks overlap, reflecting the difficulty in precisely locating ancestral crossover events and possibly other phenomena such as gene conversion.
Figure 3: Haplotype networks on chromosome 22.

The figure is oriented with the leftmost position as the centromeric q-arm origin. The top panel shows the complete set of CEPH founder haplotypes (common alleles in blue; ambiguously phased or ungenotyped regions in white). Marker locations are given in the comb diagram underneath. The middle panels show 1-Mb-wide diagonal sections of colour-coded pairwise disequilibrium matrices30 for the CEPH families (top), UK unrelated (middle) and combined samples (bottom), respectively. Each sample is followed by a comb diagram for marker location and a pictorial representation of haplotype networks. Networks of three or four markers are in blue; longer networks are in red. The bottom panels show a detailed view of common haplotypes (>5%) for the combined data set. Within each block, haplotypes are listed in descending frequency. Gene composition and integrated genetic and physical maps are indicated. Transcribed sequences are grouped according to orientation15.
To view a large pdf of this image please click here: PDF (7.5 M)
High resolution image and legend (65K)All of the common variants in any one of these networks can be surveyed with minimal genotyping effort. For example, in the CEPH families, the longest haplotype network extends 804 kb (at 11.83 Mb, including 16 markers) and a single network in the 21–27 Mb region includes 25 markers and extends 758 kb. These 25 markers could define up to 32 million chromosomes and in the absence of LD every chromosome in our sample would be unique. Instead, five haplotypes account for 76% of the CEPH founder chromosomes. Genotyping three SNP markers is sufficient to distinguish these five common haplotypes and retain 94.7% of haplotype heterozygosity. Similar long conserved haplotypes are present in the UK and combined samples.
Our ability to derive common haplotypes was significantly enhanced by the use of family-based samples (the CEPH panel). This is an important conclusion for a haplotype map project, because the relative usefulness of families will depend on empirical haplotype patterns20. Comparison of the haplotype frequency estimates from the unrelated UK compared with partially-phased CEPH chromosomes indicated that, for an average set of five consecutive markers, each CEPH founder contributed 1.91 times more information than an unrelated individual (Methods). This suggests that although genotyping extended pedigrees such as the CEPH families requires twice as much effort as that for unrelated individuals, the information for LD mapping is approximately equal. Given this similarity and the advantages of families for detection of genotyping errors, integration with meiotic maps and quality comparison across laboratories, the use of families for LD mapping is clearly preferable to unrelated individuals.
The primary motivation for construction of any LD map in the human
genome is to facilitate identification and characterization of genetic variants
for common complex diseases. The present data indicate that considerable
information is available for fine mapping of disease loci even in
first-generation maps such as our 15-kb chromosome-wide resolution. For
example, linkages to schizophrenia and other psychiatric disorders have been
reported around 22q12, near marker D22S278 (ref. 21) (at approximately 19.8 Mb); schizophrenia has
also been associated with microdeletions in 22q11, near the velo-cardiofacial
syndrome locus (
5.6 Mb). Neither of these regions shows high LD in
our data, suggesting that fine mapping may require a high density of markers.
Conversely, linkage to type 2 diabetes has been reported around marker D22S423
(ref. 22) (approximately 23.8 Mb), which is
on the edge of some of the longest tracts of high LD on the chromosome. Initial
allelic association in this region may be facilitated by this extensive
conservation.
Methods
Selection of markers, DNA samples and genotyping
Markers for genotyping were selected by walking along chromosome 22 in 15-kb steps through all available SNPs and small indels and choosing the nearest variant that was suitable for a unique polymerase chain reaction-based genotyping assay. Markers were genotyped in duplicate on 77 CEPH family DNAs and 90 unrelated UK Caucasian DNAs using the Third Wave Technologies Invader assay23, and on 51 unrelated Estonian DNAs using allele specific primer extension in microarray format24.
A final set of 1,504 markers were polymorphic in the CEPH DNA panel
and were not rejected because of mendelian segregation errors,
Hardy–Weinberg equilibrium deviations or other quality issues (the CEPH
SNP set). There are 27 gaps of greater than 100 kb in this set (maximal
gap of 293 kb), yielding a mean spacing of 22.95 kb and a median
spacing of 15.07 kb. A total of 1,262 markers from the final CEPH set
and 23 additional markers were successfully genotyped on the UK sample of
unrelated individuals, for 1,286 markers on the UK Caucasians (the UK SNP set,
median spacing of 17.53 kb, mean = 26.86 kb). We refer to
the overlapping collection of CEPH and UK SNPs as the 'combined'
marker set. The final Estonian SNP set had 908 SNPs, 661 of which had minor
allele frequencies
0.20, and a median spacing of 34.72 kb (mean
= 61.42 kb). The Estonian SNP set included 594 SNPs in common
with the initial CEPH SNP set. The final data used in these analyses is
available at http://www.sanger.ac.uk/HGP/Chr22.
Error checking
We tested all markers for Hardy–Weinberg equilibrium, ignoring family structure, and excluded from the analysis those where equilibrium was rejected at the 10-4 level. We verified familial relationships within the CEPH samples and checked that presumed unrelated UK individuals were truly unrelated using the GRR program25. In addition, we excluded all genotypes that produced mendelian errors or unlikely recombination patterns (P < 0.001) in the CEPH sample26. The duplicate genotyping resulted in a very low error rate, as indicated by mendelian segregation errors (480 out of 98,095 genotypes = 0.5%) and unlikely double recombinants (315 out of 98,095 genotypes = 0.3%) in the CEPH data set.
Haplotyping
For the CEPH pedigrees, we used MERLIN26 to list all alternate sets of non-recombinant founder haplotypes including small sets of neighbouring markers. Haplotype frequencies in families were then estimated using an expectation–maximization (E–M) algorithm27 (software available on request from G.R.A.). For unrelated individuals and the combined data set, we estimated haplotype frequencies using the E–M algorithm.
Pairwise disequilibrium and distance modelling
For pairwise comparisons we calculated D' and
r2 following standard procedures13. We also
fitted decay models to all pairwise coefficients within successive 1-Mb sliding
windows:
E(r2ij)
= 1/(1 + 4Ncij),
where N is the effective population size and
cij is the recombination fraction between markers
i and j estimated from the physical distance between markers
using the chromosome 22 average 1 Mb
2 cM. We refer to the
half-length of disequilibrium as the distance at which
E(r2ij)
= 0.5. Estimates from this model are largely independent of the
underlying marker density.
Regions of excess disequilibrium
To
define boundaries for regions of unusual disequilibrium, we used a method based
on the Smith–Waterman algorithm28. For the ith
ordered marker pair within 500 kb of each other, we define the score
Si =
D'i - k and then
identify and compare high scoring segments using the Smith–Waterman
accumulation approach and related statistical theory29. We used a
penalty
k =
'
f
D' (with scale
f = 0.5, 1.0, 1.5) to detect
increasingly extreme runs of LD along the chromosome, where
D' refers to the standard deviation of all
pairwise LD coefficients.
Haplotype networks
We defined regions of limited haplotype diversity as those in which five haplotypes accounted for 75% or more of all haplotypes observed in the population and in which disequilibrium between each marker and haplotypes of surrounding markers exceed 0.75. We searched for sets of markers (networks) that met these conditions using a single marker as a seed and adding as many neighbouring markers as possible. We used MERLIN26 to identify all non-recombinant haplotypes in a growing network and the E–M algorithm to estimate haplotype frequencies. We did not require markers in a network to be consecutive, but instead allowed up to six intervening markers to be excluded.


