Main

The finished sequence of chromosome 8 contains 145,556,489 bases and is interrupted by only four euchromatic gaps, one gap at the 8p telomere and one gap containing the centromeric heterochromatin (Fig. 1 and Supplementary Table S1). These gaps are refractory to current cloning and mapping technology. The estimated total size of the euchromatic gaps is 427 kilobases (kb), based on direct sizing of three gaps and estimation of the remaining two gaps at the genome-wide average of 100 kb each. This corresponds to 0.3% of the euchromatic length of the chromosome, similar to the genome average1,7,8,9,10,11. In all, 182.3 megabases (Mb) of finished sequence were generated by the Broad Institute of MIT and Harvard (formerly Whitehead Institute/MIT Center for Genome Research (WICGR)), 27.9 Mb by Keio University School of Medicine, 8.4 Mb by the Institute of Molecular Biotechnology in Jena, and 5.8 Mb by 10 other groups (Supplementary Tables S2 and S3). These sequences (which include overlap) were combined to yield the finished path (see Methods).

Figure 1: Overview of human chromosome 8.
figure 1

The features are addressed in the order of top to bottom. In the cartoon, blue shading indicates gene deserts (≥ 500 kb with no transcript, Supplementary Table S7); telomeres (pTEL and qTEL), the centromere (CEN) and euchromatic sequence gaps (red lines) are indicated. The following features are represented in discrete windows of 100 kb: G + C content (on a scale from 30–70%); densities of LINEs (long interspersed nucleotide elements; red) and SINEs (short interspersed nucleotide elements; blue); and densities of transcripts (all are counts of elements). The box at the bottom shows blocks of conserved synteny (100-kb resolution) with dog, mouse and rat as determined for this work. Chromosomes are numbered, and are coloured arbitrarily for ease of distinction.

We assessed the local accuracy of the clone path by aligning paired-end sequences from a human Fosmid library (WIBR2, representing ×10 physical coverage) to the finished sequence7. Errors in the clone path were detected by identifying discrepancies between the predicted and observed distances between Fosmid ends7. This revealed two deleted clones, which were replaced. Finally, an independent quality assessment exercise commissioned by NHGRI estimated the accuracy of the finished sequence at less than 1 error in 100,000 bases12 (J. Schmutz, personal communication).

Several analyses support the idea that nearly the entire euchromatic region of chromosome 8 is present and accurately represented. From the well-curated RefSeq13 data set 681 transcripts (from 573 unique genes) mapped to chromosome 8. All but one of these are present and complete in the finished sequence. The finished sequence shows excellent co-linearity with the genetic map14 (Supplementary Fig. S1). Among 247 sequence-based genetic markers (Supplementary Table S4) there are six discrepancies. One discrepancy consists of eight markers and spans a region in 8p23 known to be the site of a polymorphic inversion in the human population15,16 (see below). Five discrepancies each consist of single markers out of order by one position; all occur in small regions where the genetic map shows no recombination in one of the two sexes (Supplementary Table S4). The sequence also shows good agreement with the radiation hybrid (RH) map17 (Supplementary Table S5).

We produced a manually curated gene catalogue, containing 793 gene loci and 301 pseudogene loci (see Methods). The catalogue includes all previously known genes on chromosome 8 (Table 1). According to the Hawk2 categorization scheme18, there are 614 ‘known’ genes, 109 ‘novel CDS’, 43 ‘novel transcripts’, 14 ‘putatives’ and 13 ‘gene fragments’. The small set of novel and putative categories were annotated by spliced expressed sequence tag (EST) evidence only; some ‘putative novel’ loci may prove to be pseudogenes. Comparison of manual annotation performed at the Broad Institute of MIT and Harvard to manual annotation for specific regions done at Jena and Keio indicated that they were largely the same, and that virtually all differences could be attributable to edge effects (see Supplementary Information).

Table 1 Chromosome 8 gene content

Full-length transcripts of known genes contain an average of 9.9 exons, comparable to recently published reports8,9,10,11,19, have an average length of 3,056 base pairs (bp), and internal exons have an average length of 155 bp. There is evidence of extensive alternate splicing. Gene loci have an average of 4.1 distinct transcripts, with 63% having at least two transcripts, values that are similar to recent reports8,9,11,20. Of the 301 pseudogenes on chromosome 8, 84% are processed pseudogenes arising from retrotransposition; the remaining 16% are unprocessed. We also identified 13 tRNA genes (Supplementary Table S6). Examples of genes that represent extremes from these averages are described in Supplementary Information.

Several aspects of the genome landscape are notable. The overall gene density is 5.6 genes Mb-1, below the genome average of 10 genes Mb-1. Gene distribution is highly heterogeneous, with 44 gene deserts (500 kb without a coding gene, Supplementary Table S7) that together comprise 41.9 Mb or 29% the total length. The overall G + C content is 39.2%, but varies substantially across the chromosome (Fig. 1). Nearly half of the chromosome is composed of repeat sequences, with transposable element fossils comprising 44.5%, low complexity sequence (including simple sequence repeats and satellite sequences) comprising 1.8%, and segmental duplications comprising 2.1% (with interchromosomal and intrachromosomal duplications at 1.5% each, with some sequence included in both categories) (E. Eichler and X. She, personal communication).

Chromosome 8 is the first human autosome and one of only two chromosomes (the other being chromosome X20) for which sequences span the entire pericentromeric region. The regions on both arms stretch from unique euchromatin through pericentromeric satellites and into the higher-order alpha-satellite array (Fig. 2). Three variant higher-order repeat units populate the chromosome 8 higher-order array, D8Z2 (ref. 21 and Supplementary Information). The proximal termini of both the 8p and 8q sequence contigs are comprised of nine copies of the 1.9-kb unit. The p and q arm higher-order units are highly identical to each other (96–98%) and occur in the same head-to-tail orientation, indicating that these sequences sample the edges of the chromosome 8-specific array. Analysis of the finished pericentromeric sequence of chromosome 8 is essential to test and further develop primate centromere evolution hypotheses using an autosomal model.

Figure 2: 8p and 8q pericentromeric contigs extend into chromosome 8-specific higher-order alpha satellite, D8Z2.
figure 2

The pericentromeric region of chromosome 8 is shown as a truncated ideogram with the extent of sequence coverage shown below by black bars. Dotter plots show self–self alignments of the most proximal 100 kb from each arm including 36 kb of the chromosome-specific alpha satellite array (D8Z2). Junctions between the arm-specific satellite region and D8Z2 are marked with blue arrows. Dark blocks indicate the highly repetitive nature of the satellite region and mark similarity between monomers within each satellite family. Gaps in the dark blocks occur where interspersed elements (LINEs, SINEs and long terminal repeats) interrupt the satellite sequences. In the alpha satellite array dotter plot (bottom), D8Z2 from 8p (18 kb) is joined with that of 8q (18 kb). The plot reveals the periodic nature of the centromeric, higher-order alpha satellite array with black horizontal lines indicating near identity of sequences spaced at 1.9-kb intervals. The regions outlined in blue are self–self alignments (‘8p’ and ‘8q’), whereas the remaining rectangular region of the plot is an alignment of 8p versus 8q D8Z2.

The most striking feature on chromosome 8 emerges from evolutionary and population genetic comparisons (Fig. 3). The most distal 15 Mb on chromosome 8p show an extremely high divergence between human and chimpanzee (0.021 substitutions per site, 4.0 s.d. above the mean of 0.012). The region also shows a strikingly high polymorphism rate in the human population (0.0018, 3.2 s.d. above the mean of 0.0010). The peak divergence reaches 0.032 (8.6 s.d.), and diversity 0.0028 (7.1 s.d.), across a 1-Mb region (3.3–4.3 Mb) overlapping the CSMD1 gene. This is the highest divergence level seen across all autosomes and chromosome X. Only regions of chromosome Y may be more rapidly diverging, driven by the high mutation rate in the male germ line. We excluded trivial explanations for this observation, such as unresolved segmental duplications (Supplementary Information). Diversity is also locally high in the chimpanzee, although the data are more limited.

Figure 3: Diversity and divergence on 8p.
figure 3

Coloured lines indicate the distribution of human diversity (blue) and human–chimpanzee divergence (red). Values of genome averages and of 2 standard deviations from the means are indicated (dark and light dashed lines, respectively). Features mentioned in the text are indicated in the bottom panel, including genes, two low copy repeats (LCRs) and the common 8p23 inversion. Vertical ticks in the LCR boxes indicate olfactory receptor genes or pseudogenes, and vertical ticks in the DEF cluster boxes represent individual defensin (DEF) genes. There is a discontinuity in the divergence plot from 6.98 to 8.13 Mb. This region, corresponding to the REPD repeat, is also highly duplicated in the chimpanzee, making it impossible to align sequence with high enough confidence to call divergence.

The high rate of divergence and diversity at distal 8p might reflect either an extraordinary mutation rate or population genetic history. The latter alternative would require an unusually long coalescence time to the most recent common ancestor over a very large region; this would be remarkable inasmuch as local coalescence times tend to be correlated over short distances, as the correlation falls below 0.5 within 20 kb (ref. ref. 22). We sought to resolve the issue by examining the divergence rates with more distant mammalian species, where the impact of population genetic history should be negligible.

Comparison of ancestral interspersed repeats in the human, dog23 and mouse24 genomes reveals that the region exhibits above-average lineage-specific divergence rates on all three lineages across 100 million years of evolution, but that the rate is the most elevated relative to the genome-wide mean in the lineage leading to humans. The greatest elevation is seen in the most distal 6 Mb of 8p, where the ancestral interspersed repeat divergence rates in the orthologous sequences have been 0.19 (3.3 s.d. above the mean of 0.14) on the human lineage and 0.41 (1.0 s.d. above the mean of 0.38) in the mouse lineage since the primate–rodent split, and 0.24 (1.9 s.d. above the mean of 0.20) in the dog lineage since the divergence from the common boreo-eutherian ancestor.

The biological basis for the apparently high mutation rate is unclear. Three major factors have been associated with high mutation rates in the human genome: proximity to telomeres, high recombination rate and high A + T content25,26. The region on chromosome 8p has all three factors. The mean sex-averaged recombination rate across the first 6 Mb is 2.7 cM Mb-1, with a 1-Mb window peak of 3.5, as compared to the genome-wide average of 1.2. The region from 2.5–6 Mb is 62% A + T, as compared to a genome-wide average of 59%. It is unusual in this regard, because subtelomeric regions with high recombination rates are typically (A + T)-poor. Notably, the region is not subtelomeric in the mouse, where the lowest rate elevation is observed.

The distal region on chromosome 8p also contains at least two loci that appear to be undergoing positive selection (Fig. 3). The first locus is the major cluster of defensin genes, which lies within the region of high mutation (5.5–7.5 Mb), although 2.5 Mb from the peak. The defensin genes express small cationic antimicrobial peptides crucial to the innate immune response27. Studies2,3 have suggested that defensins have been under positive selection, with a high ratio of non-synonymous to synonymous changes detected in the mature peptide coding exon. Moreover, gene and segmental duplication within the cluster have led to extensive copy number28,29 and haplotype30 polymorphism within and across populations, which are thought to influence variation in disease susceptibility and contribute to ongoing adaptive evolution in both the human and chimpanzee species. The second locus showing positive selection is MCPH1, mutations in which cause microcephaly (Online Mendelian Inheritance in Man (OMIM): 251200); there is clear evidence of accelerated non-synonymous divergence correlating with the expansion of brain size throughout the lineage from simian ancestors to the human and chimpanzee4,5.

To investigate the diversity of copy number in the defensin clusters, we resequenced several dozen polymerase chain reaction (PCR) products from representative intervals from DEFB105A (beta-defensin cluster) and DEFA1 (alpha-defensin cluster) in 14 chimpanzees, 1 gibbon, 1 macaque and 4 breeds of dog (see Methods and Supplementary Information). In all species studied, the gene family has multiple members, and the members are more similar within a species than across species. Thus, the defensin clusters have either independently duplicated in each species or have undergone gene conversion events within species.

Finally, we note that the majority of the genes in the region of high divergence in distal 8p play important roles in development or signalling in the nervous system. Notably, the extremely large CSMD1 gene, which lies at the peak of divergence and diversity, is widely expressed in brain tissues. High regional mutation rates and positive selection are generally assumed to be distinct, but it is possible that the former may facilitate the latter by increasing the rate of appearance of potentially advantageous single, or interacting, alleles (see also ref. 31). It is intriguing to speculate whether the accelerated divergence rate of this region has contributed to the rapid expansion and evolution of the primate brain.

Methods

See Supplementary Information for details on clone path building, generation of sequence map, sizing of gaps and gene annotation. The final version of the clone path is available in AGP format (see http://www.ncbi.nlm.nih.gov/genome/guide/glossary.htm) at http://www.broad.mit.edu/tools/data/data-human.html.

Gene amplification and sequencing

TBLASTN (http://www.ncbi.nlm.nih.gov/BLAST) was used to identify DEFB105 and DEFA1 orthologues in 16 chimpanzees, 1 gibbon, 1 macaque and 4 dog breeds (akita, golden retriever, greyhound and mastiff). PCR primers for gene amplification were designed using Primer3 (http://frodo.wi.mit.edu/primer3) based on the species reference sequence. Human and macaque primers were used for gibbon. Amplified products were cloned, and for each individual/gene combination, 48 or 96 clones were sequenced.

Haplotype analysis

Neighbourhood Quality Standard32 (NQS) scores were computed for all sequenced products using the published constraints32. Reads were trimmed to the first and last three consecutive NQS bases, and aligned to the reference sequence using PatternHunter (http://www.bioinformaticssolutions.com). Multiple sequence alignments were built from the pairwise alignments and inspected to find SNPs that were: at NQS bases, supported by at least two reads, and in a ten base window where not more than two other variations were observed. To minimize false positives due to errors during PCR amplification, we restricted our analysis to haplotypes that differed in >3 bases.