Malaria remains a major burden to human health in tropical and subtropical areas. In Africa alone, more than one million children under five die from the disease each year1. Although four members of the Plasmodium genus normally infect humans, nearly all deaths are attributable to a single parasite species, P. falciparum. The severity of disease caused by this species results primarily from its ability to modify the surface of infected red blood cells by inserting parasite proteins. Parasitized erythrocytes can bind to host endothelial cells, a process called cytoadherence, leading in some cases to their accumulation in specific organs such as the brain and to the development of cerebral malaria2. Current approaches to malaria control and treatment rely on measures such as insecticide-impregnated bed nets and chemotherapy. Although considerable resources have been devoted to the development of a vaccine, no effective immunization regime so far exists. Moreover, the rapid spread of resistance to existing and new antimalarial drugs means that in some areas of the world, particularly southeast Asia, reliable prophylaxis is not possible, making treatment difficult3.
In response to these problems, the Malaria Genome Sequencing Consortium4 was established to sequence the entire genome of P.falciparum as a collaborative venture. In less than a year, the consortium aims to generate almost all of the P. falciparum genome sequence in unfinished form, making nearly its entire gene complement accessible for malaria researchers. Focus is already shifting to the development and implementation of whole genome approaches to drug development and vaccine target identification.
Sequencing strategy
P. falciparum has a nuclear genome of around 30 megabases (Mb) divided between 14 chromosomes, which range in size from 0.7 to 3.5 Mb. Sequencing this genome presents significant technical difficulties, mainly because of the biased nucleotide composition of its DNA, with an overall (A + T) content estimated at 82% (ref. 5). In general, fragments of DNA over 5 kilobases (kb) are unstable in Escherichia coli, so the large-insert bacterial clones commonly used as templates for sequencing are not available. However, most P.falciparum chromosomes can be resolved by pulsed-field gel electrophoresis (PFGE), and some mapped yeast artificial chromosomes (YACs) exist. Chromosome 3 was sequenced using a whole chromosome 'shotgun' (WCS), an approach that was pioneered during the sequencing of Saccharomyces cerevisiae chromosome IX (ref. 6), and which was also used for P. falciparum chromosome 2 (ref. 7). Management of the WCS was facilitated by generating several-hundred sequence reads from each of the YACs8,9 comprising the minimal tiling path for this chromosome, and using these to place the WCS into a series of discrete bins. A strategy of sequencing YAC clones exclusively was discarded because of the high frequency of chimaerism, deletions and rearrangements inherent to YAC libraries10. In addition, although the YACs were purified through two pulsed-field gels11, the level of contaminating S. cerevisiae sequence remained high owing to preferential cloning of yeast DNA.
Assembly validation
Because of the complexity associated with the assembly of an entire chromosome containing extensive repetitive regions, we confirmed the chromosome 3 sequence by several independent methods. As double-stranded clones were used, with reads produced using both forward and reverse primers, the consistency of the read pairs generated was used as an initial assembly check. Reads derived from mapped YAC clones allowed confirmation of the colinearity of the chromosome assembly and the chromosome 3 YAC map8. In addition, the restriction enzyme map generated for this chromosome during the mapping project was used as further confirmation that the sequence had assembled correctly8. The order of mapped sequence-tagged site markers and simple sequence-length polymorphism microsatellite markers generated from the HB3xDd2 linkage segregation genetic map12 were also confirmed in the final assembly. Finally, the DNA sequence correlated well with the restriction enzyme pattern for chromosome 3 generated by optical mapping13, with an average error of
5%.
Analysis
The 1,060,106-bp chromosome 3 sequence encodes 215 predicted proteins and two tRNAs (Fig. 1 (PDF File: 234k), yielding a mean gene density of one predicted gene every 4.8 kb, which is similar to that in Caenorhabditis elegans (one gene per 5 kb)14. The overall (A + T) composition of this chromosome is 80%, with the base composition of exons being 76.8% (A + T) and introns being 84.6% (A + T). Six of the genes on chromosome 3 had been characterized prior to this work. They encode circumsporozoite protein (CSP)15,16, CS protein-TRAP-related protein (CTRP)17, RNA polymerase II largest subunit18, elongation factor-TS, CDC2-related protein kinase (EMBL accession no. EM:M86715; TREMBL accession no. TR:Q25028) and cyclophilin19, of which only three had been mapped to this chromosome15,16,17,18. Ninety-four of the predicted proteins (43.7%) have significant similarity to existing database entries. Eighty-five (39.5%) have matches to eukaryotic proteins, many providing potential functional information. Nineteen of these (8.8%) are currently unique to P. falciparum or other apicomplexan parasites, but few proteins in this group have been functionally characterized. Five predicted proteins (2.3%) have significant similarity solely to bacterial proteins and are likely to be localized to the organelles of the parasite. In total, 27.4% of the predicted proteins on chromosome 3 are members of Pfam protein families20. A further four predicted proteins contain discrete protein domains, but similarity does not extend further than the domain identified (Fig. 1 (PDF File: 234k). A preliminary functional classification of the proteins encoded on chromosome 3 is shown in Table 1 .
Figure 1: P.falciparum chromosome 3.

Order and orientation of genes and predicted genes are shown. Exons are shown as coloured boxes with introns as linking lines. The two tRNA genes are shown as purple boxes with gene names shown in purple. Genes encoding proteins that have been characterized previously in P. falciparum have names shown in red. Predicted genes encoding proteins that have similarities to proteins currently unique to apicomplexan species are shown in fuchsia. Predicted genes encoding proteins with similarity to proteins in other eukaryotes for which some functional information is available are shown in yellow. Predicted genes encoding proteins with similarities to proteins of unknown function in other organisms are shown in orange. Predicted genes encoding proteins with similarity solely to bacterial proteins are shown in light blue. Genes encoding predicted proteins having similarity only within a defined protein domain are shown in grey. Predicted genes having no significant similarities are shown in dark green with those that have been confirmed by RT–PCR shown in light green. Pseudogenes are shown in dark blue. N-terminal signal sequences are represented by wavy lines. Repetitive telomeric sequences are shown as hatched boxes, and the location of the predicted centromere (put.CEN) is shown as a red box.
High resolution image and legend (104K)Most of the predicted proteins (202 proteins; 94.0%) contain low-complexity, non-globular regions, as defined by the seg program21. However, the percentage of residues defined as low-complexity is 21.6%, indicative of the small size of many of these regions. Such regions have been previously reported7 and are often polymorphic between parasite isolates in both housekeeping genes18 and genes identified by antibody screening15,16. Regions of low complexity can be divided into two distinct classes: tandem arrays of repeated peptide motifs (such as CSP and CTRP) and homopolymer runs of a single amino-acid residue (such as RNA polymerase II largest subunit and asparagine synthetase). The homopolymer runs represent expansions of amino acids with A/T-rich codons, encoding asparagine, lysine and glutamic acid, and are the predominant type of peptide polymorphism between parasite isolates.
Before the genome project our understanding of gene organization in P. falciparum was limited, as few transcriptional units had been characterized22,23. Cloning and sequencing methodologies tended to encourage identification of genes that were highly immunogenic. Most of these genes had a single exon, whereas the remainder had small additional 5' or 3' exons. Analysis of the chromosome 2 and 3 sequences indicates that splicing is a much more common phenomenon in P. falciparum than originally thought. Nearly half of the genes on chromosome 3 are predicted to contain at least one intron (102; 47.4%). For 31.9% of these, intron splicing has been confirmed by similarity data to expressed sequence tag (EST) clones, comparative analysis with orthologous genes or by directly using polymerase chain reaction with reverse transcription (RT–PCR). For the majority of spliced genes on chromosome 3 (63 genes), splicing is predicted to be confined to the 2-exon model. Half of the 2-exon gene predictions have a 5' exon length of <100 bp, making the accurate prediction of their initiation ATG codons particularly difficult.
Several predicted genes consist of multiple small exons (up to 15) exhibiting a gene structure more similar to that of higher eukaryotes. We were able to identify five genes of this type on chromosome 3 owing to their similarity to genes in P. falciparum or other organisms. PFC0495W is similar to Eimeria tenella aspartyl protease; PFC0935C, N -acetylglucosamine-1-phosphate transferase, is most similar to its mouse homologue; and PFC0410W is most similar to the YT51 protein expressed in rat brain. The remaining two genes in this category, PFC0110W and PFC0120W, are members of a P. falciparum multigene family that encodes proteins implicated in cytoadherence24. These cytoadherence-linked asexual gene (clag) paralogues (clag 3.2 and clag 3.1, respectively) appear to have arisen from gene duplication; other single copies of clag have been localized to chromosomes 2, 4 and 9 (ref. 24). Deletion of the clag gene on chromosome 9 abolishes cytoadherence, indicating either that the clag genes on chromosome 3 are not functionally equivalent to it, or that their transcription is subject to higher order regulation24. Expression of PFC0410W, PFC0495W, PFC0935C and PFC0120W in the asexual blood stages has been confirmed by RT–PCR, and their predicted splice sites corrected by sequencing the cloned RT–PCR products24 (Fig. 2, and data not shown). The complex nature of splicing in this type of gene, and their short exons, makes them difficult to predict in the absence of similarity data; they are predicted inefficiently using the hexamer program (R. Durbin, unpublished software. Documentation, code and data are available from anonymous ftp servers at ftp.sanger.ac.uk/pub/hexamer.; Fig.2). It is likely that this class of P. falciparum genes remains significantly underpredicted from genomic sequence. Sequence data from the RT–PCR products is being used to re-train the current gene-prediction software, but generation of more P. falciparum EST sequences or full-length complementary DNA libraries would also facilitate identification of these genes.
Figure 2: Splicing in the gene PFC0935C.

a, The original gene prediction is shown, generated using hexamer and GENEFINDER, with exons drawn as coloured boxes and introns as thin lines, together with the supporting hexamer prediction of protein-coding potential and BLASTX similarity data. This is compared with the in vivo splicing of this gene, confirmed by RT–PCR and DNA sequencing. Exon 1 is not supported by either protein similarity or hexamer prediction, and exon 2 is incorrectly predicted by the hexamer program. Primers used for exon confirmation are shown as numbered arrows. b, A gel showing comparison of PCR products generated from genomic (G) and cDNA (C) template, with primer combinations shown above. Combinations 1+ 7 and 2+ 7 gave slightly smaller products than predicted, an observation confirmed by sequence data. DNA sizemarkers, the 1-kb ladder (lane M1) and the 123-bp ladder (lane M2) are shown in kb.
High resolution image and legend (63K)We compared the gene-prediction statistics for chromosomes 2 (ref. 7) and 3 (Table 2). As expected, the chromosomes are similar in many respects, including their overall (A + T) composition and coding density. However, the number of genes predicted as having introns is higher for chromosome 3. RT–PCR experiments for some chromosome-3 genes (results not shown) indicate that splicing in P.falciparum is actually under-predicted by our current gene-finding methods. Because the initial analysis of chromosome 2 predicts even fewer splicing events7, it is likely that splicing has also been underpredicted from the chromosome 2 sequence. It is essential that gene predictions are experimentally confirmed by RT–PCR to generate a larger training set with which to improve gene-finding algorithms.
Table 2: Summary of predicted features on P. falciparum chromosome 3 and comparison with chromosome 2 (ref. 7)
Direct comparison of the proteins predicted on chromosomes 2 and 3 is hampered because of the different methods used in generating those predictions and the different programs and parameters used in their analysis. For example, the differences observed between these two chromosomes, when comparing predictions of low-complexity sequence, are due at least in part to a different definition of low complexity being used to define the parameters for analysis. Improvement and standardization of gene-finding and protein-analysis methodologies, with subsequent re-analysis of the data from chromosomes 2 and 3, will allow a more accurate comparison of protein features.
Protein targeting
In addition to its nuclear genome, P. falciparum contains two organellar genomes, thought to have been acquired as a result of multiple endosymbiotic events. The mitochondrial genome is a 5.9-kb linear molecule present as multiple tandem repeats. A second organellar genome, a 35-kb circular DNA molecule, is located within the apicoplast25. This is an organelle of plastid origin, thought to be unique to apicomplexan parasites, which apparently provides essential metabolic functions26. Gradual loss of genes from the organellar genomes has occurred, such that maintenance of both organelles requires many nuclear-encoded proteins targeted to that organelle using amino-terminal sequences27. Three predicted proteins on chromosome 3 have N-terminal sequences and implied biological function indicating mitochondrial import (PFC0170C, PFC0225C and PFC0275W). Two predicted proteins have presequences that indicate targeting to the apicoplast (PFC0050C and PFC0310C). An additional three proteins are likely to be localized to an organelle, a conclusion based on their N-terminal sequences and predicted function; however, their precise location cannot be determined from their signal peptides (PFC0470W, PFC495W and PFC725C).
P. falciparum also directs several proteins to the cytoplasm, cytoskeleton and plasma membrane of the infected red blood cell. The proteins implicated as receptors involved in cytoadherence fall into this category; however, the sequences responsible for targeting these molecules remain unidentified.
Telomere structure
Non-coding sequences. The chromosome sequence contains 39 copies of the telomeric repeat sequence (T(G/A)ACCC) at the left telomere and 85 copies at the right telomere, but no attempt has been made to estimate the exact number of copies of this repeat in the intact chromosome. Other repeat sequences (R-CG7, rep11, rep20) occur between the telomere and the gene most proximal to the telomere (var). The rep11 sequence is a new repeat family consisting of an 11-bp tandem repeat located immediately telomeric to the rep20 sequences. The R-FA3 repeat sequence28 maps between the var gene and the adjacent rif gene.
Coding sequences.Of the known multigene families, there are two var genes29,30, one at each telomere, four members of the rif gene family on the left and three on the right chromosome arm and one copy of the stevor family at each telomere. Two members of the clag family24 occur in the left subtelomeric region, separated by a region with similarity to var, possibly representing the site of a previous recombination event. In addition, several pseudogenes occur at both telomeres that have homology to the var 3' exon (the varC genes)31. The preservation of these pseudogene sequences in the apparently rapidly evolving subtelomeric regions indicates that they may be biologically significant.
The most telomere-proximal sections of all four P. falciparum telomeres sequenced to date show a conserved order of repetitive DNA sequences and multigene family members (Fig. 3). In addition, the right telomere of chromosome 3 shows an extended region of similarity with the right telomere of chromosome 2 (ref. 7; Fig. 3). As well as members of known multigene families, there are several predicted genes that share similarity located on each of these chromosomes, which may represent new telomere-associated multigene families (PFC1075W–PFC1090W and PFB0980W–PFB0995W). The cellular location of proteins encoded by the members of these new multigene families is unknown, although we predict that all eight proteins have N-terminal signal peptides. Both rif and var are expressed on the surface of infected red blood cells32,33 and undergo antigenic variation, probably as a means of immune evasion34. The products of the var locus mediate cytoadherance to a variety of cellular receptors, and thus are important virulence factors. All three of the previously identified multigene families show extreme sequence polymorphism. The clustering of genes sharing similar cellular location and control is likely to be of mechanistic significance and indicates that the newly identified members of the cluster should be studied further. In addition, the regions of shared similarity between the telomeres of chromosomes 2 and 3 indicate that recombination within these regions may be frequent, and could partly explain the extensive polymorphism seen in members of the stevor, rif and var families.
Figure 3: Telomere organization in P.falciparum.

a, The telomeres of chromosomes 2 and 3, showing the higher-order organization of repetitive DNA sequences and telomere-associated multigene families. Four proteins of unknown function encoded on chromosome 3 are similar to proteins encoded on chromosome 2 and have conserved order and orientation (conserved telomere-encoded proteins: CTPs). These are shown in light blue (PFC1090W/PFB0995W), beige (PFC1085C/PFB990C), grey (PFC1080C/PFB0985C) and fuchsia (PFC1075W/PFB0980W). b, A consensus for the arrangement of repetitive DNA sequences and multigene families on P. falciparum telomeres based on the telomeres of chromosomes 2 and 3.
High resolution image and legend (19K)Centromere
No P. falciparum centromere has been identified. Centromeres in other eukaryotic species35,36 can be divided into two distinct classes: the point centromere seen in budding yeasts, such as S. cerevisiae, and the regional centromeres present in Schizosaccharomyces pombe and higher eukaryotes. The minimal functional unit of the S.cerevisiae centromere consists of an
125-bp region comprising three parts, the outer CDEI and CDEIII domains that flank a central (A + T)-rich region. S. cerevisiae centromeres are located in chromosomal regions of relatively low gene density and high (A + T) composition. The regional class of centromere has been most comprehensively characterized in the fission yeast S. pombe. Regional centromeres have characteristic complex arrays of repetitive sequences occurring over an extended region. The three centromeres of S. pombe span regions of 40–100 kb, each having a central (A + T)-rich core flanked by chromosome-specific repeat sequences37,38. Sequence in regional centromeres is very variable and this has hampered attempts to find functional centromere sequences in higher eukaryotes35.
We have identified a region on chromosome 3 that is currently the best candidate for the chromosome centromere. This region is extremely (A + T)-rich, 97.3% over 2.6 kb, with a central region having a slightly higher (G + C) content (Fig. 4a). Analysis of P.falciparum chromosome 2 reveals a region with similar structure (Fig. 4b). Both regions have very low coding potential; the nearest predicted genes are 12 kb apart on chromosome 3 (PFC0610c and PFC0615w) and 9 kb apart on chromosome 2 (ref. 7) (PFB0490c and PFB0495w). If these regions are chromosome centromeres, they have a structure more characteristic of regional rather than point centromeres. However, they would represent very compact regional centromeres, with an extremely short core sequence. Analysis of this region on both chromosomes highlights several chromosome-specific tandem repeats (Fig. 4).
Figure 4: Putative P.falciparum centromeres.

Plots of (A + T) composition for chromosomes 3 and 2. The Y axes are graduated in units of percentage (G + C) content, ranging from 0% (G + C) to 25% (G + C); the X axis represents DNA sequence in bp. a, (A + T) composition of chromosome 3 highlights a region that is extremely (A + T)-rich, Families of repetitive sequences in the predicted centromere of chromosome 3 are shown as coloured boxes. b, A similar region is present on chromosome 2, containing different families of tandem repeat sequences. Alignments of the repeat families can be found at ftp://www.sanger. ac.uk/Projects/P_falciparum/Centromeres/.
High resolution image and legend (18K)Early release of sequence data
Even before this chromosome was completed, several groups had demonstrated the utility of timely release of sequence data in unfinished form. During the course of the project, chromosome 3 sequences were used to confirm chromosome location and sequence data generated independently19, to identify new members of P.falciparum gene families39 and to identify members of families of genes conserved in other genera40.
Methods
Sequencing. P. falciparum DNA is denatured at relatively low temperatures because of its extreme (A + T) content5, so we had to adapt many standard protocols for this project. DNA used in library preparation was not exposed to either ethidium bromide or ultraviolet (UV) transillumination and we minimized the temperatures to which the DNA was exposed. Gel slices containing chromosome 3 were excised from pulsed-field gels and DNA was extracted from low-melting-point agarose using a modified agarase protocol. After equilibrating with TE buffer, pH 7.4, for several hours, gel slices were loaded into a 2-ml syringe and forced through a 26G needle. Agarase buffer was added to a final concentration of 1
, and 8U
-agarase was added per ml of sample. After incubating at 37 °C for 3 h, the sample was extracted with TE-buffered phenol and the DNA recovered by ethanol precipitation. Libraries were prepared by fragmenting the DNA by sonication and cloning into pUC18. Sequences were generated using pUC clones with both forward and reverse primers using dye-terminator chemistry.
Assembly. Because of the size and complexity of the project, we modified the standard strategy used for sequence assembly. Initially, sequences generated by the WCS and the YAC skims were assembled using the Phrap assembly program (P. Green, unpublished software), which had been adjusted to handle the large number of reads generated. We divided the chromosome into eight sections based on the YAC map, each section having a separate working database. Contigs that did not contain YAC reads were directed to a repository database, from which they could be recovered once their location had been identified. We expected to have a large number of single reads in the assembly that originated from other P. falciparum chromosomes, as the library generated for chromosome 3 was estimated as being 87% pure, based on the hybridization of shotgun reads to chromosome blots. Thus, single reads were not incorporated into any of the databases, but again they could be recovered if necessary. In collaboration with the other laboratories contributing to the Malaria Genome Sequencing Project, the reads generated during the chromosome 3 project that originate from other P. falciparum chromosomes will be incorporated into their respective chromosomes as the sequencing project progresses.
Closure. Gaps in the initial assembly were filled by several methods. Initially, oligonucleotide walking from pUC clones bridging contigs was used to fill many of the gaps. A second approach was to perform combinatorial PCR between contigs mapped to the same chromosome region by YAC-derived reads. Products generated by combinatorial PCR were sequenced by oligonucleotide walking. For gaps that could not be filled by PCR, further pUC clones proximal to the gaps were selected by hybridization of a gridded 15
chromosome 3 pUC library to radiolabelled oligonucleotides selected at contig ends41. Regions of extreme (A + T) composition were resolved by either generating transposon libraries or cloning as very small fragments (50–500 bp) into m13mp18.
Analysis. Completed sections of chromosome 3 were subjected to a series of automatic analyses to reveal possible protein-coding (R. Durbin, unpublished software; P. Green, & L. Hillier, unpublished software) and tRNA42 genes, similarities to ESTs43, other proteins44,45 and repeat/multigene families. The results were collated in a genome database (ACeDB) that merges overlapping sequences to provide a single contiguous view of the entire chromosome. Documentation, code and data for ACeDB are available from anonymous ftpservers at ftp.sanger.ac.uk/pub/acedb and ncbi.nlm.nih.gov/repository/acedb. Data from the various analyses were viewed interactively through theACeDB annotator's graphical workbench. GENEFINDER (P. Green & L. Hillier, unpublished software) predictions were confirmed or adjusted to incorporate protein, cDNA and EST matches, hexamer coding potential and repetitivesequences (see http://www.sanger.ac.uk/Projects/P_falciparum/Methods_Analysis for full details of analysis protocols and parameters). We utilized the protein family databases Pfam20 to classify common protein domains in the malaria genome. A number of web-based tools were used to identify transmembrane regions, coiled coil domains, signal peptides and overall suggestions of protein localization (http://www.sanger.ac.uk/Projects/P_falciparum/Toolkit).
Data release. Sequence data generated by the chromosome 3 project were released continuously and were available for searching using the on-site BLAST server and downloading by ftp without restriction. The fully annotated sequence is available for browsing and downloading from http://www.sanger.ac.uk/Projects/P_falciparum. Unfinished sequence data from the remaining eight P. falciparum chromosomes in progress at the Sanger Centre can also be accessed through this site.
