Abstract
Since the sequencing of the first two chromosomes of the malaria parasite, Plasmodium falciparum1, 2, there has been a concerted effort to sequence and assemble the entire genome of this organism. Here we report the sequence of chromosomes 1, 3–9 and 13 of P. falciparum clone 3D7—these chromosomes account for approximately 55% of the total genome. We describe the methods used to map, sequence and annotate these chromosomes. By comparing our assemblies with the optical map, we indicate the completeness of the resulting sequence. During annotation, we assign Gene Ontology terms to the predicted gene products, and observe clustering of some malaria-specific terms to specific chromosomes. We identify a highly conserved sequence element found in the intergenic region of internal var genes that is not associated with their telomeric counterparts.
Contiguous DNA sequences (contigs) have been obtained for chromosomes 1, 3, 4, 5 and 9, whereas chromosomes 6, 7, 8 and 13 contain a few gaps; most contigs have been ordered and oriented. Table 1 shows the status and content of the chromosomes at the time of writing. As we were unable to produce unbroken sequence from telomere to telomere for all nine chromosomes, contiguous 'pseudo-chromosomes' were constructed by artificially joining all contigs that could be mapped to an individual chromosome. In most cases, the order and orientation of the contigs could be inferred using mapping data3, 4, 5 or read-pair information. Small contigs (of less than 5 kilobases, kb) that could not be mapped onto a chromosome have not been included in the analysis, and thus a small number of genes on the unmapped contigs will be missing from the genome sequence. The construction of pseudo-chromosomes does, however, have the advantage of allowing a global analysis of chromosome structure, and also removes redundancy from the analysis that would otherwise occur owing to contamination between chromosomes during purification and aberrant contigs formed during assembly.
A comparison of the optical maps for the finished chromosomes with virtual restriction digests with two enzymes of the assembled sequences show good agreement (Fig. 1). A misassembly in chromosome 4 is apparent from both comparisons, which we have localized to a region in an internal var gene repeat. The depth of coverage in this area suggests that there is a 50-kb perfect repeat. Chromosome 9 has a deletion of 100 kb in comparison with the BamHI optical map, but it compares well with the NheI map, and with the sequence tagged site (STS) markers and the yeast artificial chromosome (YAC) map. The data strongly suggest that this anomaly is due to an optical mapping error, rather than a problem with the chromosome sequence.
Figure 1: Scatter graphs of virtual restriction digests of completed chromosomes and pseudo-chromosomes against optical map fragment sizes.

Top row: completed chromosomes (left) and unfinished chromosomes (right) compared with NheI optical map. Bottom row; as top row but compared with BamHI optical map. Each point on the graph represents a restriction fragment compared to its corresponding optical map fragment. The lines show the regression for each chromosome.
High resolution image and legend (59K)The sizes of the pseudo-chromosomes 6, 7 and 8 also compare well with the predictions from the optical map. Chromosome 13 is 400 kb smaller than the predicted size in the NheI map, but only 10 kb smaller than the predicted size from the BamHI map. Thus size comparisons between optical maps and digests reveal that very few data are missing from the chromosome assemblies (Fig. 1). When comparing contig order and orientation with the optical map of unfinished chromosomes, many more outliers are visible on the scatter plots (Fig. 1 and Table 1). Only chromosomes 13 and 6 have r2 values of less than 0.8 in correlation analysis, both against the BamHI maps. Thus for the most part, the contigs are ordered and oriented correctly.
Chromosomes 6, 7 and 8 do not resolve on pulsed field gel electrophoresis, and therefore they were sequenced as a group. Because of this we were unable to group contigs sufficiently to initiate gap closure. In order to overcome this problem, a HAPPY map6, 7, 8 was created, using data from the genome sequence to design primers. (HAPPY mapping allows the order and spacing of STS markers to be determined accurately, by following their segregation among roughly haploid samples of randomly fragmented DNA, using the polymerase chain reaction.) In the first round of mapping, 496 probes were generated which could be arranged on 61 linkage groups with 343 singletons at a lod (log of odds) threshold of 4. A further 30 probes were incorporated to increase the number of linkage groups to 62 at a lod threshold of 5 with 361 singletons. The large number of singletons produced was due to the high level of extra-chromosomal contamination of the purified chromosomes, which we estimated to be around 40%. Despite this, generation of a HAPPY map for chromosomes 6, 7 and 8 has been an invaluable step in grouping contigs to direct the finishing process.
Although gene predictions and annotations were performed by three different groups as part of the sequencing consortium, the predicted overall protein-coding content of each chromosome was very similar (Table 1). Small differences in coding percentage were seen in part due to chromosome size and thus their respective contributions of the telomeric sequences. The gene structures predicted from each group, assessed by comparing gene size, exon size and intron size, were also largely the same (Table 1). As the sequence for some chromosomes is incomplete, it is possible that exons that overlap gaps may be missed. In some cases where frame-shifts occur within exons, particular effort has been made to check that these are pseudogenes and not caused by sequencing errors. The consistency of annotations across all chromosomes suggests that the quality of sequence has not seriously affected gene identification. We expect the accuracy of sequence of all chromosomes to be very high owing to the depth of read coverage (Table 1). Chromosome maps showing the location and structure of genes along each chromosome are available (Supplementary Information).
Gene Ontology (GO) was used to classify genes across the entire genome, and as GO had not been previously applied for annotating an intracellular parasite, new parasite-specific GO terms were created9. The proportion of genes associated with parasite-specific processes or localized in parasite-specific compartments varies between chromosomes (Fig. 2). Whereas most 'housekeeping' genes appear to be evenly distributed across the chromosomes (Fig. 2a), chromosome 5 appears to have the highest proportion of genes annotated with apicoplast localization (Fig. 2b). Conversely, and unlike chromosome 4, it has a very low proportion of genes associated with host cell invasion or adhesion (Fig. 2b, c). The uneven distribution of apicoplast targeted genes on chromosome 5 involves non-orthologous genes, whereas the clustering of genes involved in host cell invasion or adhesion results from duplications of gene families such as variant antigen (var) and repetitive interspersed family (rif) genes.
Figure 2: Comparison of the percentage of annotations with specific Gene Ontology terms on each chromosome.

a, Annotations to 'cell growth and/or maintenance'; b, annotations to 'plastid'; c, annotations to 'invasion' and/or adhesion.
High resolution image and legend (68K)We have identified two previously undescribed clustered gene families; one on chromosome 9 and one on chromosome 13. On chromosome 9, there are 7 copies of a putative protein kinase which show 25–46% amino-acid identity to each other; four of these genes have a predicted signal peptide. Proteomic analysis has shown expression of two of these genes (PFI0105c and PFI0135c)10. Chromosome 13 contains a tandem array of 5 parologous genes including msp7 (ref. 11) with 15–30% identity to each other. Expression of one of these MSP7-like proteins (MAL13P1.174) has been detected, by proteomic studies, during the asexual stage12. The significance of the physical localization and function of these different genes is unknown, so further studies of their expression pattern and cellular localization are required. Protein alignments of these families are available (Supplementary Information).
Bowman et al.2 deduced a consensus pattern of repeats and coding regions for the subtelomeric regions of chromosomes 2 and 3. The overall arrangement of var, rif and subtelomeric variable open reading frame (stevor) genes is conserved in nearly all telomeres, but the number and orientation of gene families vary. For example, many subtelomeres contain multiple var genes, and some have inverted var genes. The right-hand telomere of chromosome 5 has a truncated telomere with a partial inverted var gene adjacent to the telomeric repeat, with no rep11 or rep20 repeat units. The telomere-associated repeat elements are involved in co-localization of telomeres within the nucleus13, 14. This may aid chromosome segregation and increased recombination between subtelomeric genes. Telomere repeats extending from truncated genes are frequently observed in other clones of P. falciparum, often leading to transcription of the telomere13. This observation suggests that telomere transcription may be involved in telomere maintenance at truncated chromosome ends. As the var gene on the right-hand end of chromosome 5 is inverted, there could be transcription of the telomeric repeat.
A putative centromere structure has been predicted in chromosomes 2 and 3 (ref. 2) which is characterized by a 2.6-kb region of 97.3% (A + T) content residing in a gap between coding sequences of at least 9 kb. On inspection of all of the completed chromosomes, we have identified similar structures representing the putative centromeres. There is only ever one per chromosome. All have a region of very high (A + T) content, and a core region of slightly higher (G + C) content, all lying in a gap between coding regions of between 8 and 11 kb. A similar structure has now been identified in the intracellular parasite Encephalitozoon cuniculi15. The discovery of these elements in all contiguous chromosomes, and now in another organism, suggests they have an important role in chromosome maintenance.
Three of the nine chromosomes that were sequenced by us (namely 4, 7 and 8) contain internal arrays of var genes. In the intergenic regions of the internal var arrays, we have identified a highly conserved, (G + C)-rich (
40% (G + C) content), sequence element of length
202 bp (Fig. 3). We have also identified three such (G + C)-rich conserved elements on chromosome 12, sequenced in ref. 16 (not shown in Fig. 3). There are in total 15 of these (G + C)-rich elements in the entire P. falciparum genome, with not more than one element present in every internal var intergenic region. These (G + C)-rich elements are strictly associated with internal var arrays, and were not found in subtelomeric var genes, nor near the single internal var genes on chromosomes 6 and 12. There is no obvious systematic order of the location of these (G + C)-rich sequence elements with respect to adjacent var genes in terms of proximity or direction of transcription of the var genes. The specific positioning of these conserved sequence elements between internal var genes suggests a possible regulatory function, although a standard BLASTN query in public databases showed no significant similarity to previously identified RNA genes or gene regulatory elements. The (G + C)-rich element does have the potential to form secondary structures when analysed using the MFOLD program (http://bioweb.pasteur.fr/seqanal/interfaces/mfold-simple.html) (data not shown). This could indicate that the (G + C)-rich element is a hitherto unknown transcribed RNA species. Cis-acting (G + C)-rich gene regulatory elements have been shown to function as important transcriptional regulators present in the promoter, enhancer and locus control regions of many eukaryotic genes from several species (see ref. 17 for a review). The interaction between specific sites along a DNA molecule has been shown to have a crucial role in the regulation of genetic processes such as DNA replication, site-specific recombination and transposition in other organisms18. Control of gene expression through DNA loop formation has also been shown in other organisms18, while in P. falciparum regulation of var gene expression by cooperative gene silencing elements in var gene introns19, or by a 5' flanking var gene region regulatory element, has also been described20. The potential of the (G + C)-rich sequences to form DNA secondary structures supports a possible function as regulatory elements in var-related genetic processes in P. falciparum.
Figure 3: Position and structure of var-related (G + C)-rich elements.
![Figure 3 : Position and structure of var-related (G |[plus]| C)-rich elements. Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, or to obtain a text description, please contact npg@nature.com](/nature/journal/v419/n6906/images/nature01095-f3.0.jpg)
a, Multiple alignment of the (G + C)-rich conserved sequence elements on chromosomes 4, 7 and 8 of P. falciparum, using CLUSTAL. Only the non-identical nucleotides across all 12 (G + C)-rich conserved sequence elements are indicated in the alignment, with the consensus sequence indicated at the bottom. The upper-case letters in the consensus sequence denote complete identity across all the (G + C)-rich elements presented in the alignment. Each of these sequence elements is represented with a unique identifier, representing its specific origin. b, Location of the (G + C)-rich conserved sequence elements in the intergenic region of internal var gene clusters on chromosomes 4, 7 and 8 of P. falciparum. Top panel, four (G + C)-rich sequence elements in the intergenic regions of internal var gene cluster on chromosome 7. The arrowheads indicate the peaks in the (G + C) plot, corresponding to the location of the (G + C)-rich conserved sequence elements. The exact location of the neighbouring var and pseudo-rif genes are marked with red and yellow boxes, respectively. Bottom panel, a schematic diagram representing the relative positions of the internal var and rif genes and the conserved (G + C)-rich sequence elements on chromosomes 4, 7 and 8 (not to scale). The var or rif genes are placed either on top or bottom of the grey bars, depending on the direction of transcription.
High resolution image and legend (88K)Methods
Sequencing
The DNA was cloned and sequenced according to methods described elsewhere2, 21. Derived contigs were ordered according to previously derived genetic, optical and physical maps3, 4, 5. For all unfinished chromosomes, assemblies were screened against mapped contigs to remove extra-chromosomal contamination. For chromosomes 6, 7 and 8 a HAPPY map was generated to assist ordering; briefly, agarose-embedded genomic DNA was released by melting at 65 °C, sheared gently into fragments with a mean size of
50 kb, and 88 samples, each containing
0.7 genome-equivalents of fragments, were taken (a further 8 samples were DNA-free controls). These samples (the mapping panel) were preamplified by PEP (primer extension preamplification), diluted and dispensed into 30 replica panels. Each replica was screened for between 50 and 100 markers using a two-phase polymerase chain reaction (multiplexed forward and reverse primers in phase 1, followed by dilution and a second phase for one marker at a time, using an internal forward primer and the reverse primer). Pairwise lod scores between markers were calculated, linkage groups identified, and maps of each group of three or more markers computed, essentially as described previously7, 8
Annotation
Genome annotation was carried out using Artemis22. Genes were identified by manual curation of the output of the software packages Genefinder (P. Green, unpublished work), GlimmerM23 and phat24. Functional assignments were based on assessment of BLAST and FASTA searches against public databases and domain predictions using InterproScan25, TMHMM26 and SignalP27.
Gene Ontology (GO) terms28 were manually assigned to gene products for all 14 chromosomes. First, candidate GO terms were selected by sequence-similarity searching a database of peptide sequences and their previously assigned GO terms, drawn from the following databases: Flybase, Mouse Genome Informatics, Saccharomyces Genome Database, Swissprot and The Arabidopsis Information Resource. After visual inspection of sequence alignments, suitable terms were either assigned directly from the candidate list, or alternatively, higher or lower granularity terms were selected directly from the ontology. When previously characterized genes were identified, terms were selected as above, but alternative experimental evidence codes were used to reflect the fact that the inferences were no longer based on sequence similarity. Some GO terms were also assigned automatically. In particular, 'membrane' was assigned using the transmembrane helix prediction tool TMHMM 2.0 (ref. 26).


