Main

For decades, the laboratory mouse has provided an alternative platform for infectious disease research where the pathogen under study is intractable to routine laboratory manipulation. Experimental study of the human malaria parasite Plasmodium falciparum is particularly problematic as the complete life cycle cannot be maintained in vitro. Four species of rodent malaria (Plasmodium yoelii, Plasmodium berghei, Plasmodium chabaudi and Plasmodium vinckei) isolated from wild thicket rats in Africa have been adapted to grow in laboratory rodents1. These species reproduce many of the biological characteristics of the human malaria parasite. Many of the experimental procedures refined for use with P. falciparum were initially developed for rodent malaria species, a prime example being stable genetic transformation2. Thus rodent models of malaria have been used widely and successfully to complement research on P. falciparum.

With the advent of the P. falciparum Genome Sequencing Project, undertaken by an international consortium of genome sequencing centres and malaria researchers, a series of initiatives has begun to generate substantial genome information from additional Plasmodium species3. We describe here the genome sequence of the rodent malaria parasite P. y. yoelii to fivefold genome coverage. We show that this partial genome sequencing approach, although limited in its application to the study of genome structure, has proved to be an effective means of gene discovery and of jump-starting experimental studies in a model Plasmodium species. Furthermore, we show that despite the considerable divergence between the P. y. yoelii and P. falciparum genomes, sequencing and annotation of the former can substantially improve the accuracy and efficiency of annotation of the latter.

Plasmodium yoelii yoelii genome sequencing and annotation

We applied the whole-genome shotgun (WGS) sequencing approach, used successfully to sequence and assemble the first large eukaryotic genome4, to achieve fivefold sequence coverage of the genome of a clone of the 17XNL line of P. y. yoelii (Table 1). This level of coverage is expected to comprise 99% of the genome5 assuming random library representation. As with P. falciparum, the genomes of rodent malaria parasites are highly (A + T)-rich6, which adversely affects DNA stability in plasmid libraries. Consequently, all 220,000 reads were produced from clones originating from small (2–3 kilobases (kb)) insert libraries. Contigs were assembled using TIGR Assembler7. Contaminating mouse sequences, identified through similarity searches and found to comprise 10% of the total sequence data, were excluded from the analyses. Approximately three-quarters of the contigs could be placed into 2,906 ‘groups’, each group consisting of two or more contigs known to be linked through paired reads as determined by Grouper software7. This produced an average group size of 7.4 kb, approximately 4 kb more than the average contig size. This group size is small compared with the group data produced by other partial eukaryotic genome projects, where extensive use of large insert (linking) libraries has enabled the construction of ordered and orientated ‘scaffolds’8, and emphasizes the use of such linking libraries in partial genome projects. The genome size of P. y. yoelii is estimated to be 23 megabases (Mb), in agreement with karyotype data9.

Table 1 Plasmodium yoelii yoelii genome coverage statistics

Expression data from the P. y. yoelii transcriptome and proteome were generated to aid in gene identification and annotation of the contigs (Table 1). A total of 13,080 expressed sequence tag (EST) sequences generated from clones of an asexual blood-stage P. y. yoelii complementary DNA library10, in combination with other P. yoelii ESTs and transcript sequences available from public databases, were assembled and used to compile a gene index11 of expressed P. y. yoelii sequences (http://www.tigr.org/tdb/tgi/pygi/). For protein expression data, multidimensional protein identification technology (MudPIT), which combines high-resolution liquid chromatography with tandem mass spectrometry and database searching, was applied to the gametocyte and salivary gland sporozoite proteomes of P. y. yoelii. A total of 1,413 gametocyte and 677 sporozoite peptides were recorded and used for the purposes of gene annotation.

We used two gene-finding programs, GlimmerMExon and Phat12, to predict coding regions in P. y. yoelii. GlimmerMExon is based on the eukaryotic gene finder GlimmerM13, with modifications developed for analysing the short fragments of DNA that result from partial shotgun sequencing. Gene models based on GlimmerMExon and Phat predictions were refined using Combiner. Annotation of predicted gene models used TIGR's fully automated Eukaryotic Genome Control suite of programs. Gene finding and subsequent annotation were limited to 2,960 contigs (each of which is over 2 kb in size), a subset of sequences that contains more than 20 Mb of the genome. A total of 5,878 complete genes and 1,952 partial genes (defined as genes lacking either an annotated start or stop codon) can be predicted from the nuclear genome data.

Comparative genome analysis

A comparison of several genome features of P. falciparum and P. y. yoelii is shown in Table 2, demonstrating that many similarities exist between the genomes. Besides the similarly extreme (G + C) compositions, both genomes contain a comparable number of predicted full-length genes, with the higher figure in P. y. yoelii due to an extremely high copy number of variant antigen genes (see below). Where differences between the genomes do exist, such as the (G + C) content of the coding portion of the genomes, incompleteness of the P. y. yoelii genome data, with the associated problems of accurate gene finding in both species, is likely to be a confounding factor. As an indication of this problem, analysis of P. y. yoelii proteomic data identified 83 regions of the genome apparently expressed during sporozoite and/or gametocyte stages but not assigned to a P. y. yoelii gene model (data not shown). Many of these peptide hits appear sufficiently close to a model as to indicate a fault with gene boundary prediction rather than a lack of gene prediction per se. However, as with the gene model prediction in P. falciparum, the gene models of P. y. yoelii should be considered preliminary and under revision.

Table 2 Comparison of genome features of P. falciparum and P. y. yoelii

Identifying orthologues of P. falciparum vaccine candidate proteins and proteins that are either targets of antimalarial drugs or involved in antimalarial drug resistance mechanisms is a primary goal of model malaria parasite genomics. Using BLASTP14 with a cutoff E value of 10-15 and no low-complexity filtering, 3,310 bi-directional orthologues (defined as genes related to each other through vertical evolutionary descent) can be identified in the full protein complement of P. falciparum (5,268 proteins) and the protein complement of P. y. yoelii translated from complete gene models (5,878 proteins). A list of vaccine candidate orthologues and orthologues of genes involved in antimalarial drug interactions identified from among the 3,310 orthologues and from additional BLAST analyses is shown in Table 3. Those genes that are not identifiable may either be absent from the partial genome data, or represent genes that have been lost or diverged sufficiently that they are undetectable through similarity searching.

Table 3 P. y. yoelii orthologues of P. falciparum candidate vaccine and drug interaction genes

Many of the candidate vaccine antigens under study in P. falciparum can be identified in P. y. yoelii, including orthologues of several asexual blood-stage antigens known to elicit immune responses in individuals exposed to natural infection (MSP1, AMA1, RAP1, RAP2). As immunity to P. falciparum blood-stage infection can be transferred by immune sera, identification of the targets of potentially protective antibody responses after natural infection can provide information beneficial to the selection of candidate antigens for malaria vaccines. We found several orthologues of known P. falciparum transmission-blocking candidates; in particular, members of the P48/45 gene family identified previously15 were confirmed.

We identified several P. y. yoelii orthologues of P. falciparum biochemical pathway components under study as targets for drug design (Table 3), most notably: (1) the 1-deoxy-d-xylulose 5-phosphate reductoisomerase (DOXPR) gene whose product is inhibited by fosmidomycin in P. falciparum in vitro cultures and mice infected with P. vinckei16; (2) enoyl-acyl carrier protein (ACP) reductase (FABI) whose product is inhibited by triclosan in P. falciparum in vitro cultures and mice infected with P. berghei17; and (3) a gene encoding farnesyl transferase (FTASE), which is inhibited in cultures of P. falciparum treated with custom-designed peptidomimetics18. The rodent models of malaria have proved invaluable both for the study of potency of new antimalarial compounds in vivo, and for the elucidation of mechanisms of antimalarial drug resistance.

We applied the Gene Ontology (GO) gene classification system19, which uses a controlled vocabulary to describe genes and their function, to indicate which classes of gene among the 3,310 orthologues might differ in number between P. falciparum and P. y. yoelii (Fig. 1). A similar proportion of proteins were identified for most of the GO classes between the two species, with the caveat that fewer total numbers of proteins were identified in P. y. yoelii owing to the partial nature of the genome data for this species. However, proteins allocated to the physiological processes, cell invasion and adhesion, and cell communication categories were significantly reduced in P. y. yoelii. These classes contain members of three multigene families whose genes are found predominantly in the subtelomeric regions of P. falciparum chromosomes: PfEMP1, the protein product of the var gene family known to be involved in antigenic variation, cyto-adherence and rosetting, and rifins and stevors, which are clonally variant proteins possibly involved in antigenic variation and evasion of immune responses (reviewed in ref. 20). Apparently, P. falciparum has generated species-specific, subtelomeric genes involved in host cell invasion, adhesion and antigenic variation, homologues of which are not found in the P. y. yoelii genome.

Figure 1: Functional classification comparison between P. falciparum and P. y. yoelii proteins.
figure 1

We compared the GO terms of proteins assigned to ‘biological process’ for the orthologous genes identified between the two species. The process group contains 3,041 P. falciparum annotations (filled bars), and 2,161 reciprocal annotations are shown for P. y. yoelii (open bars). Ten GO classes with similar numbers of P. falciparum and P. y. yoelii proteins in each are assigned as ‘miscellaneous’; that is, cell cycle, external stimulus response, stress response, signal transduction, homeostasis, developmental processes, cell proliferation, membrane fusion, death, cell motility.

Gene families of unique interest in the P. y. yoelii genome

The largest family of genes identified in the P. y. yoelii genome is the yir gene family, homologues of the vir multigene family recently described in the human malaria parasite Plasmodium vivax21 and in other species of rodent malaria22. In P. vivax, an estimated 600–1,000 copies of the subtelomerically located vir gene encode proteins that are immunovariant in natural infections, indicating a possible functional role in antigenic variation and immune evasion. Within the P. y. yoelii genome data, 838 yir genes (693 full genes and 145 partial genes) are present (Table 4; see also Supplementary Figs A and B). Almost 75% of the annotated contigs identified as containing subtelomeric sequences (see below) contain yir genes, many arranged in a head-to-tail fashion. Expression data indicate that yir genes are expressed during sporozoite, gametocyte and erythrocytic stages of the parasite, similar to the expression pattern seen with P. falciparum var and rif genes23. Preliminary results using antibodies developed against the conserved regions of the protein have confirmed protein localization at the surface of the infected red blood cell (D.A.C. et al., manuscript in preparation). The number of gene copies in the P. y. yoelii genome, the localization and stage-specific expression of gene members, as well as the existence of homologues in other Plasmodium species, make this gene family a prime target for the study of mechanisms of immune evasion.

Table 4 Paralogous gene families in P. y. yoelii

A maximum of 14 members of the Py235 multigene family can be identified among the P. y. yoelii protein data (Table 4). This family expresses proteins that localize to rhoptries (organelles that contain proteins involved in parasite recognition and invasion of host red blood cells). Py235 genes exhibit a newly discovered form of clonal antigenic variation, whereby each individual merozoite derived from a single parent schizont has the propensity to express a different Py235 protein24. Closely related homologues of the Py235 gene family have been found in other rodent malaria species, and more distantly related homologues have been found in P. vivax25 and P. falciparum26. The gene copy number identified in the current data set is less than has been predicted in other P. y. yoelii lines (30–50 per genome). This could reflect real differences in copy number between lines, but more probably suggests an error in the original estimate or misassembly of extremely closely related sequences. Almost all of the Py235 genes are found on contigs identified as subtelomeric in the P. y. yoelii genome (see Supplementary Fig. C).

Four further paralogous gene families, pyst-a to -d, are specific to P. y. yoelii (Table 4). The pyst-a family deserves mention, as it is homologous to a P. chabaudi glutamate-rich protein27 and to a single hypothetical gene on P. falciparum chromosome 14, suggesting expansion of this family in the rodent malaria species from a common ancestral Plasmodium gene. Two paralogous gene families containing multiple members are homologous to multigene families identified in P. falciparum. Gene members of one family, etramp (early transcribed membrane protein), have previously been identified in P. falciparum28 and in P. chabaudi where a single member has been identified and localized to the parasitophorous vacuole membrane29.

Telomeres and chromosomal exchange in subtelomeric regions

The telomeric repeat in P. y. yoelii is AACCCTG, which differs from the P. falciparum telomeric repeat AACCCTA by one nucleotide. A total of 71 contigs were found to contain telomeric repeat sequences arranged in tandem, with the largest array consisting of 186 copies. The P. y. yoelii subtelomeric chromosomal regions show little repeat structure compared with those of P. falciparum. A survey of tandem repeats in the entire genome found only a few in the telomeric or subtelomeric regions, specifically a 15 base pair (bp) (45 copies) and a 31-bp (up to 10 copies), both of which were found on multiple contigs, and a 36-bp repeat that occurred on one contig. No repeat element that corresponds to Rep20, a highly variable 21-bp unit that spans up to 22 kb in P. falciparum telomeres, was found.

The telomeric and subtelomeric regions of P. y. yoelii contigs show extensive large-scale similarity, indicating that these regions undergo chromosomal exchange similar to that reported for P. falciparum (see ref. 30). The longest subtelomeric contig is approximately 27 kb (see Supplementary Fig. C) and is homologous to other subtelomeric contigs across its entire length, indicating that the region of chromosomal exchange extends at least this distance into the subtelomeres. Recent data have shown that clustering of telomeres at the nuclear periphery in asexual and sexual stage P. falciparum parasites may promote sequence exchange between members of subtelomeric virulence genes on heterologous chromosomes, resulting in diversification of antigenic and adhesive phenotypes (see ref. 31 for review). The suggestion of extensive chromosome exchange in P. y. yoelii indicates that a similar system for generating antigenic diversity of the yir, Py235 and other gene families located within subtelomeric regions may exist.

A genome-wide synteny map

The Plasmodium lineage is estimated to have arisen some 100–180 million years ago32, and species of the parasite are known to infect birds, mammals and reptiles33. On the basis of the analysis of small subunit (SSU) ribosomal RNA sequences, the closest relative to P. falciparum is Plasmodium reichenowi, a parasite of chimpanzees, with the rodent malaria species forming a distinct clade34,35. Early gene mapping studies have shown that regions of gene synteny exist between species of rodent malaria9 and between human malaria species36,37, despite extensive chromosome size polymorphisms between homologous chromosomes38. This level of gene synteny seems to decrease as the phylogenetic distance between Plasmodium species increases39. Before the Plasmodium genome sequencing projects, the degree to which conservation of synteny extended across Plasmodium genomes was not fully apparent.

Using the P. falciparum and P. y. yoelii genome data, we have constructed a genome-wide syntenic map between the species. To avoid confounding factors inherent in DNA-based analyses of (A + T)-rich genomes, we first calculated the protein similarity between all possible protein-coding regions in both data sets using MUMmer40. Sensitivity was ensured through the use of a minimum word match length of five amino acids chosen to identify seed maximal unique matches (MUMs). By comparison, the recent human–mouse synteny analysis used a match length of 11 (ref. 8). Using this method, which is independent of gene prediction data, 2,212 sequences could be aligned (tiled) to P. falciparum chromosomes, representing a cumulative length of 16.4 Mb of sequence, or over 70% of the P. y. yoelii genome (see Supplementary Table C). The per cent of each P. falciparum chromosome covered with P. y. yoelii matches varies from 12% (chromosome 4) to 22% (chromosomes 1 and 14), with an average of about 18%. The spatial arrangement of the tiling paths (see Fig. 1 in ref. 30) confirms previous suggestions9 that most of the conserved matches are found within the body of Plasmodium chromosomes, and confirms the absence of var, rif and stevor homologues in the P. y. yoelii genome.

Although the tiling paths indicate the degree of conservation of gene order between P. falciparum and P. y. yoelii, longer stretches of contiguous P. y. yoelii sequence are necessary to examine this feature in depth. Accordingly, we carried out linkage of many P. y. yoelii assemblies adjacent to each other along the tiling paths. First, 1,050 adjacent contigs were linked on the basis of paired reads as determined by Grouper software. Second, P. y. yoelii ESTs were aligned to the tiling paths, and those found to overlap sequences adjacent in the tiling path were used as evidence to link a further 236 P. y. yoelii sequences. Third, amplification of the sequence between adjacent contigs in the tiling paths linked a further 817 assemblies. Linkage of P. y. yoelii sequences by these methods resulted in the formation of 457 syntenic groups from 2,212 original contigs, ranging in length from a few kilobases to more than 800 kb. Syntenic groups were assigned to a P. y. yoelii chromosome where possible through the use of a partial physical map9. Thus, long contiguous sections of the P. y. yoelii genome with accompanying P. y. yoelii chromosomal location can be assigned to each P. falciparum chromosome (see Fig. 1 in ref. 30). The degree of conservation of gene order between the species was examined using ordered and orientated syntenic groups and Position Effect software. Of 4,300 P. y. yoelii genes within the syntenic groups, 3,145 (73%) were found to match a region of P. falciparum in conserved order.

One section of the syntenic map between P. falciparum and P. y. yoelii in particular—associated with P. falciparum chromosomes 4 and 10 and P. y. yoelii chromosome 5—provides a detailed snapshot of synteny between the species. Chromosome 5 of P. y. yoelii has received particular attention owing to the localization of a number of sexual-stage-specific genes to it41, and because truncated versions of the chromosome are found in lines of the rodent malaria parasite P. berghei, which is defective in gametocytogenesis42. Genomic resources available for P. berghei chromosome 5 include chromosome markers and long-range restriction maps41. Exploiting the high level of synteny of rodent malaria parasite chromosomes9, these tools were applied in combination with further mapping studies to close the syntenic map of chromosome 5 of P. y. yoelii (Fig. 2).

Figure 2: Conservation of gene synteny between P. y. yoelii chromosome 5 and P. falciparum chromosomes 4 and 10.
figure 2

Physical marker data used to confirm contig order in the tiling path of P. y. yoelii chromosome 5 are shown above the contigs (open boxes). Each coloured line represents a pair of orthologous genes present in the two species shown anchored to its respective location in the two genomes. Contigs containing the P. y. yoelii rRNA unit are shown as filled boxes.

Approximately 0.8 Mb of P. y. yoelii chromosome 5 (estimated total length of 1.5 Mb) could be linked into one group that is syntenic to P. falciparum chromosome 10 and P. falciparum chromosome 4. From a total of 243 genes predicted in the syntenic region of P. falciparum chromosome 10, and 34 genes predicted in the syntenic region of chromosome 4, 171 (70%) and 22 (65%) of these, respectively, have homologues along P. y. yoelii chromosome 5 that appear in the same order. Pairs of homologous genes that map to regions of conserved synteny between P. y. yoelii and P. falciparum are probably orthologues, confirmed by the finding that most of these homologous pairs are also reciprocal best matches between the P. falciparum and P. y. yoelii proteins. Genes in the synteny gap on chromosome 10 (Fig. 2) include a glutamate-rich protein, S antigen, MSP3, MSP6 and liver stage antigen 1, several of which are prime vaccine antigen candidates in P. falciparum. Genes in the synteny gap on chromosome 4 include four var and two rif genes, which make up one of the four internal clusters of var/rif genes found in P. falciparum (see ref. 30). A series of uncharacterized hypothetical genes occur on the contigs that overlap these regions in P. y. yoelii.

An intriguing finding from the study of chromosome 5 has been the analysis of the syntenic break point between P. falciparum chromosomes 4 and 10. The final P. y. yoelii contig in the tiling path with significant synteny to P. falciparum chromosome 10 also contains the external transcribed sequence (ETS) of the SSU rRNA C unit. The synteny resumes on P. falciparum chromosome 4 in a P. y. yoelii contig that also contains the ETS of the large subunit (LSU) of the same rRNA unit. (No rRNA unit sequences are located on P. falciparum chromosomes 4 and 10; matches to contigs containing these genes occur in coding regions of other genes.) Both P. y. yoelii contigs are linked to each other through a third contig that contains the remaining elements (SSU, 5.8S, LSU, and internal transcribed sequences 1 and 2) of the complete rRNA unit (Fig. 2). Thus it seems that the break in synteny between Plasmodium chromosomes has occurred within a single rRNA unit, a phenomenon first reported in prokaryotes43. Six rRNA units reside as individual operons on P. falciparum chromosomes 1, 5, 7, 8, 11 and 13 respectively (ref. 30), in contrast to rodent malaria species that have four44. Intriguingly, breaks in the synteny between P. y. yoelii and P. falciparum can be mapped to almost all rRNA unit loci on the P. falciparum chromosomes (see Fig. 1 in ref. 30). A full analysis of this potential phenomenon is outside the scope of this study, but these results provide preliminary evidence for one possible mechanism underlying synteny breakage that may have occurred during evolution of the Plasmodium genus—that of chromosome breakage and recombination at sites of rRNA units.

Comparative alignment of syntenic regions

Recent comparative studies have revealed that the fine detail of short stretches of the rodent and human malaria parasite genomes is remarkably conserved45, and that such comparisons are useful for gene prediction and evolutionary studies. Accordingly, we used a comparison of the longest assembly of P. y. yoelii (MALPY00395, 51.3 kb) and its syntenic region in P. falciparum (chromosome 7, at coordinates 1,131–1,183 kb) as a case study for a preliminary evolutionary analysis of the two genomes. Gene prediction programs run against these two regions identified 11 genes in the syntenic region of both species (Fig. 3), eight of which are orthologous gene pairs (genes 1, 3–8 and 10). The structures of two additional gene pairs (genes 2a/b and 9) were refined through manual curation of erroneous gene boundaries. Three hypothetical genes, two in P. falciparum and one in P. y. yoelii, had no discernible orthologue in the other species; the presence of multiple stop codons in these areas suggests that the genes may have become pseudogenes. A global alignment at the DNA level of the syntenic region (Fig. 3) reveals the similarity between species in intergenic regions to be almost negligible, as mirrored in similar syntenic comparisons of mouse and human46,47. Moreover, the mutation saturation observed in intergenic regions suggests that ‘phylogenetic footprinting’ can be used to identify conserved motifs between species that may be involved in gene regulation.

Figure 3: Global alignment scheme of a syntenic region between P. falciparum and P. y. yoelii encompassing ten orthologous gene pairs and nine intergenic regions.
figure 3

White boxes represent genes that have no orthologue and were excluded from analysis; green boxes represent gene models that were refined; red boxes represent unaltered gene models; arrowheads represent gene orientation on the DNA molecule. Clusters of MUMmer matches between the two species are represented as thick blue lines. For the ten orthologous gene pairs, synonymous mutations per synonymous site (dS, open bars) and non-synonymous mutations per non-synonymous site (dN, filled bars) were estimated and plotted.

In contrast to intergenic regions, the similarity between species in coding regions is relatively high. The average number of non-synonymous substitutions per non-synonymous site, dN, between the two species is 26% (± 12%). Synonymous sites, dS, are saturated (average dS > 1), which supports the lack of similarity observed within intergenic regions. These values are considerably higher than those reported for human–rodent comparisons, which are approximately 7.5% and 45% for non-synonymous and synonymous substitutions, respectively48. The cause of such apparent disparities remains unknown, but may be a consequence of extreme genome composition or the short generation time of the parasite.

Rodent malaria species as models for P. falciparum biology

The usefulness of rodent malaria species as models for the study of P. falciparum is controversial. It is apparent that rodent models are the first port of call when preliminary in vivo evidence of antimalarial drug efficacy, immune response to vaccine candidates, and life-cycle adaptations in the face of drug or vaccine challenge are required. Different species of malaria parasite have developed different mechanisms of resistance to the antimalarial drug chloroquine, despite a similar mode of action of the drug (reviewed in ref. 49). It seems that mechanisms developed by the parasite to evade an inhospitable environment, whether caused by antimalarial drugs or the host immune system, may differ widely from species to species. A model involving evolution of different genes in Plasmodium species as a response to different host environments is consistent with the comparison of the P. falciparum and P. y. yoelii genomes presented here; conservation of synteny between the two species is high in regions of housekeeping genes, but not in regions where genes involved in antigenic variation and evasion of the host immune system are located. On the one hand, this can be interpreted as a blow to the systematic identification of all orthologues of antigen genes between P. falciparum and P. y. yoelii that could be used in the design of a malaria vaccine. On the other hand, a picture is emerging of selecting a model malaria species based on the complement of genes that best fit the phenotypic trait under study. Thus the presence of homologues of the yir family may make P. y. yoelii an attractive model for studying antigenic variation in P. vivax. Furthermore, identification of orthologues in the genomes of relatively distant rodent and human malaria parasites will facilitate finding orthologues in other model malaria species, for example monkey models of malaria such as Plasmodium knowlesi.

Methods

Genome and EST sequencing

Plasmodium yoelii yoelii 17XNL line50, selected from an isolate taken from the blood of a wild-caught thicket rat in the Central African Republic51, is a non-lethal strain with a preference for development in reticulocytes. Clone 1.1 was obtained through serial dilution of sporozoites. Parasites were grown in laboratory mice no more than three blood passages from mosquito passage to limit chromosome instability, collected by exsanguination into heparin, and host mouse leukocytes were removed by filtration. Small insert libraries (average insert size 1.6 kb) were constructed in pUC-derived vectors after nebulization of genomic DNA. DNA sequencing of plasmid ends used ABI Big Dye terminator chemistry on ABI3700 sequencing machines. A total of 222,716 sequences (82% success rate), averaging 662 nucleotides in length, were assembled using TIGR Assembler7. BLASTN of the P. y. yoelii contigs and singletons against the complete set of Celera mouse contigs8, using a cutoff of 90% identity over 100 nucleotides, identified contaminating mouse sequences that were subsequently removed. Contigs were assigned to groups using Grouper52. Each contig was assigned an identifier in the format ‘MALPY00001’.

Proteomic analysis

MudPIT technology and methods were as described in ref. 23. Sporozoites of P. y. yoelii were dissected from infected Anopheles stephensi mosquito salivary glands, and P. y. yoelii gametocytes were prepared as described53. Cellular debris from uninfected mosquitoes and mouse erythrocytes were analysed as controls. Tandem mass spectrometry (MS/MS) data sets were searched against several databases: the complete set of P. y. yoelii full and partial proteins (7,860 total); 791,324 P. y. yoelii open reading frames (stop-to-stop ORFs over 15 amino acids and start-to-stop ORFs over 100 amino acids); 57,885 ORFs from NCBI's RefSeq for human, mouse and rat; 15,570 Anopheles, Aedes and Drosophila melanogaster proteins from GenBank; and 165 common protein contaminants (for example, trypsin, bovine serum albumin).

Gene finding and annotation

The splice site recognition module of GlimmerMExon was trained specifically for P. yoelii genome data, using DNA sequences extracted from a set of 1,166 donor and 1,166 acceptor sites confirmed by P. y. yoelii ESTs. Phat and the exon recognition module of GlimmerMExon were trained on P. falciparum data as described (see ref. 54). Combiner was used to generate a final ranked list of P. y. yoelii gene models, and TIGR's Eukaryotic Genome Control suite of programs was used for automated annotation of these (both described in ref. 54). Automated gene names were assigned to proteins by taking the ‘equivalogue’ name of the hidden Markov model (HMM) associated with the protein where possible, or where no HMM was assigned, on the basis of the best-paired alignment. Each protein was assigned an identifier in the format ‘PY00001’.

Paralogous gene families

Proteins encoded by multigene families were identified by a domain-based clustering algorithm developed at TIGR. Families were regarded as potentially Plasmodium- or yoelii-specific if they were not described by any Pfam55 or TIGRFAM56 domains and if the automatic annotation process had not ascribed names corresponding to widely distributed proteins. HMMs for these families were built using the HMMER package version 2.1.1 (ref. 57). Newly constructed models were then used to search the P. yoelii, P. falciparum and GenBank databases to define the scope of the families.

Telomeric/subtelomeric repeat analysis

Subtelomeric contigs were identified through alignment using MUMmer2 (ref. 40) with a minimum exact match ranging from 30–40 bases. Tandem Repeat Finder58 used the following settings: match = 2, mismatch = 7, PM (match probability) = 75, PI (indel probability) = 10, minscore = 400, max period = 700.

Comparative analyses

Gene model predictions in the syntenic region of P. falciparum chromosome 7 were inspected manually, and bi-directional best hits between gene models that respected conserved syntenies were selected. A global alignment of the two sequences was calculated using Owen59, and nucleotide sequences of predicted gene models were aligned using CLUSTALW60 with default parameters, and refined manually. The number of substitutions per synonymous (dS) and nonsynonymous (dN) sites were estimated using the Nei and Gojobori method61. Conservation of gene order was established using Position Effect (http://www.tigr.org/software), where matches between P. falciparum and P. y. yoelii genes were calculated using BLASTP with a cutoff E value of 10-15. The query and hit gene from each match were defined as anchor points in gene sets composed of adjacent genes. Up to ten genes upstream and downstream from each anchor gene were used in creating the gene set. An optimal alignment was calculated between the ordered gene sets using BLASTP per cent similarity scores and a linear gap penalty. Low-scoring alignments with a cumulative per cent similarity less than 100 were not used. Each optimal alignment provided a list of matching genes in conserved order between P. falciparum and P. y. yoelii.