Plasmodium malariae and P. ovale genomes provide insights into malaria parasite evolution

Elucidation of the evolutionary history and interrelatedness of Plasmodium species that infect humans has been hampered by a lack of genetic information for three human-infective species: P. malariae and two P. ovale species (P. o. curtisi and P. o. wallikeri)1. These species are prevalent across most regions in which malaria is endemic2,3 and are often undetectable by light microscopy4, rendering their study in human populations difficult5. The exact evolutionary relationship of these species to the other human-infective species has been contested6,7. Using a new reference genome for P. malariae and a manually curated draft P. o. curtisi genome, we are now able to accurately place these species within the Plasmodium phylogeny. Sequencing of a P. malariae relative that infects chimpanzees reveals similar signatures of selection in the P. malariae lineage to another Plasmodium lineage shown to be capable of colonization of both human and chimpanzee hosts. Molecular dating suggests that these host adaptations occurred over similar evolutionary timescales. In addition to the core genome that is conserved between species, differences in gene content can be linked to their specific biology. The genome suggests that P. malariae expresses a family of heterodimeric proteins on its surface that have structural similarities to a protein crucial for invasion of red blood cells. The data presented here provide insight into the evolution of the Plasmodium genus as a whole.

2). This compares to about 4% of P. falciparum infections being co-infections with P. vivax using the same method, with an average of about 10% in Southeast Asia, which is in line with known estimates 14 . We also found a number of infections containing three species. We found P. malariae and P. ovale solely in African countries, with Asian samples only containing P. vivax co-infections. This could in part be due to the known differences in prevalence of P. malariae between West Africa and other regions 9 and may be more evidence of antagonistic interactions between P. vivax and the other species 8 . Another explanation could be geographical differences in mitochondrial SNPs between the sampling locations in Southeast Asia and our available samples. The presence of the two species in about 2% of infections is in line with similar estimates using other detection methods 9 . However, due to the sampling bias of only looking within clinical P. falciparum infections and due to possible differences in mitochondrial SNPs across geographical regions, the 2% prevalence found for the two species is likely to be a significant underestimate.

II. Genome Assemblies
A 33.6 megabase (Mb) reference genome of P. malariae was produced from clinically isolated parasites and sequenced using Pacific BioSciences long-read sequencing technology (Methods). The assembled sequence comprised 14 supercontigs representing the 14 chromosomes, with 6 chromosome ends extending into telomeres, and a further 47 unassigned subtelomeric contigs containing an additional 11 telomeric sequences (Table 1) wallikeri, each assembly comprising fewer than 800 scaffolds and both being more contiguous than concurrently available alternatives 15 Table 1).
The genomes are significantly larger than previously sequenced Plasmodium species and, like P. vivax, have isochore structures with a higher AT content in the subtelomeres, which comprise ~40% of the genome. In addition, a P. malariae-like genome was produced using Illumina sequencing from parasites isolated from a chimpanzee co-infected with P. reichenowi. The P. malariae-like genome was more fragmented than the other assemblies and its 23.7 Mb sequence misses most subtelomeric regions due to whole genome amplification bias prior to sequencing. In comparison to the draft genomes produced by Ansari et al 15 (Extended Data   Table 1), three of the genomes assembled in the present study (PmUG01, PocGH01, PowCR01) are similar in size but significantly more contiguous. The genome of a chimpanzee-infecting species known as P. malariae-like (PmlGA01) is unique to the present study but lacks good coverage of the subtelomeric regions due to biased template representation introduced by the whole genome amplification process.
The assembly of P. malariae PmUG01 is based on long reads and comprises just 63 pieces. It has no gaps and surpasses the other assemblies according to all metrics reported (Extended Data Table 1). The assemblies of the present study are more contiguous, especially in the subtelomeric regions of the genomes.
The manual curation of gene models in the present study made a clear difference to the annotation. In addition to the annotation of pseudogenes, we were able to identify approximately 10% more genes as clear 1:1 orthologues of genes in both P. falciparum and P. vivax. Indeed, in terms of this metric, other available assemblies 15 are similar to the draft assembly of P. malariae-like. The highly conserved genes are especially important for cross-species comparisons and analyses. Using 1:1 orthologues as indicators of genes within the conserved core regions of chromosomes, we see that our assemblies have about 20% more short genes (less than 100 codons) annotated (averages: 100 versus 82), being thereby more similar in number to those seen in P. falciparum (101). Multi-exon genes are notoriously difficult to annotate; looking at the number of 1:1 orthologues in the different assemblies to P. vivax and P. falciparum genes with over 7 exons (300), the genomes of this study (excluding PmlGA01) have both 10% more 1:1 orthologues annotated and less variable median lengths between assemblies (range: 462-478 versus 368-500 15 ). The large number of partial genes observed in some of the genomes assembled by Ansari et al 15 is due to the higher amount of genes truncated by contig boundaries.
The subtelomeres in P. malariae and P. ovale required significant manual curation due to the high number of pseudogenes present and the ease in which exons can mistakenly be missed during annotation. Excluding PmlGA01 that lacks most subtelomeric regions, all assemblies in the present study have significantly more genes annotated as pseudogenes than those by Ansari et al 15 (averages: 869 versus 7). The latter study also reports more short genes than our assemblies (averages: 640 versus 81), suggesting potential problems with gene models. Finally, the high gene numbers reported for the assemblies in Ansari et al 15 can largely be attributed to putative subtelomeric genes, most of which are short with no assigned function and therefore have an increased likelihood of being spurious.

III. Hypnozoite Genes
Both P. ovale species are able to form hypnozoites, similar to P. vivax 16  Looking at genes previously suggested to be involved in hypnozoite formation 17 , we did not find P. o. curtisi orthologues of the three genes shared exclusively by P. vivax and P. cynomolgi that contain sporozoite-specific ApiAP2 motifs. However, we do find such motifs in six out of nine dormancy related genes 17 , including Ran (PocGH01_09023900) previously identified in a P. vivax screen for potential hypnozoite genes 16 (Supplementary Table 3).

IV. Phylogenetic Analyses
A number of conflicting phylogenetic trees of the Plasmodium genus have been published that differ in their placement of P. ovale and P. malariae. The two most commonly reported topologies either place P. ovale as a sister taxon to the rodent malaria parasites 19 (Tree A) or as an outgroup to P. malariae and P. vivax 15,20,21 (Tree B). The same studies also place P. malariae as either a distant outgroup to both the rodent malaria parasites and the P. vivax clade 19 , or as being more closely related to P. vivax than P. ovale, with P. malariae thereby being a close outgroup to the primate infective clade 15,21 . A recent study using draft genome sequences supported Tree B 15 , while the phylogenetic tree presented in this study supports the Tree A topology ( Figure 2).
Following orthologue assignment using BLASTP 22 and OrthoMCL 23 , amino acid sequences of 1000 core genes from 12 Plasmodium species (P. gallinaceum 24 , P. falciparum 25 , P. reichenowi 26  We determined the sensitivity of the tree topology to different parameters and treebuilding algorithms, we analysed the 1000 core gene alignment used for the tree in Figure 2 using a number of different algorithms for phylogenetic inference. We utilized different tree-building softwares, including RAxML 33 (see below), PhyloBayes 34 (using ratecat, cat, and uni models), and PhyML 35 (LG model with optimized site rates), all resulting in Tree A. We also used a number of different amino acid substitution models implemented within RAxML 33 version 8.2.4., including JTT, LG, LG4M, LG4X, GTR_unlinked, GTR, and DAYHOFF. All these models were tested with both CAT and GAMMA site distribution rates. All substitution models resulted in Tree A. In addition to this, we calculated the optimal substitution model for each gene partition by running RAxML 33 for each gene separately using all implemented substitution models (minimum AIC). Using this substitution model optimized partitioned alignment we still generated Tree A. In order to determine whether the conflicting topology would be a local optimum and was therefore not found using our other approaches, we used Tree B as the starting tree using RAxML 33 with a PROTGAMMAJTT substitution model. This approach still converged on Tree A, indicating that our alignment does not support the Tree B topology.
We performed bootstrapping as implemented within RAxML 36 , obtaining very good bootstrap support for all nodes (Figure 2). We also tested the RAxML '-f S' parameter with a window size of 10 and Tree A as the reference tree. This computes phylogenetic signal strengths for each site in the alignment using a leave-one-out approach 37 . We used this output to filter away the top 5% of sites that either strongly supported or strongly did not support Tree A. Using this trimmed alignment, RAxML still produced Tree A, indicating that the phylogenetic signal is not driven by a small subset of sites. Finally, we generated maximum likelihood trees using RAxML 33 with an LG4X model for additional GBlocks 32 trimmed alignments consisting of 200, 500, and 3298 orthologous genes (the latter being all genes with 1-1 orthologs across all 12 species). All alignments resulted in Tree A with good bootstrap support 36 .
We generated separate phylogenetic trees using RAxML for each gene in the 1000 orthologous gene alignment using their optimal substitution models (see above).
The consensus tree, as calculated using RAxML 33 , of these gene-specific trees was Tree A. Most nodes were supported by the majority of gene trees (>50%). The two nodes that differ between Tree A and Tree B are less well supported. Only 25% of genes support the placement of P. malariae as the distant outgroup, while 38% support P. ovale branching off with the rodent malaria parasites. While these percentages may seem low, a lower proportion of genes support Tree B, namely only 19% of genes support the placement of P. ovale as an outgroup to P. malariae and P. vivax, while 24% place P. malariae between P. ovale and P. vivax. This shows that there is significant heterogeneity in the phylogenetic signal present and that these particular nodes are difficult to resolve. We however show that a larger proportion of genes-and the strongest signals when alignments of all genes are concatenated-support Tree A than support Tree B. In order to determine the effect of alignment filtering on the resulting tree, we constructed phylogenetic trees using RAxML, with the JTT amino acid substitution model as in Ansari, et al. 15 , using a number of different alignment trimming strategies. This included no trimming, trimming using Gblocks 32 default parameters (as above), loosening the Gblocks 32 parameters by allowing some gapped sites (if in less than 50% of sequences) and allowing smaller synteny blocks (down to 2 sites), as well as all three TrimAl 38 preset options (nogap, strict, strictplus), as used in Ansari, et al. 15 (Supplementary Table 4). Due to the impact that filtering has on determining the topology, we investigated whether the filtering performed using GBlocks 32 with default parameters is appropriate. We correlated (Pearson's correlation coefficient, r) the absolute number of sites removed (Absolute) and the proportion of the gene sequence retained (Retained) with a number of molecular evolution selection coefficients (HKAr, Ka/Ks, MK p-value) calculated for three species-species comparisons (see below) for each gene. Note that these statistics were calculated using variant calls directly from mapped reads, and so should be relatively robust to alignment quality itself. Supplementary Table 5 shows the Pearson's correlation coefficient (r) and

Supplementary Table 4. Effects of Different Filtering Methods on Resulting Phylogenetic Tree
Bonferroni adjusted p-values for those correlations:

Supplementary Table 5. Correlations of molecular evolution coefficients with number of sites discarded (absolute) and proportion retained (Retained)
* p < 0.1, ** p < 0.01, *** p < 0.001 The indicates that the p-values tend to be lower (i.e. more likely to be significant) in genes where more of the gene sequence is removed. Hence, genes that seem to be under significant selective pressures tend to be filtered more heavily using GBlocks 32 . This is as expected, as Ka/Ks measures are increased by bad alignments.
This means that the filtered alignment consists of a larger proportion of neutrally evolving sites, which are more informative for phylogenetic inference.
We performed a GO term enrichment analysis 39 by looking at the top 10% of genes that were filtered either the most or the least by GBlocks 32 . We find a very strong enrichment for 'GO:0006412: translation' (p < 6e-07) in the highly filtered genes, in addition to a number of ribosomal GO terms: 'GO:0022625: cytosolic large ribosomal subunit' (p <0.001) and 'GO:0022627: cytosolic small ribosomal subunit' (p <0.001). We do not see enriched GO terms in the genes that were not filtered much. Ribosomal genes are often either extremely conserved or highly variable, making them difficult to align and they were therefore filtered away by GBlocks 32 .
Many of the genes that we filtered away using GBlocks 32 , including ribosomal genes and surface antigens such as ama1, were previously included in a manually selected genelist that generated Tree B 15 .
The genes chosen by Ansari, et al. 15  case, we note that even without filtering, a comprehensive analysis of one-to-one orthologs between Plasmodium species does not support Tree B, but agrees with our preferred tree except in the placement of P. ovale.

V. Lineage Dating
We estimated the time of divergence for the four species using a Bayesian inference tool, G-PhoCS 40 . Absolute divergence time estimates are inherently uncertain due to mutation rate and generation time assumptions, and we therefore scaled these parameters to date the P. falciparum and P. reichenowi split using G-PhoCS to 4 million years ago (MYA), as previously published (3.0 -5.5MYA) 41 . Testing a number of different generation time and mutation rate estimates in order to optimize the P. falciparum/P. reichenowi split to 4 million years ago as estimated previously 41 , we found a mutation rate of 3.8 x 10 -10 SNPs/site/lifecycle 42 and a generation time of 65 days 43 to generate this previously published date 41 . Assuming that the mutation rates and generation times are similar for P. ovale and P. falciparum, we find that the relative split of the two P. ovale species is about 5 times earlier than the split of P. falciparum and P. reichenowi. Using the same parameters as for the Laverania split, we thereby date the divergence of the two P. ovale subspecies to approximately 20.3MYA. Using the same mutation rate and a 50% longer generation time to account for the longer intra-erythrocytic cycle (100 days), we date the split of P.

VI. Population Genetics
Using four additional P. malariae samples, two additional P. o. curtisi samples and two P. malariae-like and P. o. wallikeri samples each, we investigated differences in selection pressures between two species that diverged based on host differences (P. malariae and P. malariae-like), and two species that supposedly diverged within the same host (P. o. curtisi and P. o. wallikeri). Supplementary Table 6 shows all samples sequenced for this study and their uses throughout. of unknown function (hypergeometric test, p < 0.001), suggesting important novel Plasmodium biology. Of the four genes with significant MK skews in P. o. curtisi, one is a kelch protein while the others are involved in 'DNA replication' and 'Telomere maintenance'. These results hint at a number of possible divergences between the two P. ovale species, including possible differences in drug susceptibility, changes in gametocyte genes that may have enabled speciation, while differences in DNA replication may possibly be linked to the different relapse times.

VII. RBP1a Receptor
One of the genes with the highest Ka/Ks in the P. malariae/P. malariae-like comparison is RBP1a, which has 37 nonsynonymous fixed differences between the two species and only 6 synonymous fixed differences. The other two intact RBPs are much more conserved. Knowing that P. malariae also infects new world monkeys (where it is known as P. brasilianum) 49 , we might suppose that the receptor for RBP1a may be conserved between humans and New World monkeys, but not with chimpanzees. We identified 19 human genes encoding transmembrane proteins that form 1-1 orthologous OrthoMCL 23 clusters between humans and the common marmoset but not with chimpanzees (Extended Data