Zoonotic origin of the human malaria parasite Plasmodium malariae from African apes

The human parasite Plasmodium malariae has relatives infecting African apes (Plasmodium rodhaini) and New World monkeys (Plasmodium brasilianum), but its origins remain unknown. Using a novel approach to characterise P. malariae-related sequences in wild and captive African apes, we found that this group comprises three distinct lineages, one of which represents a previously unknown, highly divergent species infecting chimpanzees, bonobos and gorillas across central Africa. A second ape-derived lineage is much more closely related to the third, human-infective lineage P. malariae, but exhibits little evidence of genetic exchange with it, and so likely represents a separate species. Moreover, the levels and nature of genetic polymorphisms in P. malariae indicate that it resulted from the zoonotic transmission of an African ape parasite, reminiscent of the origin of P. falciparum. In contrast, P. brasilianum falls within the radiation of human P. malariae, and thus reflects a recent anthroponosis.

For sample PGABG03 (accession number ERS333073), which had a distribution of contig identities that suggested the presence of both M1-like and M2 coinfections, contigs with M1-like alignments of less 60% of their length were aligned to the P. malariae reference genome PmUG01 1 (M1) using the same settings. Contigs that had less than 60% of their length included in an alignment to one or other reference were discarded. Contigs that had at least 97% nucleotide identity to either reference were assumed to be M1-like, and those that had between 83 and 92% identity were assumed to be M2. Contigs with between 92 and 97% or less than 83% identity were inspected individually, and added to the M2 contig set if they had no apparent misassemblies, had a similar level of divergence from both M1 and M1-like references in blastn alignments 12 (version 2.9.0), and this divergence was greater than the divergence between M1 and M1-like in the same region. The M2 and M1like contig sets were ordered against PmUG01 using ABACAS 13 to produce initial assemblies. To identify misassemblies and other errors, including M1-like/M2 chimeric contigs, the P. malariae-related reads were mapped back to a reference consisting of the initial M2 and M1-like assemblies and contigs that had not been included in either set, and reads mapping to M2 were visually inspected in Geneious Prime (version 20202.1.2., https://www.geneious.com) to identify contigs for which multiple different haplotypes were represented in the mapped reads. In addition, the initial M2 assembly was compared with the M1 and M1-like assemblies using blastn, and visually inspected using ACT 14 (version 18.0.2) to identify contigs that had segments aligned to PmUG01 with nucleotide identity more consistent with M1-like than M2 (>97%), and contigs that had segments that aligned to another region of the genome. Because manual inspection suggested that SPades had introduced some assembly errors at contig ends, 100 bp was trimmed off the end of each contig. Contigs were then re-extended where possible using IMAGE 15 (version 2.4), performing two iterations with a k-mer of 41 followed by two iterations with a k-mer of 31, and using sequencing reads that had aligned to the presumed M2 contigs or unassigned contigs. The misassembly checks were then repeated. In total, 18 contigs out of a original total of 494 were deleted following these checks, and four longer contigs were edited to remove short regions of incorrect sequence, resulting in a final assembly of 497,491 bp.
Annotations were transferred from the P. malariae reference genome to the final M2 assembly using RATT 16 , followed by manual inspection and correction, which included adjusting annotations to exclude possible erroneous sequences at the ends of contigs, and resulted in the annotation of orthologues of 290 P. malariae genes. Most annotations were of incomplete fragments, with 26 representing full-length genes.
To address the possibility that the "M2" genes actually represented extremely divergent alleles from an M1-like parasite, from loci under strong diversifying selection, we used the "Gene Ontology Enrichment Analysis" tool in PlasmoDB 17 (release 55) to analyse the P. malariae orthologues of annotated M2 genes. This survey confirmed that there was no enrichment for genes with GO terms indicating that they are likely to be under diversifying selection.

M2 comparison
For individual gene alignments, genes annotated in M2 and their orthologues in PmUG01 (identified during gene annotation) were screened to exclude pseudogenes and lowcomplexity regions masked using segmasker 12 , and alignments were generated for 285 genes using TranslatorX/MUSCLE 18,19 (versions 1.1 and 3.8.31, respectively). Sequences derived by variant calling from M1-like strain GA01 (as in the "Variant calling" section, except the Dustmasker step was skipped because the gene alignments had already been masked for low-complexity sequence) were added to these alignments, following the PmUG01 sequence alignment. Sites with missing data or alignment gaps were removed and corrected genetic distances were calculated for individual genes and for a concatenated alignment of all genes using the ape R package 20 (version 5.4.1, dist.dna, model "TN93").
For comparison with other Plasmodium species, amino acid alignments were generated for orthologous genes from M2, P. malariae 1 , M1-like (as above), P. knowlesi 21 , human and ape P. vivax 9,22 , P. ovale-curtisi, P. ovale-wallikeri 1 , P. berghei, P. chabaudi 23 , P. falciparum 24 , P. praefalciparum, P. gaboni 5 and P. reichenowi 9 . Orthologue groups for the published genomes were obtained from PlasmoDB 17 , except for ape P. vivax and P. ovale-wallikeri, which were added to the groups based on their orthology to human P. vivax (taken from reference 22) and P. ovale-curtisi (using OrthoMCL 25 , version 2), respectively. Orthologue groups with exactly one representative in each species assembly (1:1 orthologues) were used for analysis, 186 genes in total. For each gene, nucleotide sequences from each genome assembly were translated and the resulting amino acid sequences were aligned with MUSCLE 19 . Gene alignments were cleaned with Gblocks (version 0.91b) with default settings 26 and concatenated to produce a single alignment of 63,621 amino acids. Distances between pairs of species were calculated by counting the number of differences and dividing by the length of the sequence.

M1 and M1-like sequences
Host filtering of newly-derived (Ptv_Leo and MOpte51017) and published (GA01 and GA02 1 ) sequencing libraries was performed by mapping with bwa to a combined reference containing both chimpanzee and Plasmodium reference genomes (chromosomes only, P. malariae, P. ovale-curtisi, P. falciparum, P. vivax and P. gaboni). Read pairs mapping to P. malariae (either both mapped to P. malariae or one read unmapped) were extracted and used for SNP calling. The sequencing libraries from samples GA01 and GA02 were kindly provided by the study authors 1 to allow the use of the complete, unfiltered read sets. For M1, we included genome data from patients sampled in three recent studies 1,27,28 and assembly contigs for P. brasilianum 29 . Subsequent analyses revealed two pairs of genomes from one study 28 that were near-identical within the pair; only the genome from each pair with greater coverage of the reference (KEN01 and THA02) was included in the final analysis.
To obtain mitochondrial DNA sequence from GA01 and GA02 read libraries, reads were mapped to the chimpanzee genome with bwa, then non-chimpanzee reads were mapped with smalt to a combined reference of mtDNA genomes from P. falciparum (AY282930 30 ), P. malariae (LT594637 1 ), P. ovale-curtisi (HQ712052 31 ) and P. vivax (LT635627 9 ). All read pairs where either mate mapped to any genome in this reference were extracted and used for de novo contig assembly with SPades. Blastn to the NCBI Genbank database was used to identify P. malariae-related sequence from the resulting contigs.

Variant calling
Variants in M1 and M1-like genomes were called using the Genome Analysis Toolkit 32 (GATK, version 4.1.6.0 except where noted). Sequencing reads were mapped to a combined reference of P. malariae (PmUG01), P. falciparum, P. ovale-curtisi and P. vivax (chromosomes only) with bwa, and PCR duplicate reads removed. De-duplicated bam files were used to generate genomic variant call format files for P. malariae chromosomes using HaplotypeCaller (with OverClippedReadFilter option). Variants were called jointly on all samples, and single nucleotide polymorphisms (SNPs) were filtered using a set of annotations and values that were optimised to exclude subtelomeric SNPs (assumed to be mainly artefactual) while retaining as many SNPs as possible at fourfold degenerate sites (assumed to be mainly correct) (Quality <30.0, QualByDepth <2.0, RMSMappingQuality <42.0, FisherStrand >60.0, StrandOddsRatio >3.0, MappingQualityRankSumTest <-2.0 or >4.0, ReadPosRankSumTest <-8.0 or >8.0). The ability to call a SNP was assessed for each site in the reference for each sample using CallableLoci (GATK version 3.8.1.0, minimum read depth 5, OverClippedReadFilter applied), which output a bed format file annotating callable sites. To simplify downstream analyses, chromosome sequences were then generated for each sample by changing the reference sequence to the alternative allele at variant sites using bcftools 33 . Low-complexity regions in the reference were identified using Dustmasker 12 (version 1.0.0), and bases were changed to N in a given sample if they were in low-complexity regions or subtelomeres, or in that sample were uncallable, had more than one alternative allele or had allele depth <5. For biallelic sites with heterozygous SNP calls, the allele supported by the most reads was used (or one allele was selected at random if read support was equal for both). At the start and end of each chromosome, the boundary of the core genome defined by Ibrahim et al. 28 was compared with the extent of the region in which the PmUG01 (M1) and PmlGA01 (M1-like) genomes were syntenic, and whichever of these was closer to the centre of the chromosome was used as the core/subtelomere boundary (Supplementary Table 7).
Raw sequencing reads from P. brasilianum 29 were not available, so instead contigs from the published assembly were aligned with PmUG01 using MUMmer 11 (contigs aligned with nucmer with default options, filtered with delta-filter, then final alignments produced using show-tiling with -1, -i 90, -u 50 and -l 1000 options), followed by identification of singlenucleotide differences using show-snps (-C option). This information was then used to generate new chromosome sequences by changing the reference sequence to the alternative allele at variant sites, a step which allowed this sample to be included in the analysis pipeline used for the other samples. Regions of the reference that had been covered by contig alignments were identified with show-coords. Regions that were not covered in the alignments, low-complexity regions and subtelomeres were changed to N.
Inspection of the P. brasilianum gene sequences aligned with other M1 strains identified clusters of differences from the reference genome, frequently close to the ends of P. brasilianum contigs. These differences may represent genuine sequence changes but could also have arisen due to sequencing or assembly errors, and without access to the sequencing reads it is difficult to validate or dismiss them. Because the differences were usually unique to P. brasilianum, we reasoned that if artefactual they would be unlikely to influence the qualitative relationship between P. brasilianum and other M1 strains, but could incorrectly inflate estimates of M1 diversity. We therefore included P. brasilianum in the M1 network but excluded it from diversity and polymorphism analyses.

Genome comparisons
When chromosome sequences had been generated for all strains, these were concatenated with each other and the reference sequence to produce a sequence alignment for each chromosome, then converted to a haploid vcf file using snp-sites 34 . For each analysis, appropriate sample subsets were extracted from the vcf files using bcftools 33 (version 1.9), removing sites that were missing data, invariant or non-biallelic in the sample subset.
Principal components were calculated from haploid vcf files of high-coverage M1 and M1like samples (Supplementary Table 4) using plink 35 (version 1.9).
To search for regions of unusually high diversity in M1 or low divergence between M1 and M1-like, sliding windows of 20 kb with a 100 bp step size were examined for each sample. For each window, SNPs/base was calculated by determining the number of differences between that sample and the reference using the haploid vcf files and determining the number of non-N bases from the chromosome alignments, excluding windows with fewer than 8,000 non-N bases. Only windows completely contained in the chromosome core were analysed. For increased resolution around the high-diversity region reported in the text, a plot was also generated for chromosome 10 using windows of 5 kb with a 100 bp step size and a minimum of 2,000 non-N bases.
Site frequency spectra were generated from a haploid vcf file for all M1 strains except P. brasilianum (N=23). Since many SNPs had data missing from at least one genome, leading to variation in the number of calls, the number of calls at each SNP position were down-sampled to 15, a value chosen as a compromise between increasing genome count and increasing SNP count. Fourfold and zerofold degenerate sites were identified in PmUG01. Outgroup alignments were generated from PmUG01 and SNP-derived GA01 sequences and used with the est-sfs unfolder 36 (version 2.03) to identify the derived allele at each position and calculate the unfolded site frequency spectrum.

Polymorphism analysis
Nucleotide alignments of orthologues were generated for each non-subtelomeric, nonpseudogene from PmUG01, using reference genes that had been masked for lowcomplexity sequence as above. Additional M1 strains were excluded if less than 80% of the reference length was unmasked. All M1-like strains with at least some sequence unmasked were included for every gene, unless all three strains had some sequence, in which case only strains with at least 50% of the reference length unmasked were included. For both M1 and M1-like, strains were excluded if they contained an internal stop codon. Synonymous and non-synonymous polymorphisms were counted by determining the effect of a polymorphism on the M1 reference PmUG01 or the M1-like strain GA01, considering only genes and sites with at least two strains for both M1 and M1-like. Synonymous (S) and nonsynonymous (NS) fixed differences were counted by comparing codons between PmUG01 and GA01, excluding sites that were polymorphic in either M1 or M1-like, and assuming the order of changes that gave the lowest number of NS changes. The neutrality index 37 (NI) was arrived at by calculating the total NS and S over all genes for polymorphisms and for fixed differences, and dividing NS/S for polymorphisms by NS/S for fixed differences. The Direction of Selection statistic 38 (DoS) was calculated for each gene by subtracting the fraction of polymorphisms that were NS from the fraction of fixed differences that were NS, and density plots were generated using the "density" function in R with a fixed bandwidth of 0.075 for both M1 and M1-like. For calculation of pairwise diversities, positions that were masked in any sequence were excluded from the alignments, after which genes with less than 100 bp remaining in the alignment were discarded. Mean pairwise nucleotide diversities were calculated at all remaining sites and at sites that were fourfold and zerofold degenerate in PmUG01 using the ape R package 20 (dist.dna, model "raw").

Characterisation of a putative region of introgression
Looking for evidence of introgression between M1 and M1-like parasites, we identified one candidate region, extending over about 45 kb on chromosome 10, where some M1 strains were as divergent from M1-like parasites as they were from each other (Fig. 5a,   Supplementary Fig. 6). This region is located towards one end of chromosome 10, approximately 495 kb before the telomere, and contains 12 genes. To investigate whether the extraordinarily high diversity among M1 strains could be explained by causes other than introgression from M1-like, we examined the identity of these genes. The first six genes have clear syntenic orthologues in both P. falciparum and P. vivax, including two with unidentified functions (PmUG01_10045700 and PmUG01_10046000). However, none of the P. falciparum or P. vivax orthologues show the same striking elevation of diversity that we observed in P. malariae (data from Malariagen 39,40 ). We queried PlasmoDB's OrthoMCL annotations 17,25 for related genes, and found that with one exception, these genes have no paralogues within P. malariae, but numerous homologues in other species. The exception is PmUG01_10046100, which is a methionine-tRNA ligase gene that is the result of an ancient Plasmodium duplication that gave rise to one gene encoding a protein targeted to the cytoplasm, and another encoding a protein targeted to the apicoplast. The second part of the high-diversity region has no discernible synteny with the corresponding regions in P. falciparum and P. vivax and we were unable to identify orthologues in other Plasmodium species of the six genes it contains. Using ExportPred 41 (as implemented in PlasmoDB 17 ), we established that none are predicted to be exported. Of the six genes, five encode products described as "hypothetical proteins". One is annotated to encode a "conserved Plasmodium protein of unknown function", but we were unable to identify any homologues of this gene. Three genes have no paralogues in the P. malariae genome. The other three are apparently related to one another and to six other P. malariae genes, but also form part of a group of homologous genes (OG6_100908) with 344 members across numerous Plasmodium species, so seem unlikely to represent a novel P. malariae multigene family. In conclusion, we could find no reason why the genes in this region should be subject to unusual diversifying selection, nor other explanations for the exceptional pattern of diversity among M1 strains. Thus, the region has most likely resulted from introgression from an M1like strain, but we can identify no reason why this is the only chromosomal region where diversity persists.      GN01  ID01  KEN01  LBR02  MA01  MOpte51017  MY01  SDN01  SLE02  SLE04  THA01  THA02  THA05  THA06  THA09  THA10  THA11  THA12  UAN  UGA01  UGA03  UGA07  GA01 Table 3. CytB sequences from ape P. malariae-related parasites that were not included in the tree in Fig. 1b a . a Sequences in this table were not included in Fig. 1b because they did not cover the full length of the 2,038 bp alignment used for the tree.   Pm_set37-1 AAACGAAAT*A*A Pm_set37-2 AACGAAAAA*T*A Pm_set37-3 AACGTAAAA*A*A Pm_set37-4 ACGAAAAAA*T*A Pm_set37-5 ATACGTAAA*T*A Pm_set37-6 CGAAAAAAA*A*T Pm_set37-7 CGAAAAAAA*T*A Pm_set37-8 TAACGAAAA*A*A Pm_set37-9 TACGAA*C*G Pm_set37-10 TACGCATAT*A*T Pm_set37-11 TACGTAAAA*A*A Pm_set37-12 TATCGAAA*A*A Pm_set37-13 TTTACGAA*T*A Pm_set37-14 TTTCGTAA*T*A Pm_set31