We sequenced and annotated the genomes of four P. vivax strains collected from disparate geographic locations, tripling the number of genome sequences available for this understudied parasite and providing the first genome-wide perspective of global variability in this species. We observe approximately twice as much SNP diversity among these isolates as we do among a comparable collection of isolates of P. falciparum, a malaria-causing parasite that results in higher mortality. This indicates a distinct history of global colonization and/or a more stable demographic history for P. vivax relative to P. falciparum, which is thought to have undergone a recent population bottleneck. The SNP diversity, as well as additional microsatellite and gene family variability, suggests a capacity for greater functional variation in the global population of P. vivax. These findings warrant a deeper survey of variation in P. vivax to equip disease interventions targeting the distinctive biology of this neglected but major pathogen.
Half the world's population is estimated to be at risk for malaria caused by P. vivax1, owing to this parasite's unique potential for lengthy remission and tolerance of cooler climates than those preferred by strictly tropical Plasmodium species. Although the P. falciparum parasite is responsible for the majority of contemporary malaria-related mortality, there is evidence that P. vivax may have been a more virulent parasite before the advent of modern medicine. As far north as England, death records indicate that P. vivax likely reduced the average lifespan from 58 to 33 years during the nineteenth century2. More recently, studies have shown that P. vivax is capable of causing the severe malaria syndromes that have long been attributed only to P. falciparum3.
In spite of the past and present impact of P. vivax on human health, it remains chronically understudied relative to P. falciparum. The ability to continuously culture P. falciparum but not P. vivax in the laboratory, compounded with the differential mortality imposed by the two species, has led to vast discrepancies in the state of knowledge regarding almost all aspects of the biology of these species. The recently renewed push for malaria eradication may remain incomplete unless this disparity in knowledge is addressed and the findings of new studies applied to disease control programs4.
Genetic diversity is important to characterize in order to understand the history of human association with a disease, to evaluate the direct effects of diversity on clinical disease and also because it may directly or indirectly reduce the efficacy of therapeutics such as drugs and vaccines. In contrast to P. falciparum, for which the genomes of many hundreds of isolates have now been sequenced or genotyped5, only the P. vivax genomic reference strain (Salvador I)6 and unassembled shotgun sequencing of a Peruvian isolate (IQ07)7 have been completed. We sequenced, assembled and annotated the genomes of four geographically disparate isolates of P. vivax to remedy the lack of genetic diversity data available for this species relative to P. falciparum. The designations and geographic origins of each P. vivax strain are given in Table 1, which also lists the names and origins of a concurrently sequenced comparator panel of P. falciparum isolates that came from similarly disparate geographic locales. Templates of both species were sequenced using the same next-generation sequencing platform (Illumina Genome Analyzer IIx and HiSeq 2000) and evaluated for diversity using the same bioinformatic tools. All four of the newly sequenced P. vivax reference strains (North Korean, India VII, Mauritania I and Brazil I) are clonal infections adapted for growth in monkeys, are publically available via the Malaria Research and Reference Reagent Resource Center (MR4; see URLs) and were sequenced from genomic DNA derived from leukocyte-depleted monkey blood (Online Methods). The P. falciparum isolates (from Honduras, India, Indochina (Thailand and Laos) and Senegal) were also clonal and were sequenced using template derived from in vitro cultures. We generated de novo assemblies for each of the four P. vivax isolates (Table 2) using the ALLPATHS LG assembly algorithm8. Assembly quality was substantially higher for P. vivax than for a P. falciparum assembly generated using the same approach (Supplementary Table 1), likely due to the more moderate AT nucleotide composition of the P. vivax genome (P. vivax: 57.7% AT versus P. falciparum: 80.6% AT). Synteny between the new P. vivax assemblies and the Salvador I reference assembly was found to be highly conserved (≥97.9%; Supplementary Fig. 1).
We next used the sequencing data to evaluate genetic diversity within each species. The pairwise SNP rate for each sequenced isolate relative to the reference assembly for each species (P. vivax, Salvador I; P. falciparum, 3D7) is given in Figure 1a as a function of inferred SNP quality. Pairwise SNP rate relative to a reference assembly is expected to be a function of evolutionary or geographic distance, but, despite the varying geographic origin of isolates from both species, we found that the P. vivax SNP rates were uniformly higher than the P. falciparum SNP rates, regardless of the SNP quality threshold used. This finding suggests globally higher genetic diversity in P. vivax relative to P. falciparum, a genome-wide result confirming previous pilot surveys9,10. We explored this finding by comparing mean SNP rates according to sequence class, using SNPs with a minimum PHRED-style quality score of 30 (estimated accuracy of at least 99.9%)11. We found that P. vivax exhibited a significantly higher mean SNP rate than P. falciparum at intergenic and intronic fourfold-degenerate synonymous coding sites, within coding sequences overall and across all sequence classes (t test, P = 0.0087; Fig. 1b), suggesting that the difference in diversity is pervasive and occurs across the genome. To control for variation in the degree of functional constraint among genes within each genome, we next evaluated mean pairwise SNP diversity (π) in a collection of 3,401 genes for which we could confidently identify 1:1 orthologs in the two species using the reciprocal best BLAST hits (RBH) criterion. In this comparison, we observed approximately twice as much SNP diversity in P. vivax compared to P. falciparum (paired t test, P = 2.2 × 10−16; Fig. 1c), confirming the ubiquity and magnitude of the SNP diversity disparity between the species (Supplementary Fig. 2 and Supplementary Table 2).
To test whether a differing SNP mutation rate, rather than a different effective population size and/or demographic history, can account for the differences in SNP diversity observed, we compared the genome-wide diversity of microsatellites, a different class of mutation. Because microsatellite length variants are caused by replication slippage rather than point substitution12, we would expect their relative diversity profile to be different from that of SNPs under the null hypothesis that a different point mutation rate explains the disparity in SNP diversity. We applied a new method of evaluating microsatellite length variation from Illumina data and observed, as we had with the SNPs, significantly higher diversity in P. vivax than in P. falciparum (bootstrapping, 279 and 22,713 microsatellites with at least 8 repeats in P. vivax and P. falciparum, respectively; P = 0.021; Fig. 1d and Supplementary Fig. 3). Given the population genetic evidence that P. falciparum underwent multiple drug-induced selective sweeps and at least one significant bottleneck13, these results indicate that P. vivax may have a comparatively large effective population size due to an absence of such demographic events in recent history. Even in the face of common drug pressure, P. vivax may have disproportionate demographic stability due to its unique capacity for dormancy within infected hosts.
Our sample is smaller than is ideal for evaluating the time to the most recent common ancestor (TMRCA), but by comparing the deepest pairwise nucleotide divergence observed for each species at fourfold-degenerate synonymous sites (P. vivax: Mauritania versus Brazil I, 1.628 × 10−3 substitutions per site; P. falciparum: 3D7 versus Dd2, 9.59 × 10−4 substitutions per site) we can predict that the lower bound for TMRCA in P. vivax is approximately 70% greater than that for P. falciparum. The calculation of absolute TMRCA dates is dependent on the accuracy of the inferred mutation rate, and, therefore, results must be interpreted with caution. Nevertheless, if we assume a commonly accepted eukaryotic genome-wide mutation rate for fourfold-degenerate sites (2.2 × 10−9 substitutions per site per year14, similar to a Plasmodium rate estimated with less precision)15, we can estimate the TMRCA as 768,000 and 452,000 years ago for P. vivax and P. falciparum, respectively. This P. vivax estimate is deeper than previous TMRCA estimates generated using a small number of loci in a larger population sample16, but reconciliation of the absolute estimates is difficult without knowledge of the true mutation rate. Assuming mutation rates are similar in both parasite lineages, however, the result stands that P. vivax has a much deeper TMRCA.
Other departures in the global population history of these two species are indicated by the topology and branch lengths of their respective phylograms (Fig. 2). The relatively large degree of divergence between the IQ07 Peruvian isolate and the Brazil I and Salvador I strains of P. vivax suggests a distinct history in the New World relative to P. falciparum, which exhibits low diversity in the New World and is thought to have been introduced within the last 500 years via the African slave trade17. The high New World diversity of P. vivax, combined with the closer phylogenetic affinity of the three New World P. vivax isolates with the east Asian (North Korean) rather than the African (Mauritania I) or south Asian (India VII) strains, could suggest the precolonial arrival of P. vivax in the New World accompanying human dispersal from Asia by sea or, less likely, by the Bering land bridge during the last glacial maximum. Alternatively, this profile could be explained by recent but very large-scale (relative to P. falciparum) postcolonial introductions to the New World via transoceanic trade from east Asia, the Pacific or Europe, the last of which presumably harbored a genetically distinct parasite clade before disease elimination18. Deeper genomic sampling of P. vivax populations will be required to explain these patterns in the diversity data.
We next explored the profile of variation in individual genes and gene families to evaluate the potential functional consequences of the extremely high genomic diversity we observe in P. vivax. Mean pairwise divergence among the sequenced P. vivax isolates is highest in gene families associated with red blood cell invasion and immune evasion (Fig. 3a). Functional enrichment analysis of diversity in individual genes also found nonsynonymous SNPs to be concentrated in invasion-related motility genes (Supplementary Table 3). This extremely high sequence diversity suggests that vaccines targeting polymorphic antigens may encounter an even greater hurdle in eliciting cross-protective immune responses than they do in P. falciparum, where strain-specific immunity has been recently observed to limit vaccine efficacy19.
Differences in the distribution of nonsynonymous SNPs among genes with orthologs in both species are potentially reflective of differences in disease biology in P. vivax and P. falciparum (Supplementary Table 4). Genes expressed in the pre-erythrocytic stages are the most enriched group among those with higher ratios of nonsynonymous-to-synonymous diversity (πNS/πS) in P. vivax relative to P. falciparum (Mann-Whitney U test; Z score = 2.2), whereas genes associated with host-parasite interactions are the most enriched group with higher πNS/πS in P. falciparum relative to P. vivax (Mann-Whitney U test; Z score = 2.4) (Supplementary Table 5). Although neither of these enrichments was statistically significant after correction for multiple testing, this pattern bears further exploration when sequence data from more P. vivax genomes become available.
As expected, we observed enormous diversity in the vir gene family, the members of which are variably expressed and encode proteins that are exported to the host cell surface for the purpose of evading the host adaptive immune response20. Of the 313 vir genes included in the Salvador I reference assembly, less than a third are also observed in the four new assemblies (Fig. 3b). Unexpectedly, we encountered 15 'ultra-conserved' vir genes that were present in all assemblies and showed very low SNP diversity; in particular, one locus (PVX_113230) was invariant and showed the highest similarity (70% protein sequence identity) of any vir to the homologous kir gene family in Plasmodium knowlesi, a zoonotic parasite (Fig. 3c). Unlike most vir genes, this locus also exhibits conserved synteny in more distantly related rodent malaria parasites. These attributes suggest that PVX_113230 is likely the founder of the vir family in the P. vivax lineage, and the lack of polymorphism suggests that the protein it encodes performs an ancestral role rather than host immune modulation. The molecular function of PVX_113230 could be related to erythrocyte invasion, as suggested by its distinct expression profile in erythrocytes relative to most vir genes21 (Supplementary Fig. 4).
This global census of genomic diversity in P. vivax has uncovered an unexpected degree of genetic polymorphism, much of which may translate into important functional variation. Our data stop short of suggesting the existence of distinct subspecies of P. vivax, similar to the subspecies P. vivax vivax and P. vivax hibernans proposed on the basis of relapse phenotype22. However, the extreme diversity we observe among these new reference strains suggests a more stable and older association of this parasite with humans than for P. falciparum, serving as a warning that the eradication of P. vivax could represent a qualitatively different task.
Malaria Research and Reference Reagent Resource Center (MR4): http://www.mr4.org/; TransposonPSI: http://transposonpsi.sourceforge.net/; P. vivax nuclear and apicoplast genome assemblies and gene calls and Sanger sequencing–based Dd2 assembly, http://www.broadinstitute.org/annotation/genome/plasmodium_vivax/MultiDownloads.html; PlasmoDB, http://plasmodb.org/.
We chose four strains of P. vivax for whole-genome sequencing on the basis of geographic origin and phenotype to provide a resource of high-quality assembled and annotated sequences for the malaria research community. The North Korean strain has a long relapse phenotype that enables it to survive in the primate host through periods of drought and long winters when mosquito vectors for transmission are unavailable26. The Brazil I strain is highly resistant to the anti-relapse drug primaquine23. The India VII strain is the first Plasmodium species to be sequenced from India. The Mauritania I strain is a rare example of an African P. vivax strain that occurs among west Africans with Berber and Arab genetic backgrounds28. Genomic DNA for P. vivax sequencing was obtained from the leukocyte-depleted blood of infected, splenectomized Saimiri monkeys as described previously6. Genome-wide fragment analysis of 15 microsatellites before sequencing confirmed their independent origins. DNA and frozen stabilates are available upon request at MR4 (see URLs).
Genome sequencing, assembly and annotation.
Genomic DNA was used to construct two Illumina sequencing libraries for each P. vivax isolate, with library insert sizes of 180 bp and 3 kb. Each library was sequenced to a depth of at least 150-fold coverage (to account for contaminating monkey host DNA) using 76-bp paired-end Illumina reads on an Illumina Genome Analyzer IIx sequencer. After filtering out reads with sequence similarity to monkey or primate sequence, each genome was assembled using the ALLPATHS-LG algorithm8. Assembly quality was quantified using the N50 statistic for contigs and scaffolds, which describes the minimum contig or scaffold size such that the sum of the lengths of all contigs or scaffolds of equal or greater size accounts for at least half of the total assembly length. Synteny with the P. vivax Salvador I reference assembly was quantified for each de novo assembly by evaluating the concordance of approximately 60,000 randomly chosen pairs of 100-bp sequences, each with 100 kb of intervening sequence in the new assemblies. This method identifies breaks in synteny when the sequence pairs do not map to locations separated by 100 kb in the reference assembly or when only one member of a sequence pair is successfully mapped to the reference. Finally, for three of the four P. vivax isolates (North Korean, Mauritania I and Brazil I), we were able to recover a scaffold representing the complete apicoplast genome.
P. falciparum analyses were based on comparable Illumina read data generated from a single library for each isolate (insert size of 180 bp). Coverage for each of the isolates was as follows: Indian isolate 87_239, 42×; Indochina isolate Dd2, 196×; Honduran isolate HB3, 30×; Indian isolate ML-14, 41×; Senegalese isolate Th231.08, 59×.
The protein-coding genes in the P. vivax nuclear genomes were annotated using a combination of reference gene mapping, homology-based gene models (GeneWise)31, EST-based gene models and ab initio gene predictions. Ribosomal RNAs (rRNAs) were identified with RNAmmer32. The tRNA features were identified using tRNAScanSE33. Other common RNA features were identified with RFAM34.
The MUMmer algorithm35 was used to align draft assemblies to the P. vivax Salvador I reference genome assembly. Neighboring syntenic alignment blocks were joined to form longer alignments, which were then used to map the gene coordinates from the reference genome to the draft assemblies. Homology-based gene models were created using tblastn to search against the draft genome assemblies with the UniRef90 protein database, a Plasmodium protein database created from the annotated proteins of P. falciparum 3D7 and P. vivax Salvador I. The resultant BLAST hits were used to create GeneWise gene models. Gene models were also built using 31,777 P. vivax ESTs available on GenBank, with an ORF cutoff length of 300 bp (EST ORFs). Ab initio gene models were predicted using self-training GeneMark-ES36 and GeneId37 with parameters trained on genes from the P. vivax Salvador I genome.
The final merged gene set for each of the four sequenced P. vivax strains was created using the following workflow. If a mapped reference gene in a given assembly had intact start and stop codons, no frameshift or in-frame stop and no exons in the contig gaps, then the mapped reference gene was used directly as the gene model. If a mapped gene had a frameshift or in-frame stop, then the corresponding GeneMarkES gene model was selected. If a GeneWise gene model had no overlap with gene models from the previous two sources but had a non-generic gene product name or overlap with non-repeat PFAM domains, then the GeneWise feature was added as a gene model. Finally, if an EST ORF was at least 600 bp in length and exhibited no overlap with models identified from the previous three sources, then the EST ORF was added as a gene model. The initial gene set was compared against tRNA and rRNA features, and genes were filtered out where appropriate. Additional gene filtering was performed by removing genes with 30% or more coding sequence overlap with hits in TransposonPSI (see URLs) (e < 1 × 10−10).
Sequencing reads from each isolate were aligned using the Burrows-Wheeler Aligner (BWA)38 to the reference assemblies of P. falciparum 3D7 (build v7.1) and P. vivax Salvador I (build v7.0), both downloaded from PlasmoDB (see URLs). SNPs were called using the Unified Genotyper11 in the GATK package39. SNPs with an estimated PHRED-style quality score of Q30 or greater were used for diversity analyses.
Microsatellite variants and validation.
The mreps program40 was used to identify microsatellites in the reference assemblies of P. falciparum and P. vivax. For the microsatellite search, the mreps parameters were set as minp = 1 and maxp = 9. Searching under these conditions identified 538,794 and 95,990 microsatellite loci in P. falciparum and P. vivax, respectively. For P. falciparum, the allow-small parameter was employed to allow identification of small microsatellites that the mreps algorithm might otherwise flag as biologically insignificant due to the high AT content (∼80%) of that genome.
Illumina sequencing reads were mapped to the P. falciparum 3D7 or P. vivax reference assemblies using BWA38 with the -q 5 and -l 32 options. Base-quality score recalibration and local realignment around microsatellites11 were applied using GATK39. Custom Python scripts were used to filter out reads that did not span the entirety of a microsatellite interval in the genome. A final local realignment around microsatellite sequences, and insertion and/or deletion (indel) calling with standard hard filtering parameters was applied using GATK; accepted indels were converted into microsatellite-length genotypes. Illumina-based microsatellite calls were validated by comparison with calls made by aligning the Sanger sequence–based Dd2 assembly (see URLs) to the 3D7 reference assembly to generate a truth set of indels for genotyped microsatellites. Comparison of the Illumina-based calls to the Sanger sequencing–based calls indicated that our Illumina-based calling method exhibits a specificity of 100% and a sensitivity of 97%, on the basis of 81,569 AT dinucleotide microsatellites callable in both data sets. P values for evaluating the significance of the difference in microsatellite diversity between species were generated by resampling π values for each species 10,000 times and noting the frequency with which a difference in mean π occurred that was equal to or greater than the observed difference in mean π. The overall comparison of microsatellite diversity across motif unit sizes was performed using only microsatellites with eight or more repeat units, given that, as previously observed for P. vivax41, these longer microsatellites were observed to be more polymorphic in both species (Supplementary Fig. 5).
Evolutionary analyses and TMRCA calculation.
Mean pairwise diversity at all nucleotide sites (π), synonymous sites (πS) and nonsynonymous sites (πN) was calculated for each gene in the reference annotations for P. falciparum and P. vivax using the method described in Table 1 of ref. 42. P. knowlesi orthologs were employed to calculate interspecies dN/dS ratios using PAML v4.5 (ref. 43) with sequences from the Salvador I isolate of P. vivax. A matrix of pairwise nucleotide distances between isolates (Supplementary Table 6) was constructed for each species using SNPs identified in fourfold-degenerate synonymous coding sites of genes with orthologs in each species. To control for differences in the nucleotide substitution profile between species and enable direct comparison of branch lengths, pairwise distances were normalized by empirical nucleotide transition matrices (Supplementary Table 7) constructed for each species by rooting fourfold-degenerate polymorphisms using an outgroup species (P. knowlesi for P. vivax and P. reichenowi for P. falciparum). Phylograms were constructed from pairwise distance matrices using the Neighbor-Joining method. The lower bound of the TMRCA was estimated for each species as the deepest pairwise divergence. SNP functional enrichment analyses were carried out using Mann-Whitney U tests, and Z scores were interpreted using a Bonferroni correction for multiple testing.
Illumina sequencing reads have been submitted to the NCBI Short Read Archive (SRA; accessions SRP007883, SRP007923, SRP000493 and SRP000316). SNP calls have been submitted to dbSNP and can also be downloaded from the Broad Institute website (see URLs). P. vivax nuclear and apicoplast genome assemblies and gene calls may be downloaded from GenBank (accessions AFMK01000000, AFBK01000000, AFNI01000000, AFNJ01000000, JQ437257, JQ437258 and JQ437259) and the Broad Institute website (see URLs).
NCBI Reference Sequence
Sequence Read Archive
This work has been funded, in whole or in part, by federal funds from the US National Institute of Allergy and Infectious Diseases (NIAID), the US National Institutes of Health (NIH) and the US Department of Health and Human Services, under contracts HHSN266200400001C and HHSN2722009000018C. We gratefully acknowledge the Indian Council of Medical Research for financial support of the Malaria Parasite Bank at the National Institute of Malaria Research, New Delhi, and the NIMR Director for providing all facilities. We thank the NIAID/National Human Genome Research Institute (NHGRI) Eukaryotic Pathogens and Disease Vectors Working Group and the Broad Institute Genome Sequencing Platform for significant contributions to this project. K.G. and L.Y. are supported by a Global Health Program grant from the Bill and Melinda Gates Foundation. A.A.E. is supported by a grant from the NIH (RO1GM084320), and J.M.C. and P.L.S. are supported by grant U19AI089676, an NIAID International Center of Excellence for Malaria Research. The content is soley the responsibility of the authors and does not necessarily represent the official views of the NIH.