Population genomics of virulence genes of Plasmodium falciparum in clinical isolates from Uganda

Plasmodium falciparum causes a spectrum of malarial disease from asymptomatic to uncomplicated through to severe. Investigations of parasite virulence have associated the expression of distinct variants of the major surface antigen of the blood stages known as Pf EMP1 encoded by up to 60 var genes per genome. Looking at the population genomics of var genes in cases of uncomplicated malaria, we set out to determine if there was any evidence of a selective sweep of specific var genes or clonal epidemic structure related to the incidence of uncomplicated disease in children. By sequencing the conserved DBLα domain of var genes from six sentinel sites in Uganda we found that the parasites causing uncomplicated P. falciparum disease in children were highly diverse and that every child had a unique var DBLα repertoire. Despite extensive var DBLα diversity and minimal overlap between repertoires, specific DBLα types and groups were conserved at the population level across Uganda. This pattern was the same regardless of the geographic distance or malaria transmission intensity. These data lead us to propose that any parasite can cause uncomplicated malarial disease and that these diverse parasite repertoires are composed of both upsA and non-upsA var gene groups.

Here we show that the hidden Markov model approach using HMMER accurately classifies the DBLa reads into either UPS type A or a combined category of UPS types B and C.
First we determine the average read length in order to simulate reads to align.
We made use of cross validation to determine the accuracy of the HMMER hidden Markov models in classifying DBLa reads.Simulated translated protein reads were generated from each domain in the HMMER model database (Rask et al, 2010).A new HMMER model database was then created excluding this domain.
Finally, the simulated protein reads were aligned to the new HMMER model database.
In the manuscript we translate the DNA reads into all 6 frames and take the best hit.As it was deemed very unlikely that the incorrect reading frame would give a higher classification score we did not simulate from DNA.

mkdir temp_output
The python code below implements the cross validation procedure.

Summarise results
We now collect the results and take the best hit of each read to the HMMER model database as it's classification.We can then compare the accuracy of the classification at the sub-domain, domain and Type level.

Sub-domain level
Although we do not make use of the sub-domain classification it is included here for interest sake.As the sub-domains are often very similar the accuracy in classification with short reads is lower being 60.6% on average.This is low in part because some domain classes have very few domains from which to build the HMMER model.Consequently, when removing a domain in the cross validation procedure we can have a large impact on its respective HMMER model.

Domain level
At the higher domain levels (DBLa0, 1 and 2) the accuracy is much better due to the greater di erentiation between the domain classes giving an accuracy of 99.3%, 98.6% and 75.8% respectively.

Figure S2 .
Figure S2.Cumulative diversity curves within and among all six sentinel study sites in Uganda; plotting the number of var DBLα sequences sampled against the number of observed var DBLα types.A DBLα type was defined by a 96% sequence identity threshold and corresponds to a unique DBLα sequence.The color of the lines indicate as follows: Black = All Sites, Green=Kanungu; Red=Jinja; Orange=Kyenjojo; Taupe=Arua; Blue=Tororo; Purple=Apac.

Figure S3 .
Figure S3.Cumulative diversity curves within and among all six sentinel study sites in Uganda; plotting the number of var DBLα sequences sampled against the number of observed var DBLα types for A. upsA DBLα types and B. upsB/upsC DBLα types.The color of the lines indicate as follows: Black = All Sites, Green=Kanungu; Red=Jinja; Orange=Kyenjojo; Taupe=Arua; Blue=Tororo; Purple=Apac.

Figure S4 .
Figure S4.The frequency distribution of the proportion of pairwise comparisions within particular PTS score ranges within A. Kanungu B. Jinja C. Kyenjojo D. Arua E. Tororo F. Apac and G. among all sites.The upsA proportions are represented by the use of lighter color tones, whereas the upsB/upsC proportions are indicated by darker color tones.

Figure S5 .
Figure S5.Proportion of type sharing (PTS) of upsA and upsB/upsC DBLα types between each of the six Ugandan sentinel sites was compared to the distance between the sites using the A. bird flight and B. road distance.

Table S1 .
Var DBLα richness estimates by sentinel study site.

Table S2 .
Distances between sentinel study sites.