In silico characterisation of putative Plasmodium falciparum vaccine candidates in African malaria populations

Genetic diversity of surface exposed and stage specific Plasmodium falciparum immunogenic proteins pose a major roadblock to developing an effective malaria vaccine with broad and long-lasting immunity. We conducted a prospective genetic analysis of candidate antigens (msp1, ama1, rh5, eba175, glurp, celtos, csp, lsa3, Pfsea, trap, conserved chrom3, hyp9, hyp10, phistb, surfin8.2, and surfin14.1) for malaria vaccine development on 2375 P. falciparum sequences from 16 African countries. We described signatures of balancing selection inferred from positive values of Tajima’s D for all antigens across all populations except for glurp. This could be as a result of immune selection on these antigens as positive Tajima’s D values mapped to regions with putative immune epitopes. A less diverse phistb antigen was characterised with a transmembrane domain, glycophosphatidyl anchors between the N and C- terminals, and surface epitopes that could be targets of immune recognition. This study demonstrates the value of population genetic and immunoinformatic analysis for identifying and characterising new putative vaccine candidates towards improving strain transcending immunity, and vaccine efficacy across all endemic populations.

www.nature.com/scientificreports/ clonal laboratory strains, mostly Plasmodium falciparum strain NF54 (clone 3D7), without considering parasite genetic variability in natural populations [8][9][10][11][12] . RTS,S/AS01, the most advanced of these malaria vaccines, is based on a subunit of 3D7 circumsporozoite surface protein (CSP) antigen. It has shown moderate (< 50%) and time-limited protection in African children 13 , partly due to the low prevalence (5.2%) of 3D7-type CSP alleles 9 . Several other recombinant malaria vaccines are currently being evaluated in clinical trials, such as: EBA region II-non-glycosylated (EBA-175 RII NG), apical membrane antigen 1 diversity covering (AMA-1 DiCo), merozoite surface protein-3 (MSP3) long synthetic peptide (LSP), and serine repeat antigen-5. Other recombinant malaria vaccines in clinical trials are described in a recent review by Salamanca et al 10 . Additionally, the Cell-Traversal protein for Ookinetes and Sporozoites (CelTOS), Thrombospondin Related Adhesion Protein (TRAP), and Liver Stage Antigen 1 (LSA1), which target liver stage parasites; merozoite surface proteins (MSP2) which target blood stage parasites; and transmission blocking stage vaccines targeting Pfs230, Pfs45/48, Pfs25 and Pfs48/45 11,14,15 are also in advanced development. Nevertheless, given our incomplete understanding of the complexity of immune responses to infection, the parasite's mechanisms of invasion, and the parasite genome [16][17][18][19] , most of these vaccine candidates would face most of the challenges observed during the RTS,S/ASO1 development. To overcome some of these difficulties and improve vaccine efficacy by generating cross-protective multi-strain immunity across diverse natural populations, next generation vaccines must use innovative design approaches that account for genetic variables such as differences in antigen allele haplotypes and their frequencies across P. falciparum populations 1,20 .
Exploring the allele frequency spectrum and haplotype diversity of vaccine candidates can inform how natural selection and demographic events shape them across multiple populations. For example, characterising genetic diversity and signatures of selection in the parasite genome or for specific antigens has been useful in identifying genes under balancing selection and genic regions that are immunogenic for further development [20][21][22] . With the increase of next generation sequencing and openly accessible genome data from multiple endemic populations, future multi-allelic vaccines can benefit from genome scans for signatures of balancing selection. This will allow assessment of allele and haplotype frequencies across a broad range of endemic populations, particularly in sSA where parasite genetic diversity is high 22 . Such in silico diversity and prediction approaches should include sequence data obtained across multiple temporal and spatial populations. This could be explored to identify new candidates, refine components of current candidate vaccine antigens, with optimal components of relevant alleles from both pre-erythrocytic and erythrocytic targets prior to functional characterisation and clinical trials 15 .
Here, we conducted population genetic and immunoinformatic analysis of selected malaria candidate vaccine antigens using single nucleotide polymorphisms (SNP) from field isolates from 16 African countries. We estimated the extent of genetic diversity in these antigens and evaluated the evidence of balancing selection that could be shaping their evolution. From this large population dataset, we described a novel putative candidate vaccine antigen-phistb-that could be explored for monovalent vaccine development or as part of a multivalent vaccine. Together, this is the largest exploration of candidate vaccines' genetic diversity for natural populations of malaria parasites from sSA, providing a framework for designing future functional studies and effective malaria vaccines.

Methods
Ethical considerations. This study utilized data previously published by the Plasmodium Diversity Network Africa (PDNA) 23,24 and the open-source, MalariaGEN (https:// www. malar iagen. net/ proje cts/ pf3k). For the PDNA data, field studies were approved by the respective local ethical review committees. Venous blood (2-5 ml) was collected from study participants aged > 6 months enrolled under approved protocol(s) in each country. All participants and/or their legal guardians provided written informed consent before any study procedures. All procedures performed in studies involving human participants were in accordance with the Declaration of Helsinki. Samples were leukocyte depleted and extracted DNA sequenced by MalariaGEN at the Wellcome Sanger Institute.
To utilize PDNA and the open-source data in the current analysis, this study was elaborated as part of a malaria genomic surveillance proposal reviewed and approved by the Gambian Government/MRC Joint Ethics Committee. The methods employed in this study were in agreement with the MRCG-LSHTM research governance policy.
Vaccine candidates. A total of 16 Plasmodium falciparum vaccine antigens were identified and analysed; among them, 10 (msp1, ama1, rh5, eba175, glurp, celtos, csp, lsa3, Pfsea and trap) were included as potential vaccine antigens. The remaining 6 antigens (conserved chrom3, hyp9, hyp10, phist, and surfin8.2, surfin14.1) were identified using Rsb (a metric for comparing integrated extended haplotype homozygosity between populations), the cross population test for signatures of selection 23    www.nature.com/scientificreports/ Data analysis. Nucleotide diversity analysis and neutrality test. Population genetic analyses were carried out on each candidate vaccine dataset to investigate the genetic diversity as well as the frequency of known haplotypes being incorporated into vaccine candidates. The range and distribution of genetic diversity within and among the natural P. falciparum populations from individual countries were determined. Population genetic parameters were determined using PopGenome R package 26 . Nucleotide diversity (π) (average number of nucleotide differences per site in pairwise comparisons among DNA sequences) and theta (Ɵ) (population mutation rate) was measured for each candidate vaccine antigen by country. Nucleotide diversity was calculated both for the entire coding sequence of each antigen and in sliding windows to determine regions of increased diversity. Haplotype diversity (Hd) which represents the probability that two randomly sampled alleles are different was estimated per antigen and population. Neutrality tests designed to distinguish between neutrally evolving sequences under mutation-drift equilibrium-population expansion and contraction, and sequences evolving under non-neutral processes including directional or balancing selection, were derived. We determined Tajima's D, Fu & Li's F* and D* statistics. Tajima's D uses the frequency of segregating nucleotide sites, while Fu's F* uses the distribution of alleles or haplotypes. Both tests are based on the principle that a rapid population expansion associated with a non-neutral process will show a shift in the allele frequency spectrum compared to a neutral Wright-Fisher model consistent with population expansion under neutral evolution. For all the candidate vaccine antigens, Tajima's D was estimated by country using vcftools 27 for the entire antigen region or in consecutive sliding windows of 100 bases. Positive Tajima's D values generally suggest balancing selection, or a population contraction is acting to maintain alleles at intermediate frequencies; negative values suggest purifying or positive selection, or population expansion that results in an excess of rare alleles.

Linkage disequilibrium (LD).
To determine correlation between antigen alleles, pair-wise LD between different polymorphic sites was computed based on the genotype correlation coefficient (r 2 ) index between alleles at physically separate loci with a minor allele frequency > 1% across all populations using PLINK v1.90b6.4 28 . Within each population, r 2 measures for each antigen were calculated between all pairs of single nucleotide polymorphisms (SNP) and observed pair-wise LD (r 2 ) was averaged for each inter-SNP distance. Decay of LD with physical distance between SNP loci was fitted in R version 3.6.
Population differentiation among sampling populations. To assess geneflow between populations, we first estimated genetic differentiation (i.e. the difference in the average diversity within populations compared to that among populations) expressed as Wright's fixation index using whole SNP dataset for each vaccine candidate 29 . The fixation index (F ST ) was estimated for pairs of populations using the hierFstat package 30  Haplotype cluster network analysis. Genetic relationships between antigen haplotypes of P. falciparum from the 16 African countries were constructed using the median joining algorithm in the R packages Pegas and Ape 31 . The haplotype network was computed using haplotype pairwise differences as a distance measure, estimated from the number of SNP variants between haplotypes.
B-cell epitope prediction for vaccine candidates. The random forest algorithm trained on epitopes annotated from antibody-antigen protein structures was employed in predicting linear B-cell epitopes for 11 candidate vaccine antigens with at least 3 SNPs, by employing BepiPred-2.0 at a threshold of 0.5 32 . Sequences of these antigens were uploaded on to the server in FASTA format and B-cell epitopes returned as outputs. We further compared the B cell epitopes predicted with segments of the antigens' sequences with elevated positive Tajima's D values. This allowed us to map contiguous regions in the proteins which are likely under immune selection and could be included in future designs of multivalent malaria vaccines.
three-dimensional structure and membrane anchor prediction for Phistb. The amino acid sequence of the candidate vaccine antigen was submitted to I-TASSER for in silico prediction analysis using the default settings 33 . The predicted model with the highest C-score (range: 0-1), used for indicating confidence in the predicted models, was selected and epitopes and regions under balancing selection highlighted. Transmembrane helices and glycophosphatidylinositol (GPI) anchors were predicted by uploading phistb sequence into the TMHMM prediction servers and GPI-SOM respectively 34 .
Statistical analyses. Spearman's rank nonparametric correlation coefficient was used to measure the correlation between the three neutrality tests applied. Mann-Whitney U non-parametric test was used to assess statistically significant differences between groups assuming a non-Gaussian distribution. Statistical analysis was performed using GraphPad Prism version 8.0. Additionally, statistical comparison of genetic indices for antigens based on region of origin following stratification of populations into two geographical areas: West and Central Africa (Cameroon, Congo, Cote d'Ivoire, Gabon, Gambia, Ghana, Guinea, Mali, Mauritania, Nigeria, and Senegal) versus East Africa (Ethiopia, Kenya, Madagascar, Malawi and Tanzania). This stratification was based on previous reports by the PDNA describing the clustering of African malaria parasites into subpopulations 23  Polymorphism and haplotype diversity. The lowest median number of haplotypes per country was 3 for pfsea, and msp1, while the highest was 110 for surfin8.2 (Table S1). All candidate vaccine antigens were initially subjected to nucleotide diversity and F ST analyses. The average pairwise nucleotide difference per site estimated as π and Ɵ for the pre-erythrocytic (celtos, csp, lsa3 and trap) and erythrocytic (msp1, ama1, rh5, eba175, glurp, chrom3, hyp9, hyp10, phistb, Pfsea, surfin8.2 and surfin14.1) antigens showed large nucleotide diversity when aggregated across all countries. For the pre-erythrocytic antigens, pfsea had consistently lower π and Ɵ values of 0.00002890 and 0.00009980 respectively ( Table 1). The erythrocytic antigens hyp9 had the lowest diversity values (π = 2.538 × 10 −4 ; θ = 8.066 × 10 −4 ) whilst hyp10 had the highest (π = 0.016099; θ = 0.011532) ( Table 1). In general, pre-erythrocytic antigens were less diverse than erythrocytic antigens. Nevertheless, when disaggregated by country, the π for pre-erythrocytic antigens ranged from as high 0.0175342 for csp (Gabon) to as low as 0.0000075 for pfsea (Mauritania). For the erythrocytic antigens, the π value was as high as 0.0189404 for hyp10 (Gambia) while the lowest values was 0.0000952 for hyp9 (Cote d'Ivoire), (Table S1). Across all countries, we observed relatively higher genetic diversity in csp (except for Ethiopia) and hyp10. Nevertheless, pre-erythrocytic antigens lsa3 and pfsea had low (0.058-0.096) nucleotide diversity across all countries. Such low levels of diversity were also observed across geographical sites for erythrocytic antigens rh5, hyp9, and glurp (majority of sites). Interestingly, we observed a relatively higher nucleotide diversity at rh5 in east African sites (Kenya, Malawi and Tanzania) and the island of Madagascar (Table S1). The haplotype diversity index (Hd) values across all antigens and populations were mostly > 0.4 with the exception of pre-erythrocytic antigen pfsea, (Hd ranging from 0.05 to 0.29) and erythrocytic antigen hyp9 (0.017-0.153) which were low across all populations (Table S1). Candidate vaccine antigens ama1, eba175 and glurp from Ethiopia had lower Hd values than in the other African countries. A similar pattern was observed for pre-erythrocytic antigens. However, it is important to note that a small number of samples were analysed from Ethiopia. To identify any differences within Africa, we compared Hd values from the two subregions, namely West and Central Africa versus East Africa, and found significant differences for lsa3 (P = 0.01), hyp10 (P = 0.0005), phistb (P = 0.0005) and rh5 (P = 0.0032). In our subsequent analysis, antigens with limited SNP information were excluded, which led to removal of five antigens (pfsea, hyp9, hyp10, conserved chrom3 and rh5) from Tajima's D and LD analysis. The remaining 11 antigens were subjected to Tajima's D, LD and linear B-cell epitope and 3D-structure prediction.   www.nature.com/scientificreports/ vaccine antigens across malaria endemic populations in sSA ( Table 2). Comparison of Tajima's D values for the candidate vaccine antigens analysed by subregions revealed that for most of the antigens there were no statistically significant differences (P > 0.05) between East versus West and Central Africa. This is further evidence of standing variation in these antigens across malaria endemic communities in sSA. As expected, there was strong correlation between Tajima's D and Fu & Li's F* values for all the antigens across the populations studied (r = 0.882). However, correlation was much lower between Tajima's D and Fu & Li's D* (r = 0.03), with some antigens, notably glurp, celtos and phistb, showing a negative correlation, while Fu & Li's D* and F* had a weak correlation (r = 0.2) for the antigens except for glurp where Tajima's D was negative but Fu and Li indices were positive (Fig. 2). Tajima's D becomes negative when there is an excess of rare alleles resulting in more pairwise diversity than the number of segregating sites.
Haplotype frequency and cluster network. The frequency of haplotypes for all antigens and details of haplotypes of each antigen per country are given in Table S3. To further understand the relatedness of the www.nature.com/scientificreports/ haplotypes across our populations, haplotype networks were constructed on selected vaccine candidates already under investigation in clinical trials; 2 pre-erythrocytic (celtos and csp) and 3 erythrocytic (ama1, msp1 and phistb) antigens. The pre-erythrocytic antigens had a high number of haplotypes with no geographic clusters, suggesting lack of structure. Haplotype network of the erythrocytic antigens followed a similar pattern to the pre-erythrocytic antigens, except for phistb where geographic clustering of some haplotypes was evident (Fig. 3).

Linkage disequilibrium. Linkage Disequilibrium (LD) declined rapidly with physical separating pairs of
SNPs for all antigens, falling below an r 2 of 0.05 within 1000 base pairs (Fig. 4). Exceptionally, LD values were higher for Ethiopia, with r 2 > 0.1 across most pairs of SNPs not significantly decaying with physical distance for most antigens. This observation may be underscored by variance in malaria transmission dynamics, population migration, demography and recombination potential within the SNPs in the genes analysed.
Antigenic potential and 3D structure of phistb. Epitope scanning for linear B cell epitopes on the antigens revealed several epitopes. Some of the regions with immune epitopes overlap with regions of high Tajima's D, indicative of immune selection. The list of the B cell epitopes for the vaccine candidates and regions mapping with high Tajima's D are provided in Table S4. We further explored the immunogenicity of phistb, which had only 2 predominant haplotypes (Table S3), and evidence of genetic structure between geographical parasite populations. The TMHMM server predicted one transmembrane helix between amino acid positions 54 and 73, and the likelihood of a signal peptide (Fig. S3). Phistb also has a post-translational modification-glycophosphatidyl inositol (GPI) site in its C-terminal region. GPI is used by proteins for anchoring to the plasma membrane (Fig. S4). Following 3D structure modelling of Phistb the linear B-cell epitopes predicted overlapped with regions with high positive Tajima's D values (Fig. 5).

Discussion
We carried out a large-scale population genetic analyses of candidate vaccine antigens and hitherto uncharacterised potential antigens using SNP data of P. falciparum isolates (n = 2375) from 16 African countries. Most advanced malaria vaccine candidates were initially designed without considering genetic diversity, resulting in poor protective efficacy as shown by FMP1/AS02A MSP1 or the AMA1 candidates that had limited vaccine allele representation in naturally circulating parasites, especially in Africa 11,35,36 . Therefore this focused analysis on African parasite populations is relevant, given intense but heterogenous transmission, and high parasite genetic diversity across sub-populations 3,4 . By screening for antigens with signatures of directional selection, we expected to identify those with dominant haplotypes and exposed immune epitopes, that could be included into the panel of candidates currently under development. We observed significant diversity amongst characterised vaccine candidates and new putative antigens across all populations, and this may result in poor efficacy both within clinical trials and in future deployment 37 . As immunity could be driving balancing selection signatures against specific domains of the antigens, our scan for signatures of selection localised these segments, with sequences of known immune epitopes that could be targeted for functional evaluation. However, antigens under high immune selection pressures and balancing selection are often highly polymorphic with moderate minor allele frequencies, likely to result in variance in allele specific immune response between populations 38 . This applies also to well-known erythrocytic vaccine candidates based on ama1, msp1 and eba175, in which variation and selection has been a major hurdle in achieving broad efficacy across malaria endemic populations 35,37,39 . Like msp1 and ama1, other erythrocytic antigens such as surfin 8.2 and surfin 14.1 showed signatures of both directional and balancing selection, the latter, also due to an excess of intermediate frequency variants that may result in vaccine escape. The surfin family of proteins are expressed by genes close to the sub-telomeres where other surface variant antigens (pfemp1, rifin, stevor) are encoded 5 . These are expressed on the surface of infected erythrocytes and are known targets of protective immunity and generating heterologous antibodies associated with reduction in febrile illness due to malaria infection 6 . Surfin8.2 had the highest number of segregating sites and has been shown to be preferentially expressed in gametocytes 40,41 . Being a PEXEL (Plasmodium export element) negatively exported protein with a transmembrane domain 42 , it will be exposed to the surface and can therefore be a target for transmission blocking vaccine development. Another transmission blocking vaccine, glurp, is in advanced development alone or as a multivalent component with other erythrocytic candidates such as msp3 (GMZ2) 43 . Its lower Tajima's D value and smaller number of haplotypes could provide an advantage for combination with other candidates such as surfin8.2 to provide a broad range of erythrocytic protection and transmission blocking activity.
While erythrocytic antigens dominate the panel of malaria vaccine candidates in development, the only successful vaccine to date is the CSP-based RTS,S/AS-01. As shown already, the CSP gene is highly diverse, one of the factors contributing to reduced efficacy across populations. We assessed diversity at four other pre-erythrocytic candidates, celtos, lsa3, pfsea and trap. The least polymorphic one was the schizont egress antigen pfsea, whose antibodies have been associated with protection against high parasitemia and severe disease 44,45 . It is under preclinical exploration in combination with invariant carboxyl of PfGARP and PfMSP1 in a tri-valent vaccine formulation. As previously described, lsa3 also has a low number of haplotypes and is expressed on both sporozoites and the erythrocytic stages [46][47][48] . Thus it could also be considered as a blood stage vaccine candidate given its antibodies inhibit parasite growth in the erythrocyte 48 . Its further development against both erythrocytic and pre-erythrocytic stages will have to consider the African haplotypes described here. This also applies to one of www.nature.com/scientificreports/ the recent prominent candidates, rh5, which, despite a low number of haplotypes, had 12 non-synonymous SNPs and high between population differentiation determined by the fixation index F ST . Strong differentiation between eastern versus western African parasite populations would imply that population specific variation in haplotypes need to be considered for rh5 vaccine development. Our findings also support previous studies which demonstrated that the number of haplotypes were consistently higher in countries with higher malaria transmission such as Cameroon, Mali and Malawi [49][50][51] . Therefore, continuous genetic screening for low diversity candidates to add to the pool of current antigens that can be combined in a multivalent malaria vaccine remains important. We demonstrated the power of this genetic analysis by exemplifying one of the low diversity candidate antigens, phistb. PHIST proteins belong to the Plasmodium helical interspersed subtelomeric (PHIST) family made up of 65 gene members (PlasmoDB database, www. plasm odb. org) and are unified by possessing a single domain termed PHIST that is predicted to be composed solely of alpha helices 52,53 . This family of exported proteins are conserved across the human Plasmodium species; P. falciparum, P. vivax and P. knowlesi 52 . As we identified a transmembrane domain and GPI anchors, we expect phistb to be surface anchored 54 . Previous studies have described phistb to localise to the host erythrocyte periphery through a PRESAN domain and an N-terminal sequence, and therefore exposed to immune interaction. Unsurprisingly, segments of the antigen had high Tajima's D values, indicating balancing selection probably due to immune selection. These segments mapped to predicted B-cell epitopes which could be included in future candidate vaccine designs, preclinical testing and for selection of an optimal vaccine cocktail. The low genetic diversity, limited haplotypes in the population and linear B-cell epitopes of phistb are desirable features for designing an antimalarial vaccine that is less likely to produce allele-specific immune responses 2,50 . Recent studies with in vitro expressed phistb protein demonstrated www.nature.com/scientificreports/ the presence of significant specific anti-phistb antibodies in children, with malaria specific immune responses 55 . This further supports predictions from genetic analysis and epitope mapping in line with other studies that have identified potential immune targets by modelling protein structures and predicting the functional relevance and implications of different domains as vaccine candidates 56,57 . PHIST proteins are involved in remodelling infected red blood cells 54 . However, data on host antimalarial immune responses to these proteins, which could be crucial for vaccine development, is scarce. Despite the relevant finding on top candidates like rh5 and new potential candidates like phistb, our study has several limitations. Samples were collected mostly in one field site per country, and thus they may not provide an accurate representation of the circulating haplotypes across each country. In addition, the number of samples available from some countries such as Madagascar, Ethiopia and Nigeria, were small. Moreover, protocols for sample collection differed by country, including the time of sampling, which may influence the parasite haplotypes due to seasonality. The analysis was limited to high quality SNPs, which could have missed important genetic variations; and there is also the possibility of ascertainment bias in the dataset that have been utilised in this study, and more broader sampling for suspected vaccine candidate genes would be useful. However, despite these limitations, the analysis was comprehensive since we used a large dataset and provide a framework and basis for future studies and considerations. To verify our findings, functional studies and expanded genetic analysis incorporating targeted sequencing data are needed to further support our findings.
In conclusion, despite the broad genetic diversity of antigens of African P. falciparum isolates, it may be possible to identify the most prevalent antigen haplotypes within the continent for inclusion in a multivalent vaccine. Our findings demonstrate population genetic analysis can be used to identify antigens that consider African parasite genetic diversity for design of new vaccines.  www.nature.com/scientificreports/