Proteome-wide mapping of immune features onto Plasmodium protein three-dimensional structures

Humoral immune responses against the malaria parasite are an important component of a protective immune response. Antibodies are often directed towards conformational epitopes, and the native structure of the antigenic region is usually critical for antibody recognition. We examined the structural features of various Plasmodium antigens that may impact on epitope location, by performing a comprehensive analysis of known and modelled structures from P. falciparum. Examining the location of known polymorphisms over all available structures, we observed a strong propensity for polymorphic residues to be exposed on the surface and to occur in particular secondary structure segments such as hydrogen-bonded turns. We also utilised established prediction algorithms for B-cell epitopes and MHC class II binding peptides, examining predicted epitopes in relation to known polymorphic sites within structured regions. Finally, we used the available structures to examine polymorphic hotspots and Tajima’s D values using a spatial averaging approach. We identified a region of PfAMA1 involving both domains II and III under a high degree of balancing selection relative to the rest of the protein. In summary, we developed general methods for examining how sequence-based features relate to one another in three-dimensional space and applied these methods to key P. falciparum antigens.

Immunity against clinical malaria develops naturally following repeated exposure, with antibodies known to play a key role in this process 7,8 . Within a naturally exposed population, immune selection pressure on the malaria parasite helps drive the occurrence of high-frequency polymorphisms on key malaria antigens. The development of a humoral immune response requires recognition of antigen in its native state. As a result, antigen structure plays a large role in the determination of epitopes for a humoral immune response. In other words, immune selection pressure driven by antibody-antigen interactions also occurs at the level of three-dimensional (3D) protein structure. Thus, examination of polymorphic regions in the context of protein 3D structure may help illuminate particular structural regions that are important targets of natural immunity. A number of studies have explored the relationship between protein structure and immune responses within Plasmodium species, including work on AMA1 from various species [9][10][11][12][13] , CSP [14][15][16] , EBA-175 17 , MSPDBL2 18 and MSP2 19 . The majority of these studies have examined the location of polymorphic residues on a protein structure for single antigens, which likely arise as the result of immune selection pressure on particular epitopes. Polymorphisms can also arise as a result of T-cell driven selection pressure, as has been described for key T-cell epitopes within the C-terminal domain of CSP 20,21 . Other tests of immune selection pressure include Tajima's D, which can help identify departure from a neutral model of selection 22 . A number of studies have examined Plasmodium proteins under immune selection pressure (balancing selection) using a sliding window approach 9,10,[23][24][25][26] , although all of these studies examine Tajima's D in the context of the linear sequence and do not consider the spatial proximity of residues (i.e., residues that are distant in the linear sequence may be proximal in the 3D structure). Here, we incorporate residue spatial information into measures of immune pressure, using both known and modelled protein structures. We demonstrate that the consideration of protein structural information can give extra insights into the regions of a protein under immune selection pressure.
In summary, we show that polymorphic residues within P. falciparum are usually surface exposed and are enriched within secondary structure turn elements. Predicted B-cell epitopes are also typically located on highly surface exposed regions. In contrast, predicted MHC class II binding peptides are generally buried within the core of a protein, and do not seem to overlap with polymorphic residues to a significant extent, which suggests that high frequency polymorphisms are more likely driven by humoral immune responses rather than cellular immunity. Antibodies often recognise discontinuous epitopes, therefore it is important to consider the spatial arrangement of residues when examining antigenicity. Accordingly, we incorporate structural information into a modified Tajima's D test, and assessed two polymorphic vaccine candidates, EBA-175 and AMA1. We identified strong signatures of balancing selection for a discontinuous region of PfAMA1 bordering domains II & III.

Identification of matching PDB structures.
For each Plasmodium species examined, matching PDB structures were identified using a BLAST search against the PDB database, with an e-value cut-off of 10.0. A sequence identity threshold >90% was used, normalized to the length of the shorter sequence in the comparison. The NCBI-blast+ 2.3.0 package was used for all BLAST searches 29 . Redundancy in PDB structures was removed using a sequence identity cut-off of 90% to group similar structures using precomputed sequence similarity clusters available on the RCSB PDB database (http://www.rcsb.org/pdb/). region for each gene was selected based on the 3D7 reference strain, and the corresponding multiple sequence alignment used for a sliding window calculation of Tajima's D. A window size of 102 base pairs (bp) and step size of 3 bp was used unless otherwise specified. For calculation of Tajima's D with incorporation of protein structural information (referred to as spatially derived Tajima's D), the relevant PDB file sequence was aligned to the 3D7 reference strain, and a radius of 15 Å used to extract surrounding residues for each central residue, with Tajima's D calculated using codons for these surrounding residues. For a detailed description, see the BioStructMap documentation at https://github.com/andrewguy/biostructmap. Sequences for calculation of Tajima's D were generated using polymorphism data obtained from PlasmoDB. Polymorphic sites were included on the proviso that at least 50 of the 65 sequences had a reliable base call at that position (percentage isolates with a base call ≥ 76%), in line with the original study 24 . A read frequency threshold of 70% and a minimum read depth of 5 was used to identify reliable base calls.
Calculation of relative solvent accessibility and secondary structure. Relative solvent accessibility (RSA) was calculated using the DSSP program (accessed via BioPython) 35 , using the maximum accessible surface area (ASA) values from Rost & Sander 36 . When considering a two-state definition of solvent accessibility, an RSA threshold of 20% was used, similar to approaches taken in previous studies 37 . Note that this threshold is very close to the median RSA value for the proteins considered in this study, effectively splitting the dataset in two. Secondary structure assignment was also performed using the DSSP program using the eight DSSP secondary structure classes 35  Prediction of membrane-proximal regions. To assess the potential impact of membrane proximity, transmembrane regions were predicted using TMHMM v2.0 46 . Known and predicted GPI-anchored proteins were obtained from Gilson et al. 47 , and the position of the GPI omega site within these proteins was predicted using PredGPI 48 . Structured regions were considered to be proximal to a transmembrane region or GPI anchor if they were within 10 residues of either.
Visualisation of protein structures. Protein structures and features mapped on them were visualised using PyMol 49 ; feature values were saved into the B-factor column of a PDB file, and visualised using the spectrum command.
Data analysis and statistics. The majority of data analysis was performed using the Anaconda distribution of Python 3.5. Data was stored in an SQLite database, and accessed using Python SQLAlchemy. Plotting was performed with the Python Matplotlib package, version 1.5.1 50 . Statistical analysis was performed using SciPy 51 .

Results
A Python package for spatial mapping of features onto PDB structures. Plasmodium structures identified in the PDB were used for a series of spatial mapping analyses using our BioStructMap Python package. When dealing with protein sequence-level data, such as sequence polymorphisms or measures of evolutionary selection pressure, it is common practice to apply a function over a sliding window on the linear sequence. However, spatial proximity of residues within a 3D folded protein structure is often important, motivating an application of a "3D sliding window" across a protein structure as implemented in BioStructMap (Fig. 1). We were particularly interested in assessing immune-mediated selection pressure that occurs as a result of antibody recognition of a dominant epitope. Such selection pressure often applies to multiple residues within the epitope. Importantly, these residues are often not continuous within a linear sequence; therefore, consideration of protein structural information may enhance identification of regions of immunological importance. In this setting, the selection of a radius for calculation of adjacent residues should be reflective of the potential binding surface for an antibody. While the interaction dimensions for antibody-antigen interaction surfaces vary depending on the sequence of both antigen and antibody, the mean maximum paratope dimension were estimated as 28 Å 52 or 30 Å 53 . We have used the upper of these two estimates when choosing a radius of 15 Å for all spatial averaging presented in this paper. Using data on sequence polymorphism from a study in The Gambia 24 , we mapped known polymorphisms onto P. falciparum crystal structures, correlating the occurrence of polymorphisms with residue surface exposure 36 , average hydrophilicity 38 and hydrophobicity 39 , and predicted B-cell epitopes 41 and MHC class II binding peptides 42,43 . We used the BioStructMap package to identify polymorphic hotspots within regions with known or modelled structure. Finally, we used this tool to include spatial information into a calculation of Tajima's D and applied this to key antigens. Overview of the structural mapping with spatial averaging approach as used in the Python BioStructMap package. For any given residue within a PDB structure, all residues within a specified radius of the given residues are identified. The location of these residues within a given reference sequence is then found, with the assumption that user-provided data will be aligned to this reference sequence. Using selected residues and the corresponding subset of user-provided data, a function is called, returning (usually) a numerical value. For example, this function may return the mean of the respective data. Note that the provided data and the mapping function may take a diverse number of forms. This includes functions which apply some statistical test over a multiple-sequence alignment of genetic sequences (e.g. Tajima's D). In this case, the function would apply the statistical test over the subset of codons which code for the selected residues. The returned value is then assigned to the original residue in the PDB structure. This process is repeated for all residues within the PDB structure. Results can be viewed as a heatmap displayed over the PDB structure.

Overview of experimental protein structures within Plasmodium species.
To determine the number of proteins from various Plasmodium species that have structures in the PDB, we performed a BLAST search of proteins from major human, primate and murine Plasmodium spp. against a database of all PDB sequences. BLAST matches were grouped according to experimental technique (X-ray diffraction, NMR, other) and redundancy across PDB files was accounted for by clustering redundant PDB sequences using a 90% sequence identity threshold. For Plasmodium falciparum 3D7 sequences, there were a total of 368 non-redundant structures (total of 1,111 PDB files), equating to a total of 275 proteins (5.1% of the proteome) with at least one known structure in the PDB ( Table 1). The majority of these structures were determined using X-ray crystallography. Similarly, for P. vivax Sal-1 sequences, 234 non-redundant structures (810 PDB files) were found, covering a total of 157 proteins (2.8% of the proteome).
We examined the set of matching PDB structures for cross-species homology (Fig. 2). When comparing three human malaria species (P. falciparum, P. vivax, P. knowlesi), there was a large core of 146 non-redundant structures (596 PDB files) common to all 3 species, with another 206 non-redundant structures (490 PDB files) that matched only proteins in P. falciparum, while only 60 non-redundant structures (136 PDB files) matched proteins in P. vivax alone. For the 3 rodent malaria species examined (P. yoelii, P. chabaudi, P. berghei), 133 of the 155 non-redundant structures (602 of the 635 PDB files) matched proteins in all 3 species. Similarly, there was a large overlap between P. falciparum and P. reichenowi, with only 23 of the 372 non-redundant structures (58 of the 1127 total PDB files) matching proteins in only one of these two species. When considering all 7 species examined here, 81.2% (388 of 478) of non-redundant structures that matched proteins in any Plasmodium sequence also matched proteins in two or more Plasmodium species.
To determine the number of structured regions that were proximal to either transmembrane regions or GPI anchors, we predicted transmembrane regions using TMHMM v2.0 46 and extracted known and predicted GPI-anchored proteins from Gilson et al. 47 . Only 5% (13 of 275) of proteins with known structures had structured regions within 10 residues of predicted transmembrane domains or GPI omega sites. Given the minimal number of membrane proximal structures identified, we have not treated membrane proximal regions differently in the following analysis.
Polymorphic residues are predominantly surface-exposed. We first correlated the occurrence of non-synonymous single nucleotide polymorphisms with the relative solvent accessibility (RSA) of the corresponding amino acid residues; RSA is a proportional measure of exposure of a particular amino acid residue to bulk solvent. Based on a few known examples 10,54 , it was expected that polymorphic residues that evolved to evade humoral immune responses would be located predominantly on the surface. We indeed observed that polymorphic residues generally had some level of solvent exposure, with higher overall RSA values than residues that contained no underlying nucleotide polymorphisms (Fig. 3). Polymorphic residues had a significantly higher median RSA value of 0.47 than the background median RSA level of 0.20 (p < 0.0001, Mann-Whitney U test). Residues with underlying synonymous SNPs did not have significantly different RSA values to the background distribution (p = 0.79, Mann-Whitney U test). Immunologically relevant polymorphisms are expected to be maintained at a high frequency within a population. A minor allele frequency (MAF) threshold of 5% is commonly used to distinguish between high-and low-frequency polymorphisms [55][56][57][58] . Correspondingly, polymorphic residues with a MAF ≥ 5% had significantly higher RSA than all polymorphic residues (p = 0.04, Mann-Whitney test), with a median RSA of 0.52.
Average hydrophilicity and hydrophobicity in relation to polymorphic residues. A number of epitope prediction methods have utilised amino acid residue propensity scales based on residue biophysical properties such as hydrophobicity and hydrophilicity in an attempt to predict targets of adaptive immune We considered here the effect of average residue hydrophobicity and hydrophilicity within a 15 Å radius on polymorphism location, but observed only a small increase in hydrophilicity and corresponding small decrease in hydrophobicity ( Figure S1a,c). Given that average hydrophilicity and hydrophobicity are correlated weakly with RSA, it is likely that the small shift in average hydrophilicity and hydrophobicity is reflective of the much larger shift observed for solvent accessibility between polymorphic and non-polymorphic residues.
To explore this hypothesis, we re-ran the spatial averaging algorithm and restricted it to surface exposed residues (RSA ≥ 0.2), showing only a small, albeit significant, difference between polymorphic and non-polymorphic residues (Figure S1b,d).
Polymorphic residues occur preferentially in turns. We also investigated whether or not polymorphic residues are more likely to occur in any particular secondary structure motifs. When considering all polymorphic residues, we observed a significant reduction in polymorphic residues within β-strand elements (E), from 19% to 11% (p = 0.006) ( Figure S2). When considering only those polymorphisms with MAF ≥ 5%, we observed a large reduction in the proportion of residues within β-strand elements (E) (p = 0.0009), from 19% to 6%, alongside a large increase in the proportion of residues within turns (T), from 9% to 22% (p < 0.0001).  Figure S3). Although a number of tools exist for predicting B-cell epitopes based on structural data 63,64 , we wished to assess the utility of a state-of-the-art method that only uses the linear protein sequence as input, as this will be applicable to the many Plasmodium proteins with unknown structures. As we have previously utilised BepiPred 1.0 in a study examining disordered proteins in P. falciparum 6 , we provide a comparison to the older BepiPred 1.0 40 algorithm here ( Figure S4). Our subsequent analysis was  only performed on proteins that contained at least one high frequency polymorphic residue within a region of known structure, as it was impractical to run the full proteome through the BepiPred 2.0 online interface. Epitopes predicted with BepiPred 2.0 were predominantly surface exposed, with increasing BepiPred thresholds predicting increasingly surface exposed residues ( Figure S3a). BepiPred 2.0 epitope scores were significantly higher for polymorphic residues (p < 0.0001; Mann-Whitney U test), even when restricting analysis to surface exposed residues (RSA ≥ 0.2) (p < 0.0001; Mann-Whitney U test) ( Figure S3b). At the default threshold, 50% of non-polymorphic surface-exposed residues fell within predicted epitopes, compared to 75% of polymorphic surface-exposed residues. We also examined the location of predicted MHC class II binding peptides, specifically looking at MHC class II haplotypes, again using those that have been shown to be present at high frequency within the Gambian population: HLA-DPA1*02:01-DPB1*01:01 and HLA-DQA1*05:01-DQB1*03:01 44 . MHC class II binding peptides were observed to be predominantly buried within the core region of proteins, with lower overall surface exposure than the set of non-binders ( Figure S3c, d). Additionally, polymorphic residues were found at lower frequency within MHC class II binding peptides compared to non-binding regions. For the HLA-DQA1*05:01-DQB1*03:01 haplotype, 0.20% of residues involved in high-binding peptides were polymorphic (MAF ≥ 5%), compared to 0.74% of residues not involved in a predicted MHC binding peptide. Similarly for the HLA-DPA1*02:01-DPB1*01:01 haplotype, 0.21% of residues involved in high-binding peptides were polymorphic (MAF ≥ 5%), compared to 0.71% of residues not involved in a predicted MHC binding peptide. Thus, MHC class II binding peptides are typically buried within the hydrophobic core of proteins, and do not appear to be significant targets of immune selection pressure.

Prediction of B-cell epitopes and
Comparative structural modelling of Plasmodium falciparum proteins. To extend the structural coverage, the entire proteome of the P. falciparum (3D7 strain) was subjected to comparative structural modelling using ModPipe 32 (https://salilab.org/modpipe/) based on structures in the PDB. A MPQS threshold of 1.1 was used to filter out low-quality models before further analysis. A total of 1575 reliable models were created, covering 923 proteins or 17% of the proteome. The majority of the models in the filtered dataset covered the entire length of the corresponding P. falciparum 3D7 protein sequence (median coverage = 95%, mean coverage = 87%) ( Figure S5). We then used both known and modelled structures to identify polymorphic hotspots within P. falciparum.
Identification of polymorphic hotspots. Polymorphic regions have often been thought to relate to potential antigenicity of malaria proteins, with antigenic diversity a contributing factor to parasite evasion of host immune responses 65,66 . With this in mind, spatial averaging of polymorphisms was performed to identify regions of proteins that have clusters of high-frequency polymorphisms and are hence likely to be under some level of immune selection pressure. Given that polymorphic residues tend to be surface exposed, we have restricted this Figure 3. Polymorphic residues within known P. falciparum structures are predominantly surface exposed. Relative solvent accessibility (RSA) is shown for residues with and without identified polymorphisms. RSA represents the proportional surface area of a residue that is exposed to solvent, relative to the maximum possible exposure for that amino acid. RSA was calculated using the maximum accessible surface area (ASA) values from Rost & Sanders 36 . Box-and-whisker plots show median (red line) and interquartile range (box) of residue RSA values for each group. Violin plots show the smoothed distribution of RSA values for each group (violin plots employ a Kernel Density Estimation to compute an empirical probability distribution for each group). Polymorphic residues shown are both those with underlying non-synonymous SNPs regardless of allele frequency (n = 204), and those with underlying non-synonymous SNP with a minor allele frequency (MAF) ≥ 5% (n = 105). The majority of residues in the dataset did not have underlying polymorphisms (n = 28,869). Sequence polymorphisms were obtained from 65 Gambian isolates from Amambua-Ngwa et al. 24 , accessed via PlasmoDB. Polymorphic residues had significantly higher RSA values than the background RSA levels (p < 0.0001, Mann-Whitney U test), and polymorphic residues with a MAF ≥ 5% had significantly higher RSA than all polymorphic residues (p = 0.04, Mann-Whitney test). analysis to surface exposed residues with RSA ≥ 0.2. Polymorphic hotspots on P. falciparum structures were identified using the set of all PDB crystal structures that matched proteins in P. falciparum (Table S1), and the set of all modelled structures with MPQS > 1.1 (Table S2). Protein polymorphisms from the Gambian population were used, and a MAF threshold of 5% employed to restrict polymorphisms to those with some immunological relevance. A threshold of either 10% or 20% of surrounding residues being polymorphic was used to identify regions of protein structure that are particularly polymorphic. For both experimental and modelled sets of structures, most proteins identified are known antigens, and include AMA1, CSP, TRAP, PfEMP1, DBL-MSP2, MSP1 and EBA-175. Mapping of polymorphisms and predicted B-cell epitopes and MHC class II binding peptides on these structures is shown in Figures S6-S10 & Figs 4 and 5. We have used known structures where there are no missing residues, and modelled structures when the relevant experimental structures have unresolved residues. Although the density of polymorphic residues differs between antigens, all proteins examined here have regions of particularly dense polymorphisms that in most cases overlap with B-cell epitopes predicted using BepiPred 2.0. In many cases, the most polymorphic regions are surface exposed protrusions.
Key parameters mapped to PfAMA1 & EBA-175. From the proteins identified above, we focus here on two major vaccine candidates for P. falciparum malaria: Apical Membrane Antigen 1 (AMA1) and Erythrocyte Binding Antigen 175 (EBA-175) Region II (RII) (Figs 4 and 5). A number of crystal structures exist for PfAMA1, but they only encompass domains I and II. Previous studies suggest that domain III may also be a significant target of protective antibody responses 10,67 , and hence we used a modelled structure of all three domains to calculate polymorphic hotspots and spatially derived Tajima's D values. There are two models for PfAMA1 generated by ModPipe with MPQS > 1.1; one of these models uses a PfAMA1 structure as a template and hence does not include domain III, whereas the other model uses a PvAMA1 structure as a template and includes all three domains. However, this second model fails to accurately place a loop (S345-Y397; 3D7 sequence) that is unresolved within the PvAMA1 template but resolved within known PfAMA1 structures. Previously published work has manually modelled the full domains I-III of PfAMA1 using a combination of P. falciparum and P. vivax templates, and we have used this model for examining PfAMA1 10 . While the crystal structure for EBA-175 RII has been solved 68 , we have used a ModPipe homology model for analysis here as this model includes a number of residues that are unresolved in the experimental structure. These unresolved residues occur in loops that are no more than 4 residues long, and as such the model is likely not to be inaccurate.
For PfAMA1, the majority of polymorphic residues are surface exposed, with most polymorphisms falling within Domain I (Fig. 4a). The most polymorphic region (Fig. 4b) is a highly surface-exposed loop formed by residues T194 -D212, which has been termed the C1-L cluster 69,70 . Potential B-cell epitopes were predicted using BepiPred 2.0 (Fig. 4c). The predicted epitopes (especially at higher thresholds) predominantly fell on highly surface-exposed regions. In contrast, predicted MHC class II binding peptides were predominantly buried within the core of the protein, especially for predicted high-affinity peptides (Fig. 4d,e). For MHC class II haplotypes with high frequency within the Gambian population (e.g., the HLA-DPA1*02:01-DPB1*01:01 haplotype), 50% of residues involved in predicted high-binding peptides were located in the core of the protein (RSA < 0.2), while 57% of residues involved in low-binding peptides were also buried. For HLA-DQA1*05:01-DQB1*03:01, 83% and 43% of residues from predicted high-and low-binding peptides, respectively, were buried in the protein core (47% of all residues are buried).
When considering EBA-175 RII polymorphisms (Fig. 5a), two main hotspots were identified, with both being surface-exposed loops (Fig. 5b). These two loops are located on opposing faces of the structure, with one loop within the F1 domain (residues N252-V266) and the other within the F2 domain (residues S432 -N442). The F2 surface loop is involved in the formation of a two-strand antiparallel β-sheet between identical residues (N433 -H436) upon EBA-175 RII dimer formation 68 . This region is in the center of the RII dimer, with residues K439 and K442 also likely involved in binding to glycans in the glycophorin A receptor. The F2 β-finger (residues C476-C488) that is the target of inhibitory monoclonal antibodies R217 and R215 17,71 was also identified as a polymorphic hotspot, although to a lesser extent. Similar to PfAMA1, predicted B-cell epitopes were located on highly surface-exposed regions of EBA-175 RII (Fig. 5c) and include the known F2 β-finger epitope and an epitope (H303 -Q315) that is the target of the R218 monoclonal antibody 72 . Conversely, MHC class II binding peptides were mostly buried (Fig. 5d,e). For the HLA-DPA1*02:01-DPB1*01:01 haplotype, 70% of residues involved in predicted high-binding peptides were located in the core of the protein (RSA < 0.2), while 58% of residues involved in low-binding peptides were also buried. For HLA-DQA1*05:01-DQB1*03:01, 50% and 65% of residues from predicted high-and low-binding peptides respectively were buried in the protein core (44% of all residues are buried). In summary, a number of highly polymorphic surface exposed regions were identified for both PfAMA1 and EBA-175 RII, which in most cases overlapped with predicted B-cell epitopes. In contrast, predicted MHC II binding peptides were predominantly buried within the core of the protein.

Incorporating protein spatial information into a genetic test for immune selection pressure.
To assist in the identification of regions of protein under immune selection pressure, we developed a modified calculation of Tajima's D that includes protein structural information using the spatial averaging approach introduced earlier in this study. We compared this approach to application of a standard sliding-window over the linear sequence to assess whether our spatially derived Tajima's D calculation improved the ability to detect sites under immune selection pressure. These two methods were applied to PfAMA1 and EBA-175 (Fig. 6). For AMA1, there is evidence for balancing selection within DI when calculating Tajima's D using a traditional sliding window approach (Fig. 6c), as has been observed previously in other populations 10 . In contrast to the linear sliding-window approach, the new spatial averaging approach reveals a surface exposed region on the border of DII and DIII as the area with the highest Tajima's D values; parts of DI also appear to be under balancing selection (Fig. 6a). As expected, the so-called 'silent face' of PfAMA1 had Tajima's D values that were negative or close to   zero. For EBA-175, a large portion of the F1 domain appears to be under balancing selection, as is a surface loop in the F2 domain (residues S432 -N442) (Fig. 6b,d). The region with highest calculated spatially derived Tajima's D values is contained within the F1 domain of RII, comprised predominantly of residues E266-D289, P314-Q322 and L382-L400. This site is also part of the dimerization interface formed between two molecules of EBA-175 RII as it binds to its glycophorin A receptor; during dimerization this site makes contact with the F2 domain of the other dimer pair 68 . It has previously been suggested that antibodies that block dimerization of EBA175 may negatively impact on glycophorin A engagement 71 , and antibodies that inhibit binding of EBA-175 to glycophorin A have also been shown to be associated with protection from clinical malaria 72 . Results from spatially derived Tajima's D and conventional sliding window Tajima's D analysis are similar to each other for EBA-175. This structure is predominantly alpha-helical, and sites with high Tajima's D values are mostly continuous stretches of protein sequence.
Previous analyses of the cross-reactivity of human antibodies, and vaccine-induced antibodies in rabbits, to different AMA1 alleles 73,74 and mutation studies of C1 residues 75 suggested that polymorphisms in the C1 cluster did not explain a large component of antigenic differences between alleles. Furthermore standard analysis of polymorphisms in linear sequences did not correlate highly with antigenic differences 74,75 . Therefore, additional approaches that consider structure may be required to yield further insights. Interestingly, our analysis of AMA1 incorporating structural considerations identified a site within DII/DIII of AMA1 that stands out as having a high spatially derived Tajima's D score. This site is composed of four distinct regions of continuous sequence (P303 -G313; L419 -I426; V437 -I454; D483 -F505), which together make up a surface exposed face of DII/DIII (Fig. 7). The spatial proximity of these regions is not accounted for when performing a sliding window analysis, and it is only when using spatial information that we observe the highest Tajima's D score of 2.39, compared to 1.84 for a linear sliding window calculation.

Discussion
This study has examined structured protein domains from P. falciparum and the importance of structural features in assessing potential targets of humoral immunity. Antigenic polymorphism data from a Gambian population was used to identify regions under immune selection pressure, with polymorphic residues found to be primarily surface exposed and enriched within turns as compared to other secondary structure elements. A number of studies have observed that polymorphism variation in buried residues is generally more detrimental to protein function than polymorphic variation in surface exposed regions [76][77][78] . This observation, coupled with antibody-driven selection pressure on surface-exposed residues, likely explains our finding that high-frequency polymorphic residues were highly surface exposed. Similarly, our observation that polymorphic residues were proportionally enriched within turns is supported by previous work examining the structural features of antigen-antibody interfaces [79][80][81] . These studies used a three-state definition of secondary structure (helices, sheets, loops), and observed an increased proportion of loop elements (turns, bends or coils) within epitopes; we only observed an increased number of polymorphisms within turns.
Prediction of B-cell epitopes is a challenging problem, with early attempts based on amino acid residue propensity scales only a little better than chance 82 . Despite many advances in recent years, the performance of current B-cell epitope prediction algorithms still falls behind that of predictors of other immunological features, such as MHC and T-cell epitope predictors 83 . With this in mind, the results obtained in our study are encouraging. The sequence-based BepiPred 2.0 algorithm 41 predominantly identifies surface exposed regions as potential epitopes, despite only using the linear protein sequence as an input. Additionally, when restricting our analysis to only surface exposed residues, ~75% of polymorphic residues were placed within predicted epitopes compared to ~50% of non-polymorphic residues (at the default threshold). Thus, we suggest that BepiPred 2.0 is an informative tool for initial selection of potential epitopes, particularly in the absence of a known protein structure.
Another approach to identifying potential epitopes explored in this study was the identification of polymorphisms clustered within a potential antibody binding radius. All of the proteins identified using this approach are known highly polymorphic antigens, including AMA1, PfEMP1, EBA-175, TRAP, DBLMSP2, CSP and MSP1 [84][85][86][87][88][89] . Additionally, our relatively conservative homology modelling approach did not yield additional proteins of interest when examining polymorphic hotspots, despite a 3-fold increase in the coverage of the proteome. Future work could extend the approach used in this study using structural models derived from other methods such as threading and fold recognition.
Other methods for determining likely immune targets within a pathogen include measures of balancing selection such as Tajima's D. Typically, Tajima's D is applied as either a single metric over a whole gene, or as a sliding window over the genome. A number of studies of malaria antigens have also applied Tajima's D as a sliding window over particular genes to identify regions of the protein under immune-mediated selection pressure 9,10,23-26 . However, for antibody-mediated selection pressure, a conformational epitope may contain residues that are distant in the linear protein sequence. Thus, we hypothesized that incorporation of structural information into a sliding window calculation of Tajima's D may improve detection of regions under immune selection pressure. We developed a Python tool (BioStructMap) for spatial averaging, and applied it to both PfAMA1 and EBA-175. While application of a spatially derived Tajima's D to EBA-175 did not yield any major differences compared to a standard linear sliding window method, we revealed a region in PfAMA1 bordering DII and DIII with a high spatially derived Tajima's D value that was not observed using a traditional sliding window (Fig. 6). Interestingly, this region is composed of four distinct segments of protein that combine to form a surface-exposed face, which explains why a linear sliding window method failed to identify this discontinuous epitope (Fig. 7).
Previous studies also suggest that DIII may be an additional target of humoral immune responses. In a comparison of Tajima's D between PfAMA1 and PvAMA1, the highest Tajima's D values for PfAMA1 were observed in DIII, in contrast to DI for PvAMA1 10 , suggesting that DIII may play a significant role as a target for protective immune responses against P. falciparum. It has also been observed that a monoclonal antibody (1E10) against PfAMA1 DIII acts synergistically with antibodies against other distant parts of the protein to inhibit merozoite growth, despite not having potent inhibitory capabilities on its own 67 . This suggests a potential role for antibodies targeting DIII in the context of a broader response against PfAMA1, despite an anti-DIII response not being inhibitory in isolation. Population studies examining antibody levels against PfAMA1 domains suggest that antibody responses against both DII and DIII are relatively rare 90 , however the recombinant protein constructs used in these studies fail to account for the considerable interaction between domains. Indeed, the region of PfAMA1 with the strongest signature of balancing selection in our study was composed of residues from both DII and DIII, and antibodies targeting this epitope may not be identified in assays that use recombinant DII or DIII constructs. Taken together, these studies support our observation that a region bordering DII and DIII of PfAMA1 may be an important target of humoral responses in the context of natural infection, as well as supporting the validity of our spatially derived Tajima's D approach.
In this study, we have developed an approach to identify structured regions of P. falciparum proteins that are likely under some level of immune selection pressure, showing that polymorphic sites are predominantly surface exposed and enriched within turns. We applied a spatially derived Tajima's D calculation to key antigens, identifying a region of PfAMA1 between DII and DIII that was under a high degree of balancing selection. These methods and accompanying results have utility in the identification of proteins under balancing selection, furthering our understanding of functional immune targets during malaria infection. These approaches are also broadly applicable to other pathogens.