Long-distance transmission patterns modelled from SNP barcodes of Plasmodium falciparum infections in The Gambia

Malaria has declined significantly in The Gambia and determining transmission dynamics of Plasmodium falciparum can help targeting control interventions towards elimination. This can be inferred from genetic similarity between parasite isolates from different sites and timepoints. Here, we imposed a P. falciparum life cycle time on a genetic distance likelihood model to determine transmission paths from a 54 SNP barcode of 355 isolates. Samples were collected monthly during the 2013 malaria season from six pairs of villages spanning 300 km from western to eastern Gambia. There was spatial and temporal hierarchy in pairwise genetic relatedness, with the most similar barcodes from isolates within the same households and village. Constrained by travel data, the model detected 60 directional transmission events, with 27% paths linking persons from different regions. We identified 13 infected individuals (4.2% of those genotyped) responsible for 2 to 8 subsequent infections within their communities. These super-infectors were mostly from high transmission villages. When considering paths between isolates from the most distant regions (west vs east) and travel history, there were 3 transmission paths from eastern to western Gambia, all at the peak (October) of the malaria transmission season. No paths with known travel originated from the extreme west to east. Although more than half of all paths were within-village, parasite flow from east to west may contribute to maintain transmission in western Gambia, where malaria transmission is already low. Therefore, interrupting malaria transmission in western Gambia would require targeting eastern Gambia, where malaria prevalence is substantially higher, with intensified malaria interventions.

Spatiotemporal genetic similarity of P. falciparum isolates. We identified 178 pairs of isolates with at least 95% SNP barcode similarity across the transmission season and between villages (Fig. 2, Supplementary  Fig. 3; heatmap of all pairwise distances). High proportions of dissimilar barcodes were evident within eastern villages (K, L and M), where malaria prevalence is higher ( Fig. 2A). In contrast, isolates from the A and B village pair in western Gambia and from villages G and F in central Gambia shared similar barcodes, suggesting similar clones transmitted between these sites. Genetic similarity (pairwise allelic similarity >95%) was also spatially hierarchical, strongest within households, then within village, and least within non-paired villages ( Fig. 2A.1). Infections collected at different months throughout the season were highly similar (Fig. 2B). However, temporal structure was weaker than geospatial structure as only 0.4% of isolates had pairwise barcode similarity beyond 95% within 30 days (Fig. 2B.1). This proportion dropped to 0.2% for samples with a sampling time difference beyond 60 days. Overall, pairwise genetic distance showed a weak increasing trend with days separating samples, only detectable for samples separated by more than 120 days ( Supplementary Fig. 4). The relationship between pairwise genetic distance, time and place of sampling was heterogeneous ( Supplementary Fig. 5), reflecting the hierarchical experimental design. Hudson's identity-by-descent (IBD) index detected stronger local pairwise IBD, which linearly correlated with genetic similarity, once the latter was below 55% across the barcode ( Supplementary Fig. 6). Given the low density of SNPs per chromosome, genetic distance was used as a measure of relatedness and analysis of the likelihood of transmission pathways for samples was restricted to a genetic distance of 30% or less. Beyond a genetic distance of 30%, pairs of samples can differ by a wide range of evolutionary distances (0.5-3.5 substitutions per sites), (Supplementary Fig. 7), probably due to recombination and other evolutionary events that may exaggerate genetic distances between isolates with a common ancestry.    The six processes indicated for host G A infecting host G B in a transmission event; together leading to a difference between the sampled genotypes accumulated over time. It is assumed individuals were bitten b days before they were sampled and genotyped. The periods b, T i and T B (blue lines) occur in humans, while T V and T d (green lines) represent the time for the vector to become infectious and to infect an uninfected individual, respectively. (B) Processes involved in a common source of infection in hosts G A and G B observed at two different times after original transmission event. T A and T B represent the human infectious time since the previous sampling. T i A and T i B are the times from a mosquito bite to establishment of peripheral parasites.
www.nature.com/scientificreports www.nature.com/scientificreports/ same households or within villages. The higher transmission in eastern Gambia resulted in more than half of all the pathways identified being within the eastern village clusters of J, K, L and M. Village K in the east had the highest number of within village paths. A primary objective of the study was to identify transmission pathways from sources arising in the extreme eastern and western village clusters. Figure 5a

Multiple infections from a single source-'super-infectors'.
To determine the presence of 'super-infector' , i.e. barcodes related to those of at least two subsequent infections in travel/contact constrained paths, we relaxed the bi-directional optimization used to identify maximum likelihood pathways. There were 13 super-infectors who could have been the source of at least 2 subsequent infections (Fig. 6). The highest 'super-infector' was linked to 8 possible infections. The higher transmission sites in the east, especially village K in the south bank, had 46% of individuals likely to be the source of at least 2 or more infections.

Discussion
Genotyping malaria parasites to determine relatedness between clones and assessing the temporal and spatial relationship among them has become a standard tool in malaria epidemiological research 15 . Such analysis can guide malaria control efforts by identifying genetically isolated Plasmodium falciparum populations, clusters of genetically similar clones maintaining local malaria transmission, those originating from outside a local transmission area or responsible for new local outbreaks in non-malaria regions 14,15 . We analysed variation across a panel of 54 SNPs in 355 P. falciparum infections, to characterise local populations and detect transmission paths across The Gambia. As previously shown in neighbouring Senegal, infection complexity was heterogeneous and lowest in the low transmission western regions close to the Atlantic coast 1,12 . Urbanisation in the western regions and probably the dominance of a less effective vector, Anopheles melas, contribute to the overall low transmission at these sites around salty river water or the Atlantic Ocean 16 . These low transmission regions included those with isolates having unique barcodes around the central regions and those with high levels of shared barcodes, like the more urban coastal towns. None or very few clinical malaria cases were reported in sites with unique P. falciparum barcodes 1 , where the isolates genotyped may include long-standing residual infections, which are not highly transmissible and not leading to clinical disease (at least some of them) 17,18 . Identifying and determining the presence of such infections is vital for malaria elimination and requires improved diagnostic approaches for targeted intervention. On the other hand, sites with multiple shared barcodes and highly similar parasite pairs in paired villages suggest clonal expansion similar to outbreaks after the introduction of infection(s). Such situations would require approaches able to detect and target infections "imported" into The Gambia, which is surrounded by Senegal, where there is still relatively high transmission in the southern-eastern regions 19 . These findings are interesting given that the overall village-scale analysis showed clustering of highly similar barcodes at household www.nature.com/scientificreports www.nature.com/scientificreports/ level, where the risk of clinical malaria is higher in households with at least one infected individual at the beginning of the transmission season 1 . Therefore, approaches such as reactive household interventions with drugs or other tools may target this infection reservoir and facilitate local elimination of malaria. Such interventions are already being tried and results may help further clarify our findings 20 . Further establishing the chain of transmission of such imported or local outbreaks could facilitate approaches towards elimination.
Transmission analysis can make use of genetic polymorphisms to link infectious pathogen isolates using genetic distance and phylogenetic approaches as applied for viruses and bacteria 21 . Analyses can be further constrained by known case-contact information and temporal-spatial distances to refine the likelihood of a transmission event between infection pairs. As previously shown in Kenya and The Gambia, variation between parasite genotypes increases with both time and geographical distance 4 . Intuitively, the likelihood of a transmission event should increase with decrease genetic distance between samples. However, direct application of genetic distances to detect malaria transmission events can be affected by several factors including; human migration, mutation and recombination of P. falciparum isolates from different geographical areas 22 . Transmission paths can be inferred by eliminating these evolutionary confounders of relatedness using probabilistic models of genetic distance within the linear range of the relationship with evolutionary distance 22 . Here we employed genetic distances within 30% (the linear range of correlation with evolutionary distance), reducing uncertainties in ancestry due to recombination. Moreover, Hudson's measure of identity-by-descent 23 correlated linearly within these genetic distances, increasing confidence in connectivity and recent ancestry between pairs in detected transmission paths 5 . Our novel model also combined two unobserved processes; within-host and within-vector parasite dynamics and duration of infection, with chronology of sampling and travel history of sampled individuals. Transmission paths were only restricted to sample pairs (source and infected) whose chronology satisfied the parasite's life cycle. This enabled the estimation of the direction of P. falciparum (a recombining eukaryotic infectious pathogen) flow across The Gambia. More than half of the identified person-to-person transmission pathways occurred within www.nature.com/scientificreports www.nature.com/scientificreports/ household and village, indicating substantial local transmission. Although some infections in the east may have originated from western Gambia, there was a higher flux of parasites from east to west, particularly at peak transmission. Therefore, interrupting transmission in western Gambia would require intensifying control efforts in eastern Gambia and probably southern and eastern Senegal, where malaria transmission is higher 19 .
The analysis here was limited by the relatively low proportion of samples successfully genotyped, especially during the early months of the transmission season and from the low transmission peri-urban regions near the Atlantic coast of The Gambia. Thus, latent transmission links could have been missed from low parasite density samples and missing genotypes within transmission chains. Future genotyping studies can be improved by higher resolution sampling and inclusion of genome amplification steps as demonstrated for whole genome sequencing 24 . These can further apply recent de novo genomic variants approaches to reconstruct transmission trees 25 . This new whole genome approach remains largely inapplicable to most African malaria settings because of the higher complexity of infections and the poor availability of genome sequences from dense temporal and spatial samples. These difficulties explain the small number of studies investigating the relatedness and transmission pathways between malaria infections, either at small or large geographical scale in sub-Saharan Africa. Thus, SNP barcodes will continue to provide cheaper and low-density data which can be combined with our approach for transmission studies.

conclusions
We have shown a strong spatial hierarchy between genetic relatedness of pairs of P. falciparum isolates within households and villages, and between pairs of villages in close proximity. While this represents the local transmission hierarchy, long range transmission from high to low malaria prevalence areas were also detected. Hence, genotyping P. falciparum isolates can be used to track pathways of infection, including discriminating between indigenous and imported infections. It would be helpful to start considering a more programmatic use of these and other transmission tracking approaches in support of national malaria control programs to better target control efforts.

Methods
Study sites and samples. Data were collected as part of a study carried out between June 2013 and April 2014 in six pairs of villages across The Gambia 1 . Six sites, each comprising a pair of villages, were selected. Each village pair was within 5 km and differed by prevalence of infection according to a survey carried out in November 2012 18 . Each village was coded with a letter, from A to M (Fig. 1). Monthly blood samples (Whatman filter paper blood spots) were systematically collected from all consenting residents between June and December 2013. In addition, a blood sample was collected from suspected clinical malaria cases. www.nature.com/scientificreports www.nature.com/scientificreports/ Genotyping summary population genetic statistics. DNA was extracted from 3 discs of dried blood spots on filter paper using automated DNA extraction system (Qiagen). Plasmodium spp infection was tested by diagnostic and speciation PCR 26,27 . P. falciparum positive samples were genotyped for 86 SNPs (Supplementary  Table 1) on the Sequenom MassARRAY iPLEX platform 28 at the Broad Institute, Boston, USA 29 . Genotyping assays followed standard amplification and MALDI-TOF mass spectrometry and allele calling pipeline. Following assay quality control, SNP calls were made based on assay intensity values (r) and allelic signal intensity ratios (theta). Alleles were either classified as homozygous with the reference (0), alternative (1), or heterozygous (intermediate values). The multilocus SNP genotype calls were aggregated for all samples and filtered to remove loci with less than 25% call rate and per sample missingness of more than 12 loci (22%). The number of retained SNPs from chromosomes 1-14 were 2, 4, 3, 6, 1, 2, 5, 4, 3, 2, 4, 2, 7, 9, respectively. The filtered dataset with maximum loci missingness of 22% included 355 subjects (32% of samples attempted) and 54 SNPs (62.7% of 86 SNPs), (Supplementary Table 2). The most densely genotyped villages occurred in the east, in October and November ( Supplementary Figs 1 and 2).
In case of mixed barcodes, the complexity of infection was determined following the model in McCOIL 30 . A genotype file representing the major alleles at mixed positions was generated and used to derive summary population genetic statistics (allele frequencies, observed and expected heterozygosity, multilocus linkage allelic association) in R statistical software (version 3.3.2). Allele frequency differences represented by average Fst over all pairs of villages was calculated with the hierFstat package. All further analyses for transmission paths were based on the "shared variant". We retained the rarer alleles (minor variant) for isolates where loci had mixed calls. This approach considers that the presence of less common variants in same physical position between 2 barcodes maximise the likelihood of link between the infection pairs and offer additional resolution in detecting true transmission paths 31 . Compared to alternative approaches in which the major variants or random choice of allele at mixed positions are used, this did not significantly bias allele frequency spectra and genetic distances between pairs of isolates.
Genetic barcode relatedness between P. falciparum infections. Genetic distance between isolate pairs was measured as p-distance, i.e. the number of substitutions (differences in identity by state) between two genotypes, expressed as the percentage of the number of SNP positions with alternative alleles between pairs of barcodes for all positions without missingness in the sample pair. Additionally, relatedness was inferred by calculating barcode identity by descent (IBD) using the hmmIBD algorithm 23 . Spatial distances were calculated as the geographic distance between sample sources (Km), based on their recorded latitude and longitude position. As The Gambia is uniformly low-lying, altitude was not considered in calculation of geographic distances. Temporal distance was expressed as the difference in days between sample collection dates. For all pairwise genetic distances, spatial and temporal differences were visualized as a 3D plot ( Supplementary Fig. 5). The spatial and temporal hierarchical structure of the study design was evident.
inferring transmission events. We applied a model that quantifies the likelihood of a genetic distance between pairs of infections, while accounting for the time between the pairs. As the true date of infection was unknown for sampled populations, the sampling date was used as a proxy. The primary objective was to identify the most likely person-to-vector-to-person transmission (referred thereafter as person-to-person) paths and the direction from source to infected. These paths were constrained by the unobserved parasite life cycle time within the vector and length of infection between human sampling times (Fig. 3). In the pathway between samples in a transmission link, T A represents the human infectious time since the previous sampling (Fig. 3A). It is assumed that individuals were bitten b days before they were sampled and genotyped. T V and T d represent the time for the vector to become infectious and to infect an uninfected individual, respectively. Their combination is constrained by the lifespan of the vector. After the infective bite, there are T i days for the infection to appear in the peripheral blood and T B days until a blood sample is collected to be genotyped. The observation-time difference between genotypes was therefore determined as; T V + T d + T i + T B − b, assumed to represent the evolutionary time between pairs of genotypes in a transmission path. Pairs of individuals may be linked to a common parent infection from the same vector or different vectors, with the observed genotypes only evolving from infection to observation-times. In this case, the duration between observation-times is T d + T i B + T B − T i A + T A, assumed to be the evolutionary time between genotypes (Fig. 3B). Considering the monthly sampling, T A and T B are assumed to vary between 0 and 30 days and b to vary between 0 and T A . T A and T B were simulated from uniform distributions with a range of 0-30 days. T V and T i were simulated from triangular distributions, with ranges of 8-15 days and a mode of 11.5 days, and of 5-15 days with a mode of 10 days, respectively, representing the time range for an infected mosquito to develop sporozoites and transmit to an uninfected individual. The vector lifespan was simulated as a triangular distribution with range of 18-24 days and a mode of 21 days, with T d simulated as lifespan -T v . Each model was simulated through 100,000 random sampling from the possible between-event time distributions. The distributions showed a clear overlap of the smoothed density of pairwise events for the same source and for the person-to-person transmission paths models. To increase the chance of identifying 'who infected who' , the observation times between related genotypes were restricted to lie to the right of the 97.5% of the times for same source distributions ( Supplementary Fig. 8). Thus, a range of 32 to 67 days are required between samples to observe a 'directional' transmission event.
SNP genotypes of 355 isolates were fitted with 88 possible evolutionary models using a maximum likelihood approach in IQ-TREE. The TVM (transversion model) + I (invariant sites) + G4 (4 regions of variable substitution rate) was the most appropriate model by the Bayesian Information Criterion. The evolutionary distances between all pairs of isolates for the optimal tree were calculated. The evolutionary distance saturated at a genetic distance of approximately 0.3 (p-distance adjusted for missingness), (Supplementary Fig. 7). For pairs of isolates with a genetic distance less than 0.3, the substitution rates were estimated by dividing the evolutionary distance with their observational time difference, assuming the time between source and recipient infection genotypes satisfied the person-to-person time constraint of at least 32 days. Ten thousand random barcodes of 54 SNPs were then evolved over 32 to 67 days via an MCMC model with TVM + I + G4 parameters from the optimal model with a bootstrapped sample of the estimated substitution rates. This resulted in a joint probability density (smoothed via 2-dimensional kernel density estimation) of genetic distance and the sampling time difference between isolates ( Supplementary Fig. 9). For a given time between observed infections, the bivariate distribution was used to estimate the probability of the corresponding genetic distance between sample pairs. If the genetic distance exceeded 0.3, the transmission probability was assigned a zero-value due to the non-uniqueness of relationship between evolutionary and genetic distance. This limits the effect of unexplained genetic distances on the transmission chain inferred. Additionally, only sample pairs with a pairwise (and/or) missingness of 12 or less SNPs were considered in transmission paths. Under these conditions, there was no evidence of missingness biasing the p-distance estimates of genetic distance.
To improve the accuracy of the modified evolutionary model, it was additionally constrained by travel information within 30 days before sampling, thus precluding inter-village pathways for individuals who did not travel. With a sampling interval of 30 days or greater, the potential source subjects were restricted to travel within the previous 30 days. As the time from biting to detectable infection is greater than 30 days, the recipient subjects were constrained to have travelled within the last 60 days. For subject pairs with timing and travel constraint (i.e. at least one of the pairs must have travelled) satisfied, the bivariate density was used to ascribe a probability to each possible transmission pair based on the observed genetic and temporal difference. The probability of a transmission event was zero for pairs with a predicted genetic distance outside the 95% confidence limits of the bivariate distribution. Pairwise probabilities were weighted by the probability of an observed temporal difference (based on the life cycle simulation), given that the genotype pair arose from a person-to-person transmission. A transmission probability matrix between each pair of isolates was built in this manner. To avoid ambiguity in source versus recipient pathways and vice-versa, the most likely transmission paths were chosen as the pairs with the highest probability from a maximization over columns in each row and over rows in each column of the transmission probability matrix. This resulted in a set of pathways where each source had a unique maximum likelihood for an infected subject and vice-versa. Sources with more than 2 paths to subsequent infections were classified as 'super-infectors' (Supplementary Fig. 10), obtained by optimising over the row space (sources) only.
Statistical analyses. Packages in R (version 3.3.2) were used to draw maps and calculate summary population genetic statistics. Evolutionary models and maximum likelihood phylogenetic trees were fitted using IQTREE (version 1.5.3). MCMC simulations, graphics and modelling of evolution and transmission likelihoods were performed in MATLAB (version R2016b). ethical approval. Written informed consent was obtained from all participants ≥18 years old and parents/ caregivers/guardians provided written informed assent for children aged 12-17 years following standardised good clinical research practice SOPs at MRCG-LSHTM. The study and protocols were approved by the Gambia Government/MRC Joint Ethics Committee (SCC 1318). All methods in this study were performed in accordance with the relevant guidelines and regulations enforce by the local EC and the MRCG-LSHTM research governance.

Data Availability
Multilocus SNP genotypes for Plasmodium falciparum isolates are available in Supplementary Table 2.