Introduction

Malaria, caused by Plasmodium parasites and transmitted by Anopheles mosquitoes, remains a pressing global health problem, with a mortality and morbidity burden heavily concentrated among children less than five years old. The morbidity and mortality impacts of Plasmodium falciparum malaria are predominantly concentrated in Sub-Saharan Africa, whereas the burdens of Plasmodium vivax are most heavily felt in Asia and South America1. The complex co-evolutionary history between Plasmodium parasites, humans, and Anopheles mosquitoes is contained within the genome of each organism, and genomic tools and data are of key importance for understanding the fundamental genetic underpinning of malaria, its geo-spatial distribution and control strategies to eliminate it. There is a rapidly growing number of P. falciparum and P. vivax isolate DNA that have undergone whole genome sequencing (WGS), with continued advances in genomic technologies likely to accelerate the timely generation of datasets from clinical and surveillance blood samples to inform disease epidemiology and control.

The rich information contained in WGS data can be used to infer transmission patterns, detect drug resistance, and support wider malaria control initiatives and elimination strategies2,3. WGS data in combination with population genomic methods can detect selective sweeps associated with drug resistance and infer the geographic origin of infections, including if infections are found to be imported or drug resistant and whether treatment should be adapted accordingly. It is known that malaria parasites have a population structure primarily based on geography4,5. Several informative molecular barcodes for speciation and geography have been developed2,3, but typically these barcodes have not used the whole genome due to the high-dimensionality of the data and the associated computational cost3. However, machine learning (a subfield of Artificial Intelligence) with its ability to incorporate and analyse very large and high-dimensional datasets in an efficient manner, seems potentially well suited for geo-predicting using WGS data. Machine learning can be applied for classification, which concerns predicting a label (e.g., country, continental region), and regression, which involves predicting a quantity (e.g., longitude or latitude).

Machine learning has been applied effectively across a variety of problems in malaria research, including the detection of evolutionary selection associated with drug resistance6,7, the classification and detection of parasites in red blood cells8,9,10,11, and antimalarial drug discovery12. Deep learning is a subset of machine learning where algorithms aim to extract and learn series of hierarchical representations, often leveraging large amounts of data. The application of deep learning, and especially neural networks, has been explored within population genetics13,14, including for other pathogens15,16. Pioneering work has also shown that machine learning, including deep learning convolutional neural networks (CNNs), can be used to predict geographic locations from human, mosquito and P. falciparum genetic variation17, building on methods and the use of large genotyping chips or WGS for population structure assessment18,19. Here, we aim to further expand on the application of geo-prediction for malaria parasites by using a very large dataset of isolates sourced globally, (P. falciparum, n = 5957, 27 countries; P. vivax, n = 659, 13 countries) across 11 regions (South East Asia (SEA), Southern SEA (SSEA), South Asia, South America, West Africa, Central Africa, South Central Africa, East Africa, Horn of Africa, Southern Africa, Oceania). We explore the potential of both regular machine learning approaches that aim to learn representations from sequence and geographical data, as well as deep learning approaches that aim to learn and extract layers of hierarchical representations of SNP combinations linked to geography. We compare four commonly applied approaches, including classification methods that predict locations and subsequently interpolate to specific coordinates, as well as compare the performance across geographies (countries) both including the observations within those and excluding them from the training sets used to develop the models.

Materials and methods

Processing of raw sequencing data

Publicly available raw Illumina (> 150 bp paired end) sequence data from previously published studies of P. falciparum and P. vivax was downloaded from the ENA repository (see S1 Table and S2 Table for accession numbers), and accompanied by meta-data including locations of sampling (see S1 Table and S2 Table for latitude and longitude coordinates). The data included public raw sequence and GPS data from MalariaGEN projects (www.malariagen.net). Raw WGS data for P. falciparum (n = 5957) and P. vivax (n = 659) were aligned with the Pf3D7 (v3) and PvP01 (v1) reference genomes, respectively, using bwa-mem software (v0.7.12) using default parameter settings (e.g., concerning mismatch and sequence read clipping penalties; see http://bio-bwa.sourceforge.net/bwa.shtml). The samtools (v1.9) functions fixmate and markdup were applied to the resulting BAM files to call a set of potential variants20. For variant quality control, calibration assessments were performed using the GATK’s BaseRecalibrator and ApplyBQSR functions, benchmarking off known high quality variants from genetic crosses for P. falciparum5,21 and previously curated datasets for P. vivax20. A revised set of SNPs and insertions/deletions (indels) was called with GATK’s HaplotypeCaller (version 4.1.4.1) using the option -ERC GVCF5,22. Variants were then assigned a quality score using GATK’s Variant Quality Score Recalibration (VQSR), and those with a VQSLOD score < 0, representing variants more likely to be false than true, were filtered out7,22. Additionally, SNPs were removed if they had more than 10% missing alleles7,22.The resulting dataset comprised of parasite genomes of P. falciparum (5,957 isolates, 750 k SNPs) and of P. vivax (659 isolates, 588 k SNPs). The population structure was assessed using a principal component analysis (PCA) of between isolate SNP differences. In parallel, ADMIXTURE analysis23 was performed to understand the composition of ancestral groups across geography, where the optimal number of groups (K) was established using cross validation with values ranging between 1 and 20. This cross validation analysis led to 10 ancestral groups for both P. falciparum and P. vivax (K = 10).

Statistical models and performance

Using machine learning (ML) and deep learning (DL) statistical models, the goal was to use SNPs to predict geographical source at a location (GPS), country, and regional resolution. We applied two standard models for classification at a country and region level: (1) penalized multinomial logistic regression classifier (LOG-C; ML); (2) CNN (CNN-C; DL). Subsequently, we used the predictive probabilities placed on different locations to perform a weighted interpolation between these locations and make predictions at the GPS coordinate level.

In particular, the final prediction location (longitude and latitude) was determined by a weighted average of classifier predictions, where weights are the probabilities placed by the model on each location.

We also applied two regression models for GPS coordinate prediction: (iii) penalised linear regression model (LIN-R; ML); (iv) CNN (CNN-R; DL). The LOG-C and LIN-R models were tuned on the regularization strength C for the L1 penalty (LASSO) and implemented in the sklearn Python package (https://scikit-learn.org). The penalty parameters were tuned using cross-validation (see below, S3 Table). The deep learning CNN architecture was implemented using the Keras library (version 2.2.4)24 in Python. Our CNN models had an architecture with a soft-max prediction layer and regularization through dropout25 to prevent overfitting and support transferability. The main model had one convolutional layer with 4 filters, with respective filter size of (40, 9) followed by two drop-out and dense layers with ReLu activation (similar to17), and applied the Stochastic Gradient Descent algorithm for optimisation. We trained and validated the models for 1000 epochs. The parameterisation of the models is summarised (S3 Table). We created a stratified three-fold split in the dataset (80% training, 10% validation, 10% test) for all models, and used the validation dataset to cross-validate parameters (S3 Table). The LOG-C and LIN-R models were cross-validated (stratified, four-fold) on the regularization strength C for the L1 penalty. The reported scores (accuracy, mean weighted distance error) were calculated by making predictions on the hold-out test set (see S3 Table for the final parameter set). In addition, we conducted a “leave-one-geography-out”, where each single geography in the training dataset was omitted in turn, with the model trained on the remaining geographies, to understand generalizability towards previously unseen locations26.

Classification accuracy was determined after assigning predicted latitude and longitude pairs to individual countries. For the classification models, a mean (weighted) distance error was calculated using the Haversine method to allow for (angular) distance calculations along a sphere, based on the difference of the actual and estimated location. The latter was determined by a weighted average of classifier predictions, where weights are the probabilities placed by the model on each location. The accuracy was calculated based on the labels of the prediction versus the test data. In particular, the baseline accuracy using a naive prediction based on the most common country would be 18.8% for P. falciparum (Cambodia) and 24.3% for P. vivax (Thailand). For the regression models, the error was calculated using the Haversine method based on the difference between the predicted and actual latitude and longitude using angular distance.

Results

Malaria isolate sequence data and population structure

Raw WGS data with accompanying geographic origin information was available in the public domain for P. falciparum (n = 5957, 27 countries) and P. vivax (n = 659, 13 countries) (Table 1), which represent the global distributions for each parasite. Most P. falciparum isolates were sourced from SEA (2,648, 44.5%) followed by West Africa (2,042, 34.3%) and East Africa (451, 7.6%). Whilst, for P. vivax, most isolates were sourced from SEA (282, 42.9%) followed by South America (220, 33.4%) and SSEA (48) (Table 1). By analysing each species separately, high quality genome-wide SNPs were identified across the isolates (P. falciparum 750 k SNPs, P. vivax 588 k SNPs). Most SNPs have low minor allele frequencies (SNPs with MAF < 1%: P. falciparum 94.6%, P. vivax 77.6%) (S1 Figure). Most SNPs were in genic regions (P. falciparum 76.5%, P. vivax 54.3%), with a high proportion of non-synonymous (NS) amino acid changes (P. falciparum 63.0%, P. vivax 42.5%). The genetic diversity amongst P. falciparum isolates was relatively homogeneous across the 27 countries (SNP π: median 0.037, range 0.027–0.053), and lower in magnitude than P. vivax, whose data was sourced from 13 countries (SNP π: median 0.056, range 0.037–0.066) (Table 1).

Table 1 Sample origin and SNP Diversity by geographic location.

Unsupervised clustering methods were applied to the genome-wide SNPs of each species to reveal the extent of their population structure and linked (pseudo-)ancestral patterns. Principal component analysis (PCA) of P. falciparum and P. vivax isolates revealed the expected separation by continent, and clear evidence of population structure at both the regional and country level (Fig. 1). An analysis of population structure and ancestry using ADMIXTURE software23 determined the number of ancestral groups (P. falciparum K = 10, P. vivax K = 10), and their relative abundance for each isolate was estimated (Fig. 2). For P. falciparum, there were dominant ancestral groups across region and continent (Africa 4, SEA 4, Oceania 1, South America 1), with some evidence of mixture of ancestries (e.g., SEA isolates with 3 ancestral populations), but a general consistency within country. For P. vivax, the numbers of dominant ancestral groups by region differed from P. falciparum (South America 4, SEA 2, SSEA 2, East Africa 1, South Asia 1), due to sampling and Plasmodium species endemicity differences, such as the near absence of P. vivax in Africa. Overall, there was more homogeneity of ancestral groups within P. vivax isolates, with some groups broadly linked to neighbouring countries (comparison with Fig. 1). These analyses confirmed that spatial-genomic clustering and classification is possible using WGS data.

Figure 1
figure 1

Population structure using principal component analysis based on all high-quality SNPs. Axes show percentage of variation explained by each principal component (PC).

Figure 2
figure 2

ADMIXTURE analysis involving 10 inferred ancestral populations (denoted as K1 to K10).

Application of geo-classification models

For P. falciparum, the predictive performance of the classification methods (LOG-C, CNN-C) was stronger than for the regression models (LIN-R, CNN-R) in regional (Table 2) and country-wide (Table 3) analyses (mean distance error (km): LIN-R 470, LOG-C 93, CNN-R 245, CNN-C 77). For locations included in the training dataset, the performance of the classification models was close to 100% at the regional level, and close to 90% at the country level (S4 Table, S5 Table). The poorest performance of the models was for African populations, for example, the mean distance error for CNN-C was high in West African (267 km) and East African countries (117 km, especially Kenya and Uganda), as well as Malawi (530 km) (Table 3), compared to other regions. This observation is consistent with the complex ancestries in African populations (Fig. 2), as well as another deep learning analysis17. As expected, where we predicted countries absent in data used by the training models, the distance errors (km) were at least ~ five-fold larger (LIN-R 2246, LOG-C 1848, CNN-R 1983, CNN-C 1540), with the poorest predictions for Peru (Table 4). The best performing model in this setting was the CNN-C classifier (Fig. 3).

Table 2 Mean distance Error (km) per model by region using geographies included in the training data.
Table 3 Mean distance error (km) per model on test data using those countries included in the training data.
Table 4 Mean distance error (km) per model on test data for unseen geographies.
Figure 3
figure 3

Maps with predicted vs. actual locations for the best predictive models. Blue points are the actual locations in the dataset, red points are the predicted locations (where different to actual), with red lines link the actual and the predicted locations. CNN-C deep learning Convolutional Neural Network classifier. LOG-C penalised multinomial logistic regression classifier.

For P. vivax, the predictive performance of the classification methods (LOG-C, CNN-C) was also superior compared to regression models (LIN-R, CNN-R) across regional (Table 2) and country-wide (Table 3) analyses (mean distance error (km): LIN-R 890, LOG-C 33, CNN-R 819, CNN-C 36) (Table 3). For locations included in the training dataset, the performance of the classification models was close to 100% at both the regional and country level, with the poorest performance in neighbouring China and Myanmar (S4 Table, S5 Table). The (mean) distance error for the countries not used in the development of the model is distinctively larger (km: LIN-R 1481, LOG-C 2508, CNN-R 2512, CNN-C 2405), with the poorest predictions for Ethiopia and Peru (Table 4). The best performing model in this setting was a LIN-R regression (Fig. 3).

Discussion

WGS data of Plasmodium parasites can detect imported infections, drug resistance, and transmission patterns, thereby assisting decision making in clinical and malaria control settings. With the implementation of WGS gaining traction across health systems, there is an opportunity to implement statistical learning methodologies to assist surveillance activities. A clear use-case includes the determination of the geographical origin of isolates, building on insights from previous work which shows that genomic data can be used to cluster parasites by geography2,3,4,5. Our work reveals that machine learning approaches, particularly those focusing on classification (e.g., deep learning CNNs), have the potential to accurately predict geographic locations at a GPS and country-level resolution. As expected, the performance was much stronger for isolates of which the geographic origin was already represented at the country level in the dataset, demonstrating the need for WGS to be implemented more widely to fill country gaps in genetic diversity. The weakest predictions were for P. falciparum in West and East Africa, where common ancestries, mixed infections, movement of people, drug resistance and malaria endemicities can complicate genetic diversity analysis. The distance errors are similar to a previous machine learning analysis of P. falciparum (median < 20 km), which implemented a single deep learning approach on a smaller dataset17. Our CNN for classification approach appeared to perform well across parasite species, was implemented with measures to minimise the effects of over-fitting, and its performance is likely to improve with greater isolate sampling and WGS data.

Whilst we have implemented a limited set of machine learning methods, there is scope to test alternative approaches (e.g., gradient boosted trees, support vector machines)16 or further optimise our model parametrisations (beyond the default settings) to improve performance. For example, while L1-penalized regression approaches are generally quite competitive, stability selection on top of the LASSO leads generally to improvements27. Moreover, the resulting model is white box and leads to a set of interpretable SNPs. CNNs are the most utilised deep learning network type, and known to outperform alternative approaches28. However, one limitation of CNN models is their “black box” nature, with a complex architecture consisting of several layers, and in our context (and others17) making it difficult to establish which (combinations of) SNPs are informative for the geographical profiling. Other studies have used population genomic approaches to determine informative SNPs, with a focus on applying genotyping assays or amplicon sequencing for resource poor settings2,3. We provide computer code to implement the models, to assist future assessments in simulation or empirical studies. Future work should focus on the development of an online “geo-locator” tool that reveals a prediction of location, which can be assessed for its plausibility against the actual position, if known, and feedback into the model building and learning process. Such a framework could also be extended to integrate explicit drug resistance markers29, as well as genomic data for malaria vectors17, and use sequences generated on portable and field deployable sequencing platforms (e.g., Oxford Nanopore Technology MinION). Such tools would be of immediate value to malaria control programs in endemic countries, including those that are implementing elimination activities who wish to differentiate between locally acquired or imported infections. It would also assist those countries with low malaria burden, including through the detection of imported parasites that could threaten malaria elimination targets.

In summary, our study has demonstrated that machine learning methods can play an informative role in determining the geographic origin of WGS isolates, thereby providing important insights for both control and surveillance activities. Further, such approaches will be scalable when WGS becomes routine and cost effective, resulting in a setting with increasingly “big data” being available for decision making. The utility of this “learning” system will improve with time, as underlying methodologies and model performances improve with more data becoming available, and they are implemented within informatic tools to assist surveillance and clinical decision making. This utility underscores the benefit of making sequencing data and linked geographical information publicly available to global databases in a more-timely fashion to understand infection dynamics, the advantages of which have also been demonstrated by the COVID-19 crisis.

Conclusion

Advances in sequencing technologies are making real time genomics-informed surveillance and clinical management a reality. With the resulting big genomic datasets, our study has shown that machine learning methods, a subset of Artificial Intelligence, can accurately predict the geographical source of malaria parasites from sequence data. With greater geographical coverage and informatics infrastructure, such approaches will improve in performance and assist malaria control and elimination activities.