Selective whole genome amplification of Plasmodium malariae DNA from clinical samples reveals insights into population structure

The genomic diversity of Plasmodium malariae malaria parasites is understudied, partly because infected individuals tend to present with low parasite densities, leading to difficulties in obtaining sufficient parasite DNA for genome analysis. Selective whole genome amplification (SWGA) increases the relative levels of pathogen DNA in a clinical sample, but has not been adapted for P. malariae parasites. Here we design customized SWGA primers which successfully amplify P. malariae DNA extracted directly from unprocessed clinical blood samples obtained from patients with P. malariae-mono-infections from six countries, and further test the efficacy of SWGA on mixed infections with other Plasmodium spp. SWGA enables the successful whole genome sequencing of samples with low parasite density (i.e. one sample with a parasitaemia of 0.0064% resulted in 44% of the genome covered by ≥ 5 reads), leading to an average 14-fold increase in genome coverage when compared to unamplified samples. We identify a total of 868,476 genome-wide SNPs, of which 194,709 are unique across 18 high-quality isolates. After exclusion of the hypervariable subtelomeric regions, a high-quality core subset of 29,899 unique SNPs is defined. Population genetic analysis suggests that P. malariae parasites display clear geographical separation by continent. Further, SWGA successfully amplifies genetic regions of interest such as orthologs of P. falciparum drug resistance-associated loci (Pfdhfr, Pfdhps, Pfcrt, Pfk13 and Pfmdr1), and several non-synonymous SNPs were detected in these genes. In conclusion, we have established a robust SWGA approach that can assist whole genome sequencing of P. malariae, and thereby facilitate the implementation of much-needed large-scale multi-population genomic studies of this neglected malaria parasite. As demonstrated in other Plasmodia, such genetic diversity studies can provide insights into the biology underlying the disease and inform malaria surveillance and control measures.


SWGA enriches P. malariae DNA and increases WGS data coverage.
We performed SWGA using a designed primer set (denoted as Pmset1) consisting of five primers (see S1 Table) that preferentially bind the P. malariae genome (average binding sites located once every 2.9 kb within the P. malariae genome, compared to once every 45.1 kb in the human genome). For successful selective amplification it is essential that the binding sites are in close proximity in the parasite genome and spaced further apart in the human genome 35 . Using two test samples (PM_THA_001 and PM_THA_002), we demonstrate that Pmset1 successfully amplifies the P. malariae genome, allowing for higher quality WGS data in comparison to non-amplification (S1 Fig.). Whilst all four samples were sequenced at a similar depth, we observed that amplified samples have a significant increase in coverage, with a mean 18.6-fold increase in the percentage of the genome covered with ≥ 5 reads when compared to non-amplification (S2 Table). The increase in genome coverage seen with SWGA allows for greater detection of SNPs which can be used for downstream population genetics analysis. As a result, there was an 800-to 13,000fold increase in the number of callable SNPs detected in samples amplified using Pmset1 (S2 Table).
After validation of Pmset1, 17 additional clinical samples were amplified using Pmset1 and underwent WGS. One sample (PM_THA_009), with a low parasitaemia of 0.0016% presented with low coverage after the first sequencing run (27% genome covered ≥ 5 reads), this sample was re-sequenced, and the second run had Scientific RepoRtS | (2020) 10:10832 | https://doi.org/10.1038/s41598-020-67568-4 www.nature.com/scientificreports/ better results (44% genome covered ≥ 5 reads) (S2 Table). The two sequencing runs were combined to generate PM_THA_009com (52% genome covered ≥ 5 reads). Excluding the separate runs for PM_THA_009, and one sample with low genome coverage (PM_LBR_003), the remaining samples had an average of 67.4% (± 15%) of the genome covered by ≥ 5 reads (S2 Table). The coverage profile after amplification was uneven, as reported for other Plasmodia 32 , but generally, across all chromosomes, reaching coverage above the recommended cut off point for SNP calling (five reads or above) (Fig. 1). Coverage of the mitochondria was variable but consistently high in comparison to other chromosomes (mean: 26-fold coverage). The average chromosomal coverage of the two unamplified samples was much lower, with only 0.82% of the genome with a coverage ≥ 5 reads (S2 Fig). SWGA is dependent on the initial parasitaemia of a sample. To determine a potential limit of parasitaemia for WGS, a measure of genomic coverage was assessed in nine Thailand samples for which parasitaemia data was available (range of parasitaemia: 0.0004% to 0.2024%). We determine a parasitaemia limit using both microscopy estimates and cycle threshold (CT) values calculated using the qPCR method 36 . We plotted the CT values of each sample alongside the percentage of the genome that was covered by ≥ 5 reads. We determined that a CT value of 30 will lead to an estimate of 50% of the genome covered by ≥ 5 reads (Fig. 2a). Coverage results are unpredictable below this limit, however, as with PM_THA_001, sequence data may be usable below this limit. When using percentage parasitaemia, we verified that all sequence data from parasite densities higher than 0.01% (400 parasites/ul) led to > 50% of the genome covered by five or more reads; this is a lower limit than previously defined for P. falciparum, and P. knowlesi 32,37 (Fig. 2b). For difficult samples with lower parasitaemia it is possible to improve genome coverage by performing independent SWGA reactions and by increasing sequence data, as observed previously for P. vivax 31 , and also demonstrated here for PM_THA_009, for which merging data lead to > 50% of genome covered with at least 5 reads (S2 Table).
Determining and excluding hypervariable regions. Many Plasmodium species are known to contain large regions of repetitive sequences within the subtelomeres, which is exaggerated in the case of P. malariae, leading to an enlarged genome in comparison to other species 27 . We defined the core genome by both excluding regions with > 2.25 SNPs on average per 5 kb window (S3 Fig.) or containing Pm-fam genes (Fig. 3, S4 Fig., core genome coordinates are listed in S3 Table), to leave a total core genome size of 23,960,057 bases (81% of the total PmUG01 reference genome).
Genetic diversity and population structure. We investigated the multiplicity of infection (MOI) in all samples using the core genome, initially through determining the proportion of SNPs that were heterozygous, alongside running estMOI 38 for each sample which calculates the percentage of the genome that supports a MOI of 1 (S2 Table, S5 Fig.). The samples were P. malariae mono-infections, that is, where no other Plasmodium species were detected by qPCR. However, it is possible that > 1 clone of P malariae is present in a sample i.e. polyclonal. Using this sample set, three isolates displayed evidence of polyclonal infections (PM_LBR_002, PM_ UGA_007 and PM_THA_012). This observation was confirmed by assessing the minor allele frequency (MAF) distribution of these isolates, where they presented with a higher proportion of SNPs with a non-reference MAF www.nature.com/scientificreports/ in the range 0.2 to 0.8 (S6 Fig.). For these three isolates only the major allele strain in each isolate was used in further population genetics analysis. A total of 868,476 genome-wide SNPs were found within the 18 high quality samples (average of 48,249 SNPs per sample), of which 194,709 were unique. However, as with other Plasmodium spp., the subtelomeric region of the P. malariae genome contains large sections of repetitive DNA sequence 27 . These regions are problematic when interpreting WGS data from short-read technologies such as Illumina as short reads are likely to be aligned to incorrect regions along the reference genome, leading to deceptively high coverage and number of SNPs.
After removing hypervariable regions, we analysed the core genome (see S3 Table for coordinates) of 18 samples (≥ 40% of the genome covered by ≥ 5 reads) and identified 29,899 unique SNPs (mean: 5,810 ± 2,229 SNPs per sample) for downstream population genetic analysis. We found that geographically proximal samples displayed less pairwise diversity than geographically separated samples, with parasites from Thailand appearing more closely related to each other than to parasites obtained from Africa. Nucleotide diversities (π) > 3 × 10 −4 nucleotide differences per site are only seen when comparing samples between Thailand and Africa, and π < 2 × 10 −4 was only seen when comparing samples within Thailand or Africa (S4 Table).
A maximum-likelihood tree was constructed using core genome SNP data and demonstrates clear regional separation of P. malariae parasites, with samples from the African continent clustering together, and independently from samples originating in Thailand (Fig. 4).
Genetic variation in in orthologs of known P. falciparum genes associated with drug resistance. P. malariae parasites are commonly subject to antimalarial treatments, therefore we investigated the coverage and prevalence of mutations in orthologs of known P. falciparum genes associated with drug resistance (Pfcrt, Pfdhfr, Pfdhps, Pfk13 and Pfmdr1; gene IDs are in S5 Table). SNPs were only found in Pmdhfr (n = 3; 2 non-synonymous), Pmdhps (n = 5; 1 non-synonymous) and Pmmdr1 (n = 4, 2 non-synonymous) (Fig. 5, Table 1). SNPs within Pmdhfr at positions 1,292,026 and 1,292,193 in chromosome 5 appear to be more common globally than other SNPs, whereas SNPs within Pmdhps and Pmmdr1 appear to be more prevalent in Thailand than Africa ( Table 2). All of the non-synonymous mutations found within Pmdhfr led to amino acid alterations (F57L, R58S and N114S) at positions that align with known drug-resistance associated positions within the Pfdhfr ortholog (C59R and S108N respectively) upon amino acid allignment (Table 1, S7 Fig.) 43 . In addition,  www.nature.com/scientificreports/ the mutation at position 527,528 within Pmmdr1 (chromosome 10), which leads to the amino acid substitution L1063F, aligns in close proximity to N1042D in the Pfmdr1 ortholog that is associated with quinine resistance, and increased mefloquine and artemisinin susceptibility ( Mixed infections. P. malariae parasites are commonly found in mixed infections with other Plasmodium spp. 911,12 . This provides a further obstacle for WGS, as not only is the human genome a potential contaminant, but also the other Plasmodium species present. We used four further unprocessed clinical blood samples from Thailand which were found to be mixed infections after qPCR 36 and underwent SWGA to determine whether   Table). However, when DNA from other species is present at high concentrations, SWGA may not be effective for amplification of P. malariae (S6 Table).

Discussion
P. malariae is a neglected malaria parasite with unique features, such as a longer quartan cycle and the ability to persist in the human host for years or decades 13 . Genetic investigation of this parasite may allow us to understand how P. malariae is able to cause chronic infections, why there are accounts of P. malariae parasites persisting after treatment with ACT, and why some P. malariae infections lead to severe outcomes whilst others remain asymptomatic. Malaria parasite genomics can provide important biological insights to understand this disease, but the difficulty of obtaining sufficient parasite DNA for WGS has been a challenge for genomic studies of P. malariae.
Here we present the first application of SWGA for this species. We have customized the SWGA approach to successfully amplify P. malariae DNA extracted directly from unprocessed blood from clinical samples which were obtained from six different countries. In agreement with others 31,32 , we have demonstrated that the parasitaemia affects the efficiency of SWGA, and recommend using samples with a percentage parasitaemia > 0.01%, which is a lower threshold than reported for other species 31,32,37 . The WGS data generated from SWGA-treated samples is of high quality with good overall coverage, leading to an average of 67.4% (± 15%) of the genome covered by ≥ 5 reads between the 18 samples assessed in this study. Using these samples, we were able to identify 868,476 total SNPs (average 48,249 SNPs per sample), filtered to 104,583 total SNPs after exclusion of hypervariable regions (average of 5,810 SNPs per sample). This is lower than SNP prevalence documented in P. knowlesi (115,995 SNPs per sample including hypervariable regions) 37 , yet higher than SNPs found in P. vivax (14,463 SNPs per sample before filtering for core genome) after SWGA 31 .
It is important to note that differences in the number of SNPs per sample reported could also be due to differences in the method used for variant calling.
A maximum likelihood tree based on SNP data revealed geographic clusters, with clear separation of African and Asian samples. This geographical clustering is consistent with data for P. falciparum 45 and P. vivax parasites 24,45,46 . Similar geographic clustering was observed in the phylogenetic analysis of SNPs in the circumsporozoite gene from P. malariae isolates from Africa and Asia 47 . To improve geographical clustering resolution (i.e. by country), the number of samples investigated needs to be increased. Our data suggests that parasites display isolation by distance, therefore country or multi-country regional analysis of P. malariae populations could be used in future studies to identify regions under selection in different populations.
We further demonstrate that SWGA successfully amplifies genes orthologous to those associated with drug resistance in P. falciparum, and identify SNPs in Pmdhfr, Pmdhps and Pmmdr1. The effects of these SNPs are unknown, and to date, there are no characterised molecular markers of drug resistance in P. malariae parasites, even though treatment failures have been reported 19,48 . Despite this, potential mutations of interest were  www.nature.com/scientificreports/ found, particularly at positions 1,292,023, 1,292,026 and 1,292,193 in chromosome 5 in the Pmdhfr gene. These mutations lead to amino acid substitutions F57L, R58S and N114S respectively, and align almost perfectly with P. falciparum amino acid substitutions C59R and S108N which are associated with reduced susceptibility to sulfadoxine/pyrimethamine 49 . The nonsynonymous mutation N114S has been previously reported in two P. malariae samples from Thailand and the F57L and R58L mutations have been reported in P. vivax samples from several geographical regions 50,51 . In addition, one mutation within Pmmdr1 at position 525,728 in chromosome 10 leads to amino acid substitution L1063F, which aligns with close proximity to N1042 in the Pfmdr1 ortholog, associated with reduced susceptibility to quinine and increased susceptibility to mefloquine, halofantrine and artemisinin 44 . It is important to note that whilst treatment failures are seen with P. malariae infections, it is not clear whether this is due to mutations within the parasite genome leading to reduced drug efficacy, or perhaps a specific phenotype of this species due to the longer parasite life cycle which may reduce drug absorption 48 ; therefore further functional studies are required to determine the effect, if any, of these substitutions. The subtelomeres, containing the fam-l and fam-m gene families are of great interest when studying P. malariae, as they are unique to this species and are thought to be involved in host-parasite interactions 26 . Unfortunately, sequence analysis of these regions is notoriously difficult using short-read technologies, therefore longer-read sequencing will be needed to further investigate these regions.
In conclusion, the SWGA approach offers a fast, cost effective way to explore the genome diversity of P. malariae from unprocessed blood of infected individuals. Further studies should consider the analysis of a larger number of samples from a greater geographical range and different clinical outcomes, in addition to studies investigating the subtelomeric regions with long read technologies. Such studies are necessary to characterize the epidemiology and genetic diversity of P. malariae populations, with the potential to provide biological insights for disease control. Sample collection and processing. This project used nine P. malariae DNA samples extracted from unprocessed venous blood from infected individuals in Thailand. Parasite density (parasites/µl) determined by microscopy was available for these isolates. Genomic DNA was extracted from frozen unprocessed blood using the QIAamp DNA Blood Mini Kit (Qiagen) or the QIAsymphony DSP DNA Kit in combination with a QIAsymphony SP instrument (Qiagen), according to manufacturer's instructions. As microscopy is prone to human errors, all extracted DNA samples were subject to qPCR as outlined by Shokoples et al. 36 to ensure that only P. malariae single species infections were used.

Methods
A further ten DNA samples were provided by the Public Health England-Malaria Reference Laboratory (PHE-MRL) at the London School of Hygiene and Tropical Medicine (LSHTM). These samples were sourced from individuals who had reported recent travel to only one country with malaria transmission, including: Kenya (n = 2), Liberia (n = 2), Sierra Leone (n = 2), Sudan (n = 1) and Uganda (n = 3) between 2010 and 2017. PHE-MRL samples are commonly sourced from individuals returning to visit relatives in their original native country. For species identification, PHE-MRL samples perform both a nested PCR 52 and qPCR 36 and are archived according to the species present.
Total DNA concentration for all samples was quantified using a Qubit v2.0 fluorometer (Thermo Fisher Scientific).
Selective whole genome amplification. The swga program (www.githu b.com/eclar ke/swga) was used to identify primers that preferentially amplify the P. malariae genome 35 , using its reference genome (PmUG01, https ://plasm odb.org) as the target (foreground), and the human genome (GRCh37; https ://grch3 7.ensem bl.org/) as the background. The swga program ranks primers dependant on the ratio of foreground genome binding to the background genome binding, combined with the evenness of primer binding along the target genome and generates multiple potential primer sets. The five highest-ranked sets consist of combinations of 4 to 6 oligonucleotides each, with overlapping primers. The set that ranked highest (Pmset1) consisted of five primers: TAT GTA TA*T*T, TTA TTC *G*T, TTC GTT *A*T, TTT TTA *C*G, TAT TTC *G*T, that were ordered with a phosphorothioate bond (represented by *) modifications to prevent primer degradation by the exonuclease activity of the Phi29 polymerase. To evaluate the efficacy of Pmset1 for SWGA of the P. malariae genome, we tested two samples (PM_THA_001 and PM_THA_002) and sequenced both before and after SWGA.
DNA samples were subject to SWGA following previously published protocols 31,32,37 . All SWGA reactions were carried out in a UV Cabinet for PCR Operations (UV-B-AR, Grant-Bio) to eliminate potential contamination. Briefly, a maximum of 60 ng of gDNA (minimum of 5 ng) was added to a total 50 µl reaction alongside 5 µl of 10 × Phi29 DNA Polymerase Reaction Buffer (New England BioLabs), 0.5 µl of Purified 100 × BSA (New England BioLabs), 0.5 µl of 250 µM Primer mix, 5 µl 10 mM dNTP (Roche), 30 units Phi29 DNA Polymerase (New England BioLabs) and Nuclease-Free Water (Ambion, The RNA Company) to reach a final reaction volume of 50 µl. The reaction was carried out on a thermocycler with the following step-down program: 5 min at 35  Sequence data analysis. Raw fastq files were trimmed using trimmomatic set to default parameters 53 , and aligned to the P. malariae UG01 reference genome (PlasmoDB) using bwa-mem software 54 . SNPs were identified using the samtools software suite (samtools.sourceforge.net) 55 and filtered for quality based on previously described methods 56 . The coverage of each nucleotide position was analysed using sambamba 57 , which was set to include only SNPs with coverage levels of at least fivefold. Poor quality samples were removed (< 40% of the genome covered by 5 reads) to leave 18 high quality samples. We used estMOI 38 to determine MOI for samples, and the major allele was used when heterozygous SNP calls were found.
Determining and excluding subtelomeric regions. To exclude hypervariable subtelomeric regions the P. malariae genome was split into 5 kb segments and the average number of SNPs was calculated. We defined an upper limit for the number of SNPs within each window in order to identify highly polymorphic windows. This SNP limit was used in conjunction with the positions of the Pm-fam gene families to define the subtelomeric regions of each chromosome and exclude these from downstream analysis.

Population genetics.
To investigate the population structure of P. malariae parasites, a distance matrix was created which was based on a matrix of pairwise identity calculated from the SNPs present in each sample. Using the distance matrix, a maximum likelihood tree was produced using Iqtree 39 with Modelfinder 40 to select the best model of substitution and ultrafast bootstrap analysis 41 . The resulting Newick tree was visualised in iTOL 42 . The nucleotide diversity (π) metric was used to investigate the genetic variability between samples, and was calculated using the pegas (v0.10) package 58 , which defines nucleotide diversity as the average number of SNPs per position between two sequences.
Drug resistance orthologs. Orthologs of known genes involved in drug resistance in P. falciparum were analysed. The SNPs were described using the snpEff software 59 which annotates the genes affected, the type of mutation, and if non-synonymous, the amino-acid change that has occurred. The coverage of genes of interest was also analysed using the output file from applying sambamba software 57 . The genes investigated and their respective IDs are summarised in S5 Table.

Data availability
All raw sequence data is listed in the European Nucleotide Archive (study accession number PRJEB33837).