Improved 18S and 28S rDNA primer sets for NGS-based parasite detection

The development and application of next-generation sequencing (NGS) have enabled comprehensive analyses of the microbial community through extensive parallel sequencing. Current analyses of the eukaryotic microbial community are primarily based on polymerase chain reaction amplification of 18S rRNA gene (rDNA) fragments. We found that widely-used 18S rDNA primers can amplify numerous stretches of the bacterial 16S rRNA gene, preventing the high-throughput detection of rare eukaryotic species, particularly in bacteria-rich samples such as faecal material. In this study, we employed in silico and NGS-based analyses of faecal samples to evaluated the existing primers targeting eukaryotic 18S and 28S rDNA in terms of avoiding bacterial read contamination and improving taxonomic coverage for eukaryotes, with a particular emphasis on parasite taxa. Our findings revealed that newly selected primer sets could achieve these objectives, representing an alternative strategy for NGS.


Results
In silico screening. For NGS-based analyses of eukaryote diversity, previous studies have mostly used 1391F/ EukBr, as recommended by EMP 10,[15][16][17] , or 563F/1132R [18][19][20][21][22] , which targets the V9 or V4-5 regions of 18S rDNA, respectively. In this article, these two primer sets are referred to as 'conventional primer sets' and used as comparators. To identify primer pairs that can efficiently detect a wide variety of parasites while avoiding bacterial DNA amplification for use in NGS-based parasite detection, we screened all possible 18S and 28S rDNA primers. Some previous studies have extensively tested 18S rDNA primers in silico to design universal eukaryotic primers to be used as standards for NGS-based analyses of eukaryote diversity 5,6,13 . Therefore, we used these recommended 18S rDNA primer sets to re-evaluate for detection of parasitic taxa groups (Table 1; Supplementary Table S1). For 28S rDNA, we retrieved the possible universal primers (n = 52) (Supplementary Table S1) from previous reports 13,23,24 and screened them based on their melting temperatures (Tm) and amplicon sizes (Materials and Methods), yielding 13 primer pairs ( Table 2).
We then evaluated those primer pairs based on their sequence identity with eukaryotic parasites, fungi and bacteria using the SILVA non-redundant sequence dataset (Tables 1 and 2). The EMP primer set (1391F/EukBr) can be used to detect a wide variety of parasitic taxa, exhibiting 43.4-66.7% coverage in majority of the tested taxa, excluding Nematoda, Platyhelminthes, Longamoebia and Fungi. The other conventional primer set 563F/1132R exhibited higher coverage than the EMP primer set (≥87.4% excluding Haemosporodia). However, these two primer sets showed similarity with bacterial 16S rDNA sequences (13.0% and 89.9%, respectively). Although the other 18S primer sets demonstrated lower taxonomic coverage for eukaryotes than 563F/1132R, they appeared to amplify less bacterial rDNA. For instance, 574F/952R showed low coverage for Nematoda, Fornicata and Parabasalia. Moreover, 574F/952R, 574*F/952R and 1183F/1631R showed low coverage for Acanthocephala, while 1183F/1631R showed low coverage for Parabasalia, Haemosporidia and Entamoebida, despite their low similarities to bacterial sequences (coverage <0.1%). These tendencies were also observed when we used strict or mild parameters for taxonomic coverage evaluations (Tables S2 and S3). Based on these results, 616*F/1132R and 1183F/1631R were selected for further evaluation as the best primer sets for the 18S V4-5 and V7-8 regions, respectively. . TestPrime computes coverages for each taxonomic group by running in silico PCR on the SILVA database via sorting database sequences into "match", "mismatch" and "nodata (sequences not covering the primer match position)". The frequencies of "match" sequences among "match" and "mismatch" sequences are shown as percentages with the sequence numbers in parentheses. a Numbers in parentheses show the numbers of sequences available in the SILVA database. Please note these numbers are not always the denominators because of the presence of "nodata" sequences. b Only 17 bases from 3′ was used for the primer EukBr because many sequences in the SILVA database lacks the corresponding 5′ region. c Target variable regions and amplicon sizes based on the S. cerevisiae rRNA gene (NC_001144) are shown below the primer names. d Primer match to humans and mice is indicated by superscripts h and m on the values, respectively.
Data for the 28S rDNA primers are summarised in Tables 2, S2 and S3. All 28S primer sets showed low similarity with bacterial sequences. The taxonomic coverage for some eukaryotic parasites was variable, especially for some protozoan groups including Haemosporidia, Fornicata, Parabasalia and Entamoebida. On the other hand, the coverage for the other parasitic taxa were not very different, although the primer sets designed for the D3-D5 regions showed low coverage for Nematoda, Platyhelminthes and Discicristata. Among the five primer sets designed based on the D8-D9 regions, GA20F/RM8R, RM7F/RM9R and GA20F/RM9R exhibited wide taxonomic coverage except for Entamoebida. Based on these results, we selected seven 28S primer sets, namely DM568F/RM2R, RM2F/RM3R, RM3F/RM4R, GA12F/RM4R, GA20F/RM7R, GA20F/RM8R and GA20F/ RM9R, for further evaluation.     Table 2. List of primer sets targeting the 28S rRNA gene and their coverages in 14 taxonomic groups. Primer sets were tested for matches to sequences in the SILVA database (v.132) using TestPrime under the following parameters: maximum number of mismatches of four bases and length of 0-mismatch at the 3′ end of three bases). TestPrime computes coverages for each taxonomic group by running in silico PCR on the SILVA database via sorting database sequences into "match", "mismatch" and "nodata (sequences not covering the primer match position)". The frequencies of "match" sequences among "match" and "mismatch" sequences are shown as percentages with the sequence numbers in parentheses. a Numbers in parentheses show the numbers of sequences available in the SILVA database. Please note these numbers are not always the denominators because of the presence of "nodata" sequences. b Target variable regions and amplicon sizes based on the S. cerevisiae rRNA gene (NC_001144) are shown below the primer names. c Primer match to humans and mice is indicated by superscripts h and m on the values, respectively. (2019) 9:15789 | https://doi.org/10.1038/s41598-019-52422-z www.nature.com/scientificreports www.nature.com/scientificreports/ All the 18S and 28S primers tested showed high coverages (>60%) for Euteleostomi which includes their possible hosts. In particular, human and mouse DNA are likely to be amplified by all primers tested (Tables 1 and 2). qPCR. To confirm whether the selected primer sets could efficiently amplify eukaryotic rDNA without dimer or hairpin structure generation, we performed qPCR using C. elegans DNA as representative eukaryote DNA because all selected primers displayed 100% sequence similarity with C. elegans rRNA. 18S rDNA from 0.1 ng of C. elegans genomic DNA (final concentration, 0.01 ng/µl), which corresponds to ~200,000 copies of rRNA, was amplified at ~21 cycles (mean Ct ± SD = 21.28 ± 0.71) using the EMP primer set (1391F/EukBr), corresponding to the amplification efficiency of 80-87%. 1183F/1631R and 563F/1132R exhibited similar PCR efficiencies, whereas 616*F/1132R showed lower efficiency ( Supplementary Fig. S1A). To assess the avoidance of bacterial DNA amplification, we used a bacterial DNA mixture. Amplification from 0.1 ng of bacterial DNA (final concentration of 0.01 ng/µl), which corresponds to ~20000 copies of rRNA, required ~27 cycles (mean Ct ± SD = 27.10 ± 0.81) using the EMP primer set. Ct difference between eukaryotic and bacterial DNA was the largest for 1183F/1631R, followed by 616*F/1132R, whereas this difference was the smallest for the EMP primer set. The results for 28S rDNA primer sets are shown in Supplementary Fig. S1B. Amplification efficiencies of the 28S primer sets for C. elegans DNA exceeded 70% except for that of RM3F/RM4R, which required 12-16 additional cycles for amplification. All 28S primers demonstrated lower sensitivity to bacterial DNA and required more than 10 additional PCR cycles compared with the EMP primer set.

Representative species
The detection limits of C. elegans DNA using the primer sets were 0.2-2 pg, corresponding to 1-10 C. elegans cells. Products were detected for no-template negative controls at approximately 30 cycles using RM2F/RM3R compared with approximately 35 cycles using the EMP primer sets. No non-specific amplification was detected with 40 cycles using the other primer sets.
Deep sequencing. Next, we performed MiSeq analysis of 18S or 28S rDNA amplicons using the two conventional and six newly selected primer sets. We used DNA extracted from the faeces of wild rats and a domesticated bovid as templates, which were anticipated to be highly rich in bacteria. Our previous morphological observations have revealed that five rats (i.e., WR4-8) were heavily infected with parasitic nematodes, while one rat (ZR4) was infected with tapeworms 4 . In contrast, the bovine sample (MB1) was rich in protozoan parasites. In total, 1,311,788 high-quality reads, with a mean of 23,425 reads per test (samples × primers), were obtained via Illumina MiSeq (Table S4).
Numbers of operational taxonomic units (OTUs) detected using the EMP primer set ranged from 150 to 400 (Table S4). However, approximately half of the OTUs were assigned to either Bacteria or Archaea. Although the number of OTUs detected using 563F/1132R was the highest for each DNA sample (>1200 OTUs), after removing the bacteria and archaea reads, this number became the lowest among all primer sets. Other primer sets detected few or no bacterial OTUs, and the eukaryotic OTUs ranged from 40 to 500. Among these, RM2F/RM3R detected the lowest number of OTUs in the six rat samples, whereas GA12/RM4R detected the lowest numbers in the bovine sample. Overall, the six newly selected primer sets more readily avoided bacterial DNA amplification than the conventional primer sets. Finer classifications after removing the bacteria and archaea reads are shown in Figs 2-4.  ited similar taxon distribution patterns in WR4, although small proportional differences were noted (Fig. 2). Many reads (46-91%) were assigned to the phylum Nematoda and some were assigned to the phylum Chordata and sub-phylum Saccharomycotina using all primer sets. At SILVA level 10, Nematoda reads were further classified to the family or genus level. Using the three 18S primer sets (i.e. EMP, 563F/1132R and 616*F/1132R), many (>85%) Nematoda reads was assigned to 'Rhabditida; Ambiguous' . Conversely, using 1183F/1631R, the proportion of ambiguous taxa became smaller and more reads were assigned to genera such as Strongyloides and Ancylostoma. Using the 28S primer sets, no 'Rhabditida; Ambiguous' reads were detected and all Nematoda reads were subdivided into genera, including Heligmosomoides, Nippostrongylus and Strongyloides; this trend was similar to the nematode taxon distribution observed in our previous morphological identification 4 . Similar results were noted in other WR samples (i.e. WR5, WR7 and WR8), although differences in minor taxon distributions were observed (Supplementary Tables S7 and S8). In WR6, Eimeriorina was detected in addition to Nematoda, Chordata and Saccharomycotina using all primer sets (Fig. 2). At SILVA level 10, the Eimeriorina reads were further classified into Eimeria or 'Eimeriorina; Ambiguous' taxa.
Tapeworm-infected samples. ZR4 harboured Hymenolepis tapeworms in its intestine. At SILVA level 7, all 18S primer sets detected high proportions of Platyhelminthes as well as Saccharomycotina and Trichomonas (or 'Trichomonas; Ambiguous') ( Fig. 3; Supplementary Table S7). The three 28S primer sets did not detect Trichomonas reads, while GA20F/RM9R detected few Trichomonas reads (approximately 0.1%); therefore, Saccharomycotina and Platyhelminthes occupied higher proportions of the total reads using the 28S primer sets than using the 18S primer sets. Chordata reads were detected by all 18S primers and two 28S primer sets (i.e., DM568F/RM2R and RM2F/RM3R). Mastotermes was detected only by the GA20F/RM9R primer set. At SILVA level 10, Platyhelminthes reads were further classified to the order Cyclophyllidea using the 18S primer sets and to the genus Hymenolepis using the 28S primer sets. Saccharomycotina was further classified to the order Saccharomycetales using the 18S primer sets and to the genera Saccharomyces and Kazachstania using the 28S primer sets.
Protozoa-rich samples. Various protozoa occur in the bovine gastrointestinal tract; thus, protozoal cysts are frequently detected in faecal samples. Most of these protozoa form a part of the normal ruminal microflora called ciliated protozoa 25,26 ; however, some of these, such as Eimeria (Coccidia), Cryptosporidium, Giardia,  Figure 3. Taxonomic classification of eukaryotic reads in the faecal samples of a tapeworm-infected rat (ZR4) at SILVA levels 7 and 10. Circles from the inside show taxonomic distributions using the EMP (1391F/EukBr), 563F/1132R, 616*F/1132R, 1183F/1631R, DM568F/RM2F, RM2F/RM3R, GA12F/RM4R and GA20F/RM9R primer sets, respectively. Only taxa with ≥5% of the total non-bacterial reads are shown in the plots, and taxa with <5% are summarised as 'Others' . Sequence reads without any taxonomic assignments because of low similarity with the database are shown as 'Unassigned' .

Beta diversity analyses. A technical replicate experiment was performed from PCR amplification to
MiSeq independently from the first experiment using the newly selected primer sets (Dataset 2; Supplementary Tables S9-S12). The dendrograms of cluster analysis based on the Bray-Curtis dissimilarity of taxon abundance  www.nature.com/scientificreports www.nature.com/scientificreports/ from the two replicate experiments are shown in Fig. 5A. All replicates (Datasets 1-2) in MB1 and ZR4 were clustered together in the dendrogram, suggesting high reproducibility of the methods using these primer sets (Fig. 5A). For WR samples, although the replicates were largely clustered together, some technical replicates were nested within the other DNA samples (e.g. WR4 with WR7 and WR5 with WR8), perhaps because those samples showed very similar taxonomic compositions. Principal coordinates analysis (PCoA) plots were generated for www.nature.com/scientificreports www.nature.com/scientificreports/ ZR4, MB1 and WR samples (Fig. 5B-D, respectively). In the three plots, PCoA1 separated the samples based on the PCR target regions (18S or 28S), although the separation in the WR plot, which contained five DNA samples, was not as obvious as that in the other plots. Among the 28S primer sets, RM2F/RM3R and GA12F/RM4R were clustered together in all the plots, whereas DM568F/RM2R and GA20F/RM4R were clustered together in the ZR and WR plots but not in the MB plot. Among the 18S primer sets, 563F/1132R and 616*/F1132R were clustered together in all the plots. These results correspond to the target regions of 18S or 28S (Tables 1 and 2).

Discussion
Bacterial read contamination of PCR amplicons often poses a critical problem in NGS-based analyses of eukaryotic diversity or diagnoses. In extreme cases, as with MiSeq of a bovine faecal sample in the present study, over 95% of the total sequence reads can be derived from bacterial DNA, making it difficult to detect rare eukaryotes in the samples. Increasing data acquisition may resolve this issue; however, presence of raw data with only one or two orders of magnitude than non-contaminated cases is inefficient and therefore prevents high-throughput analyses. The primer sets newly screened in this study, which can efficiently amplify rDNA from a wide range of eukaryotes without bacterial DNA amplification, are anticipated to be suitable tools for diversity analyses of eukaryotic microbes, including parasites.
At the same time, we noted that each primer set could not detect a specific taxonomy groups. According to our deep sequencing analysis, two of the 28S primer sets could not to detect Trichomonas species, which were detected by all other 18S primer sets. Spironucleus reads were detected from rat faeces using only one 28S primer set (GA20F/RM9R). Entamoeba could not be detected using the EMP and GA20F/RM9R primer sets. In addition, the results of in silico analysis suggested that one primer set is unlikely to cover all taxonomic groups of parasites. For instances, Plasmodium spp., one of the most medically important parasites, was difficult to detect using any of the tested 18S rDNA primer sets, although it may be detected using the two 28S rDNA primer sets. Trypanosoma and Leishmania, two other important parasitic genera, could be detected only using 563F/1132R and 1183F/1631R among the tested primers. Collectively, these results suggest the importance of selecting primer sets according to the study objective.
To achieve fine taxonomic resolution, long sequences containing sufficient diversity to distinguish closely related species are essential. Although sequencing technologies capable of producing long sequences, such as PacBio and NanoPore, are available 28,29 , these remain impractical for rDNA-based microbiome analyses because of their higher error rates and lower throughputs than those of Illumina sequencing. Therefore, many studies have used Illumina sequencing, for which the maximum length is 600 bp (300-bp paired-end). Although we used variable regions of 18S rDNA with fragment lengths ranging from 150 to 570 for taxonomic classification, we were unable to further assign the reads to the genus or species level in most cases. On the contrary, reads of 28S rDNA, which has higher sequence diversity than 18S rDNA 13 , could sometimes be further assigned to the genus level, suggesting that 28S rDNA represents a good option for studies in which finer classification is necessary. One of the challenges in 28S rDNA-based population analyses is the enlargement of the database because database sizes affect fine taxonomic classification. The current database (SILVA r132) contains 198,843 28S rDNA sequences compared with 695,171 18S rDNA sequences (https://www.arb-silva.de/). In addition, we discarded primer sets with amplicon sizes that were out of range even though they demonstrated good taxonomic coverages (Supplementary Table S13). These primers can be used as alternates if they are capable of amplifying sequences to meet the length requirement.
Host DNA contamination did not hamper analyses in this study. Small proportions of mammalian (Chordata) reads were detected with any combination of samples and primers. This is probably because the faecal samples used in this study were collected from wild animal and contained high number of eukaryotic microbes. However, our in-silico analysis revealed that all the tested primer sets theoretically cannot avoid amplification of host DNA. Therefore, when samples are expected to have small amounts of eukaryotic microbes, such as clinical samples from human or samples from well-kept pets, PCR blockers may be required, which prevent host DNA amplification [30][31][32] . Applying taxon-specific primers is an alternative option to avoid amplification of host DNA. Recently, Cannon et al. 33 proposed a high-throughput method to detect a wide range of parasites by a combination of multiple taxon-specific primers. We tested those primers using our evaluation criteria and confirmed that those primer sets amplify each targeted taxa and can avoid host and bacterial DNA amplification (Table S14). Although this strategy requires optimisation for multiplex PCR (amplification of multiple targets in a single PCR) for high throughput studies and may require a reasonable normalisation method for amplification bias by each primer set for a reliable estimation of taxa distribution in a sample, the assay still has an advantages in customizability to easily include additional targeted taxa 33 . Therefore, the primer sets selected in this study can be added to the multiplex assay, which could achieve more comprehensive "parasitome" analyses.
The benefits and drawbacks of the newly selected primer sets and conventional primers are summarised in Table 3. First, the newly selected primer sets could avoid bacterial DNA amplification. However, taxonomic coverage differed with each primer set. Ultimately, the primer sets should be selected according to the study objectives, taking the parasites that need to be covered and the required resolution into account. However, we recommend the use of 616*/F1132R for 18S rDNA or DM568F/RM2R for 28S rDNA, or a combination of those, as new standard primer sets for parasite detection because these provide wide taxonomic coverage of parasitic eukaryotes with minimal bacterial DNA contamination. www.nature.com/scientificreports www.nature.com/scientificreports/ with eukaryote and bacterial rDNA sequences using TestPrime 1.0 and the SILVA 132 database under the following parameters: maximum number of mismatch = 4 bp and the length of 0-mismatch zone at the 3′ end = 3 bp). We used the non-redundant reference dataset (Ref NR) build by a dereplication of the full reference set using a 99% identity criterion and were suggested by SILVA to be used as a representative dataset for classification, phylogenetic analysis and probe design. DNA samples. A bacterial DNA mixture was prepared by combining 70 ng DNA extracted from pure cultures of seven bacterial species (Escherichia coli, Enterobacter sp., Serratia sp., Bacillus subtilis, Klebsiella pneumoniae, Group A Streptococcus and Staphylococcus epidermidis) using a QIAmp DNA Mini Kit (Qiagen). C. elegans DNA was extracted from approximately 10,000 worms using the same kit.

SSU
For MiSeq analyses, DNA extracted from the faeces of rats caught in the Miyazaki City Phoenix Zoo (ZR, Rattus rattus) or in Miyazaki downtown (WR, Rattus norvegicus) in our previous study 4 were used. Faecal samples from a domesticated bovid (MB, Bos taurus) were provided by the veterinary parasitology lab of the University of Miyazaki, and DNA was extracted using a Maxwell RSC Purefood GMO Kit (Promega), as described previously 34 . qPCR. qPCR was performed to test the amplification efficiency of each primer set using C. elegans DNA or the bacterial DNA mixture as a template. Reactions were performed in triplicates using a StepOnePlus Real-Time PCR System (Applied Biosystems) under the following conditions: 95 °C for 10 min, followed by 40 cycles of 95 °C for 15 s, 50 °C for 30 s and 60 °C for 1 min (for 18S rDNA amplification), or 95 °C for 10 min, followed by 40 cycles of 95 °C for 15 s and 60 °C for 1 min (for 28S rDNA amplification). The reaction volume was 10 μl, including 5 μl of the Power SYBR Green PCR Master Mix (2x), 0.9 μM of each primer and 1 μl of DNA solution. To calculate the PCR efficiencies and detection limits, serial 10-fold dilutions of C. elegans DNA (1 ng to 0.01 pg) were used as templates.
MiSeq sequencing. PCR was performed using Tks Gflex DNA Polymerase (Takara), and a 30-µl reaction mixture containing 1 µl of template DNA (1-3 ng of DNA), 15 µl of 2 × Gflex buffer, 0.5 µl each of the forward/ reverse primers with the Illumina MiSeq Adapter (10 µM final concentration), 0.5 µl (100 U) of DNA polymerase and 13 µl of nuclease-free H 2 O. Reactions were performed using Veriti Thermal Cycler (Applied Biosystems) under the following conditions: 95 °C for 1 min, followed by 35 cycles of 95 °C for 15 s, 60 °C (for 28S rDNA amplification) or 50 °C (18S rDNA amplification) for 1 min and 68 °C for 1 min. Duplicate PCRs were performed independently, and the produced materials were then mixed. The PCR products were confirmed via agarose gel electrophoresis and purified using AMpure XP beads (Beckman Coulter). Index PCR was performed to attach dual indices and Illumina sequencing adapters to the first PCR products using the Nextera XT Index Kit (Illumina) and KAPA HiFi HotStart Ready Mix (Kapa Biosystems) under the following conditions: 95 °C for 3 min, followed by 8 cycles of 95 °C for 30 s, 55 °C for 30 s and 72 °C for 30 s, and the final extension at 72 °C for 5 min. The PCR product was cleaned using AMpure XP beads, pooled at equal concentrations and then sequenced using the MiSeq Reagent Nano Kit v3 (600 cycles) according to the manufacturer's protocol (http:// icom.illumina.com/) to produce 300-bp paired-end reads.
Similarity in taxa composition and the relative abundance were analysed via PCoA and hierarchical cluster analyses using the Bray-Curtis similarity index with R vegan package 37 .

Data availability
The sequencing data have been deposited to the DNA Data Bank of Japan Sequence Read Archive under the BioProject PRJDB3050.