Introduction

Naegleria are free living protists belonging to the family of Vahlkampfiidae and occur in fresh warm water sources and soil worldwide. Members of the genus Naegleria show a flexible phenotype. Under favourable conditions they exhibit an amoeboid form while formation of fast-moving flagellates or resting cysts is observed in nutrient sparse conditions or during dry periods. Since the first description of N. gruberi in 1899, there have more than 40 species been isolated and characterized so far. Despite the huge diversity, only Naegleria fowleri is known for its ability to infect humans1,2. When contaminated water enters the nose, for example during swimming or ritual nose rinsing, N. fowleri overcomes the host’s extracellular matrix (ECM) and follows the olfactory nerves to the brain by crossing the cribriform plate. There the amoeba multiplies, destroys nerve cells and causes primary amoebic meningoencephalitis (PAM) with a mostly fatal outcome3,4,5. Due to the fast progression of the disease, diagnosis is usually made post-mortem by microscopic examination of the cerebral spinal fluid (CSF) or by quantitative PCR6. Treatment options are still limited, where the most promising therapy is medication with antibiotics and antifungals such as Amphotericin B, Fluconazole, Rifampicin and Miltefosine7. Pathogenicity factors have been investigated for several years but the mechanisms of infection and onset of the disease are still poorly understood. Secretion of proteases and hydrolysing enzymes is known as an important factor during pathogenesis in various eukaryotic parasites. In general, they are involved in the degradation of extracellular matrix proteins, lysis of host cells or invasion of the host cells by intracellular parasites8,9,10. Several studies highlight the importance of proteins with hydrolysing activity in N. fowleri. By analysing N. fowleri conditioned medium, different cysteine proteases with similarity to cathepsin B have been identified which are most likely involved in the degradation of the ECM and the invasion of the blood-brain barrier11,12,13,14,15. In addition, the function of metalloproteinases16, phospholipases17, elastases18 and glucosidases19 during invasion of the central nervous system (CNS) have been discussed. Furthermore, secreted proteins are potential drug targets and a recent study examined cysteine protease inhibitors as potential drug against PAM20. Although numerous studies describe proteolytic enzymes based on their biochemical properties by using enzyme and proteinase inhibitor assays, the diversity of N. fowleri secreted proteins is still sparsely described and the actual genes encoding the factors responsible for the cytotoxic effect remains often unknown. Although the genome of N. fowleri (Isolate ATCC 30863) has been published in 2014, the assembly is highly fragmented including over 1,000 contigs and gene annotation on the genomic sequence is missing21. Therefore, improving of N. fowleri assembly provides the basis for further experiments on a molecular and computational level to unravel the pathogenicity of N. fowleri. In the last few years sequencing technologies markedly improved and long-read sequencing using Oxford Nanopore Technology (ONT) offers a new possibility to resolve highly repetitive regions in genome assemblies, leading to highly contiguous reference sequences22,23. Nevertheless, one of the major drawbacks of long read sequencing methods is their high error rate. Therefore, hybrid assembly approaches incorporating high quality short-reads are often used to improve the accuracy23,24. Recent improvement in sequencing chemistry and base-calling algorithms of ONT lead to an increase in read quality and thus facilitating genome de novo assembly and leading to a consensus accuracy of over 99.8%25. In this study, we successfully applied ONT sequencing to the genome of the human pathogenic amoeba N. fowleri ATCC 30894. The higher contiguity of the N. fowleri genome assembly enables the prediction of genes on the nucleotide level using ab initio and RNA sequencing (RNAseq) based methods and provides a high-quality reference for further downstream experiments. The secretion of proteins with hydrolysing activity has been discussed as important factor in the pathogenesis of N. fowleri. Using deep neural networks implemented in SignalP26 and DeepLoc27, we identified 208 potentially secreted proteins of which 18% are annotated with a hydrolysing function. In free-living N. fowleri they are most likely involved in the lysis of bacterial and eukaryotic microorganism. Given their proteolytic function, these proteins may additionally play a major role in the degradation of ECM proteins and nerve cells and contribute to the pathogenesis of PAM.

Results

Genome assembly

To gain a complete overview of the N. fowleri genome, total DNA of the isolate ATCC 30894 was extracted and sequenced using ONT. Sequencing of the DNA on the GridION system using one flow cell resulted in a total of 1,352,535 base called reads (9 Gb) with a mean read length of 6,658.6 bp and a N50 of 11,677 bp. The nuclear genome of N. fowleri was assembled using the string graph assembler Canu v1.728. To improve consensus accuracy, the initial assembly was polished in two steps using ONT raw signal data by applying Nanopolish v0.11.022 and with high quality Illumina reads by using Pilon v1.2229. The final, curated assembly of the nuclear genome consists of 81 contigs with a total length of 29,549,925 bp, the N50 has a size of 717,491 bp while the L50 is 18. Further, the assembly has a GC content of 36.9% which is similar to the previously published isolate N. fowleri ATCC 30863 assembly and the N. lovaniensis genome, while for N. gruberi a slightly lower GC content of 35% is reported. With a size of 717,491 bp, the N50 is 18 times higher than the N50 of the previously published N. fowleri ATCC 30863 genome and is comparable to N50 of the N. lovaniensis assembly (Table 1).

Table 1 Comparison of sequenced Naegleria genomes.

Assembly quality

As measurement for completeness and quality, the presence of 303 Benchmarking Universal Single-Copy Orthologs (BUSCOs)30 was analysed at the individual assembly steps. BUSCOs only include genes that have evolved as single copy orthologs over a long time period. Duplication or loss of such genes are considered as rare events, therefore presence or absence of BUSCOs provides an overview of assembly accuracy and completeness30. The initial Canu assembly contains 192 complete, 50 fragmented and 61 missing BUSCOs. Polishing using ONT signal level data (Nanopolish) increased the number of complete BUSCOs to 251 while less fragmented (16) and missing (39) BUSCOs are observed. Polishing using high quality Illumina data (Pilon) and manual curation of the assembly further increased the accuracy resulting in 262 complete, 7 fragmented and 34 missing BUSCOs (Fig. 1). Additional rounds of Pilon polishing did not improve the BUSCOs statistics. To benchmark the quality of the ONT assembly, the number of BUSCOs was compared to other previously sequenced Naegleria species. Similar numbers of complete BUSCOs are found for the N. fowleri ATCC 30863 assembly (264), the N. lovaniensis assembly (259) and the N. gruberi genome (257). The number of fragmented BUSCOs of the N. fowleri ATCC 30894 ONT is slightly higher (7) than observed in the previous Illumina assembly of N. fowleri ATCC 30863 (Table 2).

Figure 1
figure 1

Analysis of Benchmarking Universal Single-copy Orthologs (BUSCOs) at different assembly steps. The genome completeness was evaluated by analysing 303 conserved BUSCOs of the Eukaryota odb9 dataset. The initial Canu assembly contains high numbers of fragmented and missing BUSCOs. Polishing using Nanopolish in combination with ONT raw data and highly accurate Illlumina reads by applying Pilon could reduce the number of fragmented and missing BUSCOs. Further, manual curation using MUMmer alignments and duplicated BUSCOs reduced the occurrence of duplicated BUSCOs.

Table 2 Analysis of BUSCOs across different sequenced Naegleria species.

Mitochondrial genome and extrachromosomal sequences

Beside its nuclear genome, N. fowleri possesses a mitochondrial genome and encodes its ribosomal subunits on an extrachromosomal plasmid (rDNA plasmid). Due to the high coverage of those elements relative to the chromosomal sequences, when combined, overall assembly performance is poor. Therefore, mitochondrial reads were quality filtered and downsampled using filtlong v0.2.031 and assembled separately using Canu v1.7 followed by polishing using ONT and Illumina reads. The manually curated and circularized mitochondrial assembly has a size of 49,483 bp and sequence comparison to public available reference sequences of strain V419 (Genbank accession KX580903.1) and V212 (Genbank accession NC_021104.1) revealed a similarity of 98.6%. Additionally, reads of the rDNA plasmid were filtered and assembled in the same manner, resulting in a circular contig with the size of 15,936 bp encoding for the small and large ribosomal subunit.

Genome annotation

First, we annotated the repeats of N. fowleri. De novo repeat annotation and masking using RepeatModeler v1.0.1132 and RepeatMasker v4.0.833 revealed a repeat content of 6%. Simple repeats represent 1.98% of the repetitive sequences, while 1.85% are LINES and 1.12% are classified as DNA elements. A complete list of repetitive elements is shown in Table 3. Second, genes on the nuclear genome of N. fowleri were predicted by an ab initio and RNAseq based approach using BRAKER134 while non-coding and ribosomal RNA sequences were annotated by a similarity approach using Infernal 1.1.235 in combination with the Rfam 12.1 database36. In total 13,925 genes were predicted and 12,009 (86%) encode at least one PFAM protein domain while to 8,434 (61%) could be annotated with a GO term using BLAST2GO37 and to 5,604 (40%) KEGG number could be assigned using eggNOG38. Compared to other Naegleria species, slightly less proteins were identified: for N. gruberi 16,62039 genes are reported and in the N., lovaniensis assembly 15,19540 genes were predicted. Nevertheless, the set of predicted proteins did not show a reduced number complete BUSCOs compared to other Naegleria species (Table 4). In addition to the gene coding regions, Infernal 1.1.2 identified tRNAs, and 5 S rRNAs as well as the spliceosomal RNAs U1-U6 on the nuclear genome.

Tablee 3 Repetitive elements of the N. fowleri genome.
Table 4 BUSCO Analysis of predicted proteins across different Naegleria species.

Genome similarity

To gain an overview of the gene repertoire of the genus Naegleria, proteins of N. fowleri, N. lovaniensis, and N. gruberi were clustered using orthoVenn41. The clustering revealed 8,315 protein clusters shared by all Naegleria species. In total, 2,479 clusters are shared between N. fowleri and N. lovaniensis, while N. fowleri shares only 231 protein clusters with N. gruberi (Fig. 2). To assess the phylogenetic position of the newly sequenced N. fowleri isolate, a phylogenetic tree based on bootstrapping and maximum likelihood was estimated using RAxML v8.2.1142 by considering 978 single copy orthologs of the Naegleria species with public available genomes. As outgroup we choose the more distantly related Trypanosoma brucei. The N. fowleri isolate ATCC 30894 is closely related to the isolate ATCC 30863 and N. lovaniensis, while N. gruberi is more distantly related (Fig. 3).

Figure 2
figure 2

Clustering of predicted Naegleria proteins. To gain an overview of the diversity of Naegleria species, predicted proteins of N. fowleri ATCC 30894, N. lovaniensis and N. gruberi were clustered using OrthoVenn.

Figure 3
figure 3

Maximum likelihood tree of sequenced Naegleria species. To achieve a comprehensive overview of all Naegleria species with available reference genome, a phylogenetic tree was constructed based on maximum likelihood and bootstrapping using RAxML. Evolutionary distances were estimated based on 978 single copy orthologs of all Naegleria species and T. brucei as taxonomic outgroup.

Secreted proteins

The secretion of proteases and other degradative enzymes are discussed as an important factor during pathogenesis of PAM in order to disrupt the extracellular matrix and nerve cells. To identify proteins secreted by the classical secretory pathway we combined SignalP v5.026 to identify signal peptides and Deeploc v1.027 to verify the cellular localization. In total, we identified  208 potentially secreted proteins of which 163 have BLAST similarities to a protein in the Uniref90. Using the BLAST2GO pipeline v5.237 75 proteins could be annotated with a GO term. To gain an overview of the molecular function of the secreted proteins, GO annotation were visualized using WEGO v2.043. Almost 20% of the secreted proteins are annotated with the term hydrolase activity (GO:0016787) while 10% are associated with an ion-, protein-, or lipid binding function (GO:0005488). Other terms are catalytic activity (GO:0140096, 10%), enzyme regulator activity (GO:0030234, 3.8%) or isomerase activity (GO:0016853, 1%) (Fig. 4). BLAST similarities and GO annotations were used to further classify the secreted proteins. Regarding proteins with hydrolysing function, 27 proteases including cysteine and serine protases as well as 21 proteins involved in the degradation of lipids and peptidoglycan such as lipases, phospholipases, endolysin like proteins and beta-xylosidases. Additionally, three proteins with similarities to the autocrine proliferation repressor (aprA, Uniprot accession number Q5XM24) and counting complex (countin-1, Uniprot accession number Q86IV5) of Dictyostelium discoideum were identified, which suggests the ability of sensing and regulating the population density in Naegleria. Additionally, the secretome contains proteins belonging to the Ependymin and Tenascin family as well as different proteinase inhibitors, DnaJ homolog superfamily proteins or ribonucleases. The function of 107 proteins is still unknown. For 41 proteins no BLAST similarity was found and 66 have similarities to predicted or uncharacterized proteins (Fig. 5). A full list of the predicted secreted proteins including their annotation is shown in Supplementary Table S1.

Figure 4
figure 4

GO annotation of secreted proteins. The abundance of the GO terms within the category Molecular Function was visualized using WEGO. Hydrolase activity is the most abundant category followed by catalytic activity, protein binding and enzyme regulator activity.

Figure 5
figure 5

Overview of the function of the secreted proteins. Proteins were classified manually based on BLAST similarity and GO annotation to gain an overview of their biological function. Proteins with a hydrolysing activity are the largest group; in total 27 proteases, 8 carbohydrate/peptidoglycan degrading as well as 13 lipid degrading/binding proteins were identified. 66 proteins have similarities to predicted or uncharacterized proteins, while 41 proteins do not have any similarities to known proteins.

Discussion

The flexible life stages of N. fowleri, including resting cysts, fast moving flagellates and a crawling amoeboid form and its ability to proliferate in fresh water sources as well as a facultative parasite within the host’s CNS makes N. fowleri an ideal organism to study fundamental eukaryotic processes and pathogenesis. Although pathogenesis of N. fowleri has been studied extensively for more than 50 years, the mechanisms of invasion of the CNS and the ability to survive within the human brain are still poorly understood. Furthermore, the diversity of N. fowleri isolates on the genomic level is largely unknown. In this study, we successfully applied ONT sequencing to the genome of the human pathogenic amoeba N. fowleri resulting in a highly contiguous reference of the genome comprising of 83 contigs. In comparison to the ONT assembly, the previously sequenced N. fowleri ATCC 30863 based on Illumina and 454 sequencing technology is assembled in 1,729 contigs with a N50 of 38,128 bp and L50 of 212 while the total assembly length is 27,791,290 bp. De novo assembly using long reads decreased the total number of contigs drastically (90 vs 1729) and lead to an 18 fold increased N50 (717,491 bp vs 38,128 bp). Furthermore, polishing using signal level raw data and high-quality Illumina reads improved the quality of the initial assembly. The final assembly shows similar numbers of complete BUSCOs as other sequenced Naegleria species, suggesting that the sequencing and assembly of ONT data resulted in a similar quality and completeness. In addition, the circular sequences of the mitochondrion and the extrachromosomal rDNA plasmid, for which currently no reference is available in public databases, were successfully assembled into single circular contigs. Although less proteins were identified using an ab initio approach in the current assembly when compared to other published Naegleria genomes, analysis of conserved eukaryotic proteins (BUSCOs) across species did not show a reduced completeness of the protein set. In summary, sequencing and de novo assembly of the N. fowleri genome resulted in a high-quality draft genome providing the basis for further studies including transcriptomics or proteomics. To gain insight in the biology of the human pathogenic amoeba and to identify factors involved in the pathogenesis of PAM, we predicted 208 potentially secreted proteins and analysed their function based on BLAST similarities and GO annotations. A large proportion of those proteins is annotated with a hydrolysing function but also proteins such as proteinase inhibitors, ribonucleases or enzymes playing a role in sensing of the environment are identified. Hydrolysing enzymes are important factors for accessing nutrients including prokaryotic and eukaryotic microorganism found in soil and fresh water sources. The ability of Naegleria species to feed on bacteria as well as on mammalian cells has been described in different studies examining growth conditions in laboratory environments. The addition of heat inactivated bacteria, for example, increases the proliferation rate of Naegleria and cultivation on a mammalian cell monolayer has been linked to an increased pathogenicity of N. fowleri44,45. So far, the genes involved in the assimilation of nutrients are not well studied. By analysing secreted proteins, we identified proteins involved in the degradation of bacterial and eukaryotic cell components. The N. fowleri genome encodes for proteins such as endolysin, xylosidases and lipopolysaccharide-binding protein which play a major role in the degradation of bacterial cell membranes that consists of proteoglycan and outer membrane lipopolysaccharides. Further, the protist also secrets phospholipases and ceramidases which enable the hydrolysis of eukaryotic cell membranes. Additionally, 27 secreted proteases were identified which are most likely involved in the degradation of extracellular material. Proteins with a hydrolysing function not only play a role in the nutrition of Naegleria but could also serve as potential pathogenicity factors. Among others, the predicted secretome contains proteins which have been previously linked to pathogenicity such as Naegleriapore A (Uniprot accession number Q9BKM2) and virulence-related protein Nf314 (Uniprot accession number P42661). Naegleriapore A was characterized by Herbst et al. (2002)46 and shows a cytotoxic activity against mammalian and bacterial cells. The virulence-related protein Nf314 is a serine carboxypeptidase and was discovered in 1992 by analysing gene expression patterns in highly and weakly pathogenic N. fowleri trophozoites47. Other studies highlight the importance of cysteine, serine, and metalloproteases13,16,48 as well as of phospholipases17,49 during the pathogenesis of N. fowleri. However, the actual protein sequences often remain unknown. The here identified proteases and phospholipases are potentially involved in the pathogenesis of PAM and could serve as the missing link between the described proteolytic and lipolytic function and the actual gene sequence. Additionally, proteins belonging to the Ependymin and Tenascin family were identified. Their GO annotations indicate a function in cell adhesion and in binding proteins of the extracellular matrix such as fibronectin or collagen. However, their actual role and how their binding function can be linked to the pathogenicity still has to be examined. Analysis of secreted proteins further shows, that N. fowleri secrets DNAJ (HSP40) domain containing proteins which are linked to increased virulence in bacteria and the parasite Plasmodium falciparum50,51,52. HSP40 is known as regulator of HSP7053, a heat-shock protein which has been linked to pathogenicity of N. fowleri previously54. Given the regulatory function, it is possible, that HSP40 plays an important role in the pathogenicity of N. fowleri. Other proteins of the secretome are protease inhibitors or act as ribonucleases but still little is known about their biological function. The secretion of proteases inhibitors has been described in various parasites including Apicomplexa, oomycetes or helminths and a protective function from host proteases is reported for protease inhibitors55,56,57. Further investigations are needed to clarify the role of secreted protease inhibitors in Naegleria. Beside proteins with potential involvement in pathogenicity, we found evidence for cell-cell communication and autoregulation of the population density similar as observed in D. discoideum by the identification of proteins similar to the autocrine proliferation repressor (aprA) and the counting complex (countin-1).

To conclude, sequencing of the N. fowleri genome using long reads resulted in a high-quality draft genome. Furthermore, we identified 208 potentially secreted proteins of which 20% have a hydrolase activity. Given their proteolytic function, they are involved in the lysis of microorganism and eukaryotic tissue cells as nutrients and are therefore considered as potential pathogenicity factors.

Methods

Cultivation of Naegleria and DNA isolation

To extract high molecular DNA, N. fowleri (ATCC 30894) was cultivated in Nelson’s Medium (pH 6.5)58 using NunclonTM Δ Surface cell culture flasks (Thermo Fisher Scientific, Allschwil, Switzerland). 1 × 107 trophozoites were used for DNA extraction using the DNeasy Blood and Tissue Kit (Qiagen, Basel, Switzerland) according to the manufacturer’s protocol including an RNA digestion step. DNA was finally eluted in 100 μl low TE buffer (10 mM Tris-HCl, 0.1 mM EDTA, pH 8, Thermo Fisher Scientific) and quantified using the Qubit 3.0 Fluorometer (Thermo Fisher Scientific).

Library preparation and sequencing

1 μg DNA was used for library preparation using the ONT 1D ligation sequencing kit (SQK-LSK109) according to the manufacturer’s protocol without fragmentation step. The resulting library was spotted on a SpotON Flow Cell (FLO-MIN106, R9.4) and sequenced on the GridION X5 sequencer for 48 h. Live base-calling was carried out on the GridION X5 (software release 18.12.1) using ONT MinKNOW 3.1.8 and Guppy v2.0.5 with default options. In addition to the long-read sequencing, 1 μg of the isolated DNA was sent to the Functional Genomics Center Zurich (FGCZ) for sequencing on Illumina NovaSeq. 6000 resulting in 55 Million paired-end reads. For further analysis, Illumina reads were quality trimmed using Trimmomatic 0.3659 (options: LEADING:3, TRAILING:3, SLIDINGWINDOW:4:8, MINLEN:36).

Genome de novo assembly and polishing

To facilitate assembly of the genomic sequence, rDNA plasmid reads were excluded from the assembly, by mapping raw reads to the N. gruberi rDNA plasmid reference sequence (GenBank accession no. AB298288.1). Unmapped sequences were then de novo assembled using Canu v1.7 with default settings. To increase consensus accuracy, raw reads were mapped to the draft assembly using minimap2 v2.860 and Nanopolish v0.11.022 was applied as polishing tool. In total, 5 rounds of Nanoplish were performed to minimize the number of changes. To further improve consensus accuracy, quality trimmed Illumina reads were mapped to the assembly using bwa v0.7.16a61 and the sequence was polished using Pilon v1.2229. After polishing, the assembly was manually curated to reduce the number redundant contigs by considering MUMmer v4.0.062 all-against-all alignments of the contigs and duplicated BUSCOs. An overview of the sequencing and assembly workflow is shown in Fig. 6.

Figure 6
figure 6

Genome Assembly Workflow. Total DNA was isolated followed by library preparation using the ONT LSK109 kit and sequencing. The other part of the DNA sample was used for Illumina sequencing. Reads of the ONT sequencing were base called using Guppy. Before assembly of the nuclear genome, reads of the mitochondrial genome and the extrachromosomal rDNA plasmid were removed by aligning them to N. gruberi reference sequence. The nuclear genome was then assembled using Canu 1.7 followed by 5 rounds of Nanopolish v0.11.0 and one round of Pilon v1.22 in combination with trimmed high-quality Illumina data. The polished assembly was then manually curated to remove redundant sequences based on MUMmer v4.0.0 alignments and duplicated BUSCOs. The mitochondrial genome and the rDNA plasmid were assembled separately. Therefore, reads were quality filtered using filtlong followed by assembly using Canu and polishing using Nanopolish and Pilon as described above.

Assembly of the mitochondrial DNA and the ribosomal DNA plasmid

Beside its genomic DNA, N. fowleri possesses a mitochondrial genome and a circular plasmid containing the ribosomal sequences, both were assembled separately. To retain 100x coverage of the mitochondrial genome, raw reads were filtered using filtlong v0.2.031 with the parameters–min_length 1000–keep_percent 90–target_bases 5000000–trim–split 500 and the previously published mitochondrial sequence of N. fowleri (GenBank accession no: NC_021104.1) as reference (option -a). Filtered reads were assembled using Canu 1.7 followed by polishing using Nanopolish v0.11.0 and Pilon v1.22. Finally, the consensus sequence was manually curated and circularized by aligning to the reference (NC_021104.1). The extrachromosomal rDNA plasmid was assembled in the same manner using the N. gruberi rDNA plasmid (GenBank accession no: AB298288.1) as reference. During read filtering using filtlong and options were set to–min_length 1000–keep_percent 90–target_bases 3200000–trim–split 1000–length_weight 10 to maximize the number of long sequences.

Repeat annotation

Repeats were de novo predicted using RepeatModeler v1.0.1132, including RECON33 v1.05 and RepeatScout v1.0.533. Sequences with known protein domain were identify by Hmmer3.1b63 and the PFAM-A 29.0 database64 database and excluded from the library. For further classification into main functional repeat categories, remaining sequences were submitted to TEclass65. The curated repeat library was finally used for repeat annotation using RepeatMasker v4.0.833.

Gene annotation

Gene models were predicted using BRAKER134 in combination with unpublished RNA sequencing data which are part of an on-going project (N. Liechti et al., unpublished). Infernal 1.1.235 and the Rfam 12.1 database36 were used for annotation of non-coding and ribosomal RNA sequences. PFAM protein domains of de gene coding sequences were identified using HMMER v3.1b263 in combination with the PFAM-A 29.0 database64. Sequences were annotated with BLAST66 similarity search against UniRef90 with an e-value of 1e-05. Predicted proteins were then annotated using the BLAST2GO v5.2 pipeline37 with default mapping parameters including BLAST similarity search against Uniref90 and protein domain search using InterproScan67. Additionally, EggNOG mapper68 in combination with eggNOG 4.5 orthology data38 was used to retrieve KEGG categories. Secreted proteins were predicted by SignalP v5.026 and the resulting set of proteins was additionally analysed for their cellular localization using Deeploc v1.027.

Genome and proteome completeness

Completeness of the assembly at the individual steps and of the predicted gene models was accessed using BUSCO v3.0.130 in combination with the Eukaryote dataset odb969 comprising of 303 single-copy orthologs using the default species parameters. In addition, the number of BUSCOs was compared to previously sequenced Naegleria genomes (N. fowleri ATCC 30863 (AWXF00000000), N. lovaniensis PYSW00000000), N. gruberi (ACER00000000)). To assess the completeness of the gene prediction, BUSCO analysis was performed on the BRAKER1 predicted proteins and compared across different Naegleria species (N. fowleri ATCC 30863 (transcriptome de novo assembly21), N. lovaniensis (predicted proteins40), N. gruberi (Uniprot reference proteome UP00000667139).

Phylogenetic analysis

To gain an overview of shared orthologs between Naegleria species, protein sequences of N. fowleri ATCC 30894, N. gruberi (Uniprot reference proteome UP00000667139), and N. lovaniensis (predicted proteins40) were clustered using orthoVenn v2.041 with the default parameters (e-value = 1e-5, inflation value = 1.5). Further, phylogenetic relationships were inferred by constructing a maximum likelihood tree. Therefore, the core genome of four Naegleria species (N. fowleri ATCC 30894 (this study), N. gruberi, N. lovaniensis, and N. fowleri ATCC 30863 (transcriptome based ORFs21) and the more distantly related T. brucei brucei (Uniprot reference proteome UP000008524) as outgroup was computed using orthoVenn v1.0 (default parameters). The sequences of the resulting 978 single-copy orthologs were aligned using MUSCLE v3.8.3140 followed by trimming using trimA v1.4l70 and concatenating to a supermatrix using FASconCAT v1.0471. The best fitting amino acid replacement model was estimated using Prottest v3.4.272 and the phylogenetic tree was estimated based on maximum likelihood and 1,000 bootstrap iteration using RAxML v8.2.1142.