Nanopore sequencing improves the draft genome of the human pathogenic amoeba Naegleria fowleri

Naegleria fowleri is an environmental protist found in soil and warm freshwater sources worldwide and is known for its ability to infect humans and causing a rapid and mostly fatal primary amoebic meningoencephalitis. When contaminated water enters the nose, the facultative parasite follows the olfactory nerve and enters the brain by crossing the cribriform plate where it causes tissue damage and haemorrhagic necrosis. Although N. fowleri has been studied for several years, the mechanisms of pathogenicity are still poorly understood. Furthermore, there is a lack of knowledge on the genomic level and the current reference assembly is limited in contiguity. To improve the draft genome and to investigate pathogenicity factors, we sequenced the genome of N. fowleri using Oxford Nanopore Technology (ONT). Assembly and polishing of the long reads resulted in a high-quality draft genome whose N50 is 18 times higher than the previously published genome. The prediction of potentially secreted proteins revealed a large proportion of enzymes with a hydrolysing function, which could play an important role during the pathogenesis and account for the destructive nature of primary amoebic meningoencephalitis. The improved genome provides the basis for further investigation unravelling the biology and the pathogenic potential of N. fowleri.

www.nature.com/scientificreports www.nature.com/scientificreports/ based on their biochemical properties by using enzyme and proteinase inhibitor assays, the diversity of N. fowleri secreted proteins is still sparsely described and the actual genes encoding the factors responsible for the cytotoxic effect remains often unknown. Although the genome of N. fowleri (Isolate ATCC 30863) has been published in 2014, the assembly is highly fragmented including over 1,000 contigs and gene annotation on the genomic sequence is missing 21 . Therefore, improving of N. fowleri assembly provides the basis for further experiments on a molecular and computational level to unravel the pathogenicity of N. fowleri. In the last few years sequencing technologies markedly improved and long-read sequencing using Oxford Nanopore Technology (ONT) offers a new possibility to resolve highly repetitive regions in genome assemblies, leading to highly contiguous reference sequences 22,23 . Nevertheless, one of the major drawbacks of long read sequencing methods is their high error rate. Therefore, hybrid assembly approaches incorporating high quality short-reads are often used to improve the accuracy 23,24 . Recent improvement in sequencing chemistry and base-calling algorithms of ONT lead to an increase in read quality and thus facilitating genome de novo assembly and leading to a consensus accuracy of over 99.8% 25 . In this study, we successfully applied ONT sequencing to the genome of the human pathogenic amoeba N. fowleri ATCC 30894. The higher contiguity of the N. fowleri genome assembly enables the prediction of genes on the nucleotide level using ab initio and RNA sequencing (RNAseq) based methods and provides a high-quality reference for further downstream experiments. The secretion of proteins with hydrolysing activity has been discussed as important factor in the pathogenesis of N. fowleri. Using deep neural networks implemented in SignalP 26 and DeepLoc 27 , we identified 208 potentially secreted proteins of which 18% are annotated with a hydrolysing function. In free-living N. fowleri they are most likely involved in the lysis of bacterial and eukaryotic microorganism. Given their proteolytic function, these proteins may additionally play a major role in the degradation of ECM proteins and nerve cells and contribute to the pathogenesis of PAM.

Results
Genome assembly. To gain a complete overview of the N. fowleri genome, total DNA of the isolate ATCC 30894 was extracted and sequenced using ONT. Sequencing of the DNA on the GridION system using one flow cell resulted in a total of 1,352,535 base called reads (9 Gb) with a mean read length of 6,658.6 bp and a N50 of 11,677 bp. The nuclear genome of N. fowleri was assembled using the string graph assembler Canu v1.7 28 . To improve consensus accuracy, the initial assembly was polished in two steps using ONT raw signal data by applying Nanopolish v0.11.0 22 and with high quality Illumina reads by using Pilon v1.22 29 . The final, curated assembly of the nuclear genome consists of 81 contigs with a total length of 29,549,925 bp, the N50 has a size of 717,491 bp while the L50 is 18. Further, the assembly has a GC content of 36.9% which is similar to the previously published isolate N. fowleri ATCC 30863 assembly and the N. lovaniensis genome, while for N. gruberi a slightly lower GC content of 35% is reported. With a size of 717,491 bp, the N50 is 18 times higher than the N50 of the previously published N. fowleri ATCC 30863 genome and is comparable to N50 of the N. lovaniensis assembly (Table 1).
Assembly quality. As measurement for completeness and quality, the presence of 303 Benchmarking Universal Single-Copy Orthologs (BUSCOs) 30 was analysed at the individual assembly steps. BUSCOs only include genes that have evolved as single copy orthologs over a long time period. Duplication or loss of such genes are considered as rare events, therefore presence or absence of BUSCOs provides an overview of assembly accuracy and completeness 30 . The initial Canu assembly contains 192 complete, 50 fragmented and 61 missing BUSCOs. Polishing using ONT signal level data (Nanopolish) increased the number of complete BUSCOs to 251 while less fragmented (16) and missing (39) BUSCOs are observed. Polishing using high quality Illumina data (Pilon) and manual curation of the assembly further increased the accuracy resulting in 262 complete, 7 fragmented and 34 missing BUSCOs (Fig. 1). Additional rounds of Pilon polishing did not improve the BUSCOs statistics. To benchmark the quality of the ONT assembly, the number of BUSCOs was compared to other previously sequenced Naegleria species. Similar numbers of complete BUSCOs are found for the N. fowleri ATCC 30863 assembly (264), the N. lovaniensis assembly (259) and the N. gruberi genome (257). The number of fragmented BUSCOs of the N. fowleri ATCC 30894 ONT is slightly higher (7) than observed in the previous Illumina assembly of N. fowleri ATCC 30863 (Table 2).
Mitochondrial genome and extrachromosomal sequences. Beside its nuclear genome, N. fowleri possesses a mitochondrial genome and encodes its ribosomal subunits on an extrachromosomal plasmid (rDNA plasmid). Due to the high coverage of those elements relative to the chromosomal sequences, when combined, overall assembly performance is poor. Therefore, mitochondrial reads were quality filtered and downsampled  Table 3. Second, genes on the nuclear genome of N. fowleri were predicted by an ab initio and RNAseq based approach using BRAKER1 34 while non-coding and ribosomal RNA sequences were annotated by a similarity approach using Infernal 1.1.2 35 in combination with the Rfam 12.1 database 36 . In total 13,925 genes were predicted and 12,009 (86%) encode at least one PFAM protein domain while to 8,434 (61%) could be annotated with a GO term using BLAST2GO 37 and to 5,604 (40%) KEGG number could be assigned using eggNOG 38 . Compared to other Naegleria species, slightly less proteins were identified: for N. gruberi 16,620 39 genes are reported and in the N., lovaniensis assembly 15,195 40 genes were predicted. Nevertheless, the set of predicted proteins did not show a reduced number complete BUSCOs compared to other Naegleria   www.nature.com/scientificreports www.nature.com/scientificreports/ species (Table 4). In addition to the gene coding regions, Infernal 1.1.2 identified tRNAs, and 5 S rRNAs as well as the spliceosomal RNAs U1-U6 on the nuclear genome.

Genome similarity.
To gain an overview of the gene repertoire of the genus Naegleria, proteins of N. fowleri, N. lovaniensis, and N. gruberi were clustered using orthoVenn 41 . The clustering revealed 8,315 protein clusters shared by all Naegleria species. In total, 2,479 clusters are shared between N. fowleri and N. lovaniensis, while N. fowleri shares only 231 protein clusters with N. gruberi (Fig. 2). To assess the phylogenetic position of the newly sequenced N. fowleri isolate, a phylogenetic tree based on bootstrapping and maximum likelihood was estimated using RAxML v8.2.11 42 by considering 978 single copy orthologs of the Naegleria species with public available genomes. As outgroup we choose the more distantly related Trypanosoma brucei. The N. fowleri isolate ATCC 30894 is closely related to the isolate ATCC 30863 and N. lovaniensis, while N. gruberi is more distantly related (Fig. 3).
Secreted proteins. The secretion of proteases and other degradative enzymes are discussed as an important factor during pathogenesis of PAM in order to disrupt the extracellular matrix and nerve cells. To identify proteins secreted by the classical secretory pathway we combined SignalP v5.0 26 to identify signal peptides and Deeploc v1.0 27 to verify the cellular localization. In total, we identified 208 potentially secreted proteins of which 163 have BLAST similarities to a protein in the Uniref90. Using the BLAST2GO pipeline v5.2 37 75 proteins could be annotated with a GO term. To gain an overview of the molecular function of the secreted proteins, GO annotation were visualized using WEGO v2.0 43 . Almost 20% of the secreted proteins are annotated with the term hydrolase activity (GO:0016787) while 10% are associated with an ion-, protein-, or lipid binding function (GO:0005488). Other terms are catalytic activity (GO:0140096, 10%), enzyme regulator activity (GO:0030234,  Table 4. BUSCO Analysis of predicted proteins across different Naegleria species. www.nature.com/scientificreports www.nature.com/scientificreports/ 3.8%) or isomerase activity (GO:0016853, 1%) (Fig. 4). BLAST similarities and GO annotations were used to further classify the secreted proteins. Regarding proteins with hydrolysing function, 27 proteases including cysteine and serine protases as well as 21 proteins involved in the degradation of lipids and peptidoglycan such as lipases, phospholipases, endolysin like proteins and beta-xylosidases. Additionally, three proteins with similarities to the autocrine proliferation repressor (aprA, Uniprot accession number Q5XM24) and counting complex (countin-1, Uniprot accession number Q86IV5) of Dictyostelium discoideum were identified, which suggests the ability of sensing and regulating the population density in Naegleria. Additionally, the secretome contains proteins belonging to the Ependymin and Tenascin family as well as different proteinase inhibitors, DnaJ homolog superfamily proteins or ribonucleases. The function of 107 proteins is still unknown. For 41 proteins no BLAST similarity was found and 66 have similarities to predicted or uncharacterized proteins (Fig. 5). A full list of the predicted secreted proteins including their annotation is shown in Supplementary Table S1.

Discussion
The flexible life stages of N. fowleri, including resting cysts, fast moving flagellates and a crawling amoeboid form and its ability to proliferate in fresh water sources as well as a facultative parasite within the host's CNS makes N. fowleri an ideal organism to study fundamental eukaryotic processes and pathogenesis. Although pathogenesis of N. fowleri has been studied extensively for more than 50 years, the mechanisms of invasion of the CNS and the ability to survive within the human brain are still poorly understood. Furthermore, the diversity of N. fowleri  www.nature.com/scientificreports www.nature.com/scientificreports/ isolates on the genomic level is largely unknown. In this study, we successfully applied ONT sequencing to the genome of the human pathogenic amoeba N. fowleri resulting in a highly contiguous reference of the genome comprising of 83 contigs. In comparison to the ONT assembly, the previously sequenced N. fowleri ATCC 30863 based on Illumina and 454 sequencing technology is assembled in 1,729 contigs with a N50 of 38,128 bp and L50 of 212 while the total assembly length is 27,791,290 bp. De novo assembly using long reads decreased the total number of contigs drastically (90 vs 1729) and lead to an 18 fold increased N50 (717,491 bp vs 38,128 bp). Furthermore, polishing using signal level raw data and high-quality Illumina reads improved the quality of the initial assembly. The final assembly shows similar numbers of complete BUSCOs as other sequenced Naegleria species, suggesting that the sequencing and assembly of ONT data resulted in a similar quality and completeness. In addition, the circular sequences of the mitochondrion and the extrachromosomal rDNA plasmid, for which currently no reference is available in public databases, were successfully assembled into single circular contigs. Although less proteins were identified using an ab initio approach in the current assembly when compared to other published Naegleria genomes, analysis of conserved eukaryotic proteins (BUSCOs) across species did not show a reduced completeness of the protein set. In summary, sequencing and de novo assembly of the N. fowleri genome resulted in a high-quality draft genome providing the basis for further studies including transcriptomics or proteomics. To gain insight in the biology of the human pathogenic amoeba and to identify factors involved in the pathogenesis of PAM, we predicted 208 potentially secreted proteins and analysed their function based on BLAST similarities and GO annotations. A large proportion of those proteins is annotated with a hydrolysing function but also proteins such as proteinase inhibitors, ribonucleases or enzymes playing a role in sensing of the environment are identified. Hydrolysing enzymes are important factors for accessing nutrients including prokaryotic and eukaryotic microorganism found in soil and fresh water sources. The ability of Naegleria species to feed on bacteria as well as on mammalian cells has been described in different studies examining growth conditions in laboratory environments. The addition of heat inactivated bacteria, for example, increases the proliferation rate of Naegleria and cultivation on a mammalian cell monolayer has been linked to an increased pathogenicity of N. fowleri 44,45 . So far, the genes involved in the assimilation of nutrients are not well studied. By analysing secreted proteins, we identified proteins involved in the degradation of bacterial and eukaryotic cell components. The N. fowleri genome encodes for proteins such as endolysin, xylosidases and lipopolysaccharide-binding protein which play a major role in the degradation of bacterial cell membranes that consists of proteoglycan and outer membrane lipopolysaccharides. Further, the protist also secrets phospholipases and ceramidases which enable the hydrolysis of eukaryotic cell membranes. Additionally, 27 secreted proteases were identified which are most likely involved in the degradation of extracellular material. Proteins with a hydrolysing function not only play a role in the nutrition of Naegleria but could also serve as potential pathogenicity factors. Among others, the predicted secretome contains proteins which have been previously linked to pathogenicity such as Naegleriapore A (Uniprot accession number Q9BKM2) and virulence-related protein Nf314 (Uniprot accession number P42661). Naegleriapore A was characterized by Herbst et al. (2002) 46 and shows a cytotoxic activity against mammalian and bacterial cells. The virulence-related protein Nf314 is a serine carboxypeptidase and was discovered in 1992 by analysing gene expression patterns in highly and weakly pathogenic N. fowleri trophozoites 47 . Other studies highlight the importance of cysteine, serine, and metalloproteases 13,16,48 as well as of phospholipases 17,49 during the pathogenesis of N. fowleri. However, the actual protein sequences often remain unknown. The here identified proteases and phospholipases are potentially involved in the pathogenesis of PAM and could serve as the missing link between the described proteolytic and lipolytic function and the actual gene sequence. Additionally, proteins belonging to the Ependymin and Tenascin family were identified. Their GO annotations indicate a function in cell adhesion and in binding proteins of the extracellular matrix such as fibronectin or collagen. However, their actual role and how their binding function can be linked to the pathogenicity still has to be examined. Analysis of secreted proteins further shows, that N. fowleri secrets DNAJ (HSP40) domain containing proteins which are www.nature.com/scientificreports www.nature.com/scientificreports/ linked to increased virulence in bacteria and the parasite Plasmodium falciparum [50][51][52] . HSP40 is known as regulator of HSP70 53 , a heat-shock protein which has been linked to pathogenicity of N. fowleri previously 54 . Given the regulatory function, it is possible, that HSP40 plays an important role in the pathogenicity of N. fowleri. Other proteins of the secretome are protease inhibitors or act as ribonucleases but still little is known about their biological function. The secretion of proteases inhibitors has been described in various parasites including Apicomplexa, oomycetes or helminths and a protective function from host proteases is reported for protease inhibitors [55][56][57] . Further investigations are needed to clarify the role of secreted protease inhibitors in Naegleria. Beside proteins with potential involvement in pathogenicity, we found evidence for cell-cell communication and autoregulation of the population density similar as observed in D. discoideum by the identification of proteins similar to the autocrine proliferation repressor (aprA) and the counting complex (countin-1).
To conclude, sequencing of the N. fowleri genome using long reads resulted in a high-quality draft genome. Furthermore, we identified 208 potentially secreted proteins of which 20% have a hydrolase activity. Given their proteolytic function, they are involved in the lysis of microorganism and eukaryotic tissue cells as nutrients and are therefore considered as potential pathogenicity factors.

Methods cultivation of Naegleria and DNA isolation.
To extract high molecular DNA, N. fowleri (ATCC 30894) was cultivated in Nelson's Medium (pH 6.5) 58 using Nunclon TM Δ Surface cell culture flasks (Thermo Fisher Scientific, Allschwil, Switzerland). 1 × 10 7 trophozoites were used for DNA extraction using the DNeasy Blood and Tissue Kit (Qiagen, Basel, Switzerland) according to the manufacturer's protocol including an RNA digestion step. DNA was finally eluted in 100 μl low TE buffer (10 mM Tris-HCl, 0.1 mM EDTA, pH 8, Thermo Fisher Scientific) and quantified using the Qubit 3.0 Fluorometer (Thermo Fisher Scientific). Genome de novo assembly and polishing. To facilitate assembly of the genomic sequence, rDNA plasmid reads were excluded from the assembly, by mapping raw reads to the N. gruberi rDNA plasmid reference sequence (GenBank accession no. AB298288.1). Unmapped sequences were then de novo assembled using Canu v1.7 with default settings. To increase consensus accuracy, raw reads were mapped to the draft assembly using minimap2 v2.8 60 and Nanopolish v0.11.0 22 was applied as polishing tool. In total, 5 rounds of Nanoplish were performed to minimize the number of changes. To further improve consensus accuracy, quality trimmed Illumina reads were mapped to the assembly using bwa v0.7.16a 61 and the sequence was polished using Pilon v1.22 29 . After polishing, the assembly was manually curated to reduce the number redundant contigs by considering MUMmer v4.0.0 62 all-against-all alignments of the contigs and duplicated BUSCOs. An overview of the sequencing and assembly workflow is shown in Fig. 6.
Assembly of the mitochondrial DNA and the ribosomal DNA plasmid. Beside its genomic DNA, N. fowleri possesses a mitochondrial genome and a circular plasmid containing the ribosomal sequences, both were assembled separately. To retain 100x coverage of the mitochondrial genome, raw reads were filtered using filtlong v0.2.0 31 with the parameters-min_length 1000-keep_percent 90-target_bases 5000000-trim-split 500 and the previously published mitochondrial sequence of N. fowleri (GenBank accession no: NC_021104.1) as reference (option -a). Filtered reads were assembled using Canu 1.7 followed by polishing using Nanopolish v0.11.0 and Pilon v1.22. Finally, the consensus sequence was manually curated and circularized by aligning to the reference (NC_021104.1). The extrachromosomal rDNA plasmid was assembled in the same manner using the N. gruberi rDNA plasmid (GenBank accession no: AB298288.1) as reference. During read filtering using filtlong and options were set to-min_length 1000-keep_percent 90-target_bases 3200000-trim-split 1000-length_weight 10 to maximize the number of long sequences.
Repeat annotation. Repeats were de novo predicted using RepeatModeler v1.0.11 32 , including RECON 33 v1.05 and RepeatScout v1.0.5 33 . Sequences with known protein domain were identify by Hmmer3.1b 63 and the PFAM-A 29.0 database 64 database and excluded from the library. For further classification into main functional repeat categories, remaining sequences were submitted to TEclass 65 . The curated repeat library was finally used for repeat annotation using RepeatMasker v4.0.8 33 .
Gene annotation. Gene models were predicted using BRAKER1 34  www.nature.com/scientificreports www.nature.com/scientificreports/ Genome and proteome completeness. Completeness of the assembly at the individual steps and of the predicted gene models was accessed using BUSCO v3.0.1 30 in combination with the Eukaryote dataset odb9 69 comprising of 303 single-copy orthologs using the default species parameters. In addition, the number of BUSCOs was compared to previously sequenced Naegleria genomes (N. fowleri ATCC 30863 (AWXF00000000), N. lovaniensis PYSW00000000), N. gruberi (ACER00000000)). To assess the completeness of the gene prediction, BUSCO analysis was performed on the BRAKER1 predicted proteins and compared across different Naegleria species (N. fowleri ATCC 30863 (transcriptome de novo assembly 21 ), N. lovaniensis (predicted proteins 40 ), N. gruberi (Uniprot reference proteome UP000006671 39 ).

Data availability
ONT and Illumina raw reads been deposited at the National Center for Biotechnology Information (NCBI) BioProject repository PRJNA541227 with the accession numbers SRR9047098 (ONT) and SRR9047076 (Illumina). The genome assembly is available under the GenBank accession number: VFQX00000000. The sequences of the predicted proteins are available on figshare (https://doi.org/10.6084/m9.figshare.8313656). Figure 6. Genome Assembly Workflow. Total DNA was isolated followed by library preparation using the ONT LSK109 kit and sequencing. The other part of the DNA sample was used for Illumina sequencing. Reads of the ONT sequencing were base called using Guppy. Before assembly of the nuclear genome, reads of the mitochondrial genome and the extrachromosomal rDNA plasmid were removed by aligning them to N. gruberi reference sequence. The nuclear genome was then assembled using Canu 1.7 followed by 5 rounds of Nanopolish v0.11.0 and one round of Pilon v1.22 in combination with trimmed high-quality Illumina data. The polished assembly was then manually curated to remove redundant sequences based on MUMmer v4.0.0 alignments and duplicated BUSCOs. The mitochondrial genome and the rDNA plasmid were assembled separately. Therefore, reads were quality filtered using filtlong followed by assembly using Canu and polishing using Nanopolish and Pilon as described above.