Background & Summary

Leishmaniasis is a neglected tropical disease. It is considered to be a disease of poverty, primarily affecting low and middle-income countries (LMICs). Leishmaniasis is caused by parasites of the genus Leishmania and 18 different species are known to infect humans1. 98 sandfly species are suspected or confirmed vectors of Leishmania2. There are three major types of leishmaniasis: visceral, also known as kala-azar, is fatal if left untreated in over 95% of cases; cutaneous, the most common form, causes skin lesions leaving life-long scars and serious disability or stigma; mucocutaneous, leads to partial or total destruction of mucous membranes of the nose, mouth and throat3. Over one billion people live in endemic areas and are at risk of leishmaniasis. It is estimated that each year, globally, new cases of cutaneous leishmaniasis occur at an incidence of 700,000 to 1.2 million or more in over 100 countries4. Additionally, up to 300,000 visceral leishmaniasis cases cause more than 200,000 deaths annually5.

The genus Leishmania is divided into four subgenera: L. Leishmania, L. Viannia, L. Sauroleishmania and the newest subgenus L. Mundinia, the latter now accommodating several species from the L. enriettii complex and others, from five continents6,7,8,9,10,11,12. In 1994, the Leishmania Genome Network was initiated13 and announced, ten years later, the assembly of the Leishmania major Friedlin strain as the first Leishmania reference genome14. Since then, a total of 58 genomes have become available publicly, assembled at a variety of levels of completeness ranging from contigs to chromosome level. Prior to our project, only two L. Mundinia subgenus genomes have been sequenced and assembled: Leishmania enriettii, strain LEM3045 (GCA_000410755) and Leishmania sp. MAR, strain LEM2494 (GCA_000410755). The genus Porcisia is a sister genus of Leishmania within the sub-family Leishmaniinae. Prior to the release of our genome, there were no genome sequences for genus Porcisia. Subsequently, the partial genome of P. deanei was released and published15.

We assembled and annotated the genomes of five L. Mundinia species – those of L. martiniquensis, L. orientalis, L. enriettii, L. sp. Ghana and L. sp. Namibia - and one genome in the genus Porcisia – that of P. hertigi, formerly known as L. hertigi16 - using Illumina and Nanopore sequencing. The two isolates from Ghana and Namibia are from new species that have not yet been formally named. The World Health Organization (WHO) codes for the six isolates are: L. martiniquensis MHOM/TH/2012/LSCM1;LV760; L. orientalis MHOM/TH/2014/LSCM4;LV768; L. enriettii MCAV/BR/2001/CUR178;LV673; L. sp. Ghana MHOM/GH/2012/GH5;LV757; L. sp. Namibia MPRO/NA/1975/252;LV425; and P. hertigi MCOE/PA/1965/C119;LV43. Nanopore long reads were used for the initial scaffolding assemblies, followed by mapping of the Illumina short reads onto these scaffolds, thus increasing quality of the assembled sequence while preserving whole chromosome integrity. Final polishing, reordering and reorienting of chromosomes, along with masking and classifying of repeat regions, was guided by the most closely related reference genome for each species. Finished genome annotation was both evidence-based and ab initio.

Figure 1 summarises data sizes and total yield per sample. The total sequencing data file size for all samples was 139.33 Gigabytes, yielding 58.70 GigaBases of sequence data from 23.71 GigaReads. Figure 2 summarises our analysis workflow. This workflow generated four main outputs for each assembly: genome, proteome, and transcriptome files in FASTA format, and a General Feature Format file (GFF) that contains the coordinates for all proteins and transcripts in the assembly.

Fig. 1
figure 1

Stacked column chart showing number of sequenced reads in GigaReads (blue), number of yielded bases in GigaBases (red), and the file sizes in Gigabytes (yellow) for each genome assembly.

Fig. 2
figure 2

Flowchart showing the analysis workflow strategy.

Methods

Sample collection, sequencing and software

From the parasite cryobank at Lancaster University, we selected six samples of the species listed above without publicly available reference genomes. Table 1 gives details for strains, isolates, BioSample and BioProject accessions17,18,19,20,21,22,23,24,25,26,27,28. Illumina HiSeq 4000 and MiSeq sequencing was contracted to BGI Genomics and Aberystwyth University. Nanopore sequencing was performed in-house using MinION FLO-MIN106 flow cells with SQK-LSK109 ligation sequencing protocol. Throughout the text we provide literature citations to software where available. Links to both published and unpublished software used are provided in Table 2. We created public GitHub and Zenodo repositories for the analysis pipeline29,30.

Table 1 Sample descriptions for all assemblies.
Table 2 Tools used in analysis workflow with conda or docker link.

Genome assembly

De novo assemblies were performed with Nanopore MinION long reads using Flye31. Due to the low quality scores in Nanopore long reads, we mapped high quality Illumina short reads onto the assemblies and created corrected consensus sequences using minimap232 and SAMtools33. The consensus sequence was scanned for any contamination or any sequence of vector origin by BLAST+34 on the UniVec database35. Finally, a polishing step was done to minimise gaps using Pilon36.

Chromosome verification

For all chromosomes of each polished genome, we then ran BLAST + (parameters: -max_target_seqs. 1 -max_hsps 1) against all TriTrypDB37 release-47 genomes. The output for each genome was then visualized using wordcloud to suggest the closest relative among TriTrypDB genomes38. Then, synteny was plotted for each genome by aligning each of its chromosomes with the corresponding chromosomes of its wordcloud-predicted closest relative, using MUMmer39 (Fig. 3). This confirmed that the order and orientation of the chromosomes of each genome was equivalent to those of its closest TriTrypDB genome. Completion was then achieved by sorting and removing any duplicate scaffolds or contigs using funannotate40, followed by a final quality check using Genome Assembly Annotation Service (GAAS).

Fig. 3
figure 3

Dotplot representing synteny between each of our genomes and its wordcloud-predicted closest related reference genome, produced using MUMmer.

Repetitive element annotation

We identified and classified repeat regions in the polished assemblies using RepeatModeller and TEclass41. Then, we generated a stratified genome-wide repeat plot for each assembly38 (see also L. martiniquensis example in Fig. 4) to assist the decision of which repeats to mask, using RepeatMasker.

Fig. 4
figure 4

Example genome-wide repeat plot for L. martiniquensis, stratified: simple (micro-satellites), low complexity, DNA, long terminal repeats (LTRs), long interspersed nuclear elements (LINEs), RNA, rolling circle (RC), satellites, short interspersed nuclear elements (SINEs) and retroposons. The middle pie chart represent the proportion of each repeat class in the genome: none (94.4%), simple (micro-satellites) (4.11%), low complexity (0.655%), DNA (0.419%), unknown (0.161%), LTRs (0.110%), LINEs (0.052%), RNA (0.027%), RC (0.019%), satellites (0.010%), retroposons (0.005%), SINEs (0.004%).

Gene prediction and functional annotation

After repeat masking, we annotated the assemblies using the MAKER242 annotation pipeline over two rounds: 1) an evidence-based annotation round using EST, mRNA-seq and protein homology evidence from TriTrypDB release-47 along with our repeat-masking output, 2) an ab initio round using AUGUSTUS43, with the pre-trained L. tarentolae as the model organism. After each round, Annotation Edit Distance (AED) scores were calculated and plotted (Fig. 5). We calculated brief statistics for each round, e.g. the number of genes and other features, using Genometools44 and AGAT45. After completion of all annotation rounds, we assigned functional annotations from the Uniprot46 and Pfam47 databases using BLAST + and InterProScan48.

Fig. 5
figure 5

Annotation Edit Distance (AED) score (x-axis) line plot for all assembly annotation rounds: evidence-based (solid line) and ab initio (dotted line). Y-axis represents the genome cumulative percentages.

Analysis pipeline

To make sure that all assemblies and annotations are reproducible by future investigators, the entire process from obtaining the SRAs49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91 to the annotation assignments92,93,94,95,96,97 has been made available29 using Snakemake98. This Snakemake pipeline ought to be easily adaptable to the sequencing of further similar parasite genomes, throughout the parasitology community30.

Data Records

Table 3 details the sequencing output. Short and long reads were deposited in the NCBI Sequence Read Archive (SRA)49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91. Six BioProjects23,24,25,26,27,28 and six BioSamples17,18,19,20,21,22 were also created at NCBI. The assembled genomes were deposited at NCBI Assembly99,100,101,102,103,104. Additional files containing raw reads quality reports105,106,107,108,109,110, mapped reads111,112,113,114,115,116, classified repeated sequences117,118,119,120,121,122 and functional annotations92,93,94,95,96,97 were deposited at Lancaster University electronic data archive.

Table 3 Details of reads, bases and file sizes.

Technical Validation

Genomic DNA integrity

Genomic DNA was extracted using Trizol (Invitrogen) and quantified using Qubit® dsDNA HS Assay Kits (ThermoFisher Scientific) prior to sequencing. Concentrations ranged between 68.2 and 120 ng/µL. For consistency, we used the same extracted DNA for all three sequencing platforms (Nanopore MinION, Illumina HiSeq 4000 and MiSeq). Furthermore, we assessed the gDNA high molecular weight using N50 estimates of MinION long reads which were ranged between 12.07 and 22.92 kilobases.

Contamination screening

We scanned all assemblies for any contamination or any sequence of vector origin by first building a UniVec Database and then using BLAST+ . All contaminants were found either at the beginning or at the end of contigs and then deleted. No contaminants affected assembly integrity.

Quality of short and long raw sequence reads

We used FastQC to check the sequence quality of Illumina short reads sequences and pycoQC to check the Nanopore long reads sequence quality. We used MultiQC123 to output all sequence quality scores in one interactive report105,106,107,108,109,110.

Assembly validation

Since the analysis took many steps to finish, quality checks were introduced between each step. Some checks were focused on completeness, for instance using BUSCO124 as a benchmark for the presence of expected universal single-copy orthologues. Other checks focussed on the correct order and orientation of the chromosomes, for instance MUMmer alignment to find synteny between assemblies and other Leishmania genomes. Yet further checks focussed on the accuracy and precision of annotation, for instance using Annotation Edit Distance score (AED) in MAKER2 (Fig. 5). We checked reproducibility of the assemblies and annotations using Snakemake.