Chromosome-scale genome sequencing, assembly and annotation of six genomes from subfamily Leishmaniinae

We provide the raw and processed data produced during the genome sequencing of isolates from six species of parasites from the sub-family Leishmaniinae: Leishmania martiniquensis (Thailand), Leishmania orientalis (Thailand), Leishmania enriettii (Brazil), Leishmania sp. Ghana, Leishmania sp. Namibia and Porcisia hertigi (Panama). De novo assembly was performed using Nanopore long reads to construct chromosome backbone scaffolds. We then corrected erroneous base calling by mapping short Illumina paired-end reads onto the initial assembly. Data has been deposited at NCBI as follows: raw sequencing output in the Sequence Read Archive, finished genomes in GenBank, and ancillary data in BioSample and BioProject. Derived data such as quality scoring, SAM files, genome annotations and repeat sequence lists have been deposited in Lancaster University’s electronic data archive with DOIs provided for each item. Our coding workflow has been deposited in GitHub and Zenodo repositories. This data constitutes a resource for the comparative genomics of parasites and for further applications in general and clinical parasitology.


Background & Summary
Leishmaniasis is a neglected tropical disease. It is considered to be a disease of poverty, primarily affecting low and middle-income countries (LMICs). Leishmaniasis is caused by parasites of the genus Leishmania and 18 different species are known to infect humans 1 . 98 sandfly species are suspected or confirmed vectors of Leishmania 2 . There are three major types of leishmaniasis: visceral, also known as kala-azar, is fatal if left untreated in over 95% of cases; cutaneous, the most common form, causes skin lesions leaving life-long scars and serious disability or stigma; mucocutaneous, leads to partial or total destruction of mucous membranes of the nose, mouth and throat 3 . Over one billion people live in endemic areas and are at risk of leishmaniasis. It is estimated that each year, globally, new cases of cutaneous leishmaniasis occur at an incidence of 700,000 to 1.2 million or more in over 100 countries 4 . Additionally, up to 300,000 visceral leishmaniasis cases cause more than 200,000 deaths annually 5 .
The genus Leishmania is divided into four subgenera: L. Leishmania, L. Viannia, L. Sauroleishmania and the newest subgenus L. Mundinia, the latter now accommodating several species from the L. enriettii complex and others, from five continents [6][7][8][9][10][11][12] . In 1994, the Leishmania Genome Network was initiated 13 and announced, ten years later, the assembly of the Leishmania major Friedlin strain as the first Leishmania reference genome 14 . Since then, a total of 58 genomes have become available publicly, assembled at a variety of levels of completeness ranging from contigs to chromosome level. Prior to our project, only two L. Mundinia subgenus genomes have been sequenced and assembled: Leishmania enriettii, strain LEM3045 (GCA_000410755) and Leishmania sp. MAR, strain LEM2494 (GCA_000410755). The genus Porcisia is a sister genus of Leishmania within the     www.nature.com/scientificdata www.nature.com/scientificdata/ used for the initial scaffolding assemblies, followed by mapping of the Illumina short reads onto these scaffolds, thus increasing quality of the assembled sequence while preserving whole chromosome integrity. Final polishing, reordering and reorienting of chromosomes, along with masking and classifying of repeat regions, was guided by the most closely related reference genome for each species. Finished genome annotation was both evidence-based and ab initio. Figure 1 summarises data sizes and total yield per sample. The total sequencing data file size for all samples was 139.33 Gigabytes, yielding 58.70 GigaBases of sequence data from 23.71 GigaReads. Figure 2 summarises our analysis workflow. This workflow generated four main outputs for each assembly: genome, proteome, and transcriptome files in FASTA format, and a General Feature Format file (GFF) that contains the coordinates for all proteins and transcripts in the assembly.

Methods
Sample collection, sequencing and software. From the parasite cryobank at Lancaster University, we selected six samples of the species listed above without publicly available reference genomes. Table 1 gives details for strains, isolates, BioSample and BioProject accessions 17-28 . Illumina HiSeq 4000 and MiSeq sequencing was contracted to BGI Genomics and Aberystwyth University. Nanopore sequencing was performed in-house using MinION FLO-MIN106 flow cells with SQK-LSK109 ligation sequencing protocol. Throughout the text we provide literature citations to software where available. Links to both published and unpublished software used are provided in Table 2. We created public GitHub and Zenodo repositories for the analysis pipeline 29,30 . Genome assembly. De novo assemblies were performed with Nanopore MinION long reads using Flye 31 .
Due to the low quality scores in Nanopore long reads, we mapped high quality Illumina short reads onto the assemblies and created corrected consensus sequences using minimap2 32 and SAMtools 33 . The consensus sequence was scanned for any contamination or any sequence of vector origin by BLAST+ 34 on the UniVec database 35 . Finally, a polishing step was done to minimise gaps using Pilon 36 . www.nature.com/scientificdata www.nature.com/scientificdata/ Chromosome verification. For all chromosomes of each polished genome, we then ran BLAST + (parameters: -max_target_seqs. 1 -max_hsps 1) against all TriTrypDB 37 release-47 genomes. The output for each genome was then visualized using wordcloud to suggest the closest relative among TriTrypDB genomes 38 . Then, synteny was plotted for each genome by aligning each of its chromosomes with the corresponding chromosomes of its wordcloud-predicted closest relative, using MUMmer 39 (Fig. 3). This confirmed that the order and orientation of the chromosomes of each genome was equivalent to those of its closest TriTrypDB genome. Completion was then achieved by sorting and removing any duplicate scaffolds or contigs using funannotate 40 , followed by a final quality check using Genome Assembly Annotation Service (GAAS).

Repetitive element annotation.
We identified and classified repeat regions in the polished assemblies using RepeatModeller and TEclass 41 . Then, we generated a stratified genome-wide repeat plot for each assembly 38 (see also L. martiniquensis example in Fig. 4) to assist the decision of which repeats to mask, using RepeatMasker.

Contamination screening.
We scanned all assemblies for any contamination or any sequence of vector origin by first building a UniVec Database and then using BLAST+ . All contaminants were found either at the beginning or at the end of contigs and then deleted. No contaminants affected assembly integrity.
Quality of short and long raw sequence reads. We used FastQC to check the sequence quality of Illumina short reads sequences and pycoQC to check the Nanopore long reads sequence quality. We used MultiQC 123 to output all sequence quality scores in one interactive report 105-110 . assembly validation. Since the analysis took many steps to finish, quality checks were introduced between each step. Some checks were focused on completeness, for instance using BUSCO 124 as a benchmark for the presence of expected universal single-copy orthologues. Other checks focussed on the correct order and orientation of the chromosomes, for instance MUMmer alignment to find synteny between assemblies and other Leishmania genomes. Yet further checks focussed on the accuracy and precision of annotation, for instance using Annotation Edit Distance score (AED) in MAKER2 (Fig. 5). We checked reproducibility of the assemblies and annotations using Snakemake.  Table 3. Details of reads, bases and file sizes.