The draft genome sequence of the Brahminy blindsnake Indotyphlops braminus

Blindsnakes of infraoder Scolecophidia (order Squamata) are the most basal group of extant snakes, comprising of more than 450 species with ecological and morphological features highly specialized to underground living. The Brahminy blindsnake, Indotyphlops braminus, is the only known obligate parthenogenetic species of snakes. Although the origin of I. braminus is thought to be South Asia, this snake has attracted worldwide attention as an alien species, as it has been introduced to all continents except Antarctica. In this study, we present the first draft genome assembly and annotation of I. braminus. We generated approximately 480 Gbp of sequencing data and produced a draft genome with a total length of 1.86 Gbp and N50 scaffold size of 1.25 Mbp containing 89.3% of orthologs conserved in Sauropsida. We also identified 0.98 Gbp (52.82%) of repetitive genome sequences and a total of 23,560 protein-coding genes. The first draft genome of I. braminus will facilitate further study of snake evolution as well as help to understand the emergence mechanism of parthenogenetic vertebrates.


Background & Summary
The Infraorder Scolecophidia (blindsnakes) is the most basal lineage of extant snakes 1 . All constituent species are subterranean and are found mainly in the southern hemisphere and on tropical islands. They can range from 10 cm to nearly 1 m in length 2 , and they have highly specialized morphologies, including a vestigial organ form of eyes that can only perceive light. Although 462 species in five families have been described in Scolecophidia 3 , the true species diversity is thought to be greatly underestimated due to their cryptic ecology 4,5 .
As of April 2022, there are 32 available genome assemblies for snakes. Among the three major groups that comprise Serpentes (Caenophidia, Henophidia, and Scolecophidia), genomic data have been accumulated in Caenophidia, mainly for poisonous snakes belonging to the families Elapidae and Viperidae 6 and in Henophidia, which includes the families Boidae and Pythonidae, for which the genome of Python molurus bivittatus has been reported 7 . However, there are currently no datasets for draft genome assemblies or annotations for snakes in the Scolecophidia group, despite the evolutionarily importance of this group, with the exception of low-quality assembly data (N50 < 2kbp) 8 .
The Brahminy blindsnake, or Indotyphlops braminus, is one of the most well-known species in Scolecophidia (Fig. 1). No male I. braminus have been found, and this species of snake is the only known obligate parthenogenesis snake 9,10 . Further, I. braminus is an allotriploid (triploid) species [11][12][13] and is considered to have emerged via inter-species hybridization, as has occurred with other parthenogenetic reptiles 14,15 . The geographic origin of this species is thought to be in South Asia based on the distribution of congeneric species 16,17 . However, due to their small size and fossorial and parthenogenetic nature, they have been transported around the world, hidden in the rotting woods and soils of ornamental plants. Consequently, I. braminus has now been colonized artificially and unintentionally in all continents except Antarctica 18,19 . Because I. braminus can be found globally, various studies regarding their osteology 20,21 , anatomy 22 , neurology 23 , and ethology 24,25 have been conducted worldwide. For these reasons, I. braminus has the potential to serve as a useful snake model organism and is a suitable species in which to investigate the emergence mechanism of parthenogenesis in vertebrates.
In this study, we present the first draft genome of I. braminus. We extracted genomic DNA from liver and muscle tissues, constructed three pair-end (PE) libraries, and sequenced libraries using the Illumina Hiseq2500 platform. In addition, we conducted long-read sequencing of four libraries using Oxford Nanopore MinION and performed hybrid de novo assembly. The draft genome was assembled into 4,851 scaffolds (N50 = 1.25 Mbp) with a total size of 1.86 Gbp, comparable to the estimated genome size (1.50 Gbp) in k-mer analysis. Our BUSCO assessment indicated that 89.3% of orthologs conserved in Sauropsida were present in the genome assembly. Structural annotation of the genome identified 23,560 protein-coding genes. In the future, this highly-quality scolecophidian genome will be a crucial reference for further understanding of both snake evolution and the emergence mechanism of parthenogenetic species.

Methods
Sample Collection and DNA extraction. We used two I. braminus specimens collected from India (Ooty: 11°24′26″ N, 76°41′27″ E) and Japan (Okinawajima Island: 26°15′09″N, 127°45′55″E), since I. braminus individuals are parthenogenetic clones, and the worldwide colonization of this blindsnake is thought to have occurred recently 26 . Indeed, the partial sequence of the mitochondrial cytochrome b gene of I. braminus from Japan (obtained by methods described previously in Smíd et al. 27 ) matched perfectly with the corresponding region of the India specimen constructed by short-read data using NOVOPlasty v3.2 28 . The specimens used were picked up from under stones, euthanized, and dissected to isolate the liver and muscle tissues for DNA extraction. These experiments were performed under permissions received from the Ethics Committees for Animal Experiments by Dr. Babasaheb Ambedkar Marathwada University (permit No. A01) and Nagahama Institute of Bio-Science (permit No. 085).
For genome sequencing using Illumina, the I. braminus specimen from India was used, and DNA was extracted using the Wizard ® Genomic DNA Purification Kit (Promega Corporation, WH, Madison, WI, USA).
For Oxford Nanopore long-read sequencing, the specimen from Japan was used, and DNA extraction was performed using the Blood & Cell Culture DNA Midi Kit (Qiagen, Hilden, Germany) according to the manufacturer's protocol. Purified precipitates were dissolved in TE buffer (pH 8.0) and stored at −30 °C until further processing.
Library preparation and sequencing. Short www.nature.com/scientificdata www.nature.com/scientificdata/ purification using 0.4 x AMPure XP beads (Beckman Coulter, Brea, CA, USA), library preparation was performed using the SQK-LSK109 Ligation Sequencing kit (Oxford Nanopore Technologies) based on the manufacturer's protocol. Four libraries were prepared and loaded onto R9.4.1 chemistry flowcell (FLO-MIN106) and sequenced using MinKNOW v 19.06.7. After sequencing, Guppy v3.2.2 was used for basecalling. A total of 57.9 Gbp of long-read data were obtained (Table 1), which were 31.1 x coverage of I. braminus genome. The raw reads were checked using LongQC v1.2.0c 29 , and quality filtered using Filtlong v0.2.1 (https://github.com/rrwick/ Filtlong) with a minimum QV of 10 and a minimum read length of 1 Kbp. Genome assembly. We estimated the overall characteristics of the I. braminus genome, including its genome size, heterozygosity, and repeat content, by k-mer frequencies calculated from Illumina short-reads. KMC v3.1.1 30 was used to obtain a 21-mer count histogram (Fig. 2). GenomeScope v2.0 31 estimated a genome size of 1.50 Gbp, which was comparable with that of our draft genome (1.86 Gbp). The genome size of I. braminus fell within the range of other snake species whose genomes have been reported previously (1.13-2.03 Gbp).
We applied a hybrid de novo assembly approach based on Illumina short-reads and Nanopore long-reads. Short-and long-reads were assembled to contigs using MaSuRCA v4.0.5 32 . For gap-closing, assembled contigs were scaffolded into the draft genome using HaploMerger2 v20180603 33 (Table 2). We evaluated the gene completeness of our draft genome using BUSCO v5.2.2 35,36 . BUSCO assessment showed that 89.3% of orthologs conserved in Sauropsida were present in this genome assembly (sum of the percentages of single-copy and duplicate), suggesting that our draft genome possessed a sufficient gene repertoire from I. braminus (Table 2).

Repeat analysis.
Repetitive regions of I. braminus were identified using a combination of de novo and homology-based approaches. For homology-based prediction, known repetitive elements were identified using RepeatMasker v4.1.1 (http://www.repeatmasker.org) to search against published RepBase sequences. For de novo prediction, RepeatModeler v2.0.1 was executed on the I. braminus assembly to build a de novo repeat library for this species. Then, RepeatMasker was used to annotate repetitive elements using this library. The estimated repeat regions of total length 0.98 Gbp accounted for 52.82% of the genome. Long interspersed nuclear elements were the most abundant elements and accounted for 20% of the genome. A summary of the annotation is shown in Table 3.   www.nature.com/scientificdata www.nature.com/scientificdata/ Gene prediction and annotation. A BLAST search with the known mitochondrial DNA sequence of I. braminus (Accession number: NC_010196) identified a contig showing 99.9% homology. This was a mitochondrial DNA excluded from the assembly data. We also masked repeat regions and conducted gene prediction using Augustus v3.4.0 37 trained with the assessment result of BUSCO with respect to the genome assembly. In total, 23,560 protein-coding genes were annotated in the I. braminus genome (Table 4). Next, we investigated the closest protein homolog of each entry in the gene model of I. braminus using diamond v2.0.13 38 , and visualized results by Krona 39 (Fig. 3). Approximately 91% of the closest protein homolog of each gene of the gene model belonged to Sauropsida. Of the proteins detected in Sauropsida, approximately 76% were derived from Serpentes, indicating that the gene model is quite consistent with the systematic position of I. braminus.   www.nature.com/scientificdata www.nature.com/scientificdata/ The BUSCO analysis with Sauropsida conserved genes databases found 72.9% completeness in our annotation dataset (Table 4), which was lower than that estimated in the genome assembly (89.3%: Table 2). Since the completeness of predicted genes was evaluated based on the codon reading frame, it is likely that there were low-quality genes exhibiting premature termination. In this analysis, we applied a hybrid assembly with short-reads (accuracy >99.9%) and long-reads (<85%), which may have resulted in a lower base accuracy for the assembled regions with only long-reads and in low BUSCO value. To improve the assembly of the I. braminus genome, it would be necessary to obtain novel transcriptome data or perform further high accuracy shortand long-read sequencing.

technical Validation
Quality assessment of the genome assembly. The total assembly length is 1.86 Gbp, which is almost comparable with the estimated genome size (1.50 Gbp). The scaffold N50 is 1.25 Mbp (Table 2). BUSCO analysis was performed with Sauropsida conserved genes databases to assess the completeness of the genome assembly, resulting in a BUSCO value of 89.3%.
Gene prediction and annotation validation. Gene models in the assembly were predicted using Augustus trained with the BUSCO assessment result. The final gene set consisted of 23,560 genes ( Table 4). The BUSCO value was 72.9%, which was lower than that in the genome assembly, probably due to the insufficient reliability of the regions assembled using only long-reads data.

Code availability
All analyses were conducted on Linux systems. The version and code and parameters of the main software tools are described below.