Draft genome sequence of the pulse crop blackgram [Vigna mungo (L.) Hepper] reveals potential R-genes

Blackgram [Vigna mungo (L.) Hepper] (2n = 2x = 22), an important Asiatic legume crop, is a major source of dietary protein for the predominantly vegetarian population. Here we construct a draft genome sequence of blackgram, for the first time, by employing hybrid genome assembly with Illumina reads and third generation Oxford Nanopore sequencing technology. The final de novo whole genome of blackgram is ~ 475 Mb (82% of the genome) and has maximum scaffold length of 6.3 Mb with scaffold N50 of 1.42 Mb. Genome analysis identified 42,115 genes with mean coding sequence length of 1131 bp. Around 80.6% of predicted genes were annotated. Nearly half of the assembled sequence is composed of repetitive elements with retrotransposons as major (47.3% of genome) transposable elements, whereas, DNA transposons made up only 2.29% of the genome. A total of 166,014 SSRs, including 65,180 compound SSRs, were identified and primer pairs for 34,816 SSRs were designed. Out of the 33,959 proteins, 1659 proteins showed presence of R-gene related domains. KIN class was found in majority of the proteins (905) followed by RLK (239) and RLP (188). The genome sequence of blackgram will facilitate identification of agronomically important genes and accelerate the genetic improvement of blackgram.

Blackgram [Vigna mungo (L.) Hepper] is an annual leguminous crop belonging to the family Fabaceae and subfamily Papilionaceae. This crop is a major constituent of the genus Vigna Savi (subgenus Ceratotropis) grouped under the key tribe Phaseoleae that is known to accommodate other economically significant grain legumes like soybean (Glycine max (L.) Merr.), common bean (Phaseolus vulgaris L.), pigeonpea (Cajanus cajan (L.) Millsp.), mungbean (Vigna radiata (L.) R. Wilczek), cowpea (Vigna unguiculata (L.) Walp) and adzuki bean (Vigna angularis (Willd.) Ohwi & Ohashi). Blackgram is a self pollinated diploid (2n = 2x = 22) with genome size estimated to be 0.59 pg/1C (574 Mbp) 1 . It is popularly known as 'urd bean' , 'urd' or 'mash' and is an excellent source of easily digestible good quality protein (25-26%), carbohydrates (60%), fat (1.5%), minerals, amino-acids and vitamins. In addition to being an important source of human food and animal feed, it also plays a significant role in sustaining soil fertility by improving soil physical properties and fixing atmospheric nitrogen. As a hardy legume tolerant to drought, blackgram is suitable for dry land farming and is predominantly grown as an intercrop or as a sole crop under residual moisture conditions post rice harvest. Blackgram is extensively grown in south and south-east Asia from ancient times. It originated in India and has been domesticated from its wild ancestral form V. mungo var. silvestris 2 . India is the largest producer of blackgram, where about 5.0 million hectares are cultivated with an annual production of 3.8 million tonnes 3 .
In spite of its economic importance and surging demand for improved blackgram varieties, susceptibility to multiple diseases, including mungbean yellow mosaic, powdery mildew, Cercospora leaf spot and leaf crinkle hinders cultivation and reduces produce yield and quality. In this regard, it is important to study plant disease resistance mechanisms and identify genes to develop varieties with durable resistance. Plant disease resistance genes (R-genes) play a key role in recognizing proteins expressed by specific avirulence (Avr) genes of pathogens 4 . The proteins encoded by the resistance genes share common domains such as coiled-coils (CC), nucleotide binding regions (NB), toll-interleukin regions (TIR), leucine rich regions (LRR) and kinases (K). Hundreds of NBS-LRR, RLK and RLP genes have been reported in plants [5][6][7][8] . Pyramiding of plant resistance genes in new cultivars is the most effective and environment friendly approach for plant disease control and reduction of OPEN 1 Nuclear Agriculture and Biotechnology Division, BARC , Trombay, Mumbai 400085, India. 2  www.nature.com/scientificreports/ yield losses. Such useful information is lacking in blackgram. This could be attributed to the lack of genomic resources coupled with limited understanding of the molecular basis of gene expression and phenotypic variation. Whole genome assemblies support genome wide association studies(GWAS) to identify trait-specific loci and for genomic-based selective breeding 9 . Whole-genome sequencing has been conducted on several commercial Vigna species such as mungbean, adzuki bean, cowpea, beach pea 5,10-13 . Elucidation of the genome sequence of V. mungo var. mungo could reveal the general genome structure, repetitive sequences and R-gene composition of this legume species in comparison to closely related genomes and greatly assist comparative genomics with other well-studied legume genomes. Next-generation sequencing (NGS) reads are too short to resolve abundant repeats particularly in plant genomes, leading to incomplete or ambiguous assemblies 14 . Construction of highly contiguous genomes has been possible in recent years owing to expeditious advances in sequencing technologies and substantial refinements in assembly algorithms. The advent of third generation sequencing technologies capable of delivering long reads over several kilobases for haplotype phasing have significantly enhanced the possibility of de novo assemblies [15][16][17] . In view of the importance of this pulse crop in the Asiatic region and the need for molecular detailing of trait based selection, we assembled a draft genome of Vigna mungo var. mungo using next-generation platform Illumina paired end and mate pair reads combined with third generation Oxford Nanopore sequencing.

Results
Illumina and nanoporesequencing of blackgram. We prepared three libraries for sequencing by Illumina HiSeqX Ten sequencer including 150 bp paired-end library and 5-7 kb and 7-10 kb mate-pair libraries. Whole genome sequencing using Illumina paired-end (PE) long insert generated 154,940,012 reads representing ∼ 98x genome coverage. Sequencing of 2 mate-pairs of 5-7, and 7-10 kb yielded, 33,617,232 and 10,247,813 reads, respectively, with an approximate coverage of 21.2x and 6.5x, respectively, and a grand total of 43 million mate-pair reads representing ∼28x coverage (Table S1). In addition, longread sequencing by Oxford Nanopore sequencing technology (ONT) was used to generate 1,633,898 long reads, having 10,425,220,236 bp and coverage of ∼22x. A total of 11.5 Gb data was generated from whole genome library with an average read length of 6.4 kb and a maximum read length of 128.7 kb using Nanopore sequencer (Table S2). The complete genome was sequenced to a depth of ∼148x, using both Illumina and ONT platforms.
De novo assembly of blackgram genome and gene annotation. The raw reads generated from Illumina paired end, mate-pair and nanopore sequencing were processed and good quality reads were retained. Hybrid assembly was performed using Illumina and Nanopore reads by MaSuRCA v3.3.4 hybrid assembler. Scaffolds were further processed for super-scaffolds using PyScaf producing1085 scaffolds with a N50 of 1.42 Mb (Table 1). Overall, the maximum scaffold assembled length was 6343.0 kb with median scaffold length of 67.9 kb. The total length of the produced scaffolds was 475 Mb (82% of genome) for Vigna mungo cultivar Pant U-31. Read utilization was also performed to ascertain the correctness of the assembly. Illumina reads were mapped against assembly of 280,233,560 total processed reads with 279,112,626 mapped reads (99.60%). Similarly, Nanopore reads were mapped against assembly of 1,633,786 total processed reads with 1,626,270 mapped reads (99.53%).
The gene prediction and annotation of the assembled genome was carried out using repeat masked assembly genome and reference transcriptome data of Vigna mungo using BRAKER tool. In total 42,115 genes were identified with average coding sequence length of 1131 bp. The maximum and minimum sequence lengths  (Table S3). Gene ontology provides a system to categorize description of gene products according to three ontologies: molecular function, biological process and cellular component. Of the 33,959 annotated genes, majority (45.2%) were assigned with cellular components, followed by molecular (41.4%) and biological functions (8.0%). Among the assignment made to the cellular component function, the majority represented integral component of membranes (26.5%) followed by nucleus (11.1%) and cytoplasm (2.6%). Among those with molecular function, a large proportion of the sequences represented ATP binding (11.8%) followed by metal ion binding (6.3%) and DNA binding (4.5%). Under the biological process category, more sequences were assigned to regulation of transcription (2.2%) followed by carbohydrate metabolic process (1.8%) and translation process (1.3%) (Fig. 1). To asses the completeness of Vigna mungo genome assembly and gene annotation, we performed the BUSCO analysis with summarized benchmarking: C: 96.8% (S: 94.6%, D: 2.2%, m: 2.5%, n: 5366) and ~ 97% of genes were observed to be complete which also validates the completeness of draft assembly genome. Pathway assignments were carried out according to the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database. A total number of 16,404 unique KEGG pathways were identified (Table S4), of which the majority of sequences were grouped into protein families (8715) followed by carbohydrate metabolism (1158)  Simple sequence repeats (SSR) prediction. SSRs were detected using Microsatellite Identification Tool (MISA v1.0). A total of 166,014 SSRs were identified from 989 scaffolds (Table 3). More than one SSR were present in 953 scaffolds and 65,180 SSRs were of the compound type. SSR loci with di-and tri-nucleotides constituted 103,955 (62.6%) of the identified loci. The proportions of di-, tri-, tetra-, penta-, and hexa-nucleotide repeats were 38.1%, 24.5%, 36.4%, 0.69%, and 0.24%, respectively (Table S6). The number of repeats varied from 6-61 for di-nucleotides, 5-361 for tri-nucleotides, 3-7 for tetra-nucleotides, 5-19 for penta-nucleotides and 5-14 for hexa-nucleotides. The most prevalent di-, tri-, tetra-, penta-, and hexa-nucleotide repeats were AT (22.6%), AAT (3.9%),TTTA (5.1%), AAAAT (4.6%) and ATG TTG (1.9%), respectively (Table S7). Of the   (Table 4) (Table 4). Seventy-one R-genes were identified based on their homologies with mungbean, cowpea and adzuki bean sequences (Table S9).

Discussion
A better understanding of blackgram genetics is crucial for more efficient breeding in light of an anticipated increase in biotic and abiotic stresses that may accompany climate change. Whole-genome sequences are an important resource for evolutionary geneticists studying plant domestication, as well as breeders aiming to improve crop varieties. We sequenced V. mungo using Illumina PE and Nanopore with a coverage of 148x and assembled genome using MaSuRCA hybrid assembler. The final assembly comprised of 1085 scaffolds (N50 = 1.43 Mb). Hybrid assembly through combinational sequencing is a useful approach in obtaining accurate sequence data. Moreover, the production of long-reads while using third generation sequencing (Nanopore) overcomes the weakness of assembling short-reads by minimizing the generation of gaps or covering the repetitive sequences that appear in the plant genomes. In addition, while only considering the accuracy, short-reads can be used for error-correction by aligning them to long-reads, which enable the increased accuracy of the genome assembly 18 . We constructed 475 Mb (82%) of the total estimated V. mungo var. mungo genome and identified 42,115 protein-coding genes and 1970 Vigna mungo specific gene clusters. The assembly generated will also advance comparative genomics in Vigna species, as whole genome sequences of prominent Vigna species including mung bean, adzuki bean and cowpea are already available 5,11,12 . Of the 42,115 predicted genes, 33,959 could be functionally annotated. In V. radiata genome, 22,427 genes were annotated with high confidence 5 .
Most of the gene annotations were comparable to the annotation of immature seed transcriptome sequence of blackgram 19 . Orthologous gene comparison studies using genes from Vigna mungo (Pant U-31), Vigna radiata, Vigna unguiculata and Vigna angularis revealed that a total of 19,095 gene clusters were shared by all four species. High degree of conservation and collinearity between blackgram and adzuki bean was revealed through comparative mapping 20 . Gene order conservation between closely related legume species (V. angularis var. angularis, V. radiata var. radiata, and P. vulgaris) has been exploited in synteny based scaffolding approach in genome assembly 11 . Similarly, Cowpea chromosomes Vu02, Vu03 and Vu08 also have one-to one relationship with the other two Vigna species (mungbean and common bean) suggesting that these chromosome rearrangements are characteristic of the divergence of Vigna from Phaseolus 12 . TEs are potential reservoirs of phenotypic variation and phenotypic plasticity 28 . Moreover, TEs can directly assist the crop improvement programs through molecular marker approach. The presence of TEs, often close to or within the stress responsive quantitative trait loci (QTLs), especially plant defense genes, along with the traditional attributes of a molecular marker, make them the markers of choice for diversity studies and trait mapping 29,30 . While more studies would be necessary to understand the functional effects of these insertions, long-read sequences have greatly improved the assembly and identification of repeat types.
Simple sequence repeats. The development of genomic resources is critical for crop improvement programmes. NGS has allowed the discovery of a large number of DNA polymorphisms, such as SNP and InDels markers, in a relatively short time at low cost 31 . Among 166,014 SSRs (excluding mono nucleotide repeats) identified, the proportions of dinucleotide repeats were higher (38.1%) compared to other repeats in V. mungo. www.nature.com/scientificreports/ Similarly, dinucleotide repeats were found to be higher (71.3%) compared to other repeats in V. radiata 5 . Proportion of tri-, tetra-, penta-, and hexa-nucleotide SSRs were more or less same in comparison to V. radiata (24.6%, 2.5%, 1.2%, 0.2%) and lower than V. marina (49%, 3%, 7%, 5%) except for tetra-nucleotide repeats. Tetra-nucleotide repeats in V. mungo were found to be higher (36.4%) in comparison to V. radiata (2.5%) and V. marina (3.0%). Likewise, the number of compound SSRs was higher (39.2%) than that in V. radiata (35.9%) and V. marina (10.08%) 5,13 . To date, few efforts have been made to develop sufficient genomic resources in Vigna. This pioneer genome sequencing effort in V. mungo has generated SSRs and functional annotations for a huge set of genes. This information holds great promise for use in trait mapping, genomic selections, and diversity assessment.
Disease resistance genes. Whole genome sequencing has enabled genome-level investigation of the R-gene family in crop plants such as mungbean, chickpea, rice, tomato [5][6][7][8] . In blackgram, 3.9% of the total genes were found to contain R-genes which is higher (1.2%) than that reported for Medicago 32 and lower (5.27%) than that reported for Arabidopsis 33 . Plants possess a sophisticated immune system based on their ability to recognize phytopathogens. The activation of this system is based on the presence of specific receptors encoded by R-genes. Resistance genes are grouped as either nucleotide binding site leucine rich repeat (NBS-LRR) or transmembrane leucine rich repeat (TM-LRR) 34 . NBS-LRR proteins encoded by resistance (R) genes play an important role in pathogen recognition process and the activation of signal transduction in the response to pathogen attack. NBS-LRR can be further classified as toll/interleukin receptor (TIR)-NBS-LRR (TNL) or non-TNL/coiled coil-NBS-LRR (CNL) 34 . Both TNL and CNL specifically target pathogenic effector proteins inside the host cell, and thus mediate effector triggered immunity (ETI) response 35 . In Vigna mungo 8.6% of total identified R-gene related sequence showed NBS domain. In Vigna mungo transmembrane leucine rich repeat (TM-LRR) class such as receptor like kinase (RLK) and receptor like protein (RLP) accounted for 25.7% of the R-genes identified. RLPs and RLKs are pattern recognition receptors (PRRs) that mediate pathogen/microbe associated molecular pattern (PAMP/MAMP) triggered immunity (PTI/MTI) to allow recognition of a broad range of pathogens 35 . Development of diagnostic molecular markers associated with key disease resistance gene would aid in molecular resistance breeding. In this study, the black gram genome was assembled using hybrid approach with the size of 475 Mb. This has potential for developing gold standard reference assembly in future. A total of 42,115 genes were predicted from the assembled genome. Further, the predicted genes were annotated with gene ontology and pathway information. The presence of transposons and SSRs in the assembled genome was also predicted. Blackgram is grown mostly in developing countries and lack of genome sequence has delayed the implementation of molecular breeding in this Vigna species. The whole-genome sequence and SSR discovery will thus boost genomics-assisted selection for blackgram genetic improvement. Extracted genomic DNA was quantified and assessed for quality using Nanodrop2000 (Thermo Scientific, USA), Qubit (Thermo Scientific, USA) and agarose gel electrophoresis.

Methods
Illumina library preparation and sequencing. Whole genome sequencing (WGS) libraries were prepared using Illumina-compatible NEXTFlex Rapid DNA sequencing Bundle (BIOO Scientific, Inc. U.S.A.) at Genotypic Technology Pvt. Ltd., Bangalore, India. Briefly, 300 ng of Qubit quantified DNA was sheared using Covaris S220 sonicator (Covaris, Inc. USA) to generate specific fragments in the size range of 300-400 bp. The fragment size distribution was verified on Agilent 2200 TapeStation and subsequently purified using High prep magnetic beads (Magbio Genomics). Purified fragments were end-repaired, adenylated and ligated to Illumina multiplex barcode adaptors as per NEXTflex Rapid DNA sequencing bundle kit protocol 36 . Matepair illumina library preparation. Mate pair sequencing library was prepared using Illumina-compatible Nextera Mate Pair Sample Preparation Kit (Illumina Inc., Austin, TX, U.S.A.). About 4 μg of genomic DNA was simultaneously fragmented and tagged with mate pair adapters in a transposon based tagmentation step. Tagmented DNA was then purified using AMPure XP magnetic beads (Beckman Coulter, USA) followed by strand displacement to fill gaps in the tagmented DNA. Strand displaced DNA was further purified with AMPure XP beads before size-selecting the fragments on low melting agarose gel. Size selected fragments were circularized in an overnight blunt-end intra-molecular ligation step that resulted in circular DNA with the insert flanked mate pair adapter junction. Circularized DNA was sheared using Covaris S220 sonicator (Covaris, Woburn, Massachusetts, USA) to generate fragment size distribution from 300 to 1000 bp. Sheared DNA was purified to collect the Mate pair junction positive fragments using Dynabeads M-280 streptavidin magnetic beads (Thermo Fisher Scientific, Waltham, MA, USA). Purified fragments were end-repaired, adenylated and ligated to Illumina multiplex barcode adaptors as per Nextera Mate Pair Sample Preparation Kit protocol. Sequencing library, thus constructed, was quantified using Qubit fluorometer (Thermo Fisher Scientific, MA, USA) and its fragment size distribution was analyzed on Agilent 2200 TapeStation. The libraries were sequenced on Illumina HiSeq X Ten sequencer (Illumina, San Diego,USA) using 150 bp paired-end chemistry following manufacturer's instructions. www.nature.com/scientificreports/ Nanopore library preparation and sequencing. A total of 1.5 μg of gDNA was end-repaired (NEBnext ultra II end repair kit, New England Biolabs, MA, USA) and purified using 1 × AmPure beads (Beckmann Coulter, USA). Adapter ligation (AMX) was performed at RT (20 °C) for 20 min using NEB Quick T4 DNA Ligase (New England Biolabs, MA, USA). The reaction mixture was purified using 0.6 × AmPure beads (Beckmann Coulter, USA) and sequencing library was eluted in 15 μl of elution buffer provided in the ligation sequencing kit (SQK-LSK109) from Oxford Nanopore Technology (ONT Primary data analysis. The data obtained from the Illumina sequencing run was demultiplexed using Bcl2fastq softwarev2.20 (https:// sapac. suppo rt. illum ina. com/ seque ncing/ seque ncing_ softw are/ bcl2f astqc onver son-softw are. html) and FastQ files were generated based on the unique dual barcode sequences. The sequencing quality was assessed using FastQC v0.11.8 software 38 . The adapter sequences were trimmed using Trimgalore v0.4.0 39 and bases above Q30 were considered and low quality bases were filtered off during read pre-processing and used for downstream analysis. Similarly, the Nanopore reads were processed with default settings using Porechop tool (https:// github. com/ rrwick/ Porec hop). The pre-processing of Nanopore data retained 99.9% of data.
De novo genome assembly and gene annotation. Hybrid assembly was performed using Illumina and Nanopore processed reads by MaSuRCA v3.3.4 hybrid Assembler 40 with standard parameters. The assembled contigs were utilized to generate larger scaffolds using pyScaf(v1) software (https:// github. com/ lprys zcz/ pyScaf). The generated assembled genome of ~ 475 MB size was used for further analysis. The correctness of the assembly was ascertained by mapping short and long reads to the assembly. For gene prediction and annotation of the assembled genome, we used combination of ab initio prediction and transcriptome data of Vigna mungo using BRAKER 41 version 3.0.2. It helped in the identification of protein-coding genes and their exonic -intronic structure in the genome in order to improve the accuracy and completeness of the annotation. BRAKER predicted proteins were annotated against all Fabaceae protein sequences from Uniprot database 42 using DIA-MOND blastp 43 program with an e-value of 1e-5 for gene ontology and annotation. To asses the completeness of our Vigna mungo genome assembly and annotation, we employed the BUSCO software 44 to check the gene content using a plant specific database. Pathway analysis was performed using KAAS server 45 . KAAS (KEGG Automatic Annotation Server) provides functional annotations of genes in a genome by amino acid sequence comparisons against a manually curated set of ortholog groups in KEGG genes. Comparative analysis of the organization of orthologous gene clusters were carried out using genes of Vigna mungo, Vigna radiata, Vigna unguiculata and Vigna angularis through OrthoVenn 46 with E-value of 0.01and inflation value of 1.5.

Identification of transposable elements (TEs) and simple sequence repeats (SSRs).
Transposable elements analysis was performed against TREP (TRansposable Elements Platform) 47 which is a curated database of TEs (http:// botse rv2. uzh. ch/ kelld ata/ trep-db/ index. html). Each consensus representing a structural variant of a TE family was classified according to its structural and functional features. TEs classifications were based on its ability to replicate in a host genome using various transposition mechanisms and are divided into two classes based on their replication mechanism. Retrotransposons (class I) use an RNA intermediate for transposition, while DNA transposons (class II) use a DNA intermediate for transposition 27 . The genome sequence was checked for homology with TREP database using BLASTn 48 and the genomic positions having homology with known TEs were identified. SSRs were identified from the genome sequence using MicroSAtellite identification tool (MISA) 49 (http:// pgrc. ipk-gater sleben. de/ misa/). This predicted polymorphic loci of 1-6 bp length in nucleotide sequences. Repeats were identified in each scaffold sequences using MISA Perl script. In this study, the SSRs were considered to contain motifs with two to six nucleotides in size and a minimum of 6, 6, 3, 5, 5 contiguous repeat units for di-, tri-, tetra-, penta-and hexa-nucleotides, respectively. Mononucleotide repeats were not included in the SSR search criteria. Based on MISA results, primers were designed for SSR motifs using either WebSat (http:// purl. oclc. org/ NET/ websat/) online software 50 or batch primer3 ver1.0 51 . For designing PCR primers, parameter for optimum primer length was 22 mer (range: 18-27 mer), optimum annealing temperature was 60 °C (range: 57-68 °C), GC content was 40-80%, and other parameter values as default.
Identification of disease resistance genes. Disease Resistance Analysis and Gene Orthology (DRAGO v.2) pipeline was used to predict and annotate the disease resistance genes from the Plant Resistance Genes database (PRGdb 3.0; http:// prgdb. org) with curated reference R-genes 52,53 . DRAGO was executed with peptide sequence file from V. mungo var. mungo as an input to define the normalization value and the minimum score thresholds. Specifically, the previously created 60 HMM (hidden Markov model) modules were used by DRAGO 2 to detect LRR, Kinase, NBS and TIR domains and compute the alignment score of the different hits based on a BLOSUM62 matrix. The normalization value was the absolute smallest similarity score found among the input sequences considering all domains. The minimum score thresholds were calculated from the smallest similarity score reported in a specific domain among the input sequences. DRAGO 2 generated files with numeric matrix that represented the similarity score of every single protein input to each HMM profile, the domain name, start position, end position, resistance class and identification for every putative plant resistance protein. www.nature.com/scientificreports/