A draft genome for Spatholobus suberectus

Spatholobus suberectus Dunn (S. suberectus), which belongs to the Leguminosae, is an important medicinal plant in China. Owing to its long growth cycle and increased use in human medicine, wild resources of S. suberectus have decreased rapidly and may be on the verge of extinction. De novo assembly of the whole S. suberectus genome provides us a critical potential resource towards biosynthesis of the main bioactive components and seed development regulation mechanism of this plant. Utilizing several sequencing technologies such as Illumina HiSeq X Ten, single-molecule real-time sequencing, 10x Genomics, as well as new assembly techniques such as FALCON and chromatin interaction mapping (Hi-C), we assembled a chromosome-scale genome about 798 Mb in size. In total, 748 Mb (93.73%) of the contig sequences were anchored onto nine chromosomes with the longest scaffold being 103.57 Mb. Further annotation analyses predicted 31,634 protein-coding genes, of which 93.9% have been functionally annotated. All data generated in this study is available in public databases.

sizes of 6.9 Mb and 2.1 Mb, respectively. The S. suberectus assembly was further refined using 233. 19 Gb Hi-C data: 748 Mb (93.73%) of the contig sequences were anchored onto nine chromosomes, the scaffold N50 was improved to be 86.99 Mb, and the longest scaffold was 103.57 Mb.
Almost half of the S. suberectus genome (47.82%) was occupied by repetitive elements, the largest amount of which was long terminal repeat retrotransposons (17.32%). Combined with homology-based predictions, de novo predictions and transcriptome-based predictions, 31,634 protein-coding genes with an average transcript size of 1,097.55 bp were predicted in the genome. In total, 93.9% (29,688) of protein-coding genes were successfully functionally annotated. Library construction and sequencing. The DNA was sheared by a Covaris ® M220 focused-ultrasonicator TM (Covaris, Woburn, Massachusetts, USA). The sheared DNA, with fragment sizes of 250 bp and 450 bp, was processed using the TrueSeq DNA PCR-Free LT Library Kit protocol. PCR products were purified (AMPure XP system) and library quality was assessed on an Agilent Bioanalyzer 2100 system. These PCR-Free libraries were sequenced with a HiSeq X Ten instrument as 150 bp paired-end reads. In total, 77.73 Gb of raw sequence data were generated (Table 1).  www.nature.com/scientificdata www.nature.com/scientificdata/ Sheared DNA (40 μg) was purified and concentrated with AMPure PB beads (PacBio) and further used for SMRTbell preparation according to the manufacturer's protocol (Pacific Biosciences; 20-kb template preparation using BluePippin (Sage Science) size selection system with a 15-kb cut-off). The libraries were then sequenced with a PacBio sequel instrument (Pacific Biosciences, Menlo 31 Park, CA, USA). A total of 11 SMRT Cells were used to yield 79.79-fold genome coverage of sequence data ( Table 1), consisting of 63.27 Gb sequence data with an N50 read length of 14,288 bp ( Table 2).
The linked read sequencing libraries were constructed on a 10X Genomics GemCode platform 15 . Sample indexing and partition barcoded libraries were prepared using the Chromium Genome Reagent Kit (10x Genomics) according to the manufacturer's instructions. The barcode sequencing library was first quantified by Qubit2.0, insert size was checked using an Agilent2100, and finally quantified by qPCR. The 123.09 Gb library was sequenced with 150 bp paired-end reads on an Illumina HiSeq X Ten platform (Table 1).
For the Hi-C library, chromatin was fixed in place with formaldehyde in the nucleus. Fixed chromatin was digested with DpnII restriction endonuclease, 5′ overhangs were filled in with biotinylated nucleotides, and free blunt ends were ligated. After ligation, cross-links were reversed, and the DNA was purified from protein. Purified DNA was treated to remove biotin that was not internal to the ligated fragments. The DNA was then sheared to a mean fragment size of 350 bp, and sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adaptors. Biotin-containing fragments were isolated using streptavidin beads before PCR enrichment of each library. The libraries were sequenced on an Illumina HiSeq platform to produce 233.19 Gb Hi-C sequence data ( Table 1). The quality of Hi-C sequencing was evaluated using HiCUP 16 . The effect rate (%) = Unique di-tigs/Total Reads Processed = 4,356,614/10,000,000 = 43.57% (Table 3). Typically, 35.91% of the alignable read pairs represent interchromosomal interactions. Eleven percent represents intrachromosomal interactions between fragments less than 10 kb apart and 53.09% are intrachromosomal read pairs that are more than 10 kb apart ( Table 3).
Estimation of the S. suberectus genome size. Quality-filtered reads from the Illumina platform were subjected to 17-mer frequency distribution analysis with SOAPdenovo 17 . K-mer 17 was selected to estimate the genome size and heterozygosity of S. suberectus (Fig. 2). We plotted the distribution of k-mer depth against frequency with a main peak occurring at the depth of 40 (Fig. 2). Based on the total number of k-mers (32,476,446,092), the S. suberectus genome size was calculated to be approximately 793.39 Mb, using the following formula: genome size = k-mer_Number/Peak_Depth and Revised Gsize = Genome size × (1-Error Rate). The heterozygosity of the S. suberectus genome is 0.74%. Genome assembly. De novo assembly of the 63.27 Gb PacBio single-molecule long reads from SMRT Sequencing was performed using FALCON (https://github.com/PacificBiosciences/FALCON/) 18 . In order to get enough corrected reads, the longest 60 subreads were first selected as seed reads to do error correction. Then error-corrected reads were aligned to each other and assembled into genomic contigs using FALCON with parameters length_cutoff_pr = 5000, max_diff = 120, max_cov = 130. The draft assembly was polished using   www.nature.com/scientificdata www.nature.com/scientificdata/ the quiver algorithm. Pilon was used to perform error correction of p-contigs with 98.02X coverage of short paired-end reads generated from Illumina HiSeq Platforms 19 . The assembly consisted of 1,954 contigs, with a contig N50 length of 2.06 Mb (total length = 794 Mb) ( Table 4).
We used BWA-MEM 20 to align the 10X Genomics data to the assembly using default settings. Scaffolding was performed by fragScaff (in vitro, long-range sequence information for de novo genome assembly via transposase contiguity) with the barcoded sequencing reads.    www.nature.com/scientificdata www.nature.com/scientificdata/ The assembly consisted of 1,146 scaffolds, with the scaffold N50 length improving to 6.9 Mb (total length = 798 Mb) and contig N50 of 2.1 Mb ( Table 5). The genome assembly size is similar to the estimated genome size by k-mer analysis.
The input de novo assembly, shotgun reads, and Dovetail Hi-C library reads were used as input data for HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies 21 . Shotgun and Dovetail Hi-C library sequences were aligned to the draft input assembly using a modified SNAP read mapper (http://snap.cs.berkeley.edu). The separations of Dovetail Hi-C read pairs mapped within draft scaffolds were analyzed by HiRise to produce a likelihood model for genomic distance between read pairs, and the model was used to identify and break putative misjoins, score prospective joins, and make joins above a threshold. After scaffolding, shotgun sequences were used to close gaps between contigs.
The S. suberectus assembly was further refined using 233.19 Gb Hi-C data ( Table 1): 748 Mb (93.73%) of the contig sequences were anchored onto nine chromosomes (Fig. 3). The scaffold N50 was finally improved to be 86.99 Mb and the longest scaffold was 103.57 Mb.
Gene annotation. Genes in the S. suberectus genome were annotated using multiple methods, including homology-based predictions, de novo predictions and transcriptome-based predictions (Fig. 3) 32), and Oryza sativa (ftp://ftp.ensemblgenomes.org/pub/plants/ release-32/fasta/oryza_sativa/, version 1.0) were used for homology-based predictions. First, query sequences were subjected to tblastn analysis with an Expect (E)-value cutoff of 1e-5. BLAST hits corresponding to reference proteins were concatenated by Solar software, and low-quality records were removed. The genomic sequence of each reference protein was extended upstream and downstream by 2,000 bp to represent a protein-coding region. Gene structures contained in each protein region were predicted using GeneWise software 34 . For www.nature.com/scientificdata www.nature.com/scientificdata/ transcriptome-based predictions, RNA from five organs (root, petiole, leaves, flowers, and stems) was isolated and RNA-seq data were used for gene annotation, processed by TopHat and Cufflinks 35 . RNA-seq data were also assembled by Trinity 36 . PASA 37 software (http://pasapipeline.github.io/) was then used to generate a full transcriptome-based genome annotation. The homology, de novo, and transcriptomic gene sets were merged to form a comprehensive and non-redundant reference gene set using EVidenceModeler 38 software. Next, PASA 37 was used to generate UTRs as suggested by the RNA-seq data.
Our analysis indicates that 31,634 protein-coding genes with an average transcript size of 1,097.55 bp were predicted in the genome (Fig. 4a).
Functional annotation of the protein-coding genes was carried out using blastp (E-value cut-off 1e-05) against SwissProt 39 and NR databases. Protein domains were annotated by searching against InterPro 40 and Pfam database 41 , using InterProScan and HMMER (http://hmmer.janelia.org), respectively. The GO terms for genes were obtained from the corresponding InterPro or Pfam entry. The pathways in which the genes might be involved were assigned by BLAST against the KEGG database 42 with the E-value cut-off of 1e-05. www.nature.com/scientificdata www.nature.com/scientificdata/ Overall, 79% (24,976), 70.8% (22,394), and 82.5% (26,082) of genes showed enrichment in InterPro, KEGG, and GO respectively. In total, 93.9% (29,688) of protein-coding genes were successfully annotated for conserved functional motifs or functional terms.
Non-coding rNA annotation. Annotation of tRNA was performed using tRNAscan-SE 43 software with default parameters. rRNA annotation was based on homology with rRNAs from several diverse higher plant species (not shown), using blastn with 'E-value = 1e-5' . miRNA and snRNA genes were predicted by INFERNAL software 44 using the Rfam database 45 .

Data records
This Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank under the accession QUWT00000000 46 . The version described in this paper is version QUWT01000000. Raw read files are available at NCBI Sequence Read Archive 47 . All the annotation tables containing results of an analysis of the draft genome are available at figshare 48 . Technical Validation evaluation of the completeness of the S. suberectus genome assembly. To estimate the quality of genome assembly, short reads were mapped back to the consensus genome using BWA 49 and an overall 97.29% mapping rate was found, suggesting that our assembly results contained comprehensive genomic information. Gene region completeness was evaluated by RNA-Seq data (Table S1): of the 53,538 transcripts assembled by Trinity 36 , 99.62% could be mapped to our genome assembly, and 95.94% were considered as complete (more than 90% of the transcript could be aligned to one continuous scaffold).
The completeness of gene regions was further assessed using CEGMA (conserved core eukaryotic gene mapping approach) 50 : 240 of 248 (96.77%) conserved core eukaryotic genes from CEGMA were captured in our assembly, and 206 (83.06%) of these were complete (Table S2). Furthermore, we performed BUSCO (Benchmarking Universal Single-Copy) 51 analysis based on a benchmark of 956 conserved plant genes, of which 96% had complete gene coverage (including 18% duplicated ones), 1% were fragmented and only 2.6% were missing (Table S3). These data largely support a high quality S. suberectus genome assembly, which can be used for further investigation.

Code Availability
The execution of this work involved many software tools, whose versions, settings and parameters are described below.