Nelumbo genome database, an integrative resource for gene expression and variants of Nelumbo nucifera

Sacred lotus (Nelumbo nucifera, or lotus) is one of the most widely grown aquatic plant species with important uses, such as in water gardening and in vegetable and herbal medicine. A public genomic database of lotus would facilitate studies of lotus and other aquatic plant species. Here, we constructed an integrative database: the Nelumbo Genome Database (NGD, http://nelumbo.biocloud.net). This database is a collection of the most updated lotus genome assembly and contains information on both gene expression in different tissues and coexpression networks. In the NGD, we also integrated genetic variants and key traits from our 62 newly sequenced lotus cultivars and 26 previously reported cultivars, which are valuable for lotus germplasm studies. As applications including BLAST, BLAT, Primer, Annotation Search, Variant and Trait Search are deployed, users can perform sequence analyses and gene searches via the NGD. Overall, the valuable genomic resources provided in the NGD will facilitate future studies on population genetics and molecular breeding of lotus. Measurement(s) reference genome data • whole genome sequencing • transcriptome Technology Type(s) Hi-C • PacBio Sequel System • Illumina sequencing • RNA sequencing • DNA sequencing Sample Characteristic - Organism Nelumbo nucifera Measurement(s) reference genome data • whole genome sequencing • transcriptome Technology Type(s) Hi-C • PacBio Sequel System • Illumina sequencing • RNA sequencing • DNA sequencing Sample Characteristic - Organism Nelumbo nucifera Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.13487271

www.nature.com/scientificdata www.nature.com/scientificdata/ isoforms based on RNA-seq and PacBio SMRT methods from our previous studies 14,17 . It also contains the sequences of 1,517 lotus transcription factors (TFs), which are classified into 56 TF (sub)families. Furthermore, sequence data of lotus gene families, transposable elements (TEs) and other repeats are also present in the NGD, and information concerning gene expression levels and highly coexpressed genes from data from a coexpression (WGCNA) network based on 69 RNA-seq samples from 11 lotus tissues (seed coat, cotyledon, receptacle, carpel, stamen, petal, rhizome, leaf, root, petiole and apical bud tissues) is present in the NGD. Furthermore, information concerning a total of 26,939,834 high-quality SNPs (single nucleotide polymorphisms), 4,177,974 InDels and key horticultural traits from 88 lotus cultivars is present in the NGD.
Uses. Through data collection and downstream processing, our platform provides the most complete lotus genome assembly for browsing via GBrowse or JBrowse 20,21 . Genes, DNA sequences, amino acids, SNPs and InDels can be viewed via GBrowse. The gene information page includes gene-splicing structures, sequences, and functional annotations such as those from the PFAM, KEGG, GO, KOG, SwissProt, TrEMBL and Nr databases. RNA-seq-based expression profiles across different tissues are also retrievable and can be visualized via a heatmap. Searching genes by keywords is also possible in the NGD. Additionally, coexpressed genes of a query gene in the WGCNA-derived network can be retrieved by setting a weight threshold; these coexpressed genes are likely involved in the same biological process as the query gene. Coding sequences and genomic sequences can be searched based on sequence similarity via BLAST or BLAT. Primers for the PCR experiments can also be designed directly in the NGD.

Methods
Data processing. Gene predictions on our chromosome-level genome assembly were performed using transcriptomes, gene homology and ab initio identification. A list of publicly available RNA-seq datasets, which mainly contain samples of the China Antique variety, were downloaded from the NCBI SRA database (Onlineonly Table 1). First, the corrected consensus PacBio full-length transcripts were mapped to the lotus reference genome using GMAP 14 . RNA-seq reads (Illumina) were then mapped to the genome using the HISAT2-StringTie pipeline 22 . All the transcripts were further merged using TACO 23 . Coding DNA sequences (CDSs) of transcripts were predicted using Transdecoder (https://github.com/TransDecoder). Second, homology-based gene prediction was performed using GeMoMa, which used genome sequences and gene coordinates from Arabidopsis thaliana 24 , Carica papaya 25 , Vitis vinifera 26 , Macadamia ternifolia (Proteales) 27 and Brachypodium distachyon 28 as inputs. Third, ab initio prediction was performed using Braker2, which used transcript coordinates of RNA-seq as a guide 29 . Finally, all predictions were merged, and for each gene with more than one gene model (transcripts), the longest one was chosen as the representative gene model. Genes with an ORF less than 30 aa were discarded. Further, high-confidence gene sets were defined as those whose genes that either were homologous to those in other plant species in Plant Plaza 4.0 (https://bioinformatics.psb.ugent.be/plaza/versions/plaza_v4_dicots/)  www.nature.com/scientificdata www.nature.com/scientificdata/ or were supported by RNA-seq (FPKM > 0.1). To quantify the expression of each gene in different lotus tissue samples, FPKM values across different RNA-seq samples were obtained via StringTie 22 . A coexpression network of different genes based on the expression profile was constructed using the WGCNA (v1.0) package in R 30 . Specifically, genes with an average FPKM > 0.1 and a coefficient of variation (CV) of gene expression (FPKM) > 2 were retained for the WGCNA. Genes were clustered hierarchically based on Topological Overlap Matrix 31 and were assigned to nine modules (minimum module size of 600 and minimum module similarity of 0.5). The weight values between genes were used to represent the connectivity between genes.
Gene functions were annotated using the Gene Ontology (GO), KEGG, KOG, Pfam, SwissProt, TrEMBL, and Nr databases via KOBAS 3.0, BlastKOALA, PfamScan and BLAST 32,33 . As protein domains are conserved units shared by related genes, we clustered genes into domain families (gene families) according to the HMM Pfam domain annotations 34 . In addition, all transcription factors (TFs) were predicted and clustered into TF families via PlantTFDB 4.0 35 .
There were 88 recorded lotus cultivars chosen in this study as representing various floral traits (color, shape, flowering time, etc.). Among these cultivars, 62 were sequenced in our current study, while the sequences of 26 with detailed phenotypic records were downloaded from the NCBI database (Online-only Table 2). Genomic DNA was first extracted from young leaves by the CTAB method 36 , and then DNA libraries were constructed by cutting the DNA into 250~280 bp fragments using a NEBNext ® Ultra DNA Library Prep Kit for Illumina (NEB, USA) following the manufacturer's recommendations. Paired-end reads (PE150) were sequenced on an Illumina NovaSeq. 6000 (San Diego, CA, USA), which generated approximately 16 × -depth data for each cultivar sample. Clean reads were obtained by removing the adapters and low-quality reads, including those comprising > 10% N, with < 20% low-quality bases, with low-quality/ambiguous fragments at the read ends within a 5 bp window and with a quality < 20 via FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/). The clean reads were mapped to the reference genome by BWA-men 37 . Variants were subsequently called by pipeline via GATK 4.0 (Genome Analysis Toolkit) with further SNP hard-filtering parameters ("QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" and InDel hard-filtering parameters of "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0") 38 . The phenotypes and some images of these cultivars were collected from reference books 39,40 ; the phenotypes were further validated across two years of field investigations. Images of floral traits for these cultivars displayed in the NGD were taken mostly during flowering at the Wuhan Institute of Landscape Architecture (Wuhan, China).
Database construction. All genomic sequence, annotation, expression, and genetic variation data were stored via MySQL on a Ubuntu server. A user-friendly website was developed using HTML5 and JavaScript; this website which can be accessed through different browsers, such as Google Chrome and Firefox. Gene models and transcript isoforms are provided via GBrowse and JBrowse. Heatmaps of gene expression are plotted via R, and query searches are achieved via JavaScript and Java. Common utilities for genomic studies such as BLAST, BLAT and Primer Design are also deployed and accessible.

Data Records
The genomic raw PacBio sequencing data are available in the NCBI Sequence Read Archive (SRA) database under accession numbers SRR7549129 41 and SRR7549130 42 , and the Illumina and Hi-C sequencing data are deposited under SRR7615553 43 and SRR7631523 44 , which helped us in our genome assembly (Online-only Table 1). Raw whole-genome resequencing reads for 62 strains can be downloaded from the NCBI database under Bioproject accession SRP173547 45 , and the resequencing data of the other 22 cultivars are also accessible via the NCBI SRA 46,47 (Online-only Table 2). The latest assembly and annotations of the 'China Antique' lotus variety is deposited in the Nelumbo Genome Database (Download links: http://nelumbo.biocloud.net/downloadData/download?path = NNU.genomic.fa and http://nelumbo.biocloud.net/downloadData/download?path = NNU.gff3). Additionally, this Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank under the accession DUZY00000000, and the version described in this paper is version DUZY01000000 48

technical Validation
Quality control of genome annotation, expression, and genome resequencing was performed during data processing for the NGD. Genome annotation. We used a set of conserved single-copy plant genes from the BUSCO database to assess the completeness of gene annotations 50 . Compared with previous annotations of the lotus variety China Antique (BUSCOs = 74.6%), our new annotation version provides much improved, complete BUSCOs (97.5%), and 41,140 out of 46,713 annotated genes with either complete or partial ORFs (88%) were validated by 69 transcriptome datasets ( Figure S1a), which suggests relatively high quality and completeness of the genome assembly and gene annotations ( Table 2).

Gene expression.
To ensure that the FPKMs of genes accurately reflect the gene expression in different tissues and at different developmental stages, hierarchical clustering of gene FPKMs across different RNA-seq samples of the variety China Antique was performed via Expander 6.0 51 . All sample repeats clustered together, while all developmental stages from the same tissue clustered together, except for one petiole sample, and the www.nature.com/scientificdata www.nature.com/scientificdata/ relative expression of randomly selected genes in different tissues was validated through qRT-PCR (Fig. 2 and Figure S1b). Therefore, we confirmed the accuracy of FPKM as an indicator of lotus gene expression.
Whole-genome resequencing. Before genome mapping, adapters and low-quality Illumina reads were filtered and removed (see the Methods). Base content, error rate, insert size distribution and log-transformed read coverage across the lotus chromosomes were checked, all of which met the criteria for downstream analyses (Fig. 3). The quality of genome mapping was also checked. The average mapping rate for cultivars from this study was 99.18%, while the numbers in the other cultivars collected from two previous studies were 98.98% and 99.13% (Online-only Table 2). The average depth for the cultivars from the current study was 16.1×, while the depth was 12.4× and 11.8× for cultivars from the other two reports (Online-only Table 2). To ensure the final quality of SNPs and InDels called by the GATK pipeline, stringent hard-filtering parameters were set (see the Methods). Because the majority of alleles in the SNP data set are expected to be shared by at least two individuals, we plotted the frequency of SNPs according to the minor allele count (MAC) across the 88 cultivars. Indeed, most of the SNPs had MACs ≥ 2, while the SNP density peaked around MACs of four or five (Fig. 4). SNP variants were further validated and visualized using IGVtools (http://software.broadinstitute.org/software/igv/igvtools) ( Figure S1c).  Table 2. BUSCO assessment of the completeness of gene annotation. a The current genome assembly and annotation of var. China Antique 8 . b Early genome assembly and annotation of var. China Antique 9 . www.nature.com/scientificdata www.nature.com/scientificdata/