Background & Summary

In recent years, the planting area for rice (Oryza sativa L.) in Heilongjiang (HLJ) province of China has increased to around 4 million ha1. For this global largest planting region for early Geng/japonica rice, which is about 2.6 times larger than the rice planting area of Japan2, determining how to transfer its advantages in agriculture to other branches of the economy remains a significant challenge for agriculture researchers.

Early-matured Geng/japonica varieties provide the base for food security3, and supply critical agro-industrial materials, especially glutinous varieties. Glutinous rice, also called sticky rice, is becoming increasingly popular because of growing public awareness of health issues4. Glutinous rice has health benefits in managing diabetes, inhibiting chronic diseases, enhancing digestion, and reducing inflammation5. In addition to being an elite cooking material for a low gluten diet and ‘good food’6, glutinous rice also provides raw materials for environment-friendly industry7,8,9. Longgeng 57 (LG57), a glutinous early variety, has favorable quality and stable-yield behavior in the early Geng/japonica planting region; therefore, it is now planted over more than 120,000 ha per year on average.

Grain quality traits of rice are largely controlled by major genes, such as Waxy for the amylose content and OsNramp5 for the mineral nutritional quality10,11,12. Thus, further improvement of grain quality of glutinous rice, e.g., LG57, also requires more genome information.

Currently, joint analysis has become a trend in biotechnology-based rice breeding in HLJ. For example, the Rice Molecular Breeding (RMB) laboratory from the Institute of Crop Science (ICS), Chinese Academy of Agricultural Sciences (CAAS), has set up a genome-based breeding scheme with the aid of both core germplasms of 3K-RG13, and the Rice Functional Genomics Breeding (RFGB) information platform14. It also widely cooperates with local research institutes from HLJ, including Jiamusi Rice Research Institute (JMS-RRI) and Suihua RRI (SH-RRI)3. Herein, we present a dataset from a collaboration between the RMB laboratory and JMS-RRI for early-matured Geng/japonica including LG57. Information based on this dataset for certain target genes, such as Waxy and OsNramp5, were also included as examples for data validation. This dataset comprises more than 770 Gb of pedigree genome data that will be useful for researches in general.


Plant material and library construction

The early-matured Geng/japonica variety Longgeng57 (LG57) was developed by our own and licensed to be released in 2017 and is now one mega variety with multiple elite traits and widely planted (more than 120,000 hectare per year) in Heilongjiang province in Northeast of China. High-molecular-weight genomic DNA was extracted from 10-day-old leaves of LG57 pedigree members (multiple seeds) with modified CTAB method followed by 0.5x bead purification for twice. The DNA sample through the qualification processes by both 0.75% agarose gel assay and Nanodrop was quantified with Qubit. Then the sample of LG57 met the standard was submitted to the constructions of PacBio HiFi library for long-read sequencing (LRS). Samples of three parents (Longnuo 2 (LN2), Punian 8 (PN8), and Longgeng 29 (LG29)) were submitted to construct Illumina libraries short-read sequencing (SRS) (Fig. 1).

Fig. 1
figure 1

Outlines of the workflow used to generate and analyze the pedigree genome data for Longgeng 57 (LG57).

Genomic data were generated for all pedigree members, as listed in Table 1. Among them, PacBio (Menlo Park, CA, USA) protocols were adopted for long-read sequencing of LG57 and Illumina (San Diego, CA, USA) protocols were used for short-read sequencing. The details are as follows.

Table 1 Genomic data generated for pedigree of Longgeng 57.

DNA sample testing

DNA extraction from samples was carried out using a routine method that met the quality standard required for sequencing according to a previous study3. Sample purity and quantity were detected using a Nano Photometer® (IMPLEN, Westlake Village, CA, USA) and a Qubit® 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA), respectively, in combination with Agarose electrophoresis (concentration 1%, voltage 120 V for 45 min).

Library construction and Inventory inspection

Covaris® g-TUBE15 was used to break the genomic DNA into suitable large pieces. Magnetic beads were then used for enrichment and purification. SageELF (Sage, Newcastle upon Tyne, UK) was adopted to screen and purify the DNA fragments. An Annoroad® Universal DNA Library Prep Kit V2.0 (Annoroad Gene Technology, Beijing, China) was used for sample preparation, including end repair and ligation addition.

To ensure the quality of the library, a three-step quality check procedure was adopted as follows. After the library was constructed, the Qubit 3.0 was used for preliminary quantification. Then, the library was diluted to 1 ng/μL and the insert size was checked using an Agilent 2100 instrument (Agilent, Santa Clara, CA, USA). The effective concentration of the library was accurately quantified using quantitative real-time reverse transcription PCR (qRT-PCR) in a Bio-Rad CFX96 PCR instrument with a Bio-Rad IQ SYBR GRN Kit (both Bio-Rad, Hercules, CA, USA).


The single-molecule real-time (SMRT) method was adopted for the long-read sequencing (LRS) according to standard method (PacBio). Short-read sequencing (SRS) was carried out on the NovaSeq 6000 S4 platform (Illumina) to obtain a 250 bp double-ended sequencing reads.

Genome assembly, validation and annotation

For the LRS data obtained by HiFi library sequencing, the raw data (subreads) from the PacBio sequencing was filtered by using SMRT link v9.0.0.92188 ( with default parameters to obtain high-quality circular consensus sequences (CCS) data. For the assembly, hifiasm16 with default parameters were employed based on the CCS data. Merqury17 was adopted for the quality check of LG57 assembly. Also, BUSCO (Benchmarking Universal Single-Copy Orthologs)18 was used for genome assembly quality assessment. BUSCO analysis with default parameters was carried out using a single-copy gene set of several large evolutionary branches based on the OrthoDB ( The gene set was compared with the assembled genome using embryophyta_odb10, and the accuracy and completeness were assessed based on the proportions and completeness of the alignment.

Based on the LG57 assembly, two strategies were adopted for genome annotation. The first was a homologous strategy. RepeatMasker with default parameters19 based on RepBase20 was used to annotate repeats. For gene structures, BLAST21 with evalue = 1e-5 and GeMoMa22 with default parameters were used. Prediction of rRNAs, snRNAs, and miRNAs was carried out by aligning the assembly with known non-coding RNA libraries, e.g., Rfam23.The second was a de novo strategy. For repeat analysis, RepeatModeler ( with -engine ncbi was adopted. For protein-coding gene prediction, Augustus24 with–genemodel = partial, SNAP(, and GeneMark25 with default parameters were adopted. Based on the above predictions, EVidence Modeler (EVM)26 with default parameters was used to integrate the gene sets predicted by various strategies into a non-redundant gene set. The resulting predictive gene set was compared with various functional databases using UniProt27, NCBI (, PFAM28, eggNOG29, GO (gene ontology)30, and KEGG (Kyoto Encyclopedia of Genes and Genomes)31. For tRNA sequence prediction, we used tRNAscan-SE32 with parameters of -X 20 and –z 8.

The SRS data were aligned to the reference genome and variations were called using a pipeline comprising BWA33, SAMtools34, and GATK35 with default parameters, with Nipponbare IRGSP 1.036 as the reference genome.

Data Records

The assembly of LG57 is accessible at NCBI through GenBank37 or the following accession ID of JAXQPT00000000037. Additionally, the raw read data for LG57 in the bam format are also available with accession number of SRR2537649638. Other sequencing pedigree genomic data for parents of LG57, including PN8 (SRR24688636)39, LN2 (SRR24688637)40, and LG29 (SRR24688635)41. Annotation data for LG57 are accessible through figshare42. All above data except for the bam files are also accessible in RFGB website (

Technical Validation

A total 1,671,418 of reads were obtained. The averaged read-length is 16,831.42 bp and N50 value is more than 17 Kb. The distribution of these reads was shown in Fig. 2. A rough assembly for LG57 was carried out. A quality checking for the assembly of Longgeng 57 was also carried out by using Merqury and BUSCO. Based on the output of Merqury, the completeness of assembly was 99.5% and the QV was 62.0 (Table 2). As shown in Table 3, N50 of contig has arrived at more than 27 Mb, which is over 10 times of our previous work with SJ183. As shown in Table 4, a total of 1614 groups were searched by BUSCO, the complete groups accounted for about 98.8%. Functional genes predicted in LG57 comparing with those from databases were shown in Table 5. Identified by RepeatMasker, the total length of the repeat sequences is approximately 170MB, accounting for 43.13% of the whole LG57 genome (Table 6). Prediction results of different types of non-coding RNA including miRNA, tRNA, rRNA, and snRNA were listed in Table 7. These RNAs together accounting for 81.3% of the LG57 genome. We also compared the parameters of LG57 to the other assemblies. Averaged gene length of LG57 is longer than those of the others (Table 8).

Fig. 2
figure 2

Distribution of lengths of circular consensus sequences (CCS) reads for Longgeng 57 (LG57).

Table 2 Assembly quality assessment by Merqury for Longgeng 57.
Table 3 Comparison of Longgeng 57 dataset with representative assemblies including mega varieties (MV) or standard references (SR).
Table 4 Assembly quality for Longgeng 57 presented by BUSCO.
Table 5 Functional genes predicted in Longgeng 57 comparing with those from databases.
Table 6 Repeats predicted by different methods in Longgeng 57 assembly.
Table 7 Non-coding RNAs annotation results in Longgeng 57 assembly.
Table 8 Annotation results of coding region in Longgeng 57 assembly in comparing to the commonly used assemblies.

For the SRS data of the three parents (LN2, PN8, and LG29), we firstly aligned them against reference genome IRGSPv1.0 to gain the genome variations. Then we adopted sequences of three representative types of major genes from IRGSPv1.0 as queries and BLAST against LG57 assembly to get target sequences.

More details about data validation cases from three key genes for LG57 breeding works based on the pedigree genome data especially the assembly data of LG57 and the alignment data of its three parents were listed in Table 9. The maturing time of Geng/japonica is largely affected by Hd1 gene43, which commonly harbors highly-diverse variation panels in rice genome44. In this region, LG57 and its three early Geng/japonica parents show extremely high consistency. The grain quality of glutinous rice is mainly controlled by Waxy gene45. LG57 possess better grain quality than other glutinous early Geng/japonica varieties, such as PN2 and LN2. There are three differences in the Waxy genes found between PN8 and LN2. Although a common variation in the 5th exon of Waxy was found in PN8, LN2, and their progeny, LG57, there is a unique 23 bp deletion in the 1st exon that is shared by LG57 and its non-glutinous parent, LG29. Variations in major gene OsNramp5 affects the mineral concentrations in rice10. It’s notable that LG57 has variations that are different from all three parents, which is supposed to be caused by spontaneous mutations in breeding process46,47. Three types of variations in three representative genes validated the genome data and indicated the possible applications with this dataset. In a word, the quality of the pedigree genome data of LG57 was sufficient for public reuse in the future.

Table 9 Genome variations in three representative types of genes (Hd1 for maturing time, Waxy for amylose content, and OsNramp5 for mineral concentration, where 0 represents the genotype of the reference genome36 and 1 represents the first alternative genotype (ALT).