Background & Summary

Sorghum is a widely cultivated cereal crop, particularly in Africa, where it ranks 5th in global cereal production1. It exhibits remarkable adaptability and possesses strong stress resistance characteristics, including drought tolerance, waterlogging tolerance, salt-alkali tolerance, barrenness tolerance, and high temperature tolerance. It serves as a staple food for approximately 500 million people in Africa and Asia, and provides a source of energy, forage, and industrial raw material for the brewing industry. The genome of sorghum has been extensively studied, with the first reference genome of sorghum (BTx623) was published in 20092, followed by the decoding of sorghum inbred Tx430 transformation line genome in 20183 and the release of the sweet sorghum genome in 20194. In 2021, the first pan-genome of sorghum was completed5. Furthermore, in 2023, seven high-quality sorghum organelle genomes were published6. These milestones mark significant advancements in molecular research on sorghum. However, the current widely used sorghum reference genome (BTx623, RefSeq assembly accession: GCF_000003195.3)2 still has relatively low continuity and quality, with a contig N50 of 1.3 Mb and 3,913 total gaps. There has long been a need for high-quality sorghum genomes.

Recent advances in genome sequencing and assembly methodology have made telomere-to-telomere (T2T) gap-free assembly of chromosome sequences possible. The assembly of T2T genome enables the exploration of unknown fields such as telomeres and centromeres, which also provides a more in-depth research direction for animal and plant research7,8,9. T2T genome assemblies have been reported in several important crops, including banana10, barley11, rice12, and maize13. “Cuohu Bazi” (CHBZ) is an ancient local sorghum landrace collected in China between 1982 and 1986 (Fig. 1a). Through field identification, it has been discovered that CHBZ possesses excellent agronomic traits, such as dwarf stature and early maturation. The plant height is approximately 1.30 meters, and the growth period is 100 days. The T2T genome assembly of CHBZ will undoubtedly provide valuable guidance for sorghum breeding efforts.

Fig. 1
figure 1

Overview of the CZBZ and genome. (a) Photographs of CHBZ. (b) Circos plot illustrating the genome of CHBZ. The plot includes the following components, arranged from inside to outside: (I) Collinear regions within the CHBZ assembly; (II) Gene density in 1-Mb sliding windows; (III) GC content in non-overlapping 1 Mb windows; (IV) Percentage of interspersed repeats in 1-Mb sliding windows; (V) Percentage of tandem repeats in 1-Mb sliding windows; (VI) Length of pseudo-chromosome in megabases (Mb).

In this study, we aim to generate the first T2T gap-free genome for CHBZ using a combination of the latest sequencing technologies, including PacBio high-fidelity (HiFi) sequencing, ultra-long Oxford Nanopore Technology (ONT), and High-through chromosome conformation capture (Hi-C) sequencing. The genomic resources and gene structures produced by this study will lay the groundwork for future research endeavors in CHBZ genetic breeding.

Methods

Sample collection and sequencing

CHBZ, which was grown in the germplasm resource nursery at the Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Taiyuan, 030031, China, was chosen for DNA and RNA sequencing. Fresh, healthy, young seedling were harvested and immediately placed in a freezing chamber with liquid nitrogen, then stored at −80 °C. Genomics DNA was extracted with the cetyltrimethylammonium bromide method, and evaluated using NanoDrop One spectrophotometer (NanoDrop Technologies, Wilmington, DE) and Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA). The PacBio HiFi libraries, the “Using SMRTbell Express Template Prep Kit 2.0 With Low DNA Input” protocol from PacBio was followed, with an insert size of approximately 20 kb (Pacific Biosciences, USA). Subsequently, the libraries were subjected to sequencing using PacBio Sequel II platforms in circular consensus sequencing (CCS) mode. For the ONT ultra-long sequencing, the library was prepared using the Oxford Nanopore SQK LSK109 kit, and then sequenced on a PromethION flow cell (Oxford Nanopore Technologies Oxford, UK). For Hi-C sequencing, Hi-C libraries based on DpnII restriction enzymes were generated as previously described14, and sequenced on the MGISEQ-2000 platform. Total RNA from roots, stems, leaves and spikes, was isolated using the NEB Next Poly (A) mRNA Magnetic Isolation Module. DNase I (Thermo Fisher Scientific, Wilmington, DE, USA) was used to remove genomic DNA. The RNA integrity was checked using a BioAnalyzer 2100 (Agilent Technologies, Santa Clara, USA). RNA libraries were prepared using the NEB Next Ultra RNA Library Prep Kit for Illumina with an insert size of 300 bp. The RNA libraries were sequenced on a MGISEQ-2000 instrument and generated 150 bp paired-end reads.

In total, we generated 304.06 Gb (~419X coverage) ONT reads with a N50 of 52.44 kb, 28.65 Gb PacBio HiFi CCS reads with a N50 of 16.64 kb (~40X coverage), and 304.93 Gb Hi-C data (Illumina paired-end reads, ~421X coverage), 123.30 Gb RNA data (Tables 1 and 2).

Table 1 Summary of DNA sequencing data of the CHBZ genome.
Table 2 Summary of RNAseq sequencing data of the CHBZ genome.

Genome assembly

The subreads generated by PacBio Sequel II platforms were processed using the CCS algorithm of SMRTLink (v11.1.0)15 with the following parameters: “–minPasses 3 –minPredictedAccuracy 0.99 –minLength 500”. Subsequently, we performed a genome survey utilizing GCE (Genomic Charactor Estimator) (v1.0.2)16 with the parameter ‘-k 17’ using HiFi reads to assess the genome size and heterozygosity of CHBZ genome. The estimated size of the CHBZ genome was approximately 724.95 Mb, with a heterozygosity rate of 0.10% and a repeat content of 67.74% (Table 3). After quality control using software SOAPnuke (v2.0)17 with parameters “-n 0.01 -l 20 -q 0.1 -i -Q 2 -G 2 -M 2 -A 0.5”, a total of 303.56 Gb Hi-C clean data was obtained. Using PacBio HiFi reads, ONT reads, and Hi-C clean data, the primary contigs were generated by Hifiasm (v 0.19.5)18 with default parameters. To anchor contigs onto chromosomes, we employed BWA (v 0.7.12)19 to align the Hi-C clean data to the assembled contigs, and then we filtered low quality reads using a HiC-Pro pipeline20 with the default parameters. The valid reads were used to anchor chromosomes with Juicer21 and 3d-dna pipeline22. LR_Gapcloser23 was utilized to close gaps in the assembled genome, using error-corrected ONT long reads generated by NECAT24. To further refinement of the genome, the T2T assembly was polished using the similar method described in Mc Cartney, Shafin et al.25. Briefly, the HiFi reads were aligned to the T2T assembly using Winnowmap2 (v 2.03)26. The output alignments were filtered to remove all secondary alignments and alignments with excessive clipping by using ‘falconc bam-filter-clipped’ tool. Finally, racon (v 1.5.0)27 was conducted with the filtered alignments.

Table 3 Summary statistics of the CHBZ genome assembly.

Overall, the final genome assembly of CHBZ is about 724.85 Mb with a N50 length of 71.06 Mb (Table 3). Genome sequences were clustered and oriented as 10 pseudochromosomes (Fig. 1b, Table 4).

Table 4 The quality of each chromosome in the CHBZ genome.

Repeat annotation

Two strategies including de novo and homolog methods were used to predict repeat elements. De novo repeats were identified by RepeatModeler (v1.0.4)28 and long terminal repeats were annotated by LTR-FINDER (v1.0.7)29. DNA and protein transposable elements were detected by RepeatMasker (v4.0.7)30 and RepeatProteinMasker (v4.0.7), respectively, based on Repbase database31. At last, tandem repeats were performed by Tandem Repeat Finder (v4.10.0)32.

In the CHBZ genome, repeat sequences accounted for 510.36 Mb, representing 70.41% of the assembly (Table 3). Long terminal repeat (LTR) retrotransposons (55.75%) were the most abundant component among repetitive elements, which was consistent with the previous study1 (Table 5).

Table 5 Transposable element (TE, interspersed repeat) contents in the CHBZ assembly.

Protein-coding genes prediction and functional annotation

Gene prediction was conducted through a combination of transcriptome-based prediction, homology-based prediction, and ab initio prediction methods. For transcriptome-based prediction, 123.3 Gb clean reads sequenced by DNBSEQ-2000 from root, stem, leaf and ear tissues from four stages (seedling, jointing, heading and flowering, filling and maturity stage) were assembled by Trinity (v 2.15.1)33 with parameters of ‘–max_memory 200 G–min_contig_length 200–genome_guided_bam merged_sorted.bam–full_cleanup–min_kmer_cov 3–min_glue 3–bfly_opts ‘-V 5–edge-thr = 0.1–stderr’–genome_guided_max_intron 10000’, which generated 130,301 transcripts with a N50 of 2,702 (Table 6). These assembled transcripts were aligned against the T2T assembly by Program to Assemble Spliced Alignment (PASA) (v 2.4.1)34. Valid transcript alignments were clustered based on genome mapping location and assembled into gene structures. The coding regions were obtained by employing TransDecoder (v 5.7.1) (https://github.com/TransDecoder/TransDecoder) (PASA-set). In addition, the RNA-seq clean reads were also mapped to the T2T assembly using Hisat2 (v 2.0.1)35. Stringtie (v 1.2.2)36 and TransDecoder (v 5.7.1) were employed to assemble the transcripts and identify candidate coding regions into gene models (Stringtie-set). Homologous genomes from five plants, including rice (T2T-NIP)12, foxtail millet (RefSeq assembly accession: GCF_000263155.2), maize (T2T Mo17)13, A. thaliana (Col-PEK)37, and BTx623 sorghum (RefSeq assembly accession: GCF_000003195.3) were downloaded. Subsequently, these sequences were used as queries to search against the T2T assembly using GeMoMa (v 1.9)38 with bam files from RNA-seq data. Homology predictions were denoted as “Homology-set”. For ab initio prediction methods, AUGUSTUS (v 3.2.3)39 was used to predict coding regions in the repeat-masked genome. All gene models predicted were combined by EvidenceModeler (v 2.1.0)40 into a non-redundant set of gene structures. Evidence from different sources was assigned to different weight parameters: 10 for PASA-set, 5 for Stringtie-set, 5 for Homology-set, and 1 for AUGUSTUS gene prediction. Finally, the produced gene models were further refined with the PASA (v 2.4.1)34 to generate untranslated regions and alternative splicing variation information. The integrated gene set was translated into amino-acid sequences. By using Diamond program (v 0.9.30.131)41 with an E-value cutoff of 1E-05, the amino-acid sequences were aligned to five public protein databases, KOG, SwissProt42, Kyoto Encyclopedia of Genes and Genomes (KEGG)43, NCBI nonredundant database (NR), and Translation of European Molecular Biology Laboratory (TrEMBL) databases. At last, we search protein domains through InterProScan (v 5.30)44 program. The Gene Ontology (GO) terms for each gene were extracted with InterProScan. Gene annotation identified 32,855 protein-coding genes. Of which, 32,746 genes (99.67%) were annotated at least one functional database (Table 7).

Table 6 Statistics of RNAseq de novo assembly of CHBZ.
Table 7 Number of functional annotations for predicted genes in the CHBZ assembly.

Gene expression analysis

The raw RNA-seq reads were quality controlled by fastp (0.19.5)45. Subsequently, the clean reads were aligned to the CHBZ genome using Hisat2 (v2.1.0)46, with the following parameters: ‘–phred33 -p 5–sensitive–no-discordant–no-mixed -I 1 -X 1000’. An estimated mapped read count matrix was generated using htseq-count (v0.12.4)47. The gene expression level was calculated using the fragments per kilobase of exon per million mapped reads (FPKM) method48.

Comparative genomic analysis

To identify syntenic relationships between the CHBZ and BTx623 (RefSeq assembly accession: GCF_000003195.3, publicly released by the Sorghum Consortium in 2017) genomes, we extracted the longest coding sequences (CDSs) of genes. Subsequently, these CDSs were input into JCVI (v1.1.18)49 with a minimum requirement of 30 genes per block and the parameters “–cscore = 0.99” was set. The analysis yielded 24,685 orthologous pairs: 24,639 (74.6%) in CHBZ and 24,637 (72.2%) in BTx623 (Fig. 2a).

Fig. 2
figure 2

The syntenic relationships and PAVs between the CHBZ and BTx623. (a) JCVI was used to detect syntenic blocks between the CHBZ and BTx623 gene pairs. The x-axis is the CHBZ genome, and the y-axis is the BTx623 genome. (b) The heatmap of 129 CHBZ-specific PAV genes. Rows represent 129 CHBZ-specific PAV genes, and columns represent 15 RNAseq samples. Blue and red boxes represent genes showing lower and higher expression levels, respectively.

The presence/absence variants (PAVs) between CHBZ and BTx623 genome were performed using the similar method described in Li, Xu et al.50. At first, a sliding window (window size, 500 bp; step, 100 bp) was used to divide the CHBZ genome. All divided sequences were then aligned against the BTx623 genome using BWA (v 0.7.17-r1188)51 with MEM algorithm (-w 500 -M -t 16). If a sequence failed to be aligned to the BTx623 genome, or coverage with <25%, it was defined as a CHBZ-specific sequence. For the identification of CHBZ PAV genes, the longest CDS per gene was extracted, and genes with >75% of their CDS covered by specific sequences were defined as putative PAV genes. To exclude potential false positives, the CHBZ longest CDS sequences were mapped to the BTx623 genome using minimap2 (v2.24-r1122)52 with parameters of ‘-x splice -t 10 -k 12 -a -p 0.4 -N 20’. If a gene with mapping quality >10, and coverage >25%, or identity >90%, it was defined as false-positive PAV gene. In summary, our analysis yielded 161 CHBZ-specific PAV genes and 178 BTx623-specific PAV genes. Notably, 129 of the 161 CHBZ-specific PAV genes were found to be expressed in at least one RNAseq sample (Fig. 2b). Among the 178 BTx623-specific PAV genes, 163 genes were found to overlap with those identified in the pan-genomic study5.

Identification of centromeres and telomeres

Using a method similar to that described in the wild blueberry T2T assembly53, the centromeres and telomeres were identified using QuarTeT (v 1.1.1)54 method with the “-c plant” option. QuarTeT provides a comprehensive suite of tools for automating T2T genome assembly and analysis, including the TeloExplorer module for telomere identification and CentroMiner for predicting centromere candidates. Briefly, CentroMiner identifies tandem repeat monomers, selects potential centromeric repeats based on period and copy number, clusters them to minimize redundancy, and aligns representative monomers to corresponding chromosomes. Our findings revealed that the length of centromere region range from 4.31 Mb to 13.00 Mb. Notably, 98.58% of the centromere regions consist of repetitive sequences, with LTR-Gypsy and tandem repeat sequences being the predominant categories (Table 8). We observed large blank regions in the Hi-C interaction heatmap of the centromere region, a phenomenon also identified in the Scutellaria baicalensis gap-free genome55 (Fig. 3).

Table 8 The characteristic of predicted centromeres.
Fig. 3
figure 3

Chromatin interactions at 100 kb resolution reveals the characteristics of the centromere region in the CHBZ genome.

Data Records

The sequencing data has been deposited at the Sequence Read Archive with the accession number SRP47291256, under the project identifier PRJNA1037263. Additionally, the assembled genome sequence has been made available in GenBank with the Whole Genome Shotgun project accession GCA_040267525.157. Furthermore, files pertaining to the genome assembly, gene structure annotation, and repeat predictions annotation have been archived in the Figshare repository58.

Technical Validation

Evaluation of the genome assembly

Multiple approaches were employed to validate the accuracy and completeness of the CHBZ genome assembly. We predicted centromeric sequences from the CHBZ genome assembly and observed all 10 centromeres were captured (Fig. 4a). Additionally, we identified all 20 telomeres by using the seven-base telomere repeat sequence (‘AAACCCT’) as a query (Fig. 4a). The number of telomere repeats in the CHBZ genome assembly was significantly higher than that of the BTx623 genome (Fig. 4b). The Hi-C heatmap displayed a high level of consistency across all chromosomes, providing evidence for the accurate sequencing, ordering, and orientation of contigs in the CHBZ genome assembly (Fig. 4c). The completeness test of LTR showed that the LTR assembly index (LAI) value of the assembly was 23.63 (Table 3), which was higher than the LAI values of BTx623 (LAI: 19.52) and even the latest T2T assembly of wild blueberry (LAI: 20.22)53. The completeness was also supported by the high mapping rates of two type sequences on the CHBZ genome assembly, with 100% of ONT reads and 99.9% of HiFi reads aligning to the CHBZ assembly. In addition, the assembly achieved a quality value score of 61.60 (Table 3). Finally, the Benchmarking Universal Single-Copy Orthologs (BUSCO)59 test revealed that the CHBZ assembly successfully identified 99.01% of 1,614 embryophyta gene set (Fig. 4d). Overall, these results presented the high quality and reliability of the CHBZ genome assembly.

Fig. 4
figure 4

The high quality of the CHBZ genome. (a) Telomere detection map. Triangles and circles represent telomeres and centromere within the CHBZ assembled chromosomes; red color indicates high gene density; blue color indicates low gene density. (b) The statistic of telomere repeat number in CHBZ and BTx623 genome. (c)Heatmap displaying Hi-C interactions of CHBZ pseudomolecules. (d) BUSCO assessments in the CHBZ genome.

Evaluation of the gene annotation

Firstly, the congruity of exon length and intron length frequency with those of three related species substantiates the reliability of our annotation findings (Fig. 5a,b). Secondly, a total of 32,284 (98.26%) genes received evidence-based support (Table 9). Furthermore, 32,746 (99.67%) protein-coding genes were successfully annotated in various databases and 19,252 (58.60%) were supported by all six databases (Table 7, Fig. 5c). Notably, 25,873 (78.75%) exhibited detectable transcriptional activity (FPKM ≥ 1) across the 15 RNA-seq datasets (Fig. 5d). Moreover, the predicted proteins achieved a complete BUSCO score of approximately 99.38%, indicating high-quality gene annotation (Table 3). To summarize, the gene annotation exhibits a remarkable degree of accuracy and integrity.

Fig. 5
figure 5

Quality assessment of the protein-coding genes in the CHBZ assembly. (a) Comparison of exon length among four related plant gene sets. Window refers to the length of every point. (b) Comparison of intron length among four related plant gene sets. No obvious unexpected differences exist among these three organisms, indicating the high quality of gene structure annotation. (c) Petal diagram using six public databases. (d) Proportions of genes that could be transcriptionally detected in CHBZ.

Table 9 Summary of evidence for the CHBZ gene models.