Abstract
“Cuohu Bazi” (CHBZ) is an ancient sorghum variety collected from the fields of China, known for its agronomic traits like dwarf stature, early maturation. In this study, we present the first telomere-to-telomere (T2T) and gap-free genome assembly of CHBZ using PacBio HiFi reads, Oxford Nanopore Technologies, and Hi-C data. The assembled genome comprises 724.85 Mb, effectively resolving all 3,913 gaps that were present in the previous sorghum BTx623 reference genome. Notably, the T2T assembly captures 10 centromeres and all 20 telomeres, providing strong support for their integrity. This assembly is of high quality in terms of contiguity (contig N50: 71.1 Mb), completeness (BUSCO score: 99.01%, k-mer completeness: 98.88%), and correctness (QV: 61.60). Repetitive sequences accounted for 70.41% of the genome and a total of 32,855 protein-coding genes have been annotated. Furthermore, 161 CHBZ-specific presence/absence variants genes have been identified when comparing to BTx623 genome. This study provides valuable insights for future research on sorghum genetics, genomics, and evolutionary history.
Similar content being viewed by others
Background & Summary
Sorghum is a widely cultivated cereal crop, particularly in Africa, where it ranks 5th in global cereal production1. It exhibits remarkable adaptability and possesses strong stress resistance characteristics, including drought tolerance, waterlogging tolerance, salt-alkali tolerance, barrenness tolerance, and high temperature tolerance. It serves as a staple food for approximately 500 million people in Africa and Asia, and provides a source of energy, forage, and industrial raw material for the brewing industry. The genome of sorghum has been extensively studied, with the first reference genome of sorghum (BTx623) was published in 20092, followed by the decoding of sorghum inbred Tx430 transformation line genome in 20183 and the release of the sweet sorghum genome in 20194. In 2021, the first pan-genome of sorghum was completed5. Furthermore, in 2023, seven high-quality sorghum organelle genomes were published6. These milestones mark significant advancements in molecular research on sorghum. However, the current widely used sorghum reference genome (BTx623, RefSeq assembly accession: GCF_000003195.3)2 still has relatively low continuity and quality, with a contig N50 of 1.3 Mb and 3,913 total gaps. There has long been a need for high-quality sorghum genomes.
Recent advances in genome sequencing and assembly methodology have made telomere-to-telomere (T2T) gap-free assembly of chromosome sequences possible. The assembly of T2T genome enables the exploration of unknown fields such as telomeres and centromeres, which also provides a more in-depth research direction for animal and plant research7,8,9. T2T genome assemblies have been reported in several important crops, including banana10, barley11, rice12, and maize13. “Cuohu Bazi” (CHBZ) is an ancient local sorghum landrace collected in China between 1982 and 1986 (Fig. 1a). Through field identification, it has been discovered that CHBZ possesses excellent agronomic traits, such as dwarf stature and early maturation. The plant height is approximately 1.30 meters, and the growth period is 100 days. The T2T genome assembly of CHBZ will undoubtedly provide valuable guidance for sorghum breeding efforts.
In this study, we aim to generate the first T2T gap-free genome for CHBZ using a combination of the latest sequencing technologies, including PacBio high-fidelity (HiFi) sequencing, ultra-long Oxford Nanopore Technology (ONT), and High-through chromosome conformation capture (Hi-C) sequencing. The genomic resources and gene structures produced by this study will lay the groundwork for future research endeavors in CHBZ genetic breeding.
Methods
Sample collection and sequencing
CHBZ, which was grown in the germplasm resource nursery at the Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Taiyuan, 030031, China, was chosen for DNA and RNA sequencing. Fresh, healthy, young seedling were harvested and immediately placed in a freezing chamber with liquid nitrogen, then stored at −80 °C. Genomics DNA was extracted with the cetyltrimethylammonium bromide method, and evaluated using NanoDrop One spectrophotometer (NanoDrop Technologies, Wilmington, DE) and Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA). The PacBio HiFi libraries, the “Using SMRTbell Express Template Prep Kit 2.0 With Low DNA Input” protocol from PacBio was followed, with an insert size of approximately 20 kb (Pacific Biosciences, USA). Subsequently, the libraries were subjected to sequencing using PacBio Sequel II platforms in circular consensus sequencing (CCS) mode. For the ONT ultra-long sequencing, the library was prepared using the Oxford Nanopore SQK LSK109 kit, and then sequenced on a PromethION flow cell (Oxford Nanopore Technologies Oxford, UK). For Hi-C sequencing, Hi-C libraries based on DpnII restriction enzymes were generated as previously described14, and sequenced on the MGISEQ-2000 platform. Total RNA from roots, stems, leaves and spikes, was isolated using the NEB Next Poly (A) mRNA Magnetic Isolation Module. DNase I (Thermo Fisher Scientific, Wilmington, DE, USA) was used to remove genomic DNA. The RNA integrity was checked using a BioAnalyzer 2100 (Agilent Technologies, Santa Clara, USA). RNA libraries were prepared using the NEB Next Ultra RNA Library Prep Kit for Illumina with an insert size of 300 bp. The RNA libraries were sequenced on a MGISEQ-2000 instrument and generated 150 bp paired-end reads.
In total, we generated 304.06 Gb (~419X coverage) ONT reads with a N50 of 52.44 kb, 28.65 Gb PacBio HiFi CCS reads with a N50 of 16.64 kb (~40X coverage), and 304.93 Gb Hi-C data (Illumina paired-end reads, ~421X coverage), 123.30 Gb RNA data (Tables 1 and 2).
Genome assembly
The subreads generated by PacBio Sequel II platforms were processed using the CCS algorithm of SMRTLink (v11.1.0)15 with the following parameters: “–minPasses 3 –minPredictedAccuracy 0.99 –minLength 500”. Subsequently, we performed a genome survey utilizing GCE (Genomic Charactor Estimator) (v1.0.2)16 with the parameter ‘-k 17’ using HiFi reads to assess the genome size and heterozygosity of CHBZ genome. The estimated size of the CHBZ genome was approximately 724.95 Mb, with a heterozygosity rate of 0.10% and a repeat content of 67.74% (Table 3). After quality control using software SOAPnuke (v2.0)17 with parameters “-n 0.01 -l 20 -q 0.1 -i -Q 2 -G 2 -M 2 -A 0.5”, a total of 303.56 Gb Hi-C clean data was obtained. Using PacBio HiFi reads, ONT reads, and Hi-C clean data, the primary contigs were generated by Hifiasm (v 0.19.5)18 with default parameters. To anchor contigs onto chromosomes, we employed BWA (v 0.7.12)19 to align the Hi-C clean data to the assembled contigs, and then we filtered low quality reads using a HiC-Pro pipeline20 with the default parameters. The valid reads were used to anchor chromosomes with Juicer21 and 3d-dna pipeline22. LR_Gapcloser23 was utilized to close gaps in the assembled genome, using error-corrected ONT long reads generated by NECAT24. To further refinement of the genome, the T2T assembly was polished using the similar method described in Mc Cartney, Shafin et al.25. Briefly, the HiFi reads were aligned to the T2T assembly using Winnowmap2 (v 2.03)26. The output alignments were filtered to remove all secondary alignments and alignments with excessive clipping by using ‘falconc bam-filter-clipped’ tool. Finally, racon (v 1.5.0)27 was conducted with the filtered alignments.
Overall, the final genome assembly of CHBZ is about 724.85 Mb with a N50 length of 71.06 Mb (Table 3). Genome sequences were clustered and oriented as 10 pseudochromosomes (Fig. 1b, Table 4).
Repeat annotation
Two strategies including de novo and homolog methods were used to predict repeat elements. De novo repeats were identified by RepeatModeler (v1.0.4)28 and long terminal repeats were annotated by LTR-FINDER (v1.0.7)29. DNA and protein transposable elements were detected by RepeatMasker (v4.0.7)30 and RepeatProteinMasker (v4.0.7), respectively, based on Repbase database31. At last, tandem repeats were performed by Tandem Repeat Finder (v4.10.0)32.
In the CHBZ genome, repeat sequences accounted for 510.36 Mb, representing 70.41% of the assembly (Table 3). Long terminal repeat (LTR) retrotransposons (55.75%) were the most abundant component among repetitive elements, which was consistent with the previous study1 (Table 5).
Protein-coding genes prediction and functional annotation
Gene prediction was conducted through a combination of transcriptome-based prediction, homology-based prediction, and ab initio prediction methods. For transcriptome-based prediction, 123.3 Gb clean reads sequenced by DNBSEQ-2000 from root, stem, leaf and ear tissues from four stages (seedling, jointing, heading and flowering, filling and maturity stage) were assembled by Trinity (v 2.15.1)33 with parameters of ‘–max_memory 200 G–min_contig_length 200–genome_guided_bam merged_sorted.bam–full_cleanup–min_kmer_cov 3–min_glue 3–bfly_opts ‘-V 5–edge-thr = 0.1–stderr’–genome_guided_max_intron 10000’, which generated 130,301 transcripts with a N50 of 2,702 (Table 6). These assembled transcripts were aligned against the T2T assembly by Program to Assemble Spliced Alignment (PASA) (v 2.4.1)34. Valid transcript alignments were clustered based on genome mapping location and assembled into gene structures. The coding regions were obtained by employing TransDecoder (v 5.7.1) (https://github.com/TransDecoder/TransDecoder) (PASA-set). In addition, the RNA-seq clean reads were also mapped to the T2T assembly using Hisat2 (v 2.0.1)35. Stringtie (v 1.2.2)36 and TransDecoder (v 5.7.1) were employed to assemble the transcripts and identify candidate coding regions into gene models (Stringtie-set). Homologous genomes from five plants, including rice (T2T-NIP)12, foxtail millet (RefSeq assembly accession: GCF_000263155.2), maize (T2T Mo17)13, A. thaliana (Col-PEK)37, and BTx623 sorghum (RefSeq assembly accession: GCF_000003195.3) were downloaded. Subsequently, these sequences were used as queries to search against the T2T assembly using GeMoMa (v 1.9)38 with bam files from RNA-seq data. Homology predictions were denoted as “Homology-set”. For ab initio prediction methods, AUGUSTUS (v 3.2.3)39 was used to predict coding regions in the repeat-masked genome. All gene models predicted were combined by EvidenceModeler (v 2.1.0)40 into a non-redundant set of gene structures. Evidence from different sources was assigned to different weight parameters: 10 for PASA-set, 5 for Stringtie-set, 5 for Homology-set, and 1 for AUGUSTUS gene prediction. Finally, the produced gene models were further refined with the PASA (v 2.4.1)34 to generate untranslated regions and alternative splicing variation information. The integrated gene set was translated into amino-acid sequences. By using Diamond program (v 0.9.30.131)41 with an E-value cutoff of 1E-05, the amino-acid sequences were aligned to five public protein databases, KOG, SwissProt42, Kyoto Encyclopedia of Genes and Genomes (KEGG)43, NCBI nonredundant database (NR), and Translation of European Molecular Biology Laboratory (TrEMBL) databases. At last, we search protein domains through InterProScan (v 5.30)44 program. The Gene Ontology (GO) terms for each gene were extracted with InterProScan. Gene annotation identified 32,855 protein-coding genes. Of which, 32,746 genes (99.67%) were annotated at least one functional database (Table 7).
Gene expression analysis
The raw RNA-seq reads were quality controlled by fastp (0.19.5)45. Subsequently, the clean reads were aligned to the CHBZ genome using Hisat2 (v2.1.0)46, with the following parameters: ‘–phred33 -p 5–sensitive–no-discordant–no-mixed -I 1 -X 1000’. An estimated mapped read count matrix was generated using htseq-count (v0.12.4)47. The gene expression level was calculated using the fragments per kilobase of exon per million mapped reads (FPKM) method48.
Comparative genomic analysis
To identify syntenic relationships between the CHBZ and BTx623 (RefSeq assembly accession: GCF_000003195.3, publicly released by the Sorghum Consortium in 2017) genomes, we extracted the longest coding sequences (CDSs) of genes. Subsequently, these CDSs were input into JCVI (v1.1.18)49 with a minimum requirement of 30 genes per block and the parameters “–cscore = 0.99” was set. The analysis yielded 24,685 orthologous pairs: 24,639 (74.6%) in CHBZ and 24,637 (72.2%) in BTx623 (Fig. 2a).
The presence/absence variants (PAVs) between CHBZ and BTx623 genome were performed using the similar method described in Li, Xu et al.50. At first, a sliding window (window size, 500 bp; step, 100 bp) was used to divide the CHBZ genome. All divided sequences were then aligned against the BTx623 genome using BWA (v 0.7.17-r1188)51 with MEM algorithm (-w 500 -M -t 16). If a sequence failed to be aligned to the BTx623 genome, or coverage with <25%, it was defined as a CHBZ-specific sequence. For the identification of CHBZ PAV genes, the longest CDS per gene was extracted, and genes with >75% of their CDS covered by specific sequences were defined as putative PAV genes. To exclude potential false positives, the CHBZ longest CDS sequences were mapped to the BTx623 genome using minimap2 (v2.24-r1122)52 with parameters of ‘-x splice -t 10 -k 12 -a -p 0.4 -N 20’. If a gene with mapping quality >10, and coverage >25%, or identity >90%, it was defined as false-positive PAV gene. In summary, our analysis yielded 161 CHBZ-specific PAV genes and 178 BTx623-specific PAV genes. Notably, 129 of the 161 CHBZ-specific PAV genes were found to be expressed in at least one RNAseq sample (Fig. 2b). Among the 178 BTx623-specific PAV genes, 163 genes were found to overlap with those identified in the pan-genomic study5.
Identification of centromeres and telomeres
Using a method similar to that described in the wild blueberry T2T assembly53, the centromeres and telomeres were identified using QuarTeT (v 1.1.1)54 method with the “-c plant” option. QuarTeT provides a comprehensive suite of tools for automating T2T genome assembly and analysis, including the TeloExplorer module for telomere identification and CentroMiner for predicting centromere candidates. Briefly, CentroMiner identifies tandem repeat monomers, selects potential centromeric repeats based on period and copy number, clusters them to minimize redundancy, and aligns representative monomers to corresponding chromosomes. Our findings revealed that the length of centromere region range from 4.31 Mb to 13.00 Mb. Notably, 98.58% of the centromere regions consist of repetitive sequences, with LTR-Gypsy and tandem repeat sequences being the predominant categories (Table 8). We observed large blank regions in the Hi-C interaction heatmap of the centromere region, a phenomenon also identified in the Scutellaria baicalensis gap-free genome55 (Fig. 3).
Data Records
The sequencing data has been deposited at the Sequence Read Archive with the accession number SRP47291256, under the project identifier PRJNA1037263. Additionally, the assembled genome sequence has been made available in GenBank with the Whole Genome Shotgun project accession GCA_040267525.157. Furthermore, files pertaining to the genome assembly, gene structure annotation, and repeat predictions annotation have been archived in the Figshare repository58.
Technical Validation
Evaluation of the genome assembly
Multiple approaches were employed to validate the accuracy and completeness of the CHBZ genome assembly. We predicted centromeric sequences from the CHBZ genome assembly and observed all 10 centromeres were captured (Fig. 4a). Additionally, we identified all 20 telomeres by using the seven-base telomere repeat sequence (‘AAACCCT’) as a query (Fig. 4a). The number of telomere repeats in the CHBZ genome assembly was significantly higher than that of the BTx623 genome (Fig. 4b). The Hi-C heatmap displayed a high level of consistency across all chromosomes, providing evidence for the accurate sequencing, ordering, and orientation of contigs in the CHBZ genome assembly (Fig. 4c). The completeness test of LTR showed that the LTR assembly index (LAI) value of the assembly was 23.63 (Table 3), which was higher than the LAI values of BTx623 (LAI: 19.52) and even the latest T2T assembly of wild blueberry (LAI: 20.22)53. The completeness was also supported by the high mapping rates of two type sequences on the CHBZ genome assembly, with 100% of ONT reads and 99.9% of HiFi reads aligning to the CHBZ assembly. In addition, the assembly achieved a quality value score of 61.60 (Table 3). Finally, the Benchmarking Universal Single-Copy Orthologs (BUSCO)59 test revealed that the CHBZ assembly successfully identified 99.01% of 1,614 embryophyta gene set (Fig. 4d). Overall, these results presented the high quality and reliability of the CHBZ genome assembly.
Evaluation of the gene annotation
Firstly, the congruity of exon length and intron length frequency with those of three related species substantiates the reliability of our annotation findings (Fig. 5a,b). Secondly, a total of 32,284 (98.26%) genes received evidence-based support (Table 9). Furthermore, 32,746 (99.67%) protein-coding genes were successfully annotated in various databases and 19,252 (58.60%) were supported by all six databases (Table 7, Fig. 5c). Notably, 25,873 (78.75%) exhibited detectable transcriptional activity (FPKM ≥ 1) across the 15 RNA-seq datasets (Fig. 5d). Moreover, the predicted proteins achieved a complete BUSCO score of approximately 99.38%, indicating high-quality gene annotation (Table 3). To summarize, the gene annotation exhibits a remarkable degree of accuracy and integrity.
Code availability
No specific code was developed for this study. The data analyses were conducted following the manuals and protocols provided by the developers of the relevant bioinformatics tools, which are described in the Methods section along with the versions used.
References
Mccormick, R. F. et al. The Sorghum bicolor reference genome: improved assembly, gene annotations, a transcriptome atlas, and signatures of genome organization. The Plant Journal (2017).
Paterson, A. H. et al. The Sorghum bicolor genome and the diversification of grasses. Nature 457, 551–556 (2009).
Deschamps, S. et al. A chromosome-scale assembly of the sorghum genome using nanopore sequencing and optical mapping. Nature Communications 9 (2018).
Cooper, E. A. et al. A new reference genome for Sorghum bicolor reveals high levels of sequence similarity between sweet and grain genotypes: implications for the genetics of sugar metabolism. BMC Genomics 20 (2019).
Tao, Y. et al. Extensive variation within the pan-genome of cultivated and wild sorghum. Nature Plants 7, 766–773 (2021).
Zhang, S. et al. Variation in mitogenome structural conformation in wild and cultivated lineages of sorghum corresponds with domestication history and plastome evolution. BMC Plant Biology 23 (2023).
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2021).
Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Science (2022).
Shi, X. et al. The complete reference genome for grapevine (Vitis vinifera L.) genetics and breeding. Horticulture Research 10 (2023).
Huang, H. et al. Telomere-to-telomere haplotype-resolved reference genome reveals subgenome divergence and disease resistance in triploid Cavendish banana. Horticulture Research 10 (2023).
Navrátilová, P. et al. Prospects of telomere-to-telomere assembly in barley: Analysis of sequence gaps in the MorexV3 reference genome. Plant Biotechnology Journal 20, 1373–1386 (2021).
Shang, L. et al. A complete assembly of the rice Nipponbare reference genome. Molecular plant 16, 1232–1236 (2023).
Chen, J. et al. A complete telomere-to-telomere assembly of the maize genome. Nature Genetics 55, 1221–1231 (2023).
Belton, J. M. et al. Hi-C: a comprehensive technique to capture the conformation of genomes. Methods 58, 268–276 (2012).
Chin, C. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature Methods 10, 563–569 (2013).
Wang, H. et al. Estimation of genome size using k-mer frequencies from corrected long reads, arXiv. 2003. 11817 (2020).
Chen, Y. et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. GigaScience 1 (2018).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods 18, 1–6 (2021).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Servant, N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biology 16 (2015).
Durand, N. et al. Juicer provides a one-Click system for analyzing loop-resolution Hi-C experiments. Cell Systems 3, 95–98 (2016).
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 6333 (2017).
Xu, G. et al. LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly. GigaScience 1 (2019).
Chen, Y. et al. Efficient assembly of nanopore reads via highly accurate and intact error correction. Nature Communications 12 (2021).
Rhie, A. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nature methods 19, 687–695 (2022).
Jain, C., Rhie, A., Hansen, N. F., Koren, S. & Phillippy, A. M. Long-read mapping to repetitive reference sequences using Winnowmap2. Nature methods 19, 705–710 (2022).
Vaser, R., Sovic, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome research 27, 737–746 (2017).
Chen, N. Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences. Current Protocols in Bioinformatics 4 (2004).
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Research 35, W265–W268 (2007).
Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21(Suppl 1), i351–i358 (2005).
Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA 6 (2015).
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research 27, 573–580 (1999).
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology 29, 644–652 (2011).
Haas, B. J. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Research 31, 5654–5666 (2003).
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nature Methods 12, 357–360 (2015).
Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biology 20, 278 (2019).
Hou, X., Wang, D., Cheng, Z., Wang, Y. & Jiao, Y. A near-complete assembly of an Arabidopsis thaliana genome. Molecular plant 15, 1247–1250 (2022).
Keilwagen, J. et al. GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position Conservation and RNA-seq Data. Methods in Molecular Biology (2019).
Stanke, M. & Morgenstern, B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic acids research 33, W465–W457 (2005).
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biology 9, R7 (2008).
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nature Methods 12, 59–60 (2015).
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Research 27, 49–54 (1999).
Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 28, 27–30 (2000).
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature Biotechnology 37, 907–915 (2019).
Anders, S., Pyl, P. T. & Huber, W. HTSeq–a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).
Roberts, A., Trapnell, C., Donaghey, J., Rinn, J. L. & Pachter, L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome biology 12, R22 (2011).
Tang, H. et al. Synteny and Collinearity in Plant Genomes. Science 320, 486–488 (2008).
Li, T. et al. Genome assembly of KA105, a new resource for maize molecular breeding and genomic research. The Crop Journal (2023).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Genomics (2013).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2017).
Zeng, T. et al. The Telomere-to-telomere gap-free reference genome of wild blueberry (Vaccinium duclouxii) provides its high soluble sugar and anthocyanin accumulation. Horticulture Research (2023).
Lin, Y. et al. quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification. Horticulture Research (2023).
Pei, T. et al. Gap-free genome assembly and CYP450 gene family analysis reveal the biosynthesis of anthocyanins in Scutellaria baicalensis. Horticulture Research 10 (2023).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRP472912 (2024).
NCBI GenBank. https://identifiers.org/ncbi/insdc.gca:GCA_040267525.1 (2024).
Wang, H. Genome assembly and annotation of Sorghum bicolor CHBZ. figshare. https://doi.org/10.6084/m9.figshare.24532924.v1 (2024).
Waterhouse, R. M. et al. BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics. Molecular Biology and Evolution 35, 543–548 (2018).
Acknowledgements
This work was funded by Project of Conservation and Utilization of Agricultural Germplasm Resources in Shanxi Province (sxzyk202201) and Basic Research Program of Shanxi Province (20210302124238).
Author information
Authors and Affiliations
Contributions
P. G., M.X.L. and Z.X.M. conceived the study. M.L., S.H., H.G.W. and H.B.Q. collected and prepared the samples. C.H.C., X.K.Y. and J.B.J. performed bioinformatics analysis. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Li, M., Chen, C., Wang, H. et al. Telomere-to-telomere genome assembly of sorghum. Sci Data 11, 835 (2024). https://doi.org/10.1038/s41597-024-03664-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-024-03664-8