Sorghum (Sorghum bicolor L. (Moench), 2n = 2x = 20) is the world’s fifth economically most important cereal1; it is a staple particularly in the semi-arid tropics of Africa and Asia, representing 6.5 million square kilometers in over 55 countries, and home to more than 2 billion people of which 600 million are considered to be poor2. Sorghum is now becoming popular in the food industry worldwide, due to the rise in demand for gluten-free specialty grains rich in health-promoting and food oxidative stabilizing compounds3,4. Indeed, sorghum grains, particularly red varieties, exhibit the highest values of total antioxidant capacity (400–500 μmol of Trolox equiv/g) among several crops (e.g., wheat, rice, oats, barley, maize, potato)3,5 and plant food sources of natural antioxidants6,7. In addition to human food, sorghum is used also for several other purposes including energy and animal nutrition8; it is also resilient to biotic and abiotic stresses, adapted to diverse environments, requires low agricultural inputs, all of which makes it an important crop to enhance food and commodity security across the globe9,10,11.

The S. bicolor genome sequence was first released in 200912 and the current version 3.1.1 is sized 732.2 Megabases (Mb) and reports more than 34,000 annotated genes several of which can be used in genetic introgressions and genomic-aided improvement of yields and the quality of the products in this crop13. The sorghum reference genome is expected to facilitate resequencing experiments and genetic investigations in cultivated sorghum and its wild gene pool. In this work, we whole-genome resequenced and present the comparative information of S. bicolor and S. bicolor × S. halepense recombinant inbred lines; to our knowledge, no resequencing studies have been reported that accounted for such populations contemporarily.

Sorghum like any other crop can be genetically improved by introgressing genetic factors from wild relatives14. Sorghum breeders have shown interest in interspecific crosses between Sorghum bicolor and Johnsongrass [Sorghum halepense (L.) Pers.] which is a wild species natural allotetraploid (2n = 4x = 40) believed to have originated by the spontaneous hybridization between S. bicolor and S. propinquum (Kunth) Hitchc., followed by chromosome doubling15. Available evidence shows that Johnsongrass can confer a strong perenniality and overwintering in S. bicolor genetic background14,16,17,18,19. The ploidy of Sorghum halepense implies that its hybridization with S. bicolor requires the latter be either induced tetraploid or cytoplasmic genetic male sterile diploid; in both cases mainly tetraploid progeny is generated19,20, but cases of diploid descendants were observed21,22. The importance of introgressing perenniality in crops can be explained by the search for agroecological functions of a perennial cultivar as a long-lasting cover crop. The cover crops are environment friendly, help avoid bare soils, improve soil health, reduce agricultural inputs, foster biodiversity, which can make agricultural production more resilient to climate change adversities23,24. Bare soils represent one of the major failures of the conventional agricultural intensification as they cause soil and plant nutrients loss mainly through erosion and lixiviation. Perennial crops create permanent soil cover, recycle and stop loss of nutrients from fertilizers, allowing drastic reduction of rates of fertilizer application, and improving soil health. In addition, by permanently covering the soil, perennial crops limit soil moisture loss through evaporation, guarantee a high level of soil organic matter and an active soil biology, which in turn improves soil chemical and physical properties, and helps neutralize greenhouse gas emissions particularly through carbon sequestration, and hence, mitigates climate change25.

Like in other such studies26, the resequencing of sorghum populations described in this work allows capturing the natural variation across the gene pool through the identification of millions of variants among cultivated and wild S. halepense relative accessions. Such high-confidence polymorphisms will be used in forward genetics and linkage disequilibrium studies to unravel the genetic base of complex plant characteristics of agronomic importance, and the development of climate change resilient cultivars.

Sorghum breeders working on the introgression of perenniality in otherwise annual S. bicolor background select for overall sorghum bicolor plant aspect in addition to the overwintering trait. We also have observed from our experience in recent years14 that backcrosses are the most attractive as they show traits closely comparable to domesticated sorghum27 (panicle shape and compactness, bold and big seed size, absence of seed shattering, etc..) than single, double, or three-way crosses. We, therefore, used the data produced in this and previous studies from our laboratory to investigate the contribution of sorghum halepense in S. bicolor × S. halepense controlled hybridizations with particular interest in backcrosses involving two doses of S. bicolor as the recurrent parent. In this work, we describe the first whole-genome resequencing study, contemporarily evaluating S. bicolor and the progeny from S. bicolor × S. halepense crosses; we conducted a comprehensive structural and functional characterization of 172 lines 19 of which were recombinant inbred lines that inherited different proportions of S. bicolor and S. halepense. We interrogated the entire genome of these populations and produced a large set of robust and high-confidence variants that will sustain breeding and other genetic and genomic investigations in sorghum, including genomics-assisted breeding.


Genetic diversity and sequencing assessment

One hundred and seventy-two sorghum lines evaluated in this study clustered in two different populations of 153 S. bicolor genotypes and 19 S. bicolor × S. halepense recombinant inbred lines (Fig. 1). The Fstatistic (Fst) measuring the genetic structure and the level of genetic differentiation28,29 was significantly and moderately high (Fst = 0.31, p = 0.01); Fst values range from 0 in case of panmixis to 1 in case the populations do not share any genetic diversity. The first two dimensions of the principal coordinates explained 70.9% of the total genetic diversity existing in the studied populations. The one hundred and seventy-two sorghum lines were whole-genome resequenced and a corresponding number (172) of paired-end sequencing libraries constructed each with the insert size around 300 base pairs. Resequencing yielded 22.88 billion paired-end reads resulting in 3.43 trillion bases (nucleotides) and 2.6 TB of high-quality raw data. Ultimately, a total of 21.70 billion and 3.25 trillion of clean paired-end reads and bases were produced, respectively. Overall, 94.54% of total clean reads showed quality value Q20 ≥ 94.54%, which indicates a high data quality. Sorghum bicolor (Sb) and S. bicolor × S. halepense (SbxSh) showed comparable clean reads quality (Q20 = 96.17–96.38% and Q30 = 87.91–88.22%), number of clean reads (125.7–126.19 × 106) and bases (18.82–18.89 × 109), and the ratio clean bases: raw bases (94.64–94.89%). However, the guanine-cytosine (GC) rate was higher in SbxSh relative to Sb (43.42% a vs. 43.26% b) in the clean data.

Figure 1
figure 1

Genetic differentiation analysis in Sb and SbxSh populations. The ellipses are drawn accounting for the 95% confidence interval and the Euclidean distance from the center “o”. Genotypes outside the ellipses are outliers74.

Sequence alignment

Alignment assessment

Sequence reads were aligned to the Sorghum bicolor (BTx623) reference genome whose size was 732,200,000 bp, while the effective size was 675,973,270 bp (N base excluded), and GC content of 41.82%. The mapping rate i.e., the percentage coverage of the reference by reads of the sorghum samples varied from 89.15 to 95.18% with a mean of 92%. The percentage of mapped reads and that of mapped bases were identical and ranged from 82.65 to 99.92% with an average of 99.40%. The effective mapping depth i.e., LN/G, where L is the read length, N is the number of reads and G is the haploid genome length, was between 23.17X to 34.38X. In this work, SbxSh and Sb showed comparable mapped reads (99.04–99.45%), mapped bases (99.04–99.45%), sequencing depth after mapping (26.54X–26.64X), while SbxSh showed statistically significant higher percentage coverage rate (94.87a vs. 91.62b), yet lower percentage unique hit bases (81.96a vs. 77.58b) and unique hit reads (82.39a vs. 78.31b).

SNP calling and annotation

The sequence alignment of the target sorghum lines to the reference genome BTx623, the gene models, and the information derived from the reference genome allowed to identify large numbers of SNPs, InDels, CNVs, and SVs (Table 1; Fig. 2). A total of 567,046,841 SNPs was uncovered from these sorghum genomes. On average, 10,515,367.4 SNPs were observed per individual, of which 1,855,062 were located in the genic regions. The statistical analysis showed that SbxSh had more total and heterozygous SNPs, synonymous_CDS, nonsyn_CDS, exonic, genic, intronic, mRNA, pseudogenic, transcript, and tRNA SNPs, whereas Sb showed more homozygous SNPs. Among the synonymous and nonsynonymous SNPs mapped in the coding regions in either population, the synonymous SNPs represented 49% and 54%, respectively, in Sb and SbxSh populations. The SbxSh population contained 81% and 78%, respectively, of all synonymous and nonsynonymous SNPs. Contrary to synonymous mutations, nonsynonymous mutations cause variation in coding amino acids and are considered to play a significant role in changing the phenotype of organisms. Besides, nonsynonymous mutations are also strong candidates to explain the phenotypic diversity between different individuals in a population.

Table 1 Tukey HSD for SNPs, InDels, SVs, and CNVs in S. bicolor vs. S. bicolor × S. halepense results.
Figure 2
figure 2

Chromosomal distribution of WGRS information in Sb and BC1 SbxSh lines, and GWAS data from the evaluated populations74. The x-axis corresponds to the genomic coordinate. Tracks 1–4: visualize the genomic density of regions (defined as the fraction of a genomic window that is covered by genomic regions). Tracks 5, 8: significant SNPs and candidate genes are displayed according to their genomic coordinates (x-axis), while y-values were set for the sole purpose improving the resolution (legibility) of the corresponding SNPs and candidate genes.

We further analyzed the distribution of the large-effect SNPs i.e., those with a potential to disable gene functions26,30. In this work, large-effect SNPS included premature stop codon, stop codon to non-stop codon, start codon to non-start codon, and splice sites. It was found that within the 10,140 SNPs participating in codon premature termination, 2,970 SNPs disrupt splicing donor or acceptor sites of genome, 13,976 SNPs are related to alteration of initiation methionine residues, and 1,144 SNPs replace terminators with certain amino acid residues that leads to longer ORFs. The statistics are depicted in Fig. 3 where SbxSh showed higher numbers of large-effect SNPs than Sb. In SbxSh population, it was found an average of 1967, 500, 370, and 700 SNPs expected to induce premature stop codon, stop codon to non-stop codon, start codon to non-start codon, and splice sites, respectively, while in Sb it was found a maximum of 1183, 340, 100, and 340, respectively. In both populations, SNPs inducing premature stop codon were most represented with respect to the other large-effect SNPs.

Figure 3
figure 3

Statistics of different types of large-effect SNPs. Re-sequenced entries are represented on x-axis, numbers on top of each bar represent the number of SNPs74.

InDel calling and annotation

A total of 91,825,474 indels was identified of which 24% and 76% resided in Sb and SbxSh, respectively; mean individual insertions (211,283.62 vs. 649,218.63) and deletions (221,602.78 vs. 697,826.37) were statistically higher in SbxSh relative to Sb population. The genome-wide distribution of short InDels (1–10 bp) showed a lower number of these variants in genes and coding regions compared with Pseudogenes and mRNA, for instance (Table 1). Our result show that indels that are not multiples of 3 bp and produce frameshift mutations are particularly uncommon in coding regions. Frame shift mutation in CDS region, 3X-shift mutation in CDS region, 3X-shift mutation in CDS region phase 0, and 3X-shift mutation in CDS region phase No 0 were statistically higher in SbxSh than in Sb i.e., 16,544, 19,780.74, 6301.53, 13,479.21 vs. 5954.58, 6684.15, 2114.42, 4569.73, respectively. A frameshift mutation results from an insertion or deletion of a number of nucleotides that is not a multiple of three. The change in reading frame alters every amino acid after the point of the mutation and results in a nonfunctional protein. The comparative effects of frame-shifting (e.g., 1-, 2-, 4-, 5-, 7-, 8-, 10-bp.) non-frame-shift (e.g., 3-, 6-, 9-bp) shows that the former short InDels provide a much powerful explanation of the difference of traits between individuals30.

Structure and copy number variations

In this study, a total of 1,532,171 SVs was identified and found statistically comparably distributed between the two populations. Of the observed SVs, Sb and SbxSh showed statistically comparable individual mean numbers of deletions (4202 vs 4119), other SVs (4699 vs 4594), but SbxSh displayed more insertions than Sb, i.e., 72.47 vs 21.01 individual mean SV number. A total of 4,973,961 CNVs was generated from the entire population, with SbxSh producing statistically higher number of CNVs than Sb (41,296.21 vs. 27,381.26), higher CNV up-regulation (16,650.26 vs. 10,567.82) and down-regulation (24,214.95 vs 16,345.06).

Genetic variation in S. bicolor and S. bicolor × S. halepense hybrids

One of our working hypotheses was that some of the identified genetic variation might contribute to the phenotypic differentiation between S. bicolor and S. bicolor × S. halepense which pushed us to focus our analysis on SNPs in genic regions. Sorghum breeders working on the introgression of perenniality in otherwise annual S. bicolor background select for overall sorghum bicolor plant aspect in addition to the overwintering trait. We also observed from our experience in recent years14 that backcrosses are the most attractive as they show traits closely comparable to domesticated sorghum (panicle shape, seed size, etc..) than single, double, or three-way crosses. We therefore used the data produced in this and previous studies from our laboratory to investigate the contribution of Sorghum halepense in SbxSh controlled hybridizations. As Fig. 2 shows, SbxSh9 backcross line has more genes and more SNPs, CNVs, and indels than S. bicolor Sb1; nonetheless, the two lines displayed comparable number of SVs. The same pattern was observed across populations. The density of genes, SNPs, indels, and SVs increases from the pericentromeric region towards the telomeres, with SNPs/genes and short indels showing similar distribution pattern. In both populations, the distribution of CNVs was homogeneous from the centromeres to telomeres in all chromosomes. In addition, sorghum biomass related significant SNPs and candidate genes recently uncovered8 in these populations i.e., Dw genes (Dw1, Dw2, Dw3, Dw4), Ma genes (Ma1, Ma2, Ma3, Ma5, Ma6), gibberellin (GA) associated genes (SbGA2ox1, SbGA3ox1, SbGA2ox7), genes involved in controlling heading date (SbZCN8)1,31 and GA signaling and plant height regulation (SbSLR11) were mostly localized towards the distal and proximal ends of the chromosomes of interest (Fig. 2).

The analysis of genes harboring SNPs showed more genes (18,785) were shared between Sb and BC1 crosses, with relatively fewer genes i.e., 109, 230, and 291 being private to the respective three lines Sb1, SbxSh50, and SbxSh9 (Fig. 4). The analysis of the biological processes (BP) GO enrichment associated to SbxSh genes showed 33,693, and 5548 genes in the Sorghum bicolor reference genome dataset and in SbxSh9/SbxSh5 hybrids gene set that mapped to the GO terms (either directly or through inheritance), respectively (Table 2). The most granular terms included enrichment in genes associated to the plant-type cell wall organization (GO:0009664), root development (GO:004836), cell wall polysaccharide metabolic process (GO:0010383), glutathione metabolic process (GO:0006749), hydrogen peroxide catabolic process (GO:0042744), anatomical structure morphogenesis (GO:0009653), response to oxidative stress (GO:0006979), proteasomal protein catabolic process (GO:0010498), response to oxygen-containing compound (GO:1901700), generation of precursor metabolites and energy (GO:0006091), translation (GO:0006412), hormone-mediated signaling pathway (GO:0009755), response to abiotic stimulus (GO:0009628), vesicle-mediated transport (GO:0,016,192), ribonucleoprotein complex biogenesis (GO:0022613), protein-containing complex subunit organization (GO:0043933), intracellular transport (GO:0046907), cellular component assembly (GO:0022607), protein localization (GO:0008104), organelle organization (GO:0006996), nitrogen compound transport (GO:0071705), organic substance transport (GO:0071702), small molecule metabolic process (GO:0044281), regulation of transcription, DNA-templated (GO:0006355), nucleobase-containing compound metabolic process (GO:0006139), positive regulation of translation (GO:0045727).

Figure 4
figure 4

Number of shared and private genes among Sb1 (IESV 99,091 DL) and two sister perennial RILs (SbxSh9 and SbxSh50) derived from SbxSh cross backcrossed (2 recurrent parent doses: Tx623*2/Gypsum 9; BC1) to S. bicolor74.

Table 2 Hierarchical relations between over-represented (enriched) functional classes using GO biological process annotation in private genes of S. bicolor*2/S. halepense relative to S. bicolor lines.

To root development GO term (GO:004836) mapped 34 genes: GeneID:8080001, GeneID:8064680, GeneID:8054879, GeneID:8075742, GeneID:8059541, GeneID:8078975, GeneID:8081609, GeneID:8055737, GeneID:8055874, GeneID:8063307, GeneID:8058361, GeneID:8084663, GeneID:8064471, GeneID:8079326, GeneID:8079141, GeneID:8060669, GeneID:8063006, GeneID:8060622, GeneID:8080905, GeneID:8077286, GeneID:8060000, GeneID:8058075, GeneID:8074440, GeneID:8082391, GeneID:8065800, GeneID:8080849, GeneID:8085583, GeneID:8071472, GeneID:8084890, GeneID:8072111, GeneID:8063311, GeneID:8054193, GeneID:8059201, GeneID:8082281.


The development of perennial sorghum initiated in 2015 in our breeding program; earlier morpho-agronomic performances of evaluated materials was reported in our previous works14. According to our experience and available literature19, the production of rhizomes is the sine qua non condition for SbxSh lines to remain perennial under temperate climates like the prevalent conditions in our Italian experimental stations14. One of the most interesting findings from recent studies is the absence of negative trade-off between rhizome development and seed yield and aboveground biomass yield. This is expected to allow the development of high-yielding biomass, grain, and dual purpose sorghum ideotypes expressing perennating belowground structures14,19. Sorghum bicolor × S. halepense perennial lines showed competitiveness relative to commercial hybrids in terms of above-ground biomass production and grain yields; however mostly backcross-derived lines showed overall plant aspect27 and domestication traits such as high seed yield, big caryopses, seed shattering resistance, compact inflorescence and stalk strength closer to S. bicolor than other crosses14,20.

Grain yield was significantly correlated with maturity, dry mass yield, dry mass fraction of fresh material, number of culms, rhizome development, hemicellulose, and rhizome survival but important correlation coefficients were observed for maturity, number of culms and rhizome development. The traits that showed significant medium to high32 correlation with aboveground dry mass yield included plant height, dry mass fraction of the fresh material, number of culms, neutral detergent fiber.

In this work, we completed the first whole-genome resequencing analysis of a unique panel made up of Sorghum bicolor and S. bicolor × S. halepense populations. Our resequencing focused on S. bicolor × S. halepense recombinant inbred lines (RILs) instead S. halepense per se and this can be explained by our effort to align the resequencing experiment with our breeding program to develop perennial sorghum cultivars14. The SbxSh RILs were developed through crossing and selection to minimize the S. halepense associated linkage drag e.g., seed shattering, small-sized kernels, excessive tillers and rhizomes, and undesirable inflorescence compactness and shape in our breeding populations. The resequencing was therefore expected to explain the parental lines contributions to the genomic composition of the perennial hybrid combinations. The use of wild relatives in genetic introgressions is generally accompanied by linkage drag associated with the introduction of unfavorable traits along with the favorable ones33, and this necessitates a significant and time-consuming breeding effort to recover the domesticated phenotype, particularly when the primary produce is the grain18,34. This resequencing experiment is important in plant breeding, particularly in sorghum. Novel variants will be used for gene discovery, while a great number of uncovered high-quality polymorphisms will be harnessed in the process of genomic selection, genome-wide association studies, and marker-assisted selection. As these are founder populations for our entire breeding program, such opportunity the resequencing offered cannot be overemphasized35, particularly in terms of increased genomic predictions and mapping precision of quantitative trait loci. Furthermore, no genotyping platforms e.g., arrays, chips as those in use in other crop species e.g., tomato, potato or pepper, have been developed so far in sorghum for high-throughput genotyping of sorghum traits, particularly those associated with perenniality26. The major aim of this work was therefore to develop a large repertoire of genomic information and polymorphism data sets that can be used for gene discovery and validation, and as a source of markers to build genotyping platforms for applied breeding purposes.

In this work, a total of 21.70 billion and 3,25 trillion of clean paired-end reads and bases were produced, respectively. Overall, 94.54% of total clean reads showed quality value Q20 ≥ 94.54%; such quality value exceeded 96% when calculated in S. bicolor and S. bicolor × S. halepense separately, which indicates a high data quality. The mapping rate i.e., the percentage coverage of the reference by reads of the sorghum samples varied from 89.15% to 95.18% with a mean of 92%, while the percentage of mapped reads ranged from 82.65% to 99.92% with an average of 99.40%, indicating a high sequencing accuracy and the absence of contaminating DNA. The effective mapping depth was between 23.17X to 34.38X, which was largely sufficient (Zheng et al. 2011) for aligning most of the sequences of the target samples, and testified to the high quality of the reference genome. The mapping depth obtained in this study was higher than in most previous mapping experiments that showed values around 10X36,37, and encompassed the entire reference genome length in homogeneous pattern in all accessions. The mapping rate and the percentage of mapped reads realized in this study were better than reported in previous works, and confirm the high quality of the reference sequence used. For instance, Gramazio et al.26 reported a mean mapping rate of 85.4% with a range from 76.9 to 88.7%. On the other hand, in some model species, the average rates of unmapped reads were higher e.g., 3–5% in tomato38,39 and 10–15% in rice40,41. Differences in mapping experiments can be attributed to a variety of factors including differences in: (1) the progress of the sequence assembly, (2) the levels of repetitive elements, (3) genetic divergence between the sequenced samples and the reference genome, and (4) the levels of variants polymorphisms38,41.

Sorghum bicolor × S. halepense and S. bicolor populations showed comparable mapped reads, mapped bases, sequencing depth after mapping, while Sorghum bicolor × S. halepense showed statistically significant higher percentage coverage rate (94.87a vs. 91.62b), yet lower unique percentage hit bases (81.96a vs. 77.58b) and unique percentage hit reads (82.39a vs. 78.31b). Since Sorghum bicolor (Sb) and S. bicolor × S. halepense showed comparable clean reads quality (Q20 and Q30), and the number of clean reads and bases, the higher coverage rate observed in S. bicolor × S. halepense can be attributed to the lower rates of uniquely mapped reads and hence the existence of reads mapping to multiple reference genomic loci with low level of sequence similarity with the target sequence. The existence of multi-mapped reads in S. bicolor × S. halepense can be explained by this population producing statistically higher number of short indels, insertions of long fragments (at least 50 bp), CNVs, CNV up-regulations and down-regulations than S. bicolor42. A relatively lower level of sequence similarity was expected between S. bicolor reference sequence and S. bicolor × S. halepense in virtue of the genetic distance that existed between the two genomes (Fig. 1) deriving mainly from S. halepense. Under such circumstances, several authors39 pointed out the need for sequencing and assembly several reference genomes from crop wild relatives to avoid biased resequencing analyses and to improve the rate of uniquely mapped reads. However, since the dynamics of gene gains and losses during plants evolution and particularly during the interploid hybridization between S. bicolor and S. halepense is not yet fully understood, other reasons may explain the higher genome coverage of S. bicolor × S. halepense.

The SbxSh population showed a greater degree of heterozygosity (Table 1), which is consistent with previous genetic analyses34, and can be explained by the genetic history of S. bicolor that underwent the bottleneck of domestication, resulting in a narrowing of the genetic base with respect to wild species, while the tetraploid nature of S. halepense and its progeny may have played a major role in the heterozygosity observed in the SbxSh population; the fixation of alleles requires a higher number of generations in polyploids, and heterozygosity decreases slowly even in the presence of repeated cycles of self-fertilization43. Furthermore, the whole-genome resequencing reads were aligned to the S. bicolor reference genome13; alignment of sequences from an allotetraploid SbxSh to a diploid genome can result in an overestimation of heterozygous loci due to alignment of homeologs. In S. halepense homeologs derived from orthologs in the genomes of its ancestors (S. bicolor and S. propinquum) are conserved, but following S. bicolor × S. halepense hybridization it becomes difficult to predict the fate of such homeologs across generations because of different possibilities of chromosome pairing and independent assortment at meiosis44. It is nonetheless expected that at least some of the homeolog chromosome pairs can be maintained and contribute to increasing the heterozygosity of the recombinant inbred lines.

Our study has identified a large set of polymorphisms, consisting of 665,378,447 of high-quality variants including SNPs, indels, SVs, and CNVs; SNPs represented 85.22% of all variants, which is in agreement with previous works26. The identification of more SNPs in the present sorghum panel represents a good breeding opportunity as these markers are cheaper and easy to automate for high-throughput genotyping with respect to other markers45,46. The whole-genome resequencing completed in this work is therefore the starting point to develop a large number of markers not only in S. bicolor but also in Sorghum wild relatives such as Sorghum halepense, for which the lack of such information slowed down their use in breeding programs39,47. There are success stories on the opportunity to harness variants polymorphisms from crop wild relatives e.g., in soybean37,48, rice41, and tomato39,49, and eggplant26. In our study, SbxSh population produced more variants than Sb population, which confirmed the findings in previous works showing that crop wild relatives yield more variations relative to landraces or cultivated accessions. Our study highlights therefore the possibility for a controlled introgression of the variation from S. halepense to broaden the genetic basis of S. bicolor; similar introgressions were achieved in other crops e.g., rice, tomato, and wheat50. To our best knowledge, our study represents the first effort to harness the valuable large pool of genetic diversity from S. halepense using whole genome resequencing. Similar panels were evaluated in previous studies but relied on genotyping-by-sequencing platforms that showed technical limitations particularly associated with very low sequencing depth (~ 1.5X) and poor coverage8,34. Examples of such previous genotyping-by-sequencing-based investigations include linkage disequilibrium studies on biomass and biomass-related trait in sorghum8, antioxidant traits in sorghum34, and Sorghum plant architecture51. In addition, Habyarimana and Lopez used genotyping-by-sequencing SNPs to carry out genomic prediction and selection in sorghum52,53. The information produced in this work and the variants identified from S. halepense are expected to accelerate the introgression of perenniality and other useful genomic regions of this rhizome-producing species18 to develop superior climate/resilient and agroecological practice compliant sorghum cultivars. A large number of high-confidence polymorphisms was also identified in S. bicolor population and will be harnessed for high-throughput genotyping in cultivated or wild sorghum species using high-throughput genotyping platforms e.g., arrays or chips. In this work, genotypes were called individually for each sample for all variants but, for SNPs, we also performed joint genotyping across samples to produce a multi-sample VCF call-set for further investigations. The multi-sample VCF call-set produced vcf files of 33 and 6 Mb of SNPs and indels, respectively, with a good coverage and sequencing depth. These matrices will be used to provide more insights and improving previous studies particularly in domestication, genomic predictions, genome-wide association studies, and phylogenetics26.

The Fst statistics which is a metric of population structure, confirmed previous studies8 showing that Sb and SbxSh form two distinct populations. In addition, the differentiation between the two populations is supported by the higher number of variants observed in the SbxSh population, particularly SNPs, large-effect SNPs, CNVs, SVs, indels and frameshift mutations. The Fst achieved in this work was higher than or comparable to previous reports8,54; the observed Fst discrepancies can be attributed to differences in the number of markers used, population genetic diversity, and in the sampling approaches implemented in these works. An outlier (SbxSh102) was identified that is genetically closer to S. bicolor population. The SbxSh102 consists of two doses of S. bicolor recurrent parent (Tx623) in the S. bicolor × S. halepense (Gypsum 9) controlled cross and this perennial RIL is of future genetics and breeding interests in the development of perennial Sorghum bicolor ideotypes.

Our study produced genes and variants with a higher density that better covered entire lengths of individual chromosomes than in previous works52,55,56. The density of genes, SNPs and indels showed similar chromosomal distribution pattern, increasing from the pericentromeric region towards the telomeres. Such better variants distribution pattern is expected to boost novel gene and major marker discovery. In previous works8,31,57, uncovered significant SNPs and candidate genes were mostly localized towards the distal and proximal ends of the chromosomes of interest. Our whole-genome resequencing work produced high density of marker variants covering entire genome and offers therefore the opportunity to uncover novel major marker variants and genes in the pericentromeric regions that currently show scarce such information. The analysis of the biological processes GO enrichment associated to SbxSh genes showed 5,548 private genes that mapped to the important GO terms; 34 of these genes mapped to root system development (GO:0022622) two (GeneID:8054879 and GeneID:8072111) of which were reported to govern root properties58,59. Studies conducted in Arabidopsis thaliana showed that ROOT PRIMORDIUM DEFECTIVE 1 (RPD1; GeneID:8054879) is required for the maintenance of active cell proliferation and plays a critical role in the development of roots58, while the RETARDED ROOT GROWTH (RRG, GeneID:8072111) gene is predominantly expressed in the root meristem and encodes a mitochondria-localized protein that is required for cell division in the root meristem (Xiaojing Zhou et al.). The variants and functional analyses conducted in this work showed that mining the SbxSh private variants and genes can provide insights on genetic factors controlling plant characteristics capable of boosting sorghum genetic improvement, particularly the perenniality trait that is compliant with agroecological practices, sustainable agriculture, and climate change resilience.


This work generated the first whole genome map of SNPs, indels, SVs, and CNVs in a sorghum panel that includes S. halepense genome, which can be used as a framework for future investigations in functional genomics and genome-assisted breeding. Sorghum is the world’s fifth economically most important cereal and is a staple particularly in the semi-arid tropics of Africa and Asia, and is globally used for varied purposes including food, feed, fodder, commercial grade alcohol, as well as first, second, and third generation biofuels. The variants (SNPs, indels, SVs and CNVs) uncovered herein will boost genomic studies e.g., genomic prediction and selection, linkage and linkage disequilibrium analyses, molecular basis of several sorghum plant characteristics, all of which can build breakthroughs to achieve significant genetic gains in sorghum crop.

Materials and methods

Plant materials and sequence data sets

One hundred seventy-two sorghum genotypes including 19 S. bicolor × S. halepense (SbxSh) advanced inbred lines and 153 S. bicolor (Sb) lines were whole-genome resequenced. Sorghum bicolor genotypes were comprised of tropical landraces, improved lines therefrom, and temperate breeding lines. The SbxSh lines were at different levels (F4–F7) of filial progeny generated from crosses involving annual/perennial (A/P) genotypes and A/P backcrosses to annual recurrent parents (A*2/P; BC1), perennial/perennial (P/P) and annual/perennial//perennial (A/P//P) hybrid combinations followed by cycles of selection. The annual parental lines were standard diploid (2n = 20), induced tetraploids (2n = 40), cytoplasmic-genetic male-sterile, and genetic male-sterile inbred sorghum lines. Perennial parental lines consisted of a S. halepense plant and tetraploid lines derived from controlled hybridization of induced tetraploid sorghum plants with S. halepense. These populations were described in Habyarimana et al.8,14. Interspecific hybridization techniques between Sb and Sh were recently described by Hodnett et al.60.

Total genomic DNA was extracted from 10-day old etiolated sorghum seedlings grown under standard glasshouse conditions, according to the cetyl trimethylammonium bromide method61, with minor modifications. DNA integrity was evaluated with agarose gel electrophoresis, DNA quality assessed through the 260/280 and 260/230 nm ratios from NanoDrop ND-1000 spectrophotometer (NanoDrop Technologies, Wilmington, Delaware, USA) and concentration measured with a Qubit® 2.0 Fluorometer (Thermo Fisher Scientific, Waltham, MA, USA). High-quality DNA samples were shipped to BGI Tech Solutions (Hongkong) Co., Limited for libraries construction and whole-genome resequencing. Paired-end libraries were prepared with an insert size of approximately 300 bp and sequenced on Illumina platform DNB-SEQ PE150 according to the supplier’s protocol, producing 20X sequencing depth resulting in 15 G bases per sample. The BTx623 sorghum reference genome sequences were downloaded from the Joint Genome Institute Phytozome website62,63.

Bioinformatics analyses

Sequence processing, mapping, and polymorphisms calling

The raw sequences were processed with the supplier’s in-house SOAPnuke filter to obtain clean reads by discarding reads with more than 50% adaptor sequence, low quality reads for which more than 50% bases display Phred score less than 20, and reads with 2% or more “N” bases i.e., any base. The processed reads were then mapped to the reference genome of S. bicolor (BTx623) version 3.1.113 using Burrows-Wheeler Aligner (BWA)64. BWA showed good performance aligning relatively short nucleotide sequences against a long reference and producing accurate and fast results with low error rates. It provides flexible parameter setup and the output of the alignment is presented in SAM format65. Picard-tools (v1.118)66 is used to sort the SAM files by coordinate, converted them to BAM files, and mark duplicated reads to be discarded by the Genome Analysis Toolkit (GATK) during downstream analyses. BAM files were further processed for mate-pair information repairing, read group information adding, and duplicate reads labeling; such postprocessed BAM files are readily used for variation detection. Single Nucleotide Polymorphisms (SNPs) and small Insertion/Deletions (InDels) are detected by GATK67, BreakDancer68 is used for Structure Variants (SVs) calling and SOAPcnv69 for Copy Number Variants (CNVs) calling. Genotypes were called individually for each sample for all variants but, in addition, for SNPs, we also performed joint genotyping across samples to produce a multi-sample VCF (Variant Call Format) call-set for future genomic predictions and linkage disequilibrium analyses.

SNP calling and annotation

To detect high-quality SNPs we first calculated the likelihood of each sample’s genotype using SOAPsnp70 and the genotype with the highest probability was selected as the genotype of the sequenced individual at the specific locus. Next, we selected a polymorphic locus against the reference sequence using the target consensus sequence, and based on the resequencing data of 172 samples, we determined SNPs located in effective sites with sufficient quality i.e., responding to the following criteria: 3 ≤ depth ≤ 50, with depth calculated using data from each individual, average mappable sites < 1.5, and an average quality for the novel allele > 20. The SNPs were localized in splice sites, start codons, stop codons, coding and noncoding regions, and other nucleic acid molecules based on annotated gene models in S. bicolor genome reference database13.

Short InDel detection

To identify short indel we mapped the paired-end reads to the reference sequence allowing up to 10-bp gaps, merged these redundant pairs, and gaps that were supported by at least three non-redundant paired-end reads were extracted. A potential indel was identified when the number of the un-gapped reads that crossed a potential indel was no more than twice that of the gapped reads. The final high-quality indels included only those identified on both strands by paired-end reads.

Structure variation detection

Structure variation (SV) includes deletion, insertion, duplication, inversion and transposition of long fragment (at least 50 bp) in genome. In this study, we used SOAPsv64 to detect SVs based on the principle of paired-end64 i.e., that one of the two reads of paired-end should align onto the forward chain, while the other should be aligned onto the negative (reverse) chain. In addition, the distance between the two reads after the alignment should amount the size of the insert, and pairs of two reads should have a normal orientation and a suitable span when aligned to the genome. Should the orientation or span of pairs of two reads be not consistent with alignment expectations, structural variations may be involved in that region. The abnormal paired-end alignments are analyzed by clustering and the result compared with predefined SV types. A threshold of 3 abnormal paired end reads is required to support the SV existence, while SVs that were supported by at least six paired end reads were considered of high quality and identified as the final SVs in this work.

Copy number variation detection

We detected CNVs by the following steps: (i) DNA sequences were separated into fragments according to the depth of each base from the alignment results; (ii) we calculated the P-value for each fragment to estimate its probability to be a CNV; and (iii) fragments that passed the criteria (fragment length longer than 2 kb, Pvalue ≤ 0.35, mean depth less than 0.5 or more than 2.0) were kept as CNVs. The P-value was calculated as the probability of each observed depth (d) under the distribution of a simulated Poisson distributed data set whose expected value (E(d)) equals the observed mean depth. If d < E(d), the P-value = P(x ≤ d) × 2, else P-value = P(x ≥ d) × 2. The credibility of a CNV is inversely proportional to the P-value.

Sorghum halepense-associated genes and gene ontology (GO) enrichment analysis

Gene ontology and in-depth molecular genetic investigation were carried out on one dual purpose Sb line derived from an improved tropical variety (IESV 99091 DL) and two sister perennial RILs (SbxSh9 and SbxSh50) derived from rhizome-growing SbxSh cross backcrossed (2 recurrent parent doses: Tx623*2/Gypsum 9; BC1) to cytoplasmic male sterile recurrent Sb parent, at more than six generations of selfing. A set of 12,484 single nucleotide polymorphism-containing genes identified in the two sister SbxSh RILs but not in IESV 99091 DL were selected as the candidate gene set associated with S. halepense (Fig. 4). These genes were mapped to gene ontology (GO)26 to evaluate their characteristics, using PANTHER Gene List Analysis tools71. PANTHER takes a set of genes and compares the frequency of GO terms in the sample set with the frequency of the same set of GO terms in the reference set to identify terms that are over- or underrepresented in the sample set. In this work, we conducted PANTHER overrepresentation test using the GO Ontology database released May 01, 2021. The reference list consisted of Sorghum bicolor (all genes in database), while the annotation data sets were “GO molecular function complete”, “GO biological process complete”, and “GO cellular component complete”, which are the datasets with the complete, up to date GO annotations. The binomial test72 was used and the Bonferroni correction applied to account for multiple testing (one for each pathway, or each ontology term) at the same time. Only Bonferroni-corrected results with a probability level P < 0.05, were considered significant i.e., the lower the P value, the less likely the obtained result can be explained by random distribution.

Genomic mapping of biomass yield and biomass relevant traits loci in the evaluated population

The information from Habyarimana et al.8 was used in this work for genomic physical mapping of biomass related SNPs, candidate genes, and genes known to underpin sorghum plant height and maturity. The transcripts of known genes were identified on phytozome63. In their work Habyarimana et al.8, genome-wide association study was performed using the statistical genetics package Genome Association and Prediction Integrated Tool (GAPIT)73 within the R environment74. In addition, two multi-locus GWAS algorithms were used to identify significant quantitative trait loci (QTLs) for the biomass-related traits: BLINK (Bayesian-information and Linkage disequilibrium Iteratively Nested Keyway)75,76 and SUPER (Settlement of MLM Under Progressively Exclusive Relationship)77.


Statistical inferences to separate means e.g., sequencing statistics, were carried out using analysis of variance and Tukey HSD test at the 5% significance level. Genetic diversity was evaluated using the Fstatistic and the principal coordinates analysis78. Statistical inferences and data visualization were carried out using R software74.