Sugar beet (Beta vulgaris ssp. vulgaris) is an important crop of temperate climates which provides nearly 30% of the world’s annual sugar production and is a source for bioethanol and animal feed. The species belongs to the order of Caryophylalles, is diploid with 2n = 18 chromosomes, has an estimated genome size of 714–758 megabases1 and shares an ancient genome triplication with other eudicot plants2. Leafy beets have been cultivated since Roman times, but sugar beet is one of the most recently domesticated crops. It arose in the late eighteenth century when lines accumulating sugar in the storage root were selected from crosses made with chard and fodder beet3. Here we present a reference genome sequence for sugar beet as the first non-rosid, non-asterid eudicot genome, advancing comparative genomics and phylogenetic reconstructions. The genome sequence comprises 567 megabases, of which 85% could be assigned to chromosomes. The assembly covers a large proportion of the repetitive sequence content that was estimated4 to be 63%. We predicted 27,421 protein-coding genes supported by transcript data and annotated them on the basis of sequence homology. Phylogenetic analyses provided evidence for the separation of Caryophyllales before the split of asterids and rosids, and revealed lineage-specific gene family expansions and losses. We sequenced spinach (Spinacia oleracea), another Caryophyllales species, and validated features that separate this clade from rosids and asterids. Intraspecific genomic variation was analysed based on the genome sequences of sea beet (Beta vulgaris ssp. maritima; progenitor of all beet crops) and four additional sugar beet accessions. We identified seven million variant positions in the reference genome, and also large regions of low variability, indicating artificial selection. The sugar beet genome sequence enables the identification of genes affecting agronomically relevant traits, supports molecular breeding and maximizes the plant’s potential in energy biotechnology.
- Nuclear DNA content of some important plant species. Plant Mol. Biol. Rep. 9, 208–218 (1991) &
- Palaeohexaploid ancestry for Caryophyllales inferred from extensive gene-based physical and genetic mapping of the sugar beet genome (Beta vulgaris). Plant J. 70, 528–540 (2012) et al.
- Origin of the ‘Weisse Schlesische Rübe’ (white Silesian beet) and resynthesis of sugar beet. Euphytica 41, 75–80 (1989)
- Genome size and the proportion of repeated nucleotide sequence DNA in plants. Biochem. Genet. 12, 257–269 (1974) , , &
- 173–219 (Springer, 2010) , , , & in Root Tuber Crops Vol. 7 (ed. )
- 2012) http://www.mobot.org/MOBOT/research/APweb/ Angiosperm Phylogeny Website (
- A sugar beet (Beta vulgaris L.) reference FISH karyotype for chromosome and chromosome-arm identification, integration of genetic linkage groups and analysis of major repeat family distribution. Plant J. 72, 600–611 (2012) , , &
- Haplotype divergence in Beta vulgaris and microsynteny with sequenced plant genomes. Plant J. 57, 14–26 (2009) , , &
- The human phylome. Genome Biol. 8, R109 (2007) , , &
- An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants: APG II. Bot. J. Linn. Soc. 141, 399–436 (2003)
- Phylogenetic analysis of 83 plastid genes further resolves the early diversification of eudicots. Proc. Natl Acad. Sci. USA (2010) , , , &
- Isolation and linkage analysis of expressed disease-resistance gene analogues of sugar beet (Beta vulgaris L.). Genome 46, 70–82 (2003) et al.
- The absence of TIR-type resistance gene analogues in the sugar beet (Beta vulgaris L.) genome. J. Mol. Evol. 58, 40–53 (2004) , , , &
- Analysis of DNA polymorphisms in sugar beet (Beta vulgaris L.) and development of an SNP-based map of expressed genes. Theor. Appl. Genet. 115, 601–615 (2007) et al.
- 2012) , & Beta maritima: The Origin of Beets (Springer,
- The role of a pseudo-response regulator gene in life cycle adaptation and domestication of beet. Curr. Biol. 22, 1095–1101 (2012) et al.
- Progress toward understanding heterosis in crop plants. Annu. Rev. Plant Biol. 64, 71–88 (2013) &
- A bacterial artificial chromosome (BAC) library of sugar beet and a physical map of the region encompassing the bolting gene B. Mol. Genet. Genomics 269, 126–136 (2003) et al.
- Construction and characterization of a sugar beet (Beta vulgaris) fosmid library. Genome 51, 948–951 (2008) , , , &
- SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1, 18 (2012) et al.
- Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27, 578–579 (2011) , , , &
- Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009) &
- The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009) et al.
- Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 12, 1269–1276 (2002) &
- De novo identification of repeat families in large genomes. Bioinformatics 21, i351–i358 (2005) , &
- Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999)
- Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644 (2008) , , &
- tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997) &
- RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 35, 3100–3108 (2007) et al.
- Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990) , , , &
- Rfam 11.0: 10 years of RNA families. Nucleic Acids Res. 41, D226–D232 (2013) et al.
- Plant snoRNA database. Nucleic Acids Res. 31, 432–435 (2003) et al.
- GenBank: update. Nucleic Acids Res. 32, D23–D26 (2004) , , , &
- The ASRG database: identification and survey of Arabidopsis thaliana genes involved in pre-mRNA splicing. Genome Biol. 5, R102 (2004) &
- DupTree: a program for large-scale phylogenetic analyses using gene tree parsimony. Bioinformatics 24, 1540–1541 (2008) , , &
- New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59, 307–321 (2010) et al.
- MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 40, e49 (2012) et al.
- Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 3, 418–426 (1986) &
- InterProScan–an integration platform for the signature-recognition methods in InterPro. Bioinformatics 17, 847–848 (2001) &
- UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 40, D71–D75 (2012)
- KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 40, D109–D114 (2012) , , , &
- The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41 (2003) et al.
- A genome-wide comparison of NB-LRR type of resistance gene analogs (RGA) in the plant kingdom. Mol. Cells 33, 385–392 (2012) et al.
- InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Res. 40, D306–D312 (2012) et al.
- The Medicago genome provides insight into the evolution of rhizobial symbioses. Nature 480, 520–524 (2011) et al.
- Sucrose transporters of higher plants. Curr. Opin. Plant Biol. 13, 287–297 (2010) &
Extended data figures and tables
Extended Data Figures
- Extended Data Figure 1: Genomic features of RefBeet chromosomes 2–5. (658 KB)
Section 1 shows the positions of genetic markes in the genetic map2 and RefBeet. Section 2 shows the distribution (stacked area graphs) of predicted coding sequence (CDS) and repetitive sequence of the Gypsy type (LTR retrotransposon), the SINE type (non-LTR retrotransposon), the En/Spm type (DNA transposon), and three classes of satellite DNA (intercalary, centromeric, subtelomeric). The number of bases per feature is displayed in windows of 500 kb (shifted by 300 kb). Section 3 shows the distribution (stacked area graphs) of mapped small RNAs of length 21 and 24 nt in adjacent bins of 500 kb. For reads mapping at multiple locations, one random location was selected. Reads matching within predicted rRNA loci were ignored. Positions with more than 10 thousand mapped 21 nt sequences were labelled with the corresponding non-coding RNA prediction, if available, including the number of matching reads. Section 4 shows the chromosome-wide distribution of genomic variants in four sugar beet accessions and sea beet compared to RefBeet. Substitutions and deletions were detected by read-mapping with up to three variants per 100 nt read in 50 kb windows shifted by 25 kb. Shared and individual low-variation regions per accession are visible.
- Extended Data Figure 3: K-mer distribution and read coverage. (93 KB)
a, Number of 17mers at different coverages. b, c, Correlation of read coverage and GC content of reads generated from a PCR-amplified library (b) and a PCR-free library (c). Read data sets in b and c were aligned against RefBeet. The GC content and the amount of aligned bases were computed in sliding windows of 500 bases shifted by 100 bases. To reduce the amount of data points only chromosome 1 scaffolds 1 (Bvchr1.sca001, 8 Mb) was plotted.
- Extended Data Figure 4: Genetic vs physical distances. (223 KB)
a, Genetic and physical positions of 983 genetic markers in the genetic map of sugar beet2 and the RefBeet assembly, respectively. The expected physical distance in sugar beet had been reported as 855 kb per 1 cM, with deviations of up to 50-fold2. In RefBeet only 5% of marker pairs showed the expected physical distance (855 kb ± 20%) suggesting strict partitioning of the genome into regions favouring or disfavouring recombination events.
- Extended Data Figure 5: Annotation of repeats and non-coding RNA genes. (279 KB)
a, Repeat content of the sugar beet genome assembly. A total of 252 Mb (42.3%) of the genome assembly consist of repetitive DNA with retrotransposons as the most abundant repeat fraction. All major superfamilies of DNA transposons were represented, showing a dispersed or slightly centromere-enriched distribution along the chromosomes. Microsatellites and minisatellites were well represented owing to flanking heterogeneous sequences, which allowed their assembly. The remaining repetitive sequences (‘Unknown’) in 459 families represent potentially new repeats, most likely rearranged or truncated retrotransposons. b, Summary of non-coding RNA gene annotations. For different classes (miRNA, microRNA; rRNA, ribosomal RNA; snRNA, spliceosomal RNA; snoRNA, small nucleolar RNA; tRNA, transfer RNA) the number of predictions, the number and percentage of predictions with overlapping small RNA reads, and the number of families/subtypes are listed. c, Proportion of annotated tRNAs by amino acid for the five species studied (At, Arabidopsis thaliana; Pt, Populus trichocarpa; Bv, Beta vulgaris; Vv, Vitis vinifera; Zm, Zea mays). d, Absolute numbers of annotated tRNAs by amino acid. Except for pseudogenes the proportion of tRNAs is relatively constant among all species (species names as in c). e, Number of annotated tRNAs by amino acid and species (as predicted by tRNAscan-SE). The total is computed without the last two rows containing pseudogenes and presumably defunct tRNAs with undetermined anti-codon. f, Number and size of Beta vulgaris gene clusters of at least 10 members representing expanded gene families. A total of 1,274 genes are contained in 97 clusters.
- Extended Data Figure 6: Analysis of paralogous and orthologous genes. (215 KB)
a, Number and percentage of detected orthologues between Beta vulgaris and nine other plant species. Orthology relationships with 10 or more proteins for any of the species were discarded in order to avoid biases introduced by species-specific gene family expansions. In total, 18,927 sugar beet genes had orthologues in at least one of nine plants, and 16,062 paralogous sugar beet genes appeared in 14,852 trees. b, Number of duplication events detected in gene trees grouped into three age classes. The duplication ratio was calculated as the number of age class-specific duplication events divided by the total number of trees containing duplication events. c, d, Collinear blocks of protein coding genes in Beta vulgaris as dotplot (c) or circular plot (d). Each dot or connecting line represents one gene pair, respectively. Shown are 34 collinear blocks containing 7–35 gene pairs. A triplicated region is visible on chromosomes 1, 3 and 5 (arrows). e. Histogram of Ks values for Beta vulgaris protein coding gene pairs in collinear blocks. Ks values mainly scatter between 1.2 and 1.8 and show peaks at 1.2 and 1.7.
- Extended Data Figure 7: Phylogenetic trees. (289 KB)
a, Phylogenetic tree of 44 sucrose transporter protein sequences in higher plants including Beta vulgaris. The reliability for internal branches is indicated in red ranging from 0 = unreliable to 1 = highly reliable (aLRT statistics). At, Arabidopsis thaliana; Bv, Beta vulgaris; Dc, Daucus carota; Hb, Hevea brasiliensis; Hv, Hordeum vulgare; Le, Lycopersicum esculentum renamed Solanum lycopersicum; Lj, Lotus japonicus; Lp, Lolium perenne; Nt, Nicotiana tabacum; Os, Oryza sativa; Ps, Pisum sativum; Sh, Saccharum Hybrid Cultivar Q117; St, Solanum tuberosum; Sb, Sorghum bicolor; Ta, Triticum aestivum; Zm, Zea mays. b, Intraspecific relationship of five Beta vulgaris accessions based on the alignment of 2,112 shared genes.
- Extended Data Figure 8: Intraspecific variation. (359 KB)
a, Number of variants inferred from read mapping (black) and sequence identity of matching scaffolds (blue) along RefBeet chromosome 7. The variation profiles of five different accessions including the reference accession (RefBv) are shown. Regions with a high number of read-mapping variants showed a higher density of scaffolds of low sequence identity. However, low-identity scaffolds were also present in low-variation regions of mapped reads. b, Detailed view of the distribution of read mapping variants and read coverage (green) in Beta vulgaris accession KDHBv compared to RefBeet. The secondary y axis on the right side indicates the percentage of positions per window covered in the alignment. Low-variation regions were generally well covered. c, Fraction of variation deserts of different lengths along RefBeet based on read-mapping of genomic data sets and alignment of assembled scaffolds. The six different genotypes include the reference, four other Beta vulgaris accessions, and one Beta maritima accession. Variation deserts were found in all chromosomes. The variation deserts of non-reference sugar beet accessions contained 49% (179 Mb in KDH) to 58% (217 Mb in YMo) of all covered RefBeet positions. d, Intersection of variant deserts. Starting from RefBv, the size of shared variant deserts decreased by including additional Beta vulgaris accessions. e, Sequence conservation comparison of three groups of genes. Genes with GO term enrichment localized within variation deserts shown in red; genes without GO term enrichment localized within variation deserts shown in blue, genes localized outside of variation deserts shown in green. For each of the three groups 17 randomly selected genes with confirmed exon–intron structure were aligned to 24 additional sugar beet accessions. The sequence conservation was determined from the identity of the sequence alignment. Genes with GO term enrichment localized within variation deserts had the highest fraction of high identity gene alignments, followed by genes without GO term enrichment localized within variation deserts. f, Length distribution of insertions and deletions in coding sequences. Apart from one-base indels, indels of length three or multiples of three (3n) were overrepresented. Of all genes affected by indels, 49.1% had a single 3n indel and 5.0% had more than one indel (any length) with bases summing up to 3n.
Extended Data Tables
- Supplementary Information (1.9 MB)
This file contains Supplementary Methods, Supplementary Tables 1-16, Supplementary Notes and additional references.
- Supplementary Data 1 (110 KB)
Repeat families as detected by RepeatModeler along with the combined automatic and manual classification.
- Supplementary Data 2 (8.4 MB)
Predicted RefBeet genes and their functional annotation based on database searches and transfer from orthologs.
- Supplementary Data 3 (69 KB)
List of 715 putative resistance gene analogs (RGA). Beta vulgaris (Bv) genes were classified based on the presence of RGA domains (columns A+B). In 30 additional Bv genes (column D) these domains were missing in exon parts, but the genes showed sequence homology with known RGAs from other plants.