Sugar beet (Beta vulgaris ssp. vulgaris) is an important crop of temperate climates which provides nearly 30% of the world’s annual sugar production and is a source for bioethanol and animal feed. The species belongs to the order of Caryophylalles, is diploid with 2n = 18 chromosomes, has an estimated genome size of 714–758 megabases1 and shares an ancient genome triplication with other eudicot plants2. Leafy beets have been cultivated since Roman times, but sugar beet is one of the most recently domesticated crops. It arose in the late eighteenth century when lines accumulating sugar in the storage root were selected from crosses made with chard and fodder beet3. Here we present a reference genome sequence for sugar beet as the first non-rosid, non-asterid eudicot genome, advancing comparative genomics and phylogenetic reconstructions. The genome sequence comprises 567 megabases, of which 85% could be assigned to chromosomes. The assembly covers a large proportion of the repetitive sequence content that was estimated4 to be 63%. We predicted 27,421 protein-coding genes supported by transcript data and annotated them on the basis of sequence homology. Phylogenetic analyses provided evidence for the separation of Caryophyllales before the split of asterids and rosids, and revealed lineage-specific gene family expansions and losses. We sequenced spinach (Spinacia oleracea), another Caryophyllales species, and validated features that separate this clade from rosids and asterids. Intraspecific genomic variation was analysed based on the genome sequences of sea beet (Beta vulgaris ssp. maritima; progenitor of all beet crops) and four additional sugar beet accessions. We identified seven million variant positions in the reference genome, and also large regions of low variability, indicating artificial selection. The sugar beet genome sequence enables the identification of genes affecting agronomically relevant traits, supports molecular breeding and maximizes the plant’s potential in energy biotechnology.
During the last 200 years of sugar beet breeding, the sugar content has increased from 8% to 18% in today’s cultivars. Breeding has also actively selected for traits like resistance to viral and fungal diseases, improved taproot yield, monogermy of the seed and bolting resistance. After discovering a male sterile cytoplasm, breeders started to develop hybrid varieties and successfully increased yield5. Taxonomy assigns Beta to the Amaranthaceae family within Caryophylalles, an order comprising 11,510 species6 including cacti, ice plants (Aizoaceae), other drought-tolerant species, and carnivorous plants such as pitcher plants (Nepenthes) and sundew (Drosera). Until now, no Caryophyllales species have been sequenced.
To provide an extended basis for comparative plant genomics and to support molecular breeding, we sequenced the double haploid sugar beet line KWS2320 as reference genotype, using the Roche/454, Illumina and Sanger sequencing platforms (Extended Data Table 1a, Supplementary Table 1). The initial assembly was integrated with genome-wide genetic and physical map information2, resulting in 225 genetically anchored scaffolds (394.6 Mb), assigned to nine chromosomes (Table 1, Fig. 1, Extended Data Figs 1 and 2). The chromosomal nomenclature follows a previous study7 describing a Beta karyotype at chromosome arm resolution. The genetically integrated assembly, ‘RefBeet’, comprised in total 569.0 Mb in 43,721 sequences (2,333 scaffolds and 41,388 unscaffolded contigs) and had an N50 size of 1.7 Mb with 77 scaffolds being of this size or larger. We incorporated Illumina sequencing reads generated from PCR-free libraries and analysed genotyping-by-sequencing data, leading to an optimized assembly of 566.6 Mb in 2,171 scaffolds and 38,337 unscaffolded contigs. The N50 size was 2.01 Mb (the 72nd scaffold) and the chromosomally assigned fraction 84.7% (Table 1). The assembled part of the genome is assumed to represent the unique regions as well as repetitive regions, which are either short enough to be placed in a unique sequence context or divergent enough to behave as unique entities. A k-mer analysis of Illumina data indicated a genome size of 731 Mb (Extended Data Fig. 3a). We located 94% of publicly available isogenic expressed sequence tags (ESTs) in RefBeet, suggesting that gene-containing regions are comprehensively covered. A sequenced bacterial artificial chromosome (BAC) clone8 was compared to the corresponding region in RefBeet and found to be correctly assembled within one scaffold. On average, one mismatch and one insertion or deletion (indel) error occurred in 10 kb. RefBeet resolved regions of recombination suppression in centromeric and pericentromeric regions of chromosomes, flanked by regions showing enhanced recombination rates (Extended Data Fig. 4).
We identified 252 Mb (42.3%) of RefBeet as repetitive sequence (Supplementary Data 1). The largest group was long terminal repeat (LTR) retrotransposons (Extended Data Fig. 5a). Gypsy-like elements were enriched in centromeric and pericentromeric regions (Fig. 1, Extended Data Figs 1 and 2). Non-LTR retrotransposons of the long interspersed nuclear element (LINE) type were dispersed, whereas short interspersed nuclear elements (SINEs) were enriched towards chromosome ends. Three major satellite classes were organized in large arrays (Fig. 2). By analysing unassembled genomic data we estimated total amounts of 15.4 Mb centromeric, 6.0 Mb intercalary, and 0.6 Mb subtelomeric satellite DNA, as well as 10.0 Mb of 18S-5.8S-25S and 5S ribosomal genes.
A total of 27,421 protein-coding genes supported by mRNA evidence (Supplementary Table 2) were predicted in RefBeet; 91% included start and stop codons (Supplementary Table 3). The majority of the genes (73.6%) were found within chromosomally assigned scaffolds with on average 5.2 genes per 100 kb, a gene length of 5,252 bp including introns, a coding sequence length of 1,159 bp and 4.9 coding exons per gene. The codon usage was similar to other dicot species (Supplementary Table 4). Homology-based annotation of non-coding RNA genes resulted in 3,005 predictions of tRNAs, microRNAs, small nucleolar RNAs, spliceosomal RNAs and ribosomal RNAs, mainly supported by evidence from isogenic small RNA data (Extended Data Fig. 5b–e, Extended Data Table 1b, Supplementary Table 5).
Based on the translated Beta vulgaris gene set and the protein sets of nine other plants (Extended Data Table 1d) we determined 19,747 phylogenetic trees, collectively called ‘phylome’9, and inferred orthologous and paralogous gene relationships (Extended Data Fig. 6a). Previous studies left the phylogeny of rosids, asterids and Caryophyllales unresolved10 or classified Caryophyllales as a subclade of asterids11. A species tree inferred from the collection of gene trees strongly suggested that Beta vulgaris branched off before the separation of asterids and rosids (Fig. 3). Thus, according to our data, Caryophyllales represent the most basal eudicot clade among the studied species. The fraction of species-specific genes within eudicots (Fig. 3) was the largest for sugar beet, reflecting its phylogenetic position. The analysis of paralogous genes provided evidence for the absence of a lineage-specific whole genome duplication in Beta vulgaris supporting previous studies2 (Extended Data Fig. 6b–e, Supplementary Table 6).
We functionally annotated 17,151 RefBeet genes (63%) based on sequence homology (Supplementary Data 2). The number of disease resistance genes detected was comparatively small, particularly for the STK-domain containing classes (Supplementary Table 7, Supplementary Data 3). In contrast to previous studies12,13, we found a TNL class resistance gene in the genomes of sugar beet (Bv_22240_ksro) and spinach, both belonging to Amaranthaceae. The phylome tree of Bv_22240_ksro indicated that the presence of a single TNL class gene is a feature of Amaranthaceae, whereas expansion of this gene family is typical for rosids and asterids. The functional categories of expanded and potentially lost gene families (Extended Data Fig. 5f, Supplementary Tables 8, 9) indicate that genes involved in defence and stress compensation represent vital evolutionary targets. The number of transcription factors identified in RefBeet was the lowest of all species studied (Supplementary Table 10). The reference genome sequence enables future experimental approaches to determine if lower gene numbers may alter transcriptional network topologies; Caryophyllales may harbour unknown genes involved in transcriptional control. We identified four sucrose transporter (SUT) orthologues in RefBeet. Phylogenetic analysis including known sucrose transporters suggested a duplication of the SUT1 gene in Amaranthaceae followed by extensive mutation of one paralogue (Extended Data Fig. 7a). The genome sequences of sugar beet and spinach, both containing the four SUT genes, are an excellent basis for studying the implications of this duplication event.
Previous studies addressing the variation within the genus Beta indicated high divergence between genotypes2,14. We generated genome sequences of four non-reference sugar beet double haploid accessions (KDHBv, UMSBv, YMoBv, YTiBv) and characterized the genome-wide variation (Extended Data Table 1a, c, Supplementary Tables 3, 11–13). Within RefBeet we identified 7.0 million positions which were variant (77% substituted, 23% deleted) in at least one of the other accessions and 274.9 million positions which were unchanged in all five accessions. We found 2.9 million variants on average per non-reference accession. Coding regions had a prevalence of indels of length three or multiples of three (44%), compared to non-coding regions (16%). The distribution of variants revealed large regions of low variation (Fig. 1, Extended Data Figs 1, 2, 8a–c). Such variation ‘deserts’ were found in all chromosomes and in all accessions, which might reflect extensive cross-breeding with a limited number of haplotypes in the breeding material, a founder effect, or a bottleneck at the establishment of the crop. However, most of the variation deserts were accession-specific (Extended Data Fig. 8d), probably owing to recombination events that have occurred since the introduction of founder haplotypes into breeding lines. The four accessions shared 50.6 Mb of variation deserts along RefBeet containing 1,824 predicted RefBeet genes (Gene Ontology (GO) term enrichment see Supplementary Table 14). Genes in these regions, analysed in 24 additional sugar beet accessions, showed higher sequence conservation (Extended Data Fig. 8e). These findings suggest that regions of low variation are not maintained by chance, but are rather the result of breeders’ selection towards certain genes contained in those regions. The sea beet Beta maritima is fully interbreedable with sugar beet and commonly used as a valuable source of resistances against biotic or abiotic stress15. We sequenced its genome and identified a total of 75 Mb as variation desert, of which 67 Mb were shared with at least one of the four non-reference Beta vulgaris accessions. These regions may represent traces of breeding activities which aimed at introducing sea beet traits into sugar beet. The gene BvBTC1, encoded by the B-locus16 and located in a 1.1 Mb RefBeet scaffold on chromosome 2, plays an important role during vernalization. Cultivated lines are homozygotes for the B allele resulting in a biennial life cycle. The B-locus is located in variation deserts of all five sugar beet lines, whereas the genome of the annual wild form Beta maritima shows high variation at this locus, demonstrating that breeding has shaped the genome of sugar beet.
Sugar beet is a hybrid crop based on seed pool lines (male steriles, monogerms) and pollen pool lines (pollinators, multigerms). We identified regions of potentially fixed differences between the two groups: the intersection of shared low-variation regions in seed pool lines and shared high-variation regions of identical variation patterns in pollen pool lines comprised 311 genomic regions (1.6 Mb in total) containing 119 genes.
We performed evidence-based gene predictions in the assemblies of KDHBv, UMSBv, YMoBv and YTiBv. Based on the comparison of 2,112 single copy genes, UMSBv had the largest genetic distance to RefBeet (Extended Data Fig. 7b). The number of accession-specific genes ranged from 79 (RefBeet) to 271 (UMSBv). Genes were analysed for the ratio of non-synonymous to synonymous substitutions, altered start and stop sites, new stop codons, modified splice donor or splice acceptor sites and indels, revealing extensive variation in coding regions (Supplementary Tables 15, 16 and Extended Data Fig. 8f). In addition to allelic variation, the variation in gene content may contribute to heterosis, as has been suggested for maize17.
The availability of the sugar beet genome sequence very much simplifies fine-mapping of quantitative trait loci and the discovery of causal genes, as single-nucleotide polymorphism (SNP)-based markers can be designed for any region of the genome. Association mapping to identify regions of shared ancestry in sugar beet lines requires at least 100,000 variant positions for genotyping. Such positions can now be selected from a catalogue of seven million variants. The genome sequence facilitates further experimentation to characterize gene functions, which accelerates the identification of rewarding targets for transgenic manipulation, and represents an important foundation for molecular and comparative studies in sugar beet, Caryophyllales and flowering plants. The data presented are key to improvements of the sugar beet crop with respect to yield and quality and towards its application as a sustainable energy crop.
Genome sequencing and assembly
Genomic DNA isolated from root and leaf material was sequenced on the Roche/454 FLX, Illumina HiSeq2000 and ABI3730 XL sequencing platforms. The Newbler software was applied on 454, Illumina and Sanger sequencing data to assemble the reference genotype (RefBeet). Contigs of putative bacterial origin and those smaller than 500 bp were removed. Additional lines were sequenced on the HiSeq2000 platform and were assembled using SOAPdenovo. We performed gap-closing and homopolymer error correction using Illumina reads from PCR-based and PCR-free libraries (Extended Data Fig. 3b, c). Chromosome-wise scaffolding using genetic and physical mapping data was assisted by SSPACE (Methods and Supplementary Methods).
Prediction of protein coding genes was performed using the AUGUSTUS pipeline, with Illumina mRNA-seq reads and other cDNA read data as supporting evidence. Gene models were filtered for transposable element homology. Small and other non-coding RNAs were identified based on homology searches and based on Illumina sequencing data. Repeats were predicted using RepeatModeler, followed by manual curation of the predictions (Methods and Supplementary Methods).
Variant positions (substitutions, indels) were identified by read mapping and scaffold alignment (Methods and Supplementary Methods).
Phylogenetic analysis and species tree reconstruction
The longest protein sequence of each RefBeet gene was used for a Smith–Waterman search against the protein sets of nine other plant species. Alignments were generated and quality-filtered, and phylogenetic trees were calculated for each Beta vulgaris sequence. A species tree was generated from a super-tree of all trees and by multi-gene phylogenetic analysis of high-confidence 1:1 orthologues.
Protein coding gene predictions were functionally annotated based on protein signatures and orthology relationships.
Sequencing and assembly
Genomic DNA isolated from root and leaf material was sequenced on the Roche/454 FLX, Illumina HiSeq2000 and ABI3730 XL sequencing platforms. The plant material included five double haploid and two inbred sugar beet breeding lines (Beta vulgaris ssp. vulgaris; referred to as Beta vulgaris), one wild beet accession (Beta vulgaris ssp. maritima; referred to as Beta maritima), and one spinach accession (Spinacia oleracea). Additionally, 15 genotypes from an F2 panel of a Beta vulgaris cross used to generate beet genetic maps2,14 were sequenced at low coverage (Supplementary Methods).
Illumina genomic sequencing was performed on a HiSeq2000 sequencing instrument with 2 × 100 nt for paired-end reads and 2 × 50 nt for mate-pair reads (Supplementary Tables 1 and 11). Roche/454 genomic single-read and mate-pair sequencing was performed on a Roche/FLX sequencing instrument using Titanium XLR70 sequencing kits (Roche/454 Life Sciences). End-sequencing of genomic BAC and fosmid libraries introduced previously2,18,19 was performed on an ABI3730 XL DNA Analyzer. Genomic Roche/454 and Illumina data, BAC ends and fosmid ends were filtered for low-quality sequence, contamination and redundancy, and all data sets were assembled together on a 512 GB access-random memory (RAM) computer using the Newbler software (v2.6 20110630_1301, parameters -fe exclude_list -siod -nrm -scaffold -large -ace -ar -a 40 -l 500 -cpu 48).
We removed potential bacterial contigs and scaffolds based on three criteria: a GC content of 60% or higher, the presence of predicted genes without sugar beet cDNA support, and the absence of both sugar beet repeats and genes supported by cDNA data. The assembly size was determined by adding up the lengths of the scaffolds and the lengths of additional unscaffolded contigs larger than 500 bp (smaller contigs were removed). The N50 size refers to this assembly size as 100% and reports the length of the scaffold or contig that spans the 50% mark after sorting the sequences by length.
Illumina-only assemblies were performed on 100 GB and 256 GB RAM computers using the SOAPdenovo20 software v1.05 (SOAPdenovo-63mer, -K 49, pair_num_cutoff = 3, map_len = 32) followed by gap filling using GapCloser v1.12.
mRNA-seq sequencing was carried out on the Illumina Genome Analyzer (GA) IIx and Illumina HiSeq2000 sequencing instruments. From each library, one lane of data was generated with read lengths of 2 × 54 nt on the GA and 2 × 50 nt on the HiSeq2000. Small RNA libraries were sequenced with 36 nt reads on the GA and 50 nt reads on the HiSeq2000. Additional cDNA sequences from sugar beet were generated by Roche/454 Life Sciences on the GS20 platform with an average length of 106 bp. The transcript data generated are summarized in Supplementary Table 2.
Integration with genetic and physical maps as well as genotyping-by-sequencing (GBS) data
Sequence information of anchored markers in genetic and physical maps2 was used to assign scaffolds to chromosomes and to build connections between scaffolds. Confirmation and further genetic integration was derived from generating and analysing GBS data (see Supplementary Methods). To establish new connections between scaffolds, a group of scaffolds placed as neighbours based on genetic integration was used as input for SSPACE21 (v1.1, parameters -x 0 -k 1) together with the six largest paired data sets (Illumina 6 kb, Roche/454 7 kb, 10 kb and 20 kb, fosmid ends, BAC ends). The output was manually corrected if necessary. Analyses and control steps were coded in Perl v5.8.9 or used UNIX shell commands.
Correction of small indels in the assembly and gap closing
For consensus sequence correction we mapped quality filtered Illumina reads (2 × 100 nt, 93-fold genome coverage) against the assembly using BWA22 v0.5.9 allowing for 3 edits. Indels were identified using SAMtools mpileup23 v0.1.18. A total of 9,101 bp insertions and 60,685 bp deletions were corrected in the consensus sequence. Indel errors were corrected if error positions were covered by at least 10 reads, if at least 60% of them showed the same indel and if the indel was confirmed on both strands. The criteria were validated using the Sanger sequence of the sugar beet BAC insert ZR47B15 (ref. 8).
Gap closing was performed using 670 million Illumina paired-end reads generated from two PCR-amplified libraries with insert sizes 600 nt and 250 nt, and from five PCR-free libraries with insert sizes 200–700 nt as input for GapCloser20 (v1.12-r6, default parameters and -p set to 31).
Annotation of repetitive elements
The de novo identification and classification of repeats within the RefBeet assembly was performed using RepeatModeler (http://www.repeatmasker.org). RepeatModeler v1.0.5 was installed along with RECON24 v1.07, RepeatMasker ‘open 3-3-0’, and RMBlast v1.2 with BLAST v2.2.23 (all loaded from http://www.repeatmasker.org), RepeatScout25 v1.0.5 (http://bix.ucsd.edu/repeatscout/), Tandem Repeat Finder26 (trf404 loaded from http://tandem.bu.edu/), and the RepeatMasker libraries (http://www.girinst.org/server/RepBase/) as of September 2011. RepeatModeler was applied on the database file created by the BuildDatabase subprogram which was run on the RefBeet assembly. The output was used as library for RepeatMasker (parameters -e crossmatch -pa 20 -gff) to generate a masked version of the assembly and to get the genomic positions of the repeat annotation. The repetitive fraction of the assembly was determined based on the RepeatMasker output. The automated repeat classification provided by RepeatModeler was refined by manual curation of the data (see Supplementary Methods). The repeat families along with the combined automated and manual classification are listed in Supplementary Data 1. The fractions of different repeat classes annotated in RefBeet are shown in Extended Data Fig. 5a.
The distribution of small RNAs (Fig. 1 and Extended Data Figs 1 and 2) was analysed by mapping 677.8 million adaptor-trimmed small RNA sequences of three libraries (see Supplementary Table 5) against RefBeet using BWA v0.6.1. Reads shorter than 15 bases after trimming were removed. The mapping seed length was set to 15. If a read mapped to multiple locations one random location was kept. Custom Perl scripts were used to locate mapped reads within the annotation of non-coding RNAs. The read length distribution and the chromosome-wide distribution of mapped reads were computed with Perl and plotted with R v2.15.
Coding gene prediction
The evidence-based de novo annotation of coding genes was performed applying the program AUGUSTUS27 v2.5.5 on the RefBeet assembly. The evidence was provided by 616.3 million filtered Illumina mRNA-seq reads (mainly generated as paired-ends) from five Beta vulgaris accessions and different tissues, 282,169 cDNA single-end reads generated on a Roche/454 GS20 platform, and 35,523 EST sequences from public databases. The Roche/454 reads and most of the ESTs were derived from genotype KWS2320. AUGUSTUS settings were: using Arabidopsis training data, reporting untranslated regions in addition to the coding sequences, reporting alternative transcripts if suggested by hints, and accepting introns that start with AT and end with AC in addition to introns with starts flanked by GT-AG and GC-AG.
For each Beta vulgaris accession we initially predicted 30,339 to 36,589 genes. After removal of (retro)transposon gene candidates, the gene sets used for downstream analyses consisted of 25,368 to 31,355 genes (Supplementary Table 3). Of those genes, 77–94% had both start and stop codon, and the fraction of predictions completely supported by cDNA was 48–61%.
We identified transposable element-related genes in the automated gene prediction of RefBeet by screening the phylomes for GO terms specific to transposable elements, by running the program ‘TransposonPSI’ (http://transposonpsi.sourceforge.net/), and by analysing the genomic positions of the repeat annotation (Supplementary Methods). In total, 4,643 transposable element candidates were removed from the initial set of 32,064 evidence-based genes predicted in RefBeet by AUGUSTUS.
The predicted genes of assemblies generated with SOAPdenovo (Supplementary Tables 3, 11) were screened for overlap with sequences of transposable elements contained in the repeat annotation. The assemblies were masked using RepeatMasker, and gene predictions were omitted from further analyses if at least one base of their coding parts overlapped with an annotated transposable element.
Plant species analysed in comparative studies
Comparative analyses were carried out based on data from seven dicot (five rosids and two asterids) and two monocot species: Arabidopsis thaliana, Glycine max, Populus trichocarpa, Theobroma cacao, Vitis vinifera, Solanum lycopersicum, Solanum tuberosum, Oryza sativa ssp. indica and Zea mays (Extended Data Table 1d, Supplementary Methods).
Annotation of non-coding RNA genes
Non-coding RNA genes were predicted using the programs tRNAscan-SE28, RNammer29, and BLAST30, and based on database searches in Rfam31, the plant snoRNA database32, GenBank33, and the ASRG database34. To support the predictions, we mapped 677.8 million Illumina small RNA reads generated from root, inflorescence and leaf material of the reference genotype against RefBeet (Supplementary Table 5). Reads were adaptor trimmed (custom Perl script) and mapped using BWA v0.6.1 with a seed length of 15 bases and one edit allowed in the alignment. The cDNA coverage was determined with SAMtools mpileup and custom Perl scripts.
Phylome reconstruction and orthology/paralogy predictions
The longest protein sequence for each gene annotated in RefBeet was used for a Smith-Waterman search (E-value cutoff 1 × 10−5, matching length >50% of the query sequence) against the protein sets of nine other species. Alignments were generated and quality-filtered, and phylogenetic trees were calculated for each Beta vulgaris sequence (see Supplementary Methods). The collection of phylogenetic trees is referred to as the sugar beet ‘phylome’, based on which we inferred the orthologous and paralogous relationships of the genes by considering each node as either a speciation or duplication event.
Species tree reconstruction
A phylogeny describing the evolutionary relationships of the species included in the phylome was inferred using two complementary approaches resulting in identical tree topologies (Fig. 3). First, a super-tree was inferred from all the trees in the phylome (19,747 trees) by using a gene tree parsimony approach as implemented in the DupTree algorithm35. This approach is different from other super-tree approaches (such as finding the majority-rule consensus) as it finds the species topology with the minimum total number of duplications implied when reconciling a collection of gene family trees (that is, the phylome) with that species topology. Second, 110 gene families with high-confidence one-to-one orthology in at least 9 of the 10 species were used to perform a multi-gene phylogenetic analysis. Protein sequence alignments were performed as described (see phylome reconstruction in Supplementary Methods) and concatenated into a single alignment. Species relationships were inferred from this alignment using a maximum likelihood (ML) approach as implemented in PhyML36 using the Jones–Taylor–Thornton (JTT) evolutionary model; for 97 of 110 gene families this model was best-fitting. Branch supports were computed using an aLRT (approximate likelihood ratio test) parametric test based on a chi-square distribution. Both complementary approaches resulted in an identical topology. Such congruence is suggestive that a correct phylogeny was found.
To track specific or shared genes in the species tree an all-against-all BLAST search of the protein sets of the ten species was performed (E-value cutoff 1 × 10−5). The patterns of homology across species and clades were computed. The result was categorized as widespread genes, eudicot-specific genes and species-specific genes (Fig. 3).
Whole genome duplication analysis using collinear blocks of coding genes
We performed a Ks analysis of 370 paralogous gene pairs forming 34 collinear blocks (Extended Data Fig. 6c–e). Collinear blocks of coding genes were determined using MCScanX37 applied on the RefBeet protein set (longest protein isoform per gene). Protein sequences were aligned against themselves using BLASTp, the top 5 alignments per gene were kept. High-confidence collinear blocks with an E-value lower than 1 × 10−10and a score larger than 300 were selected (parameters suggested by MCScanX). A total of 34 blocks of 7–35 gene pairs (on average 11 pairs, in total 370 pairs) were found. Ks values were calculated using MCScanX, which implements the Nei-Gojobori algorithm38
Functional annotation of protein coding genes
Protein coding gene predictions were functionally annotated based on protein signatures and orthology relationships. In the protein signature approach, each sugar beet protein was inspected for different signatures such as families, regions, domains, repeats, and binding sites using InterProScan39 v4.8 and a set of different databases (PROSITE, PRINTS, Pfam, ProDom, SMART, TIGRFAMs, PIR superfamily, SUPERFAMILY, Gene3D, PANTHER, HAMAP). Additionally, BLAST searches against SwissProt40, KEGG41, and KOG42 databases were performed, and annotations were extracted. In the phylogeny-based approach 15,263 one-to-one orthology relationships between Beta vulgaris genes and GO-annotated genes of other plant species were inferred from trees in the sugar beet phylome. The annotations from all sources were combined, and a merged annotation table was generated (Supplementary Data 2).
We predicted and classified resistance gene analogue (RGA) genes by applying a modified version of an HMM-based pipeline43 (Supplementary Methods). The numbers of RGAs detected in Beta vulgaris and nine other plant species are listed in Supplementary Table 7. The gene identifiers of all putative RGAs in the Beta vulgaris genome are listed in Supplementary Data 3. Proteins were classified into 56 transcription factor families and subfamilies based on protein domains defined by InterPro motives44 indicating DNA binding or other domains characteristic to transcription factors45 (see Supplementary Methods). The numbers of transcription factors identified per species and per transcription factor class are listed in Supplementary Table 10. Sucrose transporter (SUT) proteins were identified by comparison of 40 known SUT protein sequences of 15 higher plants46 against RefBeet (see Supplementary Methods).
Expanded and potentially lost gene families in Beta vulgaris
Expanded gene families were detected by searching the Beta vulgaris phylome for genes specifically duplicated in Beta vulgaris. To determine potentially lost gene families in Beta vulgaris the set of protein sequences of Arabidopsis was used for a BLAST search (E-value cutoff 1 × 10−5, minimum 50% length of the query protein) against Beta vulgaris proteins.
Four non-reference sugar beet accessions were sequenced (KDHBv, UMSBv, YMoBv, YTiBv). Additional data for the reference accession, processed in the same way, was generated as a control and quality measure (referred to as ‘RefBv’). We generated a merged variant collection from RefBeet positions covered by read mapping or scaffold alignment and distinguished positions that were identical, variant (substituted or deleted), or uncalled (‘N’ in either RefBeet or in the other accession’s assembly). The number of insertions, deletions, substitutions and mixed events (indel plus substitution) was counted.
Substitutions, insertions and deletions contained in coding regions were extracted, counted and categorized using a custom Perl script. Categories were indels, synonymous and non-synonymous codon alterations, changed start and stop codons, new stop codons, and splice donor or acceptor sites alterations (Supplementary Table 15). Splice sites were considered as altered if variants affected the first two or the last two bases of an intron. The standard genetic code was used to translate the coding sequence into amino acids and stop codons. The number of transitions (AG, TC), transversions (AT, CG, AC, GT), and indels of length three or multiples of three (Extended Data Fig. 8f) were determined using a custom Perl script.
We discovered regions in the genome with low variant rates (≤2 variants per 2 kb window) which we refer to as variation deserts (Extended Data Fig. 8a–d). Variant positions, identical positions and excluded positions (mainly due to low coverage) were counted per 2 kb intervals (shifted by 1 kb). An interval was considered as desert interval if at most two variants and at most 500 excluded bases were contained. A variant desert was defined as stretch of adjacent desert intervals. Genes located within a variant desert with at least 90% of the genomic length (CDS and UTRs) were considered as variation desert genes. We determined the frequency of each GO term assigned to the group of 1,824 desert genes and the remaining 25,597 genes. GO terms were kept if they were at least ten times more frequent within the group of desert genes (that is, GO terms with odds ratio <10 were removed). The probability that the enrichment occurred by chance was calculated using the two-tailed Fisher exact test (P value cutoff 0.05, no correction for multiple testing). Enriched GO terms are listed in Supplementary Table 14. The conservation of 51 RefBeet genes inside and outside of shared variation deserts was measured by screening for polymorphisms within an extended panel of 24 sugar beet genotypes representing different breeding programs (Supplementary Methods).
To analyse the presence or absence of RefBeet genes within the genomes of other double haploid accessions, Illumina paired-end reads of KDHBv, UMSBv, YTiBv, and YMoBv were mapped against the RefBeet assembly using BWA v0.5.9 (3 edits allowed). Only uniquely mapping reads were considered. Before inferring absence or presence of a gene in the non-reference accessions, the coverage of RefBeet genes was confirmed by mapping Illumina data of the reference genotype: a CDS part of RefBeet genes was ignored if less than 90% of its length was covered by RefBv reads (10.6% of 27,421 RefBeet genes entirely ignored). Genes were considered absent in one of the other accessions if less than 1% within the total of retained CDS length was matched by reads from the non-reference accession. To detect accession-specific genes, the procedure was performed for each accession separately.
The phylogenetic tree of the sugar beet accessions was constructed based on a set of 2,112 single copy genes shared between the five sugar beet accessions. The protein sequences were used to generate multiple alignments based on which a phylogenetic tree was constructed (Extended Data Fig. 7b).
Sequence Read Archive
Sequencing raw data (genomic and transcript sequences) have been submitted to the SRA archive with the study accession number SRP023136. The NCBI Bioproject accession is PRJNA41497. The whole-genome shotgun assemblies have been deposited at DDBJ/EMBL/GenBank under the accessions AYZS00000000–AYZY00000000. The GenBank accession numbers KG026656–KG039419 were assigned to BAC end sequences and JY274675 –JY473858 to fosmid end sequences generated in this study. Plant material for Beta vulgaris genotype KWS2320 and Beta maritima 9W_2101 (DeKBm) are available as seeds by signing a material transfer agreement (MTA). A sugar beet website including a genome browser has been set up at http://bvseq.molgen.mpg.de, providing access to assemblies, annotations, gene models and variation data. The sugar beet phylome can be accessed at http://phylomeDB.org.
This work was supported by the BMBF grant “Verbundprojekt GABI BeetSeq: Erstellung einer Referenzsequenz für das Genom der Zuckerrübe (Beta vulgaris)”, FKZ 0315069A and 0315069B (to H.H. and B.W.) and by the BMBF grant “AnnoBeet: Annotation des Genoms der Zuckerrübe unter Berücksichtigung von Genfunktionen und struktureller Variabilität für Nutzung von Genomdaten in der Pflanzenbiotechnologie.”, FKZ 0315962 A, 0315962 B and 0315962 C (to B.W., H.H., and T.S.). We are grateful to M. Zehnsdorf, H. Kang, P. Viehoever, E. Castillo, A. Menoyo and C. Lange for library preparation and sequencing; to D. Datta for sequencing data base calling; to D. Kedra for discussions; and to D. Boyd and M. Isalan for language editing. We thank P. Pin, B. Briggs, and Strube Research for providing plant material and for discussions. We thank Roche for data generation on the 454 sequencing platform (cDNA and genomic 20 kb mate-pairs) and for early access to the Newbler genome assembly software.
Extended data figures
Extended data tables
Repeat families as detected by RepeatModeler along with the combined automatic and manual classification.
Predicted RefBeet genes and their functional annotation based on database searches and transfer from orthologs.
List of 715 putative resistance gene analogs (RGA). Beta vulgaris (Bv) genes were classified based on the presence of RGA domains (columns A+B). In 30 additional Bv genes (column D) these domains were missing in exon parts, but the genes showed sequence homology with known RGAs from other plants.