Chenopodium quinoa (quinoa) is a highly nutritious grain identified as an important crop to improve world food security. Unfortunately, few resources are available to facilitate its genetic improvement. Here we report the assembly of a high-quality, chromosome-scale reference genome sequence for quinoa, which was produced using single-molecule real-time sequencing in combination with optical, chromosome-contact and genetic maps. We also report the sequencing of two diploids from the ancestral gene pools of quinoa, which enables the identification of sub-genomes in quinoa, and reduced-coverage genome sequences for 22 other samples of the allotetraploid goosefoot complex. The genome sequence facilitated the identification of the transcription factor likely to control the production of anti-nutritional triterpenoid saponins found in quinoa seeds, including a mutation that appears to cause alternative splicing and a premature stop codon in sweet quinoa strains. These genomic resources are an important first step towards the genetic improvement of quinoa.
Quinoa (Chenopodium quinoa Willd., 2n = 4x = 36) is a highly nutritious crop that is adapted to thrive in a wide range of agroecosystems. It was presumably first domesticated more than 7,000 years ago by pre-Columbian cultures and was known as the ‘mother grain’ of the Incan Empire1. Quinoa has adapted to the high plains of the Andean Altiplano (>3,500 m above sea level), where it has developed tolerance to several abiotic stresses2,3,4. Quinoa has gained international attention because of the nutritional value of its seeds, which are gluten-free, have a low glycaemic index5, and contain an excellent balance of essential amino acids, fibre, lipids, carbohydrates, vitamins, and minerals6. Quinoa has the potential to provide a highly nutritious food source that can be grown on marginal lands not currently suitable for other major crops. This potential was recognized when the United Nations declared 2013 as the International Year of Quinoa, this being one of only three times a plant has received such a designation.
Despite its agronomic potential, quinoa is still an underutilized crop7, with relatively few active breeding programs8. Breeding efforts to improve the crop for important agronomic traits are needed to expand quinoa production worldwide. To accelerate the improvement of quinoa, we present here the allotetraploid quinoa genome. We demonstrate the utility of the genome sequence by identifying a gene that probably regulates the presence of seed triterpenoid saponin content. Moreover, we sequenced the genomes of additional diploid and tetraploid Chenopodium species to characterize genetic diversity within the primary germplasm pool for quinoa and to understand sub-genome evolution in quinoa. Together, these resources provide the foundation for accelerating the genetic improvement of the crop, with the objective of enhancing global food security for a growing world population.
Sequencing, assembly and annotation
We sequenced and assembled the genome of the coastal Chilean quinoa accession PI 614886 (BioSample accession code SAMN04338310) using single-molecule real-time (SMRT) sequencing technology from Pacific Biosciences (PacBio) and optical and chromosome-contact maps from BioNano Genomics9 and Dovetail Genomics10. The assembly contains 3,486 scaffolds, with a scaffold N50 of 3.84 Mb and 90% of the assembled genome contained in 439 scaffolds (Table 1). The total assembly size of 1.39 gigabases (Gb) is similar to the reported size estimates of the quinoa genome (1.45–1.50 Gb (refs 11,12)). To combine scaffolds into pseudomolecules, an existing linkage map from quinoa13 was integrated with two new linkage maps. The resulting map (Extended Data Fig. 1) of 6,403 unique markers spans a total length of 2,034 centimorgans (cM) and consists of 18 linkage groups (Supplementary Table 7), corresponding to the haploid chromosome number of quinoa. Pseudomolecules (hereafter referred to as chromosomes, which are numbered according to a previously published single-nucleotide polymorphism (SNP) linkage map13) were created by anchoring 565 scaffolds to the linkage map, representing 1.18 Gb (85%) of the total assembly length (Table 1, Supplementary Data 1, Supplementary Data 2). This assembly represents a substantial improvement over the previously published quinoa draft genome sequence, which contained more than 24,000 scaffolds with 25% missing data14.
Predicted protein-coding and microRNA genes (Supplementary Table 4) were annotated using a combination of ab initio prediction and transcript evidence gathered from RNA sequenced from multiple tissues using both RNA-seq and PacBio isoform sequencing (Iso-Seq) approaches (Extended Data Fig. 2a). The annotation contains 44,776 gene models (Supplementary Table 2, Extended Data Fig. 2b), which is in line with sequenced tetraploid species15, and includes 33,365 genes with annotation edit distance (AED)16,17 values ≤ 0.3 (Extended Data Fig. 2c). Of the genome, 64% was found to be repetitive, including a large proportion of long terminal repeat (LTR) transposable elements (Supplementary Table 1). A majority (97.3%) of the 956 genes in the Plantae BUSCO dataset18 were identified in the annotation (Supplementary Table 3), which is suggestive of a complete assembly and annotation. The utility of the assembly, linkage maps, and annotation was demonstrated by mapping the betalain locus and identifying candidate genes underlying stem pigmentation (Supplementary Information 7.1.6), which is often used as a morphological marker in breeding programs.
Evolutionary history of quinoa
Quinoa resulted from the hybridization of ancestral A- and B-genome diploid species19. Single-gene sequencing studies previously identified pools of North American and Eurasian diploids as candidate sources of the A and B sub-genomes, respectively20,21,22, with hybridization occurring somewhere in North America. To understand genome structure and evolution in quinoa further, we sequenced, assembled, and annotated the A-genome diploid C. pallidicaule (commonly called cañahua or kañiwa) and the B-genome diploid C. suecicum21 (Fig. 1a, Table 1). A high proportion of orthologous gene pairs in quinoa showed similar rates of synonymous substitutions per synonymous site (Ks), indicative of a whole-genome duplication event (Fig. 1b). This probably represents the hybridization of ancestral diploid species, because a similar peak was not observed in C. pallidicaule or C. suecicum (Fig. 1b). Using mutation rates calculated for Arabidopsis thaliana23 and for core eukaryotes24, we estimate the tetraploidization to have occurred 3.3–6.3 million years ago.
Multiple interfertile tetraploid species have arisen from the ancestral tetraploid following hybridization, including C. berlandieri and C. hircinum, although the evolutionary relationships among quinoa and its diploid and tetraploid relatives remain unclear25. To begin to resolve these issues, we re-sequenced 15 additional quinoa samples representing the two major recognized groups of quinoa: highland and coastal (Supplementary Data 5). We also sequenced five accessions of C. berlandieri and one accession each of C. hircinum from the Pacific and Atlantic Andean watersheds (Supplementary Data 5). Phylogenetic analysis of these taxa indicates that North American C. berlandieri is the basal member of the species complex (Fig. 1c). Quinoa was thought to have been domesticated from C. hircinum in a single event, from which coastal quinoa was later derived (Fig. 1d, arrow 1); however, our sequencing data place a C. hircinum sample basal to coastal ecotypes (Fig. 1c), suggesting the possibility that quinoa was domesticated independently in highland and coastal environments (Fig. 1d, arrows 2a and 2b, respectively). Future analyses with deeper sampling of quinoa and C. hircinum will help clarify the relationship between C. hircinum and quinoa, as well as provide germplasm for breeding broadly adapted coastal quinoa cultivars for warm-season production. The SNPs identified between these accessions and the reference quinoa genome—a total of 7,809,381 (Extended Data Fig. 3, Supplementary Table 5), including 2,668,694 that are specific to quinoa—will be useful in assessing genetic diversity and identifying genomic regions associated with desirable traits.
Analysis of sub-genome structure
By mapping sequencing reads from C. pallidicaule and C. suecicum onto the quinoa scaffold assembly, and by performing BLASTN searches of each diploid against the quinoa assembly, 156 and 410 quinoa scaffolds (totalling 202.6 and 646.3 Mb) were assigned to the A and B sub-genomes, respectively (Fig. 2a, Supplementary Data 6). A mini-satellite repeat (18-24J) previously shown to be more abundant in the B sub-genome26 is over-represented in scaffolds assigned to the B sub-genome (Supplementary Data 6). Nine chromosomes were assigned to each sub-genome (chromosomes hereafter designated as CqA or CqB, followed by the chromosome number), with the B sub-genome accounting for a larger percentage of both the genetic (1,087 cM) and physical (660 Mb) sizes than the A sub-genome (946 cM, 524 Mb). This result was not unexpected, given the differences in the estimated genome sizes of C. suecicum (815 Mb) and C. pallidicaule (452 Mb) based on k-mer analyses.
Visualization of the chromosomal locations of 5,807 homoeologous gene pairs revealed a high degree of synteny between the A and B sub-genomes (Fig. 2b). A small number of homoeologous pairs (3.1%) mapped within the same sub-genome, suggesting that recombination and chromosomal rearrangements have occurred between the A and B sub-genomes. For example, we identified homoeologous A and B sub-genome regions located in the B sub-genome chromosomes CqB05 and CqB03. The genes in the region of ~70–72 Mb of CqB03 are phylogenetically more similar to the A-genome diploid C. suecicum and therefore probably originated from the A sub-genome chromosome CqA12 (Fig. 2c). We also found evidence of large chromosomal rearrangements within sub-genomes, complicating the assignment of homoeologous chromosome pairs. For example, orthologue analysis clearly identifies CqB05 and CqA12 as homoeologous, although the same analysis is much more complicated with other chromosomes (Fig. 2b). To clarify these relationships, we identified syntenic regions between chromosomes of the diploid Beta vulgaris27 (n = 9) and the A and B sub-genome chromosomes of quinoa (Fig. 2d). These results indicate that CqA02 and CqA04 are orthologous to B. vulgaris chromosomes 8 and 2 (Bvchr8 and Bvchr2), respectively, whereas CqB01 appears to be the result of a chromosome fusion. Likewise, CqA07 appears to be the result of a fusion between ancestral chromosomes orthologous to Bvchr3 and Bvchr7.
Analysis of sub-genome content
We used OrthoMCL28 to identify clusters of orthologous genes in related species in Amaranthaceae (Extended Data Fig. 4), and specifically investigated the retention and loss of orthologous genes in quinoa and the three diploid species C. pallidicaule, C. suecicum and B. vulgaris (Fig. 3a, Supplementary Table 6). Using sets of genes for which only one copy could be found in quinoa and in each of the diploid species, we found a similar number (1,031 and 849) of genes lost from the A and B sub-genomes, respectively (Fig. 3b). The previously discussed set of 5,807 homoeologous gene pairs in quinoa are present as single copies in each of the diploids (Figs 2b, 3b) and therefore represent a core set of single-copy genes retained in each genome and sub-genome. We also investigated genes retained in multiple copies. We found that quinoa, like C. rubrum29 and B. vulgaris30, contains two genes that are orthologous to the A. thaliana gene FT, which regulates flowering time. Quinoa contains two homoeologous copies of FT2 and three of FT1, owing to a local tandem duplication (Fig. 3c, Extended Data Fig. 5). FT is known to promote flowering in A. thaliana, and functional orthologues have been found in other species31; however, B. vulgaris was found to contain a second FT gene that acts antagonistically by repressing flowering before vernalization. Future functional studies will help determine the function of these duplicated genes in quinoa, which, unlike B. vulgaris, does not require vernalization.
Mechanisms underlying saponin production
Quinoa seeds contain a mixture of triterpene glycosides called saponins32. Although saponins may be beneficial for plant growth (for example, by deterring herbivory33,34), they must be removed before human consumption as they are haemolytic and produce a bitter flavour. Because this process is costly, is often water-intensive, and can reduce the nutritional value of the seeds35, the development of saponin-free lines is a major breeding objective in quinoa8. We found that saponins accumulate in the seed pericarp (Fig. 4a, Extended Data Fig. 6) between 20 and 24 days after anthesis (Fig. 4b), eventually accounting for 4% (w/w) of the mature seed mass (Supplementary Information 8.1). We identified and annotated 43 different saponins in the seeds of the reference sample (Supplementary Table 9), adding to the almost 100 different saponins that have been previously identified in different samples of quinoa32,36.
Naturally occurring sweet quinoa strains that contain very low levels of saponins are present within the quinoa germplasm37, although the underlying genes regulating the absence of saponins in these lines are unknown. To identify these genes, we performed linkage mapping and bulk segregant analysis (BSA) using two populations segregating for the presence of saponins in the seeds: Kurmi (sweet) × 0654 (bitter), and Atlas (sweet) × Carina Red (bitter). Consistent with reports from other populations38, segregation ratios in these populations indicate that the presence of seed saponins is controlled by a single gene, with the presence of saponins being dominant (71 bitter and 21 sweet in Kurmi × 0654; 567 bitter and 175 sweet in Atlas × Carina Red). We note that quantitative and qualitative differences exist in the saponins identified in bitter lines (Extended Data Fig. 7, Supplementary Table 8) and that the presence and absence of saponins was correlated with differences in seed coat thickness, with bitter lines having significantly thicker seed coats than sweet lines (Extended Data Fig. 8).
Linkage mapping and BSA in each population identified the same region on CqB16 that distinguishes the bitter and sweet lines (Fig. 4c). Frequencies of the sweet allele in both populations reached 100% for markers located in CqB16 on scaffold 3489. We investigated the genes in a 700-kb window surrounding this region of 100% sweet allele frequency. Of the 54 annotated genes in this region (Supplementary Data 7), two are similar to genes previously shown to play a role in saponin biosynthesis. Specifically, AUR62017204 and AUR62017206 are neighbouring genes annotated as basic helix–loop–helix (bHLH) transcription factors sharing homology (Extended Data Fig. 9a) with the class IVa bHLH genes that are known to regulate triterpenoid biosynthesis in Medicago trancatula39. In M. truncatula, overexpression of the triterpene saponin biosynthesis activating regulator 1 (TSAR1) and TSAR2 bHLH transcription factors was recently shown to increase the expression of genes in the triterpenoid biosynthetic pathway, resulting in increased accumulation of triterpene saponins39. TSAR1 and TSAR2 were also found to bind to the DNA motif 5′-CACGHG-3′ (where H can be A, C, or T)39. In quinoa, we found that AUR62017206 (hereafter TSAR-like 2, TSARL2) was expressed in root tissue but not in flowers or immature seeds, whereas AUR62017204 (hereafter TSARL1) was almost exclusively expressed in seeds, with significantly lower expression levels in sweet lines (Supplementary Data 8). We identified the DNA motif bound by M. truncatula TSAR1 and TSAR2 within 2 kb upstream of the start codon in several saponin biosynthetic pathway genes in quinoa (Fig. 4d). Expression levels of these genes and several other genes in the saponin biosynthetic pathway were significantly downregulated in sweet lines (Fig. 4d, Supplementary Data 8). Together, these results suggest that TSARL1 might be a functional TSAR orthologue, although whether this is due to shared ancestry or convergent evolution is unclear.
The TSARL1 transcript was alternatively spliced in the sweet progeny of Kurmi and 0654. A SNP in the last position of exon 3 (G2078C) co-segregates with the presence of saponins in the Kurmi × 0654 progeny. The G2078C SNP alters the canonical intron/exon splice boundary (Fig. 4e), probably leading to the alternative splicing at an upstream cryptic splice site in the sweet lines (Fig. 4e). This alternative splicing of TSARL1 results in a premature stop codon (Extended Data Fig. 9b) and a truncated protein that modelling predicts to be compromised in its ability to form homodimers and/or to bind DNA (Extended Data Fig. 9b–d, Supplementary Information 8.8), which are both necessary for regulation of transcription. All bitter strains in our re-sequencing pool share the same allele (G) found in the bitter progeny of Kurmi and 0654, whereas all sweet strains (Chucapaca, G205-95DK, Salcedo INIA, as well as the mapping population parents Kurmi and Atlas) but one (Pasankalla) contain the same allele (G2078C) as the sweet progeny. Notably, however, although the G2078C allele is present in the sequenced Atlas line, none of the sweet progeny in the Atlas × Carina Red population were found to have the G2078C allele. Additional sequencing of individual plants of the Atlas variety revealed a low level of heterogeneity within the variety for the TSARL1 gene, with some plants containing the G2078C allele and others containing sequence insertions (Supplementary Information 7.2.4). Thus, it is likely that the Atlas plant used in the cross with Carina Red—which, importantly, was not the same plant used for sequencing—possessed a mutation other than the G2078C allele. Indeed, we found strong evidence of insertions in and around the TSARL1 gene in all the sweet progeny of the Atlas × Carina Red population (Fig. 4e, Extended Data Fig. 10). In particular, two exonic insertions in TSARL1 in the sweet progeny probably inactivate the gene and result in a sweet phenotype. The identification of multiple, independent mutations in TSARL1 that co-segregate with the sweet phenotype strongly suggests that this gene regulates the presence and absence of saponins in quinoa seeds. The early steps in the saponin biosynthetic pathway that are presumably regulated by TSARL1 are shared by other pathways, including sterol biosynthesis. Inactivation of these other pathways could be detrimental to normal plant growth, although we see no noticeable phenotypic differences between bitter and sweet siblings. It is possible that precursors needed for sterol biosynthesis are provided by the methylerythritol 4-phosphate (MEP) pathway in the plastid40,41. Future functional studies will help clarify these issues.
As an emerging international crop, quinoa has great potential to enhance global food security. The high-quality reference genome assembly presented here will accelerate improvements of quinoa. Major breeding objectives for quinoa improvement include the development of shorter plants with fewer branches and more compact seed heads, increased heat and biotic stress tolerance, and the introgression of the sweet phenotype into commercial varieties. The identification of the likely causative mutation underlying the sweet phenotype not only provides insights into triterpenoid saponin biosynthesis, but also enables accelerated breeding of sweet commercial varieties using marker-assisted selection. The diversity present in the primary gene pool of quinoa, which we have begun to characterize, will also help direct future breeding strategies. The resources presented here also help to make quinoa a useful model for studying polyploid genome evolution and mechanisms of abiotic stress tolerance, in particular salinity tolerance.
No statistical methods were used to predetermine sample size. The experiments were not randomized and the investigators were not blinded to allocation during experiments and outcome assessment.
Quinoa sequencing and assembly
We sequenced Chenopodium quinoa Willd. (quinoa) accession PI 614886 (BioSample accession code SAMN04338310; also known as NSL 106399 and QQ74). DNA was extracted from leaf and flower tissue of a single plant, as described in the “Preparing Arabidopsis Genomic DNA for Size-Selected ~20 kb SMRTbell Libraries” protocol (http://www.pacb.com/wp-content/uploads/2015/09/Shared-Protocol-Preparing-Arabidopsis-DNA-for-20-kb-SMRTbell-Libraries.pdf). DNA was purified twice with Beckman Coulter Genomics AMPure XP magnetic beads and assessed by standard agarose gel electrophoresis and Thermo Fisher Scientific Qubit Fluorometry. 100 Single-Molecule Real-Time (SMRT) cells were run on the PacBio RS II system with the P6-C4 chemistry by DNALink (Seoul). De novo assembly was conducted using the smrtmake assembly pipeline (https://github.com/PacificBiosciences/smrtmake) and the Celera Assembler, and the draft assembly was polished using the quiver algorithm.
DNA was also sequenced using an Illumina HiSeq 2000 machine. For this, DNA was extracted from leaf tissue of a single soil-grown plant using the Qiagen DNeasy Plant Mini Kit. 500-bp paired-end (PE) libraries were prepared using the NEBNext Ultra DNA Library Prep Kit for Illumina. Sequencing reads were processed with Trimmomatic (v0.33)42, and reads <75 nucleotides in length after trimming were removed from further analysis. The remaining high-quality reads were assembled with Velvet (v1.2.10)43 using a k-mer of 75.
Integrating BioNano optical maps with the PacBio assembly
High-molecular-weight DNA was isolated and labelled from leaf tissue of three-week old quinoa plants according to standard BioNano protocols, using the single-stranded nicking endonuclease Nt.BspQI. Labelled DNA was imaged automatically using the BioNano Irys system and de novo assembled into consensus physical maps using the BioNano IrysView analysis software. The final de novo assembly used only single molecules with a minimum length of 150 kb and eight labels per molecule. PacBio-BioNano hybrid scaffolds were identified using IrysView’s hybrid scaffold alignment subprogram.
Chicago library preparation and sequencing
Using the same DNA prepared for PacBio sequencing, a Chicago library was prepared as described previously10. The library was sequenced on an Illumina HiSeq 2500.
Scaffolding the PacBio and BioNano assemblies with HiRise
Chicago sequence data (in FASTQ format) was used to scaffold the PacBio-BioNano hybrid assembly using HiRise, a software pipeline designed specifically for using Chicago data to assemble genomes10. Chicago library sequences were aligned to the draft input assembly using a modified SNAP read mapper (http://snap.cs.berkeley.edu). The separations of Chicago read pairs mapped within draft scaffolds were analysed by HiRise to produce a likelihood model, and the resulting likelihood model was used to identify putative mis-joins and score prospective joins.
Kurmi × 0654 population mapping and genetic marker analysis
A population was developed by crossing Kurmi (green, sweet) and 0654 (red, bitter). Homozygous high- and low-saponin F2 lines were identified by planting 12 F3 seeds derived from each F2 line, harvesting F4 seed from these F3 plants, and then performing foam tests on the F4 seed. Phenotyping was validated using gas chromatography/mass spectrometry (GC/MS). RNA was extracted from inflorescences containing a mixture of flowers and seeds at various stages of development from the parents and 45 individual F3 progeny. RNA extraction and Illumina sequencing were performed as described above. Sequencing reads from all lines were trimmed using Trimmomatic and mapped to the reference assembly using TopHat44, and SNPs were called using SAMtools mpileup (v1.1)45.
For linkage mapping, markers were assigned to linkage groups on the basis of the grouping by JoinMap v4.1. Using the maximum likelihood algorithm of JoinMap, the order of the markers was determined; using this as start order and fixed order, regression mapping in JoinMap was used to determine the cM distances.
Genes differentially expressed between bitter and sweet lines and between green and red lines were identified using default parameters of the Cuffdiff function of the Cufflinks program46.
Atlas × Carina Red population mapping and genetic marker analysis
A second mapping population was developed by crossing Atlas (sweet) and Carina Red (bitter). Bitter and sweet F2 lines were identified by performing foam and taste tests on the F3 seed. DNA sequencing was performed with DNA from the parents and 94 sweet F2 lines, as described above, and sequencing reads were mapped to the reference assembly using BWA. SNPs were called in the parents and in a merged file containing all combined F2 lines.
Genotype calls were generated for the 94 F2 genotypes by summing up read counts over a sliding window of 500 variants, at all variant positions for which the parents were homozygous and polymorphic. Over each 500-variant stretch, all reads with Atlas alleles were summed, and all reads with the Carina Red allele were summed. Markers were assigned to linkage groups using JoinMap, with regression mapping used to obtain the genetic maps per linkage group.
Integrated linkage map
The Kurmi × 0654 and Atlas × Carina Red maps were integrated with the previously published quinoa linkage map13, with the Kurmi × 0654 map being used as the reference for the positions of anchor markers and scaling. We selected markers from the same scaffold that were in the same 10,000-bp bin in the assembly. The anchor markers on the alternative map received the position of the Kurmi × 0654 map anchor marker in the integrated map. This process was repeated with anchor markers at the 100,000-bp bin level. The assumption is that at the 100,000-bp bin level recombination should essentially be zero. On this level, a regression of cM position on both maps yielded R2 values >0.85 and often >0.9, so the regression line can easily be used for interpolating the positions of the alternative map towards the corresponding position on the Kurmi × 0654 map. All Kurmi × 0654 markers went into the integrated map on their original position.
Pseudomolecules were assembled by concatenating scaffolds based on their order and orientation as determined from the integrated linkage map. An AGP (‘A Golden Path’) file was made that describes the positions of the scaffold-based assembly in coordinates of the pseudomolecule assembly, with 100 ‘N’s inserted between consecutive scaffolds. Based on these coordinates, custom scripts were used to generate the pseudomolecule assembly and to recoordinate the annotation file.
Sequencing and assembly of C. pallidicaule and C. suecicum
DNA was extracted from C. pallidicaule (PI 478407) and C. suecicum (BYU 1480) and was sent to the Beijing Genomic Institute (BGI, Hong Kong) where one 180-bp PE library and two mate-pair libraries with insert sizes of 3 and 6 kb were prepared and sequenced on the Illumina HiSeq platform to obtain 2 × 100-bp reads for each library. The generated reads were trimmed using the quality-based trimming tool Sickle (https://github.com/najoshi/sickle). The trimmed reads were then assembled using the ALLPATHS-LG assembler47, and GapCloser v1.1248 was used to resolve N spacers and gap lengths produced by the ALLPATHS-LG assembler.
Repeat families found in the genome assemblies of quinoa, C. pallidicaule and C. suecicum (see Supplementary Information 3) were first independently identified de novo and classified using the software package RepeatModeler49. RepeatMasker50 was used to discover and identify repeats within the respective genomes.
AUGUSTUS51 was used for ab initio gene prediction, using model training based on coding sequences from Amaranthus hypochondriacus, Beta vulgaris, Spinacia oleracea and Arabidopsis thaliana. RNA-seq and isoform sequencing reads generated from RNA of different tissues were mapped onto the reference genome using Bowtie 2 (ref. 52) and GMAP53, respectively. Hints with locations of potential intron–exon boundaries were generated from the alignment files with the software package BAM2hints in the MAKER package54. MAKER with AUGUSTUS (intron–exon boundary hints provided from RNA-seq and isoform sequencing) was then used to predict genes in the repeat-masked reference genome. To help guide the prediction process, peptide sequences from B. vulgaris and the original quinoa full-length transcript (provided as EST evidence) were used by MAKER during the prediction. Genes were characterized for their putative function by performing a BLAST search of the peptide sequences against the UniProt database. PFAM domains and InterProScan ID were added to the gene models using the scripts provided in the MAKER package.
The following quinoa accessions were chosen for DNA re-sequencing: 0654, Ollague, Real, Pasankalla (BYU 1202), Kurmi, CICA-17, Regalona (BYU 947), Salcedo INIA, G-205-95DK, Cherry Vanilla (BYU 1439), Chucapaca, Ku-2, PI 634921 (Ames 22157), Atlas and Carina Red. The following accessions of C. berlandieri were sequenced: var. boscianum (BYU 937), var. macrocalycium (BYU 803), var. zschackei (BYU 1314), var. sinuatum (BYU 14108), and subsp. nuttaliae (‘Huauzontle’). Two accessions of C. hircinum (BYU 566 and BYU 1101) were also sequenced. All sequencing was performed with an Illumina HiSeq 2000 machine, using either 125-bp (Atlas and Carina Red) or 100-bp (all other accessions) paired-end libraries. Reads were trimmed using Trimmomatic and mapped to the reference assembly using BWA (v0.7.10)55. Read alignments were manipulated with SAMtools, and the mpileup function of SAMtools was used to call SNPs.
Identification of orthologous genes
Orthologous and paralogous gene clusters were identified using OrthoMCL28. Recommended settings were used for all-against-all BLASTP comparisons (Blast+ v2.3.056) and OrthoMCL analyses. Custom Perl scripts were used to process OrthoMCL outputs for visualization with InteractiVenn57.
Using OrthoMCL, orthologous gene sets containing two copies in quinoa and one copy each in C. pallidicaule, C. suecicum, and B. vulgaris were identified. In total, 7,433 gene sets were chosen, and their amino acid sequences were aligned individually for each set using MAFFT58. The 7,433 alignments were converted into PHYLIP format files by the seqret command in the EMBOSS package59. Individual gene trees were then constructed using the maximum likelihood method using proml in PHYLIP60.
In addition, the genomic variants of all 25 sequenced taxa (Supplementary Data 5) relative to the reference sequence were called based on the mapped Illumina reads in 25 BAM files using SAMtools. To call variants in the reference genome (PI 614886), Illumina sequencing reads were mapped to the reference assembly. Variants were then filtered using VCFtools61 and SAMtools, and the qualified SNPs were combined into a single VCF file which was used as an input into SNPhylo62 to construct the phylogenetic relationship using maximum likelihood and 1,000 bootstrap iterations.
To identify FT homologues, the protein sequence from the A. thaliana flowering time gene FT was used as a BLAST query. Filtering for hits with an E value <1 × e−3 and with RNA-seq evidence resulted in the identification of four quinoa proteins. One quinoa protein (AUR62013052) appeared to be comprised of two tandem repeats which were separated for the purposes of phylogenetic analysis. For the construction of the phylogenetic tree, protein sequences from these five quinoa FT homologues were aligned using Clustal Omega63 along with two B. vulgaris (gene models: BvFT1-miuf.t1, BvFT2-eewx.t1) and one A. thaliana (AT1G65480.1) homologue. Phylogenetic analysis was performed with MEGA64 (v6.06). The JTT model was selected as the best fitting model. The initial phylogenetic tree was estimated using the neighbour joining method (bootstrap value = 50, Gaps/ Missing Data Treatment = Partial Deletion, Cutoff 95%), and the final tree was estimated using the maximum likelihood method with a bootstrap value of 1,000 replicates. The syntenic relationships between the coding sequences of the chromosomal regions surrounding these FT genes were visualized using the CoGE65 GEvo tool and the Multi-Genome Synteny Viewer66.
The alignment of bHLH domains was performed with Clustal Omega63, using sequences from Mertens et al.39. The phylogeny was inferred using the maximum likelihood method based on the JTT matrix-based model67. Initial trees for the heuristic search were obtained automatically by applying Neighbour-Join and BioNJ algorithms to a matrix of pairwise distances estimated using a JTT model, and then selecting the topology with superior log likelihood value. All positions containing gaps and missing data were eliminated.
Distinguishing and analysing the quinoa sub-genomes
Trimmed PE Illumina sequencing reads that were used for the de novo assembly of C. suecicum and C. pallidicaule were mapped onto the reference quinoa genome using the default settings of BWA. For every base in the quinoa genome, the depth coverage of properly paired reads from the C. suecicum and C. pallidicaule mapping was calculated using the program GenomeCoverage in the BEDtools package68. A custom Perl script was used to calculate the percentage of each scaffold with more than 5× coverage from both diploids. Scaffolds were assigned to the A or B sub-genome if >65% of the bases were covered by reads from one diploid and <25% of the bases were covered by reads from the other diploid. The relationship between the quinoa sub-genomes and the diploid species C. pallidicaule and C. suecicum was presented in a circle proportional to their sizes using Circos69. Orthologous regions in the three species were identified using BLASTN searches of the quinoa genome against each diploid genome individually. Single top BLASTN hits longer than 8 kb were selected and presented as links between the quinoa genome assembly (arranged in chromosomes, see Supplementary Information 7.3) and the two diploid genome assemblies on the Circos plot (Fig. 2a).
Sub-genome synteny was analysed by plotting the positions of homoeologous pairs of A- and B-sub-genome pairs within the context of the 18 chromosomes using Circos. Synteny between the sub-genomes and B. vulgaris was assessed by first creating pseudomolecules by concatenating scaffolds which were known to be ordered and oriented within each of the nine chromosomes. Syntenic regions between these B. vulgaris chromosomes and those of quinoa were then identified using the recommended settings of the CoGe SynMap tool70 and visualized using MCScanX71 and VGSC72. For the purposes of visualization, quinoa chromosomes CqB05, CqA08, CqB11, CqA15 and CqB16 were inverted.
Quinoa seeds were embedded in a 2% carboxymethylcellulose solution and frozen above liquid nitrogen. Sections of 50 μm thickness were obtained using a Reichert-Jung Frigocut 2800N, modified to use a Feather C35 blade holder and blades at −20 °C using a modified Kawamoto method73. A 2,5-dihydroxybenzoic acid (Sigma-Aldrich) matrix (40 mg ml−1 in 70% methanol) was applied using a HTX TM-Sprayer (HTX Technologies LLC) with attached LC20-AD HPLC pump (Shimadzu Scientific Instruments). Sections were vacuum dried in a desiccator before analysis. The optical image was generated using an Epson 4400 Flatbed Scanner at 4,800 d.p.i. For mass spectrometric analyses, a Bruker SolariX XR with 7T magnet was used. Images were generated using Bruker Compass FlexImaging 4.1. Data were normalized to the TIC, and brightness optimization was employed to enhance visualization of the distribution of selected compounds. Individual spectra were recalibrated using Bruker Compass DataAnalysis 4.4 to internally lock masses of known DHB clusters: C14H9O6 = 273.039364 and C21H13O9 = 409.055408 m/z. Accurate mass measurements for individual saponins and identified compounds were run using continuous accumulation of selected ions (CASI) using mass windows of 50–100 m/z and a transient of 4 megaword generating a transient of 2.93 s providing a mass resolving power of approximately 390,000 at 400 m/z. Lipids were putatively assigned by searching the LipidMaps database74 (http://www.lipidmaps.org) and lipid class confirmed by collision-induced dissociation using a 10 m/z window centred around the monoisotopic peak with collision energy of between 15–20 V.
Quinoa flowers were marked at anthesis, and seeds were sampled at 12, 16, 20 and 24 days after anthesis. A pool of five seeds from each time point was analysed using GC/MS.
Quantification of saponins was performed indirectly by quantifying oleanolic acid (OA) derived from the hydrolysis of saponins extracted from quinoa seeds. Derivatized solution was analysed using single quadrupole GC/MS system (Agilent 7890 GC/5975C MSD) equipped with EI source at ionisation energy of 70 eV. Chromatography separation was performed using DB-5MS fused silica capillary column (30m × 0.25 mm I.D., 0.25 μm film thickness; Agilent J&W Scientific), chemically bonded with 5% phenyl 95% methylpolysiloxane cross-linked stationary phase. Helium was used as the carrier gas with constant flow rate of 1.0 ml min−1. The quantification of OA in each sample was performed using a standard curve based on standards of OA.
Specific, individual saponins were identified in quinoa using a preparation of 20 mg of seeds performed according a modified protocol from Giavalisco et al.75. Samples were measured with a Waters ACQUITY Reversed Phase Ultra Performance Liquid Chromatography (RP-UPLC) coupled to a Thermo-Fisher Exactive mass spectrometer, which consists of an electrospray ionisation source and an Orbitrap mass analyser. A C18 column was used for the hydrophilic measurements. Chromatograms were recorded in full-scan MS mode (mass range, 100 −1,500). Extraction of the LC/MS data was accomplished with the software REFINER MS 7.5 (GeneData).
SwissModel76 was used to produce homology models for the bHLH region of AUR62017204, AUR62017206 and AUR62010677. RaptorX77 was used for prediction of secondary structure and disorder. QUARK78 was used for ab initio modelling of the C-terminal domain, and the DALI server79 was used for 3D homology searches of this region. Models were manually inspected and evaluated using the PyMOL program (http://pymol.org).
Data availability statement
The genome assemblies and sequence data for C. quinoa, C. pallidicaule and C. suecicum were deposited at NCBI under BioProject codes PRJNA306026, PRJNA326220 and PRJNA326219, respectively. Additional accessions numbers for deposited data can be found in Supplementary Data 9. The quinoa genome can also be accessed at http://www.cbrc.kaust.edu.sa/chenopodiumdb/ and on the Phytozome database (http://www.phytozome.net/).
Research reported in this publication was supported by the King Abdullah University of Science and Technology (KAUST), by USDA/NIFA-REEIS grant #2012-51300-20100 (WSU/BYU), by NSF grant #1339412, and by a grant from the German Research Foundation, DFG (grant no. JU205/24-1). We thank T. Ramaraj from the National Center for Genomic Resources (Santa Fe) for his bioinformatics assistance, O. Mohammed Eid Alharbi at the KAUST Imaging and Characterization Laboratory for generating SEM images, K. Zemmouri at the KAUST Greenhouse for plant care, and H. Ho (Heno) Hwang and I. D. Gromicho at the KAUST Academic Writing Department for assistance with quinoa illustrations and photographs. We also thank H. Štorchová, F. Morton, D. Bertero, F. Fuentes and D. Brenner of USDA-ARS-NPGS for their assistance in providing seeds. Metabolite imaging was conducted at Metabolomics Australia (School of BioSciences, The University of Melbourne, Australia), a NCRIS initiative under Bioplatforms Australia Pty Ltd.
Extended data figures
This file contains information for the integrated linkage map, including Excel spreadsheet tabs for markers in the map and their positions, scaffolds assigned to each linkage group, scaffolds mapping to more than one linkage group, unmapped scaffolds, and average nucleotides per centimorgan for each linkage group.
A table of genes that are differentially expressed between green- and red-stemmed plants, as output from the CuffDiff program.
A portion of the quinoa genome annotation, containing genes in a window surrounding the mapped betalain (stem colour) locus.
Information about the lines used for sequencing.
The sub-genome assignment of scaffolds, based on the number of bases onto which reads from the A (C. pallidicuale) and B (C. suecicum) diploids were mapped, as well as the number of BLAST hits from the 18-24J repeat which is overrepresented in B genomes.
A portion of the quinoa genome annotation, containing genes in a window surrounding the mapped saponin production locus.
This table shows genes that are differentially expressed between bitter and sweet plants, as output from the CuffDiff program.
Accession numbers for all data deposited on public databases.
About this article
Molecular Cytogenetics (2018)