The diversity of quinoa morphological traits and seed metabolic composition

Quinoa (Chenopodium quinoa Willd.) is an herbaceous annual crop of the amaranth family (Amaranthaceae). It is increasingly cultivated for its nutritious grains, which are rich in protein and essential amino acids, lipids, and minerals. Quinoa exhibits a high tolerance towards various abiotic stresses including drought and salinity, which supports its agricultural cultivation under climate change conditions. The use of quinoa grains is compromised by anti-nutritional saponins, a terpenoid class of secondary metabolites deposited in the seed coat; their removal before consumption requires extensive washing, an economically and environmentally unfavorable process; or their accumulation can be reduced through breeding. In this study, we analyzed the seed metabolomes, including amino acids, fatty acids, and saponins, from 471 quinoa cultivars, including two related species, by liquid chromatography – mass spectrometry. Additionally, we determined a large number of agronomic traits including biomass, flowering time, and seed yield. The results revealed considerable diversity between genotypes and provide a knowledge base for future breeding or genome editing of quinoa.


Background & Summary
Quinoa (Chenopodium quinoa Willd.) is increasingly attracting global attention because of its unusually high grain nutritional value including high protein content, the composition and quantity of lipids, a good balance of essential amino acids, as well as isoflavones and interesting antioxidant functional properties [1][2][3] . Quinoa was first domesticated in the Lake Titicaca basin about 7,000 years ago, from where it spread to other regions in South America and the world 4 .
An agriculturally important asset of quinoa is its remarkable ability to adapt to diverse agroecological zones, which allows growth in hot dry deserts and in tropical areas with up to 88% relative humidity, from −8 °C to 40 °C 5 , and from sea level to 4,000 m high mountainous regions. Its adaptability to sodic and alkaline soils is also remarkable allowing cultivation from pH 4.5 to 9.0 6 . Quinoa is a highly drought tolerant crop that fares well in regions of below 200 mm yearly rainfall ( 7 and references therein). It is tolerant against high salinity and considered a facultative halophyte [8][9][10] . In 2013, the Food and Agricultural Organization (FAO) declared the ´International Year of Quinoa´ in recognition of the capacity of the crop to help mitigate hunger and malnutrition in food-insecure countries, and in recognition of the ancestral efforts of the Andean people to preserve quinoa as a crop (http://www.fao.org/quinoa-2013/en/).
Although quinoa grains have an exceptional nutritional value, the seed coat typically contains bitter-tasting and potentially anti-nutritional saponins 11 . Therefore, quinoa seeds require substantial processing (water-extensive washing) to remove saponins before consumption. Reduction of saponins has been a breeding target and in the future may also be achieved with biotechnological methods, such as genome editing. The quinoa saponins occur predominantly in the form of triterpenoid glycosides [12][13][14] . Their large structural diversity renders analyses non-trivial 15 .
The biological functions of saponins in quinoa remain to be investigated. Saponins may play a role in seed germination, and in deterring birds or fungal infections (reviewed in 16 ). Evidence indicates that not only the total amount of saponins is regulated (e.g., by the bHLH transcription factor CqTSARL1) 17 , but also the saponin profile 17 . However, to date, seed saponin profiles of only few quinoa genotypes have been determined 17 . As some saponins may even be beneficial to human health 18 , the diversity in saponin composition poses a great resource for breeding new and more healthy quinoa cultivars.
Our study reports the variability of the metabolome of mature quinoa seeds of a large number of genotypes (471 in total; Supplementary Dataset 1, available at figshare 19 ). Additionally, we determined agronomic traits such as plant height, total biomass, panicle density, days to flowering, and seed weight ( Fig. 1 and Supplementary  Table 1 19 ). The experimental pipeline employed for liquid chromatography -mass spectrometry (LC-MS)-based metabolome analysis of seeds is represented in Fig. 2. Metabolites were annotated using a library of authentic reference compounds, and in-source fragmentation patterns, and the data are reported in compliance with established standards 20 (Supplementary Table 2 19 and MetaboLights database, MTBLS2382). We detected and quantified 400 seed metabolites representing diverse chemical classes: 37 triterpenoid saponins, 14 flavonoids, www.nature.com/scientificdata www.nature.com/scientificdata/ 15 amino acids, 117 dipeptides, 126 lipids, and 91 other metabolites. To explore the variation between genotypes, principal component analysis (PCA) and a hierarchical cluster (HCA) heatmap were established on metabolic and phenotypic traits (Fig. 3A,B). The heatmap revealed considerable differences in metabolite abundances across genotypes (Fig. 3A), which was confirmed by PCA (Fig. 3B), in which the first and the second components explained 41.1% and 19.9% respectively, of saponin variance. PCA analysis identified 21 genotypes, mostly originating from Peru/Latin America, whose position in the PCA plot largely correlates with their geographical origin (Supplementary Table 3). Finally, we investigated correlations between and within different metabolite classes and phenotypic traits (Fig. 3C). This showed that many metabolites are highly associated within the network. However, no significant strong correlations were found between saponin content and morphological traits, indicating that genotypes with low saponin content can be selected in the future by breeding or genome editing without an impact on yield-related traits.

Methods
Quinoa germplasm. Four-hundred and sixty-eight quinoa genotypes, plus one accession from djulis (Chenopodium formosanum Koidz.) and one from goosefoot (Chenopodium album L.) were selected for the field experiment (Supplementary Table 1 19 ). Seeds from quinoa accession QQ74, for which a reference genome sequence is available 17 , were included as well. The source of the seeds is given in Supplementary Table 1 19 . Seeds of the different genotypes were propagated at the International Center for Biosaline Agriculture (ICBA) fields in years 2016 and 2017, and stored in a cold chamber at 2 °C and a relative humidity of 30%. The seeds were sown by hand by dibbling 2-3 seeds for each hole/location into the ground, to a depth of 1-2 cm near the dripper. Plants were thinned after about two weeks by removing unusually weak or strong individuals to leave one plant per location. www.nature.com/scientificdata www.nature.com/scientificdata/ Experimental site and design. Experiments were carried out at the field research facilities of the International Center for Biosaline Agriculture, ICBA (N 25° 05.847; E 055° 23.464), Dubai, the United Arab Emirates, from November 2016 to April 2017. The soils at ICBA experimental fields are sandy in texture, that is, fine sand (sand 98%, silt 1%, and clay 1%), calcareous (50-60% CaCO 3 equivalents), porous (45% porosity), and moderately alkaline (pH 8.2). The saturation percentage of the soil is 26 with a very high drainage capacity, while electrical conductivity of its saturation extract (ECe) is 1.2 dS m −1 . According to the American Soil Taxonomy 21 , the soil is classified as typic Torripsamments, carbonatic and hyperthermic 22 . Prior to sowing, poultry manure (Al Yahar Organic Fertilizers, UAE) was added at 40 tons per hectare (t ha −1 ) in the field chosen for the experiments. After four weeks of sowing, urea (nitrogen-phosphorus-potassium (NPK) content: 46-0-0) was applied at 40 kg ha −1 , while NPK (20-20-20) was applied at 30 kg ha −1 after eight weeks of planting. Fertigation technique was used for the application of chemical fertilizers. The experimental plots were randomized following an augmented design 23 , with each accession harboring a plot size of 1 m x 1 m. The distance between both, rows and plants was 25 cm.
Irrigation system. A drip irrigation system was used for the experiment, with drippers at 25 cm distance, which was part of SCADA (Supervisory Control and Data Acquisition) system. Irrigation was provided twice a day for 10 min each time. For irrigation, about 13.3 L of water was used daily per plot. Data on relative humidity, temperature, and rainfall at the experimental site were recorded by the meteorological station at ICBA (Supplementary Table 4).

Data collection.
Eleven different morphological traits were recorded to assess the variation among the quinoa genotypes (including two species related to quinoa). For days to flowering, the data were recorded when about 50% of plants were flowering (Supplementary Table 1 19 ). Data on plant height, number of primary branches, number of panicles, main panicle length, plant dry weight, and seed weight were collected after plant maturity. For dry weight measurements, plants were kept in a drier electric oven (Model-PF 30, Carbolite, United Kingdom) at 40°C for 48 hours.

Extraction of lipids and polar metabolites. The extraction protocol was adapted and modified from
Giavalisco et al. 24 . Metabolites were extracted from the quinoa seeds using a methyl-tert-butyl ether (MTBE)/ methanol/water solvent system. Equal volumes of the lipid and polar fractions were dried in a centrifugal evaporator and stored at -20 °C until processed further.
Lc-MS metabolomics. Polar and semipolar metabolites: After extraction, the dried aqueous phase was measured using ultra-performance liquid chromatography coupled to a Q-Exactive mass spectrometer (Thermo www.nature.com/scientificdata www.nature.com/scientificdata/ Fisher Scientific) in positive and negative ionization modes, as described 24 . Samples were run in ten consecutive sets of 50 samples and one set of 10 samples. Lipids: After extraction, the dried organic phase was measured using ultra-performance liquid chromatography coupled to a Q-Exactive mass spectrometer (Thermo Fisher Scientific) in positive mode, as described 24 . Samples were run in ten consecutive sets of 50 samples and one set of 10 samples.
Data pre-processing: Lc-MS metabolite data. Expressionist Refiner MS 12.0 (Genedata, Basel, Switzerland) was used for processing the LC-MS data (https://www.genedata.com/products/expressionist). Repetition was used to reduce the volume of data and to speed up processing. All types of data except Primary MS Centroid Data were removed using Data Sweep. Chemical Noise Subtraction activity was used to remove artefacts caused by chemical contamination. Snapshots of chromatograms were saved for further processing. Further processing of chromatogram snapshots was performed as follows: chromatogram alignment (Retention time (RT) search interval 0.5 min), peak detection (minimum peak size 0.03 min, gap/peak ratio 50%, smoothing window 5 points, centre computation by intensity-weighted method with intensity threshold at 70%, boundary determination using inflection points), isotope clustering (RT tolerance at 0.02 min, m/z tolerance 5 ppm, allowed charges 1-4), filtering for a single peak not assigned to an isotope cluster, charge and adduct grouping (RT tolerance 0.02 min, m/z tolerance 5 ppm). A detailed description of the software usage and possible settings was published before 25 .
An MPI-MP in-house reference library was used to identify molecular features allowing 0.005 Da mass deviation and dynamic retention time deviation (maximum 0.2 min). Processing of fractionated samples resulted in annotation of 400 compounds (Supplementary Table 2) 19 .
Saponin and ecdysteroid annotation was based on the fragmentation behavior of the parent ion characteristic for the positive mode, and the mass of the main adduct measured in the negative mode.
Data processing. Data represent normalised intensities of the main adduct measured in either the positive or negative mode. Normalisation was done to the median of a given metabolite calculated across a set.

Data Records
In this study, we generated for the first time a large repertoire of the seed metabolome for 471 quinoa genotypes. Morphological data of plants are presented in Supplementary Table 1. Using high-resolution mass spectrometry, we annotated and provided normalized metabolite data of 400 compounds across the genotypes. Data are presented in EXCEL files (Supplementary Dataset 1). For each compound, we present m/z, retention time, ion detection mode, and annotation confidence (Supplementary Table 2) 24 . The data are hosted and available at figshare 19 ). The primary access site for raw metabolic data of the 471 samples is MetaboLights 26 .

technical Validation
To validate data reducibility, we chose 14 genotypes representing low and high saponin contents, and again analyzed the saponin content of their seeds ( Fig. 4; Supplementary Table 5). The data showed high correlation (Pearson correlation coefficient, 0.98) between the sum of the saponin peaks identified in the two experiments validating metabolomics analysis.

Usage Notes
As mentioned above, the profiling data revealed considerable differences in metabolite abundances across genotypes. The data may be used to calculate the fold-change of certain metabolites between selected genotypes. In some cases, the composition of metabolites may influence product quality, as e.g. known for the Maillard reaction in bread making 27 . Hence, this dataset may be used in breeding programs when selecting specific genotypes with desirable metabolite profiles that may benefit product quality. Furthermore, in combination with the availability of genome sequences, the data can be used for functional genomics-and metabolite-based genome-wide association studies (mGWAS) to dissect the genetic basis of quinoa seed metabolism. The information on metabolite presence and quantity may also be used as a basis to design molecular markers to characterize responses to abiotic stresses. The data set is useful in genetic and correlation studies to investigate the relationship between metabolic diversity, geographical distribution, and integration with physiological and phenotypic diversity. The   Fig. 4 Seeds of 14 quinoa genotypes characterized by the lowest and highest saponin content. Metabolites were extracted and measured twice to assess metabolic profiling's reproducibility. Data express a sum of all metabolic features detected in the positive mode in the retention time window between 8.14 and 14.00 min, which corresponds to the saponin elution and is used here as a proxy for total saponin content.