RNA sequencing provides insights into the evolution of lettuce and the regulation of flavonoid biosynthesis

Different horticultural types of lettuce exhibit tremendous morphological variation. However, the molecular basis for domestication and divergence among the different horticultural types of lettuce remains unknown. Here, we report the RNA sequencing of 240 lettuce accessions sampled from the major horticultural types and wild relatives, generating 1.1 million single-nucleotide polymorphisms (SNPs). Demographic modeling indicates that there was a single domestication event for lettuce. We identify a list of regions as putative selective sweeps that occurred during domestication and divergence, respectively. Genome-wide association studies (GWAS) identify 5311 expression quantitative trait loci (eQTL) regulating the expression of 4105 genes, including nine eQTLs regulating genes associated with flavonoid biosynthesis. GWAS for leaf color detects six candidate loci responsible for the variation of anthocyanins in lettuce leaves. Our study provides a comprehensive understanding of the domestication and the accumulation of anthocyanins in lettuce, and will facilitate the breeding of cultivars with improved nutritional value.


Supplementary Figure 3. Characteristics of identified SNPs
The imputation accuracy was measured as the ratio of the correctly inferred genotypes to all of the masked genotypes.
The filling rate was calculated as the proportion of inferred genotypes in all of the masked genotypes. The number of SNPs was counted under the certain missing genotype rate. For the missing rate that ranged from 10% to 90%, the best imputation accuracy and filling rate were chosen separately after testing 320 combinations of parameters.

Supplementary Figure 9. Schematic diagram of two different demographic models for two-population
The two models were designed as follows: model 1, the cultivated group was originated directly from L. serriola; model 2, the cultivated group was originated from an ancestral cultivated group (un-sampled population in the model), and the ancestral cultivated group was originated directly from L. serriola (backwards in time). All of the populations were assumed to undergo demographic bottlenecks, based on the comparisons of different demographic models for one-population. These models only indicate relative differences. LG1 LG2 LG3 LG4 LG5 LG6 LG7 LG8 LG9 The regulatory network for flavonoid biosynthesis was inferred using the iGA and co-expression data. Each node is a gene. Each edge is a predicted regulatory connection between genes. Yellow nodes represent the identified four candidate regulatory genes. Blue nodes represent the target genes or other possible genes identified by co-expression analysis. Red arrows and blue bars have evidence of eQTL support, which represent as positive and negative regulation, respectively. LG2_207968 LG9_794051 LG4_413424 LG5_441158 LG3_295970 LG1_112575   LG2_196372 0.93 Dihydroflavonol reductase (DFR)

M6_left
LG5 GO and protein domain enrichment analysis were used to further assess functional features of these affected genes. GO analysis detected enrichment in biological processes such as defense response and protein phosphorylation (Supplementary Table 2), and Pfam analysis revealed that genes encoding disease-related proteins and kinase were significantly overrepresented (Supplementary Table 3), which were consistent with previous findings 5,6 .

Supplementary Note 2. Imputation of missing genotypes
The fillGenotype software 7 was used to impute the missing data for further analysis. The accessions of L. saligna and L. virosa were excluded from imputation because they exhibited many species' specific SNPs that are absent in either L. sativa or L. serriola. Consequently, 217 accessions were used to perform SNP imputation. By randomly masking 1% of the SNP sites, a simulation was performed to determine the imputation accuracy and the filling rate (Supplementary Fig. 5). The optimal imputation accuracy (97.95%) and filling rate (97.90%) were achieved when the missing data rate cutoff value was set at 0.8 and the following values were used for the fillGenotype parameters: w = 20, p = -11, k = 9 and r = 0.5. A total of 344,222 SNPs with missing rates ≤ 0.8 were filled using this imputation method.

Supplementary Note 3. Demographic model selection for one-, twoand four-population models
We tested one-, two-and four-population models to help guide the development of demographic models for lettuce. For the one-population model, six different demographic models were tested (Supplementary Fig. 8). A three-epoch model with a population bottleneck of some duration followed by a recovery (model 5 in Supplementary Fig. 8) was the best-fitting model for all five populations (Supplementary Table 5). Based on these results, we suggest that all populations have experienced bottlenecks in their history.
In the two-population models, each horticultural type was combined with L. serriola as the source population (Supplementary Fig. 9). Model 2 had the best support for all horticultural types (w i ≈ 1, Supplementary Table 6). This model is consistent with the idea that there were two evolutionary stages for each horticultural type, the domestication stage (leading to primitive cultivated lettuce) and the improvement stage (leading to modern cultivated lettuce).
Then, four-population models were analyzed for L. serriola and three leafy horticultural types (butterhead, crisphead and romaine). Stem lettuce was excluded from the models, based on the finding that it is genetically the most distinct group within cultivated lettuce. Twelve models (Supplementary Fig. 10) were tested for different topologies within three horticultural types, and model 5 provided the best support for the data (w i ≈ 1, Supplementary Table 7). This model suggested that these three horticultural types developed independently from an ancestral cultivated population.
Metabolic processes influencing nutrition and food functionality were also shown to be under selection during domestication. Lettuce is a source of vitamins (vitamin A, B1, B6, C, E, etc.) and minerals (potassium, etc.) for human 28 . Genes related to these metabolites were found under selection, such as TPK1 (Thiamin pyrophosphokinase 1, LG9_786996) 29 , PDX1.2 (Pyridoxine biosynthesis 1.2, LG5_427483) 30 , VTC2 we identified several genes that may be associated with leaf-heading in crisphead type.
In a recent study, it is revealed that genes involved in establishing adaxial-abaxial polarities of leaf primordia are responsible for leaf-heading in Brassica rapa and Brassica oleracea 37,38 . We speculated that the same molecular mechanism may also control leaf-heading phenotype in crisphead type. The homologue of ATHB15 (Arabidopsis thaliana homeobox 15, LG1_164856) which is a member of the class III HD-ZIP protein family to specify abaxial cell fate 37 was found to be under selection. It is known that the genetic control system for leaf primordium-SAM boundary formation has an important function in genetic network for leaf polarities 39 . Genes associated with the polarity regulation network were selected in crisphead type, such as BOP2 (Blade on petiole 2, LG5_523032), which positively regulates the expression of the AS2 gene on the adaxial sides of leaf primordium bases 39 . Genes that are important to leaf development were also identified in our analysis, such as CLV1 (Clavata 1, LG2_207770), specifying and maintaining shoot meristem identity 40  LG8_722826) 42 . Secondary cell wall is vital for plant growth, typically in stem tissue for protection, structural support, as well as water and nutrients transport 43 . Genes involved in secondary wall biosynthesis were also found to be under selection, including three cellulose synthase genes: CESA1 (Cellulose synthase 1, LG5_453514) 44 , CSLA2 (Cellulose synthase-like A2, LG6_568914) 43 and IRX6 (Irregular xylem 6, LG4_374430) 43 ; two xylan biosynthetic genes: IRX14 (Irregular xylem 14, LG2_209670) 43 and GLZ1 (Gaolaozhuangren 1, LG4_406605) 43 ; and one gene required for lignin biosynthesis: C4H (Cinnamate 4-hydroxylase, LG8_701431) 43 .

Supplementary Note 6. Linkage analysis of four loci associated with leaf color in lettuce
Six loci were identified to control color variation of lettuce using GWAS in our study. Sometimes it is challenging to construct a population that only one of the color controlling genes is segregating. Using different crosses and sub-populations, we succeeded in analyzing four loci (locus 3 to 6 in our study) that control leaf color in lettuce. These four loci are described as follows:

Linkage analysis of locus 3
A sub-population comprising 173 individuals was found showing a segregating ratio of green (130) : red (43) = 3 : 1 (χ 2 test, P = 0.9650, χ 2 = 0.002). Further analysis showed that the trait is controlled by locus 3. Two CAPs markers from the candidate region were designed (Supplementary Table 14). Screening the segregating population showed that they co-segregated with leaf color in this population, confirming our GWAS results.

Linkage analysis of locus 4
One segregating population with 163 individuals derived from Multi-parent Advanced Generation Intercrosses (MAGIC) population was found. Among these individuals, 122 had red leaves and 41 had green leaves. χ 2 test showed that the trait is controlled by a single gene (P = 0.9639, χ 2 = 0.002). Further analysis showed that the trait is controlled by locus 4. Two CAPs markers flanking the candidate gene were designed (Supplementary Table 14). These two markers co-segregated with leaf color in this population, confirming our GWAS results.

Linkage analysis of locus 5
A F2:3 sub-population (n = 167) which was derived from a cross between PI344074 and PI536839 was found with segregation of leaf color. A ratio of red (128) : green (39) ≈ 3 : 1 (χ 2 test, P = 0.6231, χ 2 = 0.242) individuals were found in this population, suggesting a single gene controlling this trait. Further analysis showed that the trait is controlled by locus 5. Two CAPs markers flanking the candidate gene were developed (Supplementary Table 14). These two markers co-segregated with leaf color in this population, confirming our GWAS results.

Linkage analysis of locus 6
Another segregating population comprising 157 F2:3 individuals derived from a cross between PI491070 and PI536760 was found. There were 126 and 31 individuals with red and green leaves, respectively. χ 2 test showed that the trait is controlled by a single gene (P = 0.1284, χ 2 = 2.312). Further analysis showed that the trait is controlled by locus 6. Two CAPs markers flanking the candidate gene were designed (Supplementary Table 14). These two markers co-segregated with leaf color in this population, confirming our GWAS results.