The water lily genome and the early evolution of flowering plants

Water lilies belong to the angiosperm order Nymphaeales. Amborellales, Nymphaeales and Austrobaileyales together form the so-called ANA-grade of angiosperms, which are extant representatives of lineages that diverged the earliest from the lineage leading to the extant mesangiosperms1–3. Here we report the 409-megabase genome sequence of the blue-petal water lily (Nymphaea colorata). Our phylogenomic analyses support Amborellales and Nymphaeales as successive sister lineages to all other extant angiosperms. The N. colorata genome and 19 other water lily transcriptomes reveal a Nymphaealean whole-genome duplication event, which is shared by Nymphaeaceae and possibly Cabombaceae. Among the genes retained from this whole-genome duplication are homologues of genes that regulate flowering transition and flower development. The broad expression of homologues of floral ABCE genes in N. colorata might support a similarly broadly active ancestral ABCE model of floral organ determination in early angiosperms. Water lilies have evolved attractive floral scents and colours, which are features shared with mesangiosperms, and we identified their putative biosynthetic genes in N. colorata. The chemical compounds and biosynthetic genes behind floral scents suggest that they have evolved in parallel to those in mesangiosperms. Because of its unique phylogenetic position, the N. colorata genome sheds light on the early evolution of angiosperms.

Water lilies belong to the angiosperm order Nymphaeales. Amborellales, Nymphaeales and Austrobaileyales together form the so-called ANA-grade of angiosperms, which are extant representatives of lineages that diverged the earliest from the lineage leading to the extant mesangiosperms [1][2][3] . Here we report the 409-megabase genome sequence of the blue-petal water lily (Nymphaea colorata). Our phylogenomic analyses support Amborellales and Nymphaeales as successive sister lineages to all other extant angiosperms. The N. colorata genome and 19 other water lily transcriptomes reveal a Nymphaealean whole-genome duplication event, which is shared by Nymphaeaceae and possibly Cabombaceae. Among the genes retained from this whole-genome duplication are homologues of genes that regulate flowering transition and flower development. The broad expression of homologues of floral ABCE genes in N. colorata might support a similarly broadly active ancestral ABCE model of floral organ determination in early angiosperms. Water lilies have evolved attractive floral scents and colours, which are features shared with mesangiosperms, and we identified their putative biosynthetic genes in N. colorata. The chemical compounds and biosynthetic genes behind floral scents suggest that they have evolved in parallel to those in mesangiosperms. Because of its unique phylogenetic position, the N. colorata genome sheds light on the early evolution of angiosperms.
Many water lily species, particularly from Nymphaea (Nymphaeaceae), have large and showy flowers and belong to the angiosperms (also called flowering plants). Their aesthetic beauty has captivated notable artists such as the French impressionist Claude Monet. Water lily flowers have limited differentiation in perianths (outer floral organs), but they possess both male and female organs and have diverse scents and colours, similar to many mesangiosperms (core angiosperms, including eudicots, monocots, and magnoliids) (Supplementary Note 1). In addition, some water lilies have short life cycles and enormous numbers of seeds 4 , which increase their potential as a model plant to represent the ANA-grade of angiosperms and to study early evolutionary events within the angiosperms. In particular, N. colorata Peter has a relatively small genome size (2n = 28 and approximately 400 Mb) and blue petals that make it popular in breeding programs (Supplementary Note 1).
We report here the genome sequence of N. colorata, obtained using PacBio RSII single-molecule real-time (SMRT) sequencing technology. The genome was assembled into 1,429 contigs (with a contig N50 of 2.1 Mb) and total length of 409 Mb with 804 scaffolds, 770 of which were anchored onto 14 pseudo-chromosomes (Extended Data Fig. 1 and Extended Data Table 1). Genome completeness was estimated to be 94.4% (Supplementary Note 2). We annotated 31,580 protein-coding genes and predicted repetitive elements with a collective length of 160.4 Mb, accounting for 39.2% of the genome (Supplementary Note 3).
The N. colorata genome provides an opportunity to resolve the relationships between Amborellales, Nymphaeales and all other extant angiosperms (Fig. 1a). Using six eudicots, six monocots, N. colorata and Amborella 5 , and each of three gymnosperm species (Ginkgo biloba, Picea abies and Pinus taeda) as an outgroup in turn, we identified 2,169, 1,535 and 1,515 orthologous low-copy nuclear (LCN) genes, respectively (Fig. 1b). Among the LCN gene trees inferred from nucleotide sequences using G. biloba as an outgroup, 62% (294 out of 475 trees) place Amborella as the sister lineage to all other extant angiosperms with bootstrap support greater than 80% (type II, Fig. 1c). Using P. abies or P. taeda as the outgroup, Amborella is placed as the sister lineage to the remaining angiosperms in 57% and 54% of the LCN gene trees, respectively. LCN gene trees inferred using amino acid sequences show similar phylogenetic patterns (Supplementary Note 4.1).

Article
To minimize the potential shortcomings of sparse taxon sampling 6 , we also inferred an angiosperm species tree using sequences from 44 genomes and 71 transcriptomes, including representatives of the ANA-grade, eudicots, magnoliids, monocots and a gymnosperm outgroup (Gnetum montanum, G. biloba, P. abies and P. taeda) (Methods). For further phylogenetic inference of these 115 species, we selected, . Molecular dating of angiosperm lineages, using a stringent set of 101 LCN genes and with age calibrations based on 21 fossils 7 , inferred the crown age of angiosperms at 234-263 million years ago (Ma) (Fig. 1d). The split between monocots and eudicots was estimated at 171-203 Ma and that between Nymphaeaceae and Cabombaceae at 147-185 Ma.
Genomic collinearity unveiled evidence of a whole-genome duplication (WGD) event in N. colorata (Extended Data Figs. 1f, 2a and Supplementary Note 5.1). The number of synonymous substitutions per synonymous site (K S ) distributions for N. colorata paralogues further showed a signature peak at K S of approximately 0.9 (Fig. 2a) and peaks at similar K S values were identified in other Nymphaeaceae species (Supplementary Note 5.2), which suggests an ancient single WGD event that is probably shared among Nymphaeaceae members. Comparison of the N. colorata paralogue K S distribution with K S distributions of orthologues (representing speciation events) between N. colorata and other Nymphaeales lineages, Illicium henryi, and Amborella suggests that the WGD occurred just after the divergence between Nymphaeaceae and Cabombaceae (Fig. 2a). By contrast, phylogenomic analyses of gene families that contained at least one paralogue pair from collinear regions of N. colorata suggest that the WGD is shared between Nymphaeaceae and Cabombaceae (Fig. 2b, Supplementary Note 5.4). If true, Cabomba caroliniana seems to have retained few duplicates (Fig. 2b,  c), which would explain the absence of a clear peak in the C. caroliniana paralogue K S distribution (Supplementary Note 5.2). Absolute dating of the paralogues of N. colorata does suggest that the WGD could have occurred before or close to the divergence between Nymphaeaceae and Cabombaceae (Extended Data Fig. 2d, Supplementary Note 5.3), considering the variable substitution rates among Nymphaealean lineages (Fig. 2a, b, Extended Data Fig. 2c). An alternative interpretation of the above results could be that the WGD signatures were from an allopolyploidy event that occurred between ancestral Nymphaeaceae and Cabombaceae lineages shortly after their divergence and that gave rise to the Nymphaeaceae (but not Cabombaceae) stem lineage (Fig. 2d, Supplementary Note 5.4).
The water lily lineage descended from one of the early divergences among angiosperms, before the radiation of mesangiosperms. Thus, this group offers a unique window into the early evolution of angiosperms, particularly that of the flower. We identified 70 MADS-box genes, including homologues of the genes for the ABCE model of floral organ identities: AP1 (and also FUL) and AGL6 (A function for sepals and petals), AP3 and PI (B function for petals and stamen), AG (C function for stamen and carpel), and SEP1 (E function for interacting with ABC function proteins). Phylogenetic and collinearity analyses of the MADS-box genes and their genomic neighbourhood indicate that an ancient tandem duplication before the divergence of seed plants gave birth to the ancestors of A function (FUL) and E function genes (SEP) (Extended Data Fig. 3, Supplementary Note 6.1). Also, owing to the Nymphaealean WGD, N. colorata has two paralogues, AGa and AGb of the C-function gene AG (Extended Data Fig. 4). Similarly, the Nymphaealean WGD-derived duplicates are homologous to other genes associated with development of carpel and stamen 8 , and to genes that regulate flowering time 9 and auxin-controlled circadian opening and closure of the flower 10  The expression profiles of N. colorata ABCE homologues largely agree with their putative ascribed roles in floral organ patterning (Fig. 3a).

Article
Notably, the N. colorata AGL6 homologue is mainly expressed in sepals and petals, whereas the FUL homologue is mainly expressed in carpels, suggesting that AGL6 acts as an A-function gene in N. colorata. The two C-function homologues AGa and AGb are highly expressed in stamens and carpels, respectively, whereas AGb is also expressed in sepals and petals, suggesting that they might have undergone subfunctionalization and possibly neofunctionalization for flower development after the Nymphaealean WGD. Furthermore, the ABCE homologues in N. colorata generally exhibit wider ranges of expression in floral organs than their counterparts in eudicot model systems (Fig. 3b). This wider expression pattern, in combination with broader expression of at least some ABCE genes in some eudicots representing an early-diverging lineage 11 , some monocots 12 and magnoliids 13 , suggest an ancient ABCE model for flower development, with subsequent canalization of gene expression and function regulated by the more specialized ABCE genes during the evolution of mesangiosperms, especially core eudicots 8 . This could also account for the limited differentiation between sepals and petals in Nymphaeales species, and is consistent with a single type of perianth organ proposed in an ancestral angiosperm flower 14 .
Floral scent serves as olfactory cues for insect pollinators 15 . Whereas Amborella flowers are scentless 16 , N. colorata flowers release 11 different volatile compounds, including terpenoids (sesquiterpenes), fattyacid derivatives (methyl decanoate) and benzenoids (Fig. 4a). The N. colorata genome contains 92 putative terpene synthase (TPS) genes, which are ascribed to four previously recognized TPS subfamilies in angiosperms: TPS-b, TPS-c, TPS-e/f and TPS-g (Fig. 4b), but none was found for TPS-a, which is responsible for sesquiterpene biosynthesis in mesangiosperms 17 . Notably, TPS-b contains more than 80 genes in N. colorata; NC11G0123420 is highly expressed in flowers (Extended Data Fig. 7); this result suggests that it may be a candidate gene for sesquiterpene biosynthase in N. colorata. Also, methyl decanoate has not been detected as a volatile compound in monocots and eudicots 18 and is thought to be synthesized in N. colorata by the SABATH family of methyltransferases 19 . The N. colorata genome contains 13 SABATH homologues and 12 of them form a Nymphaeales-specific group (Supplementary Fig. 41). Among these 12 members, NC11G0120830 showed the highest expression in petals (Fig. 4c) and its corresponding recombinant protein was demonstrated to be a fatty acid methyltransferase that had the highest activity with decanoic acid as the substrate (Fig. 4d, Supplementary Note 7.1). These results suggest that the floral scent biosynthesis in N. colorata has been accomplished through enzymatic functions that have evolved independently from those in mesangiosperms (Fig. 4e).
Nymphaea colorata is valued for the aesthetically attractive blue colour of petals, which is a rare trait in ornamentals. To understand the molecular basis of the blue colour, we identified delphinidin 3′-O-(2″-O-galloyl-6″-O-acetyl-β-galactopyranoside) as the main blue anthocyanidin pigment (Extended Data Fig. 8a-c). By comparing the expression profiles between two N. colorata cultivars with white and blue petals for genes in a reconstructed anthocyanidin biosynthesis pathway, we found genes for an anthocyanidin synthase and a delphinidin-modification enzyme, the expression of which was significantly higher in blue petals than in white petals (Extended Data Fig. 8d, e). These two enzymes catalyse the last two steps of anthocyanidin biosynthesis and are therefore key enzymes specialized in blue pigment biosynthesis 20,21 (Supplementary Note 7.2).
Water lilies have a global distribution that includes cold regions (northern China and northern Canada), unlike the other ANA-grade angiosperms Amborella (Pacific Islands) and Austrobaileyales (temperate and tropical regions). We detected marked expansions of genes    related to immunity and stress responses in N. colorata, including genes encoding nucleotide-binding leucine-rich repeat (NLR) proteins, protein kinases and WRKY transcription factors, compared with those in Amborella and some mesangiosperms (Extended Data Fig. 9, Supplementary Note 8). It is possible that increased numbers of these genes enabled water lilies to adapt to various ecological habitats globally.
In conclusion, the N. colorata genome offers a reference for comparative genomics and for resolving the deep phylogenetic relationships among the ANA-grade and mesangiosperms. It has also revealed a WGD specific to Nymphaeales, and provides insights into the early evolution of angiosperms on key innovations such as flower development and floral scent and colour.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41586-019-1852-5.  d, Relative activity of Escherichia coli-expressed NC11G0120830 with six fatty acids as substrates, with the activity on decanoic acid set at 1.0. Data are mean ± s.d. of three independent measurements. e, The presence (+) and absence (−) of sesquiterpenes and methyl decanoate as floral scent compounds and their respective biosynthetic genes in four major lineages of angiosperms when known. DAMT, decanoic acid methyltranferase.