Gnetophytes are an enigmatic gymnosperm lineage comprising three genera, Gnetum, Welwitschia and Ephedra, which are morphologically distinct from all other seed plants. Their distinctiveness has triggered much debate as to their origin, evolution and phylogenetic placement among seed plants. To increase our understanding of the evolution of gnetophytes, and their relation to other seed plants, we report here a high-quality draft genome sequence for Gnetum montanum, the first for any gnetophyte. By using a novel genome assembly strategy to deal with high levels of heterozygosity, we assembled >4 Gb of sequence encoding 27,491 protein-coding genes. Comparative analysis of the G. montanum genome with other gymnosperm genomes unveiled some remarkable and distinctive genomic features, such as a diverse assemblage of retrotransposons with evidence for elevated frequencies of elimination rather than accumulation, considerable differences in intron architecture, including both length distribution and proportions of (retro) transposon elements, and distinctive patterns of proliferation of functional protein domains. Furthermore, a few gene families showed Gnetum-specific copy number expansions (for example, cellulose synthase) or contractions (for example, Late Embryogenesis Abundant protein), which could be connected with Gnetum’s distinctive morphological innovations associated with their adaptation to warm, mesic environments. Overall, the G. montanum genome enables a better resolution of ancestral genomic features within seed plants, and the identification of genomic characters that distinguish Gnetum from other gymnosperms.
The seed plants today are represented by five distinct lineages: the species-rich angiosperms (flowering plants, approximately 352,000 species) and four gymnosperm lineages (which together comprise approximately 1,000 species and encompass cycads, Ginkgo biloba, conifers and gnetophytes). It is apparent from their long fossil record (dating back to the Late Devonian approximately 360 million years ago (Ma)) that considerably greater seed plant diversity existed in the past1. Nevertheless, widespread extinctions among many gymnosperm lineages mean that today’s gymnosperms are only a relic of their former diversity, and this has presented a major challenge for reconstructing evolutionary relationships between the extant lineages2. Probably the most controversial outstanding question in plant evolution is the phylogenetic position of gnetophytes3 (comprising the genera Gnetum, Welwitschia and Ephedra, Fig. 1) in relation to the other seed plant lineages. Apparent morphological similarities with angiosperms, such as vessel-like water-conducting cells, double fertilization and leaf morphologies with reticulate venation, have historically led to the proposition that gnetophytes form a group that is sister to angiosperms (termed the ‘Anthophyte hypothesis’)4,5. That hypothesis has, however, largely been rejected by molecular phylogenetic data and a deeper understanding of the developmental pathways that lead to similar morphological features. Nevertheless, the use of molecular data has also been problematic in inferring the exact phylogenetic position of gnetophytes, with topologies differing depending on the type of sequence data (for example, plastid versus nuclear genes, nucleotide versus amino acid data) and analytical approach used (for example, maximum parsimony, maximum likelihood, Bayesian, multispecies coalescent based methods)6,7,8. Consequently, several possible hypotheses have been put forward that place gnetophytes as sister to (1) Pinaceae (‘Gnepine’ hypothesis); (2) cupressophytes (‘Gnecup’ hypothesis); (3) all conifers (‘Gnetifer’ hypothesis); (4) all other gymnosperms; or (5) all seed plants9. Currently, the emerging consensus, based on both older and more recent studies, and recently released data from the 1KP initiative (see https://sites.google.com/a/ualberta.ca/onekp/, and Wickett et al.8), indicates that gnetophytes are sister to, or within, the conifers.
So far, the availability of whole genome sequences for gymnosperms has been limited to conifers (specifically to Pinaceae)10,11,12,13 and G. biloba14, with no whole genome assemblies available for the two remaining major seed plant lineages—cycads and gnetophytes. This deficiency, together with the conflicting phylogenetic evidence for relationships among these groups, is impeding our understanding of genome evolution across all seed plants. Here, we present a high-quality draft genome of Gnetum montanum, the first for gnetophytes. The availability of this genome, as well as survey sequence data and transcriptome data from other vascular plants (including novel data from gnetophytes Ephedra and Welwitschia), enables us to compare genomic characters with G. biloba, conifers, angiosperms and non-seed plants. Comparisons within gymnosperms, and between gymnosperms and angiosperms, highlight the unique nature of the Gnetum genome, providing new insights into patterns of genome divergence across seed plants.
Genome assembly and annotation
The genome of G. montanum (2n = 44) is small compared with other gymnosperms (flow cytometry, 4.2 Gb/1C; k-mer analysis, 4.11 Gb), and is highly heterozygous and rich in repeats (Supplementary Fig. 1a–c and Supplementary Information). To overcome problems caused by repeats and heterozygosity, we generated deep coverage (~302×, Supplementary Table 1) Illumina sequence data and applied a novel genome assembly strategy (Supplementary Information and Supplementary Fig. 2) to assemble 4.07 Gb of sequence (contig N50 size = 25.02 kb, scaffold N50 size = 475.17 kb, Supplementary Table 2), to which >99% of genome reads, >90% expressed sequence tags (ESTs) and >99% of bacterial artificial chromosomes (BACs) were mapped (Supplementary Fig. 1d,e, Supplementary Table 3 and Note 3).
A total of 27,491 protein-coding genes were predicted from this assembly (Supplementary Table 4 and Supplementary Information), 97% of which were supported by orthology (>50% coverage of a high-scoring segment pair, Supplementary Fig. 3a) with existing protein sequences and/or RNA-seq data from multiple tissues (Supplementary Table 5). A BUSCO (Benchmarking Universal Single-Copy Orthologs) analysis to assess the quality of the genome and annotation completeness suggested that 81% of the genes have been recovered (Supplementary Table 6). Unlike conifer genomes, which contain numerous pseudogenes15 (for example, 8,328 in Picea abies, 13,550 in Pinus taeda), many fewer were found in the G. montanum genome (3,122, Supplementary Information). The read depth distribution across genic regions (Supplementary Fig. 3b) suggested little sequence redundancy caused by heterozygosity (see Supplementary Fig. 3c for further confirmation of gene assembly quality).
Repetitive sequence dynamics
Repetitive sequences have been shown to account for the major component of all gymnosperm genomes that have been sequenced to date11,12,13,14, with diverse and ancient transposable elements (TEs), especially LTR retrotransposons (LTR-RTs), being particularly prevalent. Overall, the repetitive element content of G. montanum was also high (85.9%) and dominated by LTR-RTs (especially gypsy-like elements), which constituted 77.4% of the genome (Supplementary Table 8 and Supplementary Information). The genome assembly of G. montanum is likely to be sufficient to represent most of the LTR-RTs, since their length is typically around 25 kb16, and 90% of the scaffolds are larger than 34 kb. Phylogenetic reconstructions of the reverse transcriptase domains of LTR-RTs in G. montanum and P. taeda revealed that most of the gypsy- and copia-like elements in G. montanum were restricted to just a few clades, representing only a small minority of the diversity encountered in P. taeda (Supplementary Fig. 4 and Supplementary Information).
Comparative analyses of repeats identified by RepeatExplorer using survey sequence data from multiple gnetophytes (G. montanum, Gnetum gnemon, Welwitschia mirabilis and Ephedra altissima) and P. taeda revealed substantial differences in the abundance of the major repeat classes (Supplementary Fig. 5a, Supplementary Table 9 and Supplementary Information). Further, the majority of individual repeat types (repeat clusters in RepeatExplorer) were shown to be species specific (containing Illumina reads from just one species, data not shown). The species-specific nature of the repeat profiles probably reflects the long estimated divergence times between species (for example, the two Gnetum species are likely to have diverged between approximately 25 Ma and 75 Ma)17,18.
Previously, it was reported from conifers and G. biloba that LTR-RTs have accumulated steadily over the last approximately 25 Ma, especially between 16 and 24 Ma, a process contributing to their large genome sizes11,12,14. This interpretation is consistent with the data here (Supplementary Table 10), which show that most LTR-RTs in conifers are intact (solo LTR/intact LTR ratio ranged from 0.16:1 to 0.72:1, Supplementary Table 10). It is notable that the solo LTR/intact LTR ratio was substantially higher in G. montanum (~1.94:1), which together with its small genome and similar profile of accumulation (Supplementary Fig. 5b) suggest higher frequencies of LTR-RT elimination than amplification compared with G. biloba and conifers.
Most angiosperm genomes analysed to date have far fewer ancient repeats and less divergent LTR-RT subsets than conifers and G. biloba, presumably because of more efficient elimination and replacement processes operating within these angiosperm genomes19 (for example, in Oryza sativa the half-life of LTR-RTs is estimated to be less than five million years20, leading to ‘genome turnover’21). However, an exception to this pattern has been observed in Amborella trichopoda. The genome of this species is considered to have retained many features that were likely to have been present in the ancestral angiosperm genome22. It is notable that its repeat content13 and lower abundance of intact LTR-RTs (solo LTR/intact LTR ratio = 2.43/1.0; Supplementary Table 10) is similar to that observed in G. montanum. These observations suggest that neither A. trichopoda nor G. montanum genomes have experienced recent, extensive (retro) transposon activity, although they continue to eliminate repetitive sequences. Both these species seem to differ from conifers and G. biloba with respect to the dynamics of repeat accumulation11,12,14, and from other angiosperms in terms of the levels of repeat amplification/removal.
Although intron size has been positively correlated with genome size across eukaryotes as a whole23, this trend does not translate well across broad and some narrow taxonomic distances in seed plants (Fig. 2a). Previous studies of G. biloba14 and conifers11,12 have reported larger introns than angiosperms, probably arising from the long-term, steady amplification of LTR-RTs (Fig. 2b), as also observed here, where LTR-RTs account for 51% and 59% of the large intron sequences in P. taeda and G. biloba, respectively (Fig. 2a and Supplementary Table 12). The evolution of these large introns may have arisen from similar repeat accumulation processes that are operating across the genome as a whole.
When comparing these observations with introns of G. montanum, it is apparent that their introns are substantially smaller (minimum, mean and maximum intron lengths) than those of P. taeda and G. biloba (Fig. 2a, see also the statistics test in Supplementary Table 11). In addition, the repeat composition of G. montanum’s introns is dominated by both long interspersed nuclear elements (LINEs) and LTR-RTs, rather than predominantly LTR-RTs, as in conifers and G. biloba (Fig. 2b and Supplementary Table 12). The correlation between smaller intron sizes and smaller genome size in G. montanum compared with conifers and G. biloba may reflect the repeat dynamic processes operating across its genome as a whole. In contrast, the variable length distributions of introns in angiosperms suggest that the evolution of repeats in their introns do not necessarily reflect the repeat dynamics observed across the rest of their genomes24. In the highly dynamic repetitive genome of Zea mays, the profile of repeats across the genome25 and within the whole intron set (Supplementary Fig. 6a) both suggest many recent insertions. However, in A. trichopoda, the intron sizes are larger overall, and the genome size smaller than in Z. mays (Fig. 2a,b). In addition, an analysis of introns in A. trichopoda and G. montanum highlighted a closer similarity to each other (in terms of length distributions, repeat composition and divergence) than either species has to conifers and G. biloba, despite a 4.8-fold difference in their genome sizes (Fig. 2a,b and Supplementary Table 12).
Previous comparisons of orthologous introns have led to the suggestion that the expansion of introns occurred early in the evolutionary history of conifers12. Comparisons of orthologous introns (with identical adjacent exons) between P. taeda and G. biloba showed that introns identified as being long (>6 kb) in P. taeda were also typically long in their orthologues in G. biloba, containing, in both cases, abundant LTR-RTs (both gypsy- and copia-like elements, Fig. 2c). These features were likely to have been present in their most recent common ancestor (MRCA). Using similar approaches to analyse the length and repeat content of 4,348 orthologous introns of G. montanum shared with P. taeda (Supplementary Information) highlighted notable differences. The length of exons remained similar, but a substantial fraction of orthologous genes had longer introns in P. taeda (Supplementary Fig. 6b). The introns identified as ‘short’ in P. taeda comprised approximately 4% repeats, rising to approximately 56% in ‘long’ introns, largely through the accumulation of LTR-RTs (especially copia elements) (Fig. 2d and Supplementary Table 13). In contrast, introns in G. montanum that are orthologous to the ‘long’ introns of P. taeda (36% of introns analysed) showed high proportions of LINEs. As with comparisons of all introns, pairwise comparisons of orthologous introns in G. montanum and A. trichopoda again showed some similarities in their introns, with both species having abundant LINEs (Fig. 2e). Collectively, these data reveal a different repeat dynamic within introns of G. montanum compared with the other gymnosperms.
(‘Lack of’) Whole genome duplication
All angiosperms are reported to have undergone at least one round of ancient whole genome duplication (WGD), and in many lineages WGDs are recurrent and ongoing26. In addition, a WGD event has been proposed at the base of all seed plants approximately 341 Ma (zeta WGD27), although the underlying evidence for these two ancient WGD events has been recently questioned28. In gymnosperms, WGDs have been reported for conifers, G. biloba and cycads (a likely shared WGD)14,29,30. Although recent polyploidy seems common in extant Ephedra31, evidence for ancient WGDs in gnetophytes is missing (Supplementary Information and Supplementary Fig. 7), except for a WGD in Welwitchia which is likely to have occurred after the divergence of its lineage from that leading to Ephedra (Supplementary Fig. 7)29. If indeed the ancient zeta WGD is shared by all seed plants, the absence of evidence for this event in gnetophytes is best explained by their faster rates of gene evolution than other gymnosperms32,33, erasing all evidence of this more than 300 million year old event (Supplementary Information and Supplementary Fig. 7).
Organization of functional protein domains
To characterize the patterns of functional diversification in gene domains across land plants, we used principal component analysis (PCA) to analyse the number of pfam domains (conserved protein domains) in multiple species (Supplementary Information and Supplementary Table 13). Our approach showed that angiosperms formed a discrete cluster that was separate from the gymnosperms (Fig. 3a), with G. montanum being an outlier. Indeed, heatmaps compiled from the pfam data that contributed most (top 10%) to PCA1 and PCA2 showed that G. montanum formed a clade with the lycophyte Selaginella moellendorffii and the moss Physcomitrella patens (Fig. 3b), but the non-gnetophyte gymnosperms formed a separate clade (Fig. 3b).
Given the distinct distributions of G. montanum, non-gnetophyte gymnosperms and angiosperms in the PCA analysis, the data suggest that significant functional diversification of the conserved protein domains has occurred since these major lineages split. It may be surprising given the long divergence times (approximately 300 Ma)2, that G. biloba and conifers retain similar conserved domain organizations (with similar eigenvector values). This could reflect their relatively low substitution rates (on average seven times lower) compared with angiosperms33.
An analysis of the pfam domain expansions that contributed most to the PCA1 and PCA2 distributions among angiosperms (except A. trichopoda) included genes associated with flower and organ development (Supplementary Table 15). In contrast, non-gnetophyte gymnosperms showed large-scale specific expansions of pfam domains in genes associated with defence and secondary metabolism, as previously suggested (Supplementary Table 16)10,11. The clustering of G. montanum with non-seed plants in the heatmap (Fig. 3b) was a surprise, and may indicate the approach has identified proteins that have diverged very little since the MRCA of seed plants. Nevertheless, such an explanation is at odds with the hypothesis that the genes of gnetophytes have diverged rapidly, given their comparatively high substitution rate compared with other gymnosperms33.
Growth form (shrubs and lianas) and leaf morphology
Gnetophytes differ from other extant gymnosperms in growth form, with the unusual and distinct form of Welwitschia, the shrub habit of Ephedra and the shrub and liana habit and specialized leaf morphologies of Gnetum34. Cellulose synthase (CesA) and cellulose synthase-like (Csl) genes are considered to play a role in influencing the biomechanical properties of the cell35, hence potentially the distinctive growth forms of gnetophytes are associated with the divergence of these genes. To explore this hypothesis, CesA and Csl family members were examined in G. montanum and compared with those in other seed plants. The total number of CesA and Csl family members ranged about threefold among the seed plants analysed (P. abies, P. taeda, A. trichopoda, A. thaliana and O. sativa). However, only G. montanum showed a large expansion of the CslB/H gene subfamily (to 20 genes, Supplementary Table 17), involving tandem duplications (Supplementary Fig. 9), and accounting for two-thirds of its total Csl gene repertoire. Furthermore, transcriptome analysis showed that these CslB/H genes were differentially expressed in leaves, stems and roots of G. montanum, supporting an association with distinct growth forms and leaf morphologies (Supplementary Fig. 9). In contrast, all other species analysed, including Welwitschia and Ephedra, were seen to have only one to six CslB/H genes (at least based on transcriptome analysis) (Supplementary Information, Supplementary Table 16 and Supplementary Fig. 8).
Another gene family associated with leaf morphology and development is the WOX (WUSCHEL-related homeobox) family36. Recent studies have shown that the conserved family members WOX3 and WOX4, which play a role in leaf development, show diffuse WOX3 expression at the leaf bases of Arabidopsis and Gnetum, with such patterns being associated with the distinctive reticulate venation observed in their leaves37. Two unusual paralogues, GgWOXX and GgWOXY, were previously reported to occur only in gnetophytes37, and this is confirmed here in phylogenetic reconstructions of gene family members (Supplementary Information and Supplementary Fig. 10). These paralogues are unlikely to have arisen by Gnetum-specific gene amplifications, as this would group them with other Gnetum paralogues. Alternatively, these genes may correspond to ancestral seed plant sequences that have been lost in other plant lineages. Potentially the different patterns of gene loss, retention and amplification compared with other gymnosperms may be associated with their distinctive growth forms.
The presence of vessel-like water-conducting cells, morphologically distinct from tracheids, is another feature that sets gnetophytes apart from other gymnosperms. However, there has been longstanding debate whether gnetophyte ‘vessels’ are homologous to the ‘vessels’ of angiosperms. In angiosperms, VASCULAR-RELATED NAC-DOMAIN (VND) proteins VND1-7 are members of the NAC domain class of transcription factors, VND7 being a master regulator of vessel formation in A. thaliana38, and VND1-6 being upstream regulators of VND739. Although five NAC domain genes were identified in the genome of G. montanum, no orthologues of VND7 or VND1-3 in the sister clade were identified, consistent with previous analyses of other gymnosperms12, and suggesting that these proteins are restricted to angiosperms (Supplementary Fig. 11). Nevertheless, Gnetum does share the VND4-6 clade with angiosperms and other gymnosperms. Furthermore, A. trichopoda, which lacks angiosperm vessels, also lacks orthologues of VND1-3, but it does have VND7 (Supplementary Fig. 11), indicating that the ability to form vessels may have occurred after angiosperms diverged. Taken together, these data suggest a greater dependency of vessel development on VND1-3 than is apparent from experiments on A. thaliana. The most parsimonious explanation of our data is that angiosperm vessel formation requires genes from the VND7 clade (and potentially its sister clade VND1-3), and that gymnosperms, including gnetophytes, which lack sequences from both these clades cannot form structures that are homologous to angiosperm vessels. Such an interpretation supports Carlquist’s40 morphological interpretations of vessels. It is therefore most likely that different molecular mechanisms underpin the origin and development of vessels in Gnetum and angiosperms. Indeed, these new molecular data support the hypothesis based on morphological studies that Gnetum vessels are actually more closely related to conifer tracheids than angiosperm vessels and that vessels in the two groups are convergent characters40.
Extant species of Gnetum are unusual among gymnosperms in being restricted to warm, mesic habitats41; this contrasts to conifers that are adapted to cold and water-stressed environments. An analysis of genes involved in water and cold stress revealed some substantial differences between conifers and Gnetum. The late embryogenesis abundant protein (LEA) gene family encodes crucial proteins that are involved in protecting plants from desiccation or osmotic stresses associated with low temperature42,43. An analysis of LEA family members suggests that some members have been reduced in number in Gnetum or expanded in conifers (for example, LEA-3), or lost completely in Gnetum (LEA-4, 5, 6). In addition, dehydrins, which play a role in the response to cold/drought44, had only two members in G. montanum, compared with 38 in P. abies, 28 in P. taeda and 3–15 in angiosperms (Supplementary Table 19). Further analysis of the G. montanum genome also revealed relatively few gene family members of the AP2 domain containing protein families, which are involved in the cold stress response45,46, and glutathione peroxidase and glutathione S-transferase families, involved in the oxidant stress response47,48. Taken together, these data appear consistent with the hypothesis that the ecological shift to a warm, wet forest habitat is associated with a relaxation of selection pressure on genes associated with water stress and low temperature.
Here, we have described the assembly, annotation and comparative analysis of the first gnetophyte genome, namely that of G. montanum. Its genome is particularly enigmatic given a phylogenetic position within or sister to conifers. It also carries genomic peculiarities that may reflect its morphological and ecological uniqueness amongst gymnosperms. Comparisons of these genome features with the genomes of conifers and G. biloba provide opportunities to predict the nature and direction of genomic change accompanying the evolution of the lineage leading to Gnetum (Fig. 4). Assuming that gnetophytes do indeed form a clade that is sister to, or within, the conifers, the following genomic features can be predicted to have been present in the MRCA of the gymnosperms, as observed in G. biloba14 and conifers11,12: (1) a large genome size (1C > 10 Gb) comprised predominantly of a heterogeneous set of large numbers of LTR-RTs associated with low levels of repeat deletion14; (2) long introns predominantly shaped by insertions of LTR-RTs (gypsy and copia elements); (3) pfam domains that show a profile distinct from angiosperms. If this is so, and assuming a common ancestry of gnetophytes and conifers, these genomic characters, or their signatures, have subsequently been lost or diverged considerably in the lineage leading to Gnetum. This most likely involved the following genomic processes: (1) genome downsizing, leading to the relatively (for a gymnosperm) small genomes of Gnetum species (1C = 2.25–4.11 Gb). This is supported by the high ratio of solo LTR/intact LTR-RTs observed in the genome of Gnetum compared with conifers, and is indicative of the activity of recombination-based processes, which can eliminate DNA from the genome. Similar processes leading to genome downsizing have also been reported in many angiosperms, resulting in small genomes despite the occurrence of multiple rounds of polyploidy detected in many lineages49; (2) reduction in the size of introns in G. montanum and a replacement of many of the LTR-RTs repeats with LINEs to give rise to introns that are more similar to those of, for instance, A. trichopoda than to other gymnosperms; (3) elevated rates of sequence divergence causing the erosion of a hypothesised shared seed-plant WGD event and leading to a pattern of pfam domains, which is distinct from the remaining gymnosperms; (4) expansion and contraction of specific gene families associated with adaptation to new ecologies.
The sequenced G. montanum is a single mature female individual growing naturally in Fairy Lake Botanical Garden, Shenzhen, China. Genome sequences were generated using an Illumina platform and assembled with a novel hierarchical assembly strategy. Gene annotations were determined by integrating results from both de novo prediction approaches and alignment-based methods based on orthology and transcriptomic data. RNA-seq was performed using an Illumina platform. All methods and bioinformatic analyses are detailed in the Supplementary Information.
Life Sciences Reporting Summary
Further information on experimental design is available in the Life Sciences Reporting Summary.
The G. montanum genome project has been deposited at the NCBI under the BioProject number PRJNA339497. The whole genome sequencing data were deposited in the Sequence Read Archive (SRA) database under the accession number SRX2052734, SRX2098865, SRX2099144, SRX2114825, SRX2114827, SRX2134147, SRX2134160, SRX2134177, SRX2134180, SRX2134596 and SRX2134624. The G. montanum assemblies, gene sequences and annotation data are also available at the DRYAD website. The data or related program scripts that support the findings of this study are available from the corresponding author upon request.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Genome sequencing, assembly and annotation were conducted by the Novogene Bioinformatics Institute, Beijing, China; mutual contracts were No. NHT140016 and NVT140016004. This work was supported by funding from the Scientific Project of Shenzhen Urban Administration (201519) and a Major Technical Research Project of the Innovation of Science and Technology Commission of Shenzhen (JSGG20140515164852417). Additional funding was provided in particular by the Scientific Research Program of Sino-Africa Joint Research Center (SAJL201607). We thank X.Q. Wang, G.W. Hu, Z.D. Chen and Y.H. Guo for comments on gnetophyte phylogenetic relationships and ecological issues; H. Wu and X.P. Ning for discussion of related organ development; K.K. Wan and S. Sun for additional help on the analysis of repeats. We also thank X.Y. for support of funding coordination. Y.V.d.P. acknowledges the Multidisciplinary Research Partnership ‘Bioinformatics: from nucleotides to networks’ Project (no. 01MR0310W) of Ghent University, and funding from the European Union Seventh Framework Programme (FP7/2007-2013) under European Research Council Advanced Grant Agreement 322739-DOUBLEUP.