Constituting approximately 10% of flowering plant species, orchids (Orchidaceae) display unique flower morphologies, possess an extraordinary diversity in lifestyle, and have successfully colonized almost every habitat on Earth1,2,3. Here we report the draft genome sequence of Apostasia shenzhenica4, a representative of one of two genera that form a sister lineage to the rest of the Orchidaceae, providing a reference for inferring the genome content and structure of the most recent common ancestor of all extant orchids and improving our understanding of their origins and evolution. In addition, we present transcriptome data for representatives of Vanilloideae, Cypripedioideae and Orchidoideae, and novel third-generation genome data for two species of Epidendroideae, covering all five orchid subfamilies. A. shenzhenica shows clear evidence of a whole-genome duplication, which is shared by all orchids and occurred shortly before their divergence. Comparisons between A. shenzhenica and other orchids and angiosperms also permitted the reconstruction of an ancestral orchid gene toolkit. We identify new gene families, gene family expansions and contractions, and changes within MADS-box gene classes, which control a diverse suite of developmental processes, during orchid evolution. This study sheds new light on the genetic mechanisms underpinning key orchid innovations, including the development of the labellum and gynostemium, pollinia, and seeds without endosperm, as well as the evolution of epiphytism; reveals relationships between the Orchidaceae subfamilies; and helps clarify the evolutionary history of orchids within the angiosperms.
The Apostasioideae are a small subfamily of orchids that includes only two genera (Apostasia and Neuwiedia2,5), consisting of terrestrial species confined to the humid areas of southeast Asia, Japan, and northern Australia6. Although Apostasioideae share some synapomorphies with other orchids (for example, small seeds with a reduced embryo and a myco-heterotrophic protocorm stage), they possess several unique traits, the most conspicuous of which is their floral morphology7. Apostasia has a non-resupinate, solanum-type flower with anthers closely encircling the stigma (including postgenital fusion), a long ovary, and an actinomorphic perianth with an undifferentiated labellum. Three stamens (two of which are fertile) are basally fused to the style, forming a relatively simple gynostemium, and the anthers contain powdery pollen (grains not unified into pollinia). These characteristics (Extended Data Fig. 1a) differ from those of other Orchidaceae subfamilies, which have three sepals, three petals (of which one has specialized to form the labellum), and stamens and pistil fused into a more complex gynostemium (Extended Data Fig. 1b), but are similar to those of some species of Hypoxidaceae (a sister family to Orchidaceae, in the order Asparagales).
We sequenced the A. shenzhenica genome using a combination of different approaches; the total length of the final assembly was 349 Mb (see Methods and Supplementary Tables 1–4). We confidently annotated 21,841 protein-coding genes, of which 20,202 (92.50%) were supported by transcriptome data (Supplementary Fig. 1 and Supplementary Table 5). Using single-copy orthologues, we performed a BUSCO8 assessment that indicated that the completeness of the genome was 93.62%, suggesting that the A. shenzhenica genome assembly is of high quality (Supplementary Table 6). For comparative analyses, we also improved the quality of the previously published genome assemblies of the orchids Phalaenopsis equestris9 and Dendrobium catenatum10 (see Methods and Supplementary Tables 6 and 7).
We constructed a high-confidence phylogenetic tree and estimated the divergence times of 15 plant species using genes extracted from a total of 439 single-copy families (Fig. 1 and Extended Data Fig. 2). We undertook a computational analysis of gene family sizes (CAFÉ 2.211) to study gene family expansion and contraction during the evolution of orchids and related species (Fig. 1 and Supplementary Note 1.1). By comparing 12 plant species, we found 474 gene families (Extended Data Fig. 3) that appeared unique to orchids (Supplementary Note 1.2). Gene Ontology and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis found these gene families to be specifically enriched in the terms ‘O-methyltransferase activity’, ‘cysteine-type peptidase activity’, ‘flavone and flavonol biosynthesis’ and ‘stilbenoid, diarylheptanoid and gingerol biosynthesis’ (Supplementary Note 1.2).
Distributions of synonymous substitutions per synonymous site (KS) (see Supplementary Note 2.1) for paralogous A. shenzhenica genes showed a clear peak at KS ≈ 1 (Extended Data Fig. 4). Similar peaks at KS values of 0.7 to 1.1 were identified in 11 other orchids, covering all 5 orchid subfamilies (Supplementary Fig. 2). These peaks might reflect multiple independent whole-genome duplication (WGD) events across orchid sublineages or, more parsimoniously, a single WGD event shared by all (extant) orchids. Comparisons of orchid paralogue KS distributions with KS distributions of orthologues between orchid species, and between orchids and Asparagus officinalis (asparagus, Asparagaceae, a sister family to Orchidaceae in the order Asparagales) (Fig. 2a and Supplementary Note 2.1) indicated that the WGD signatures are not shared with non-orchid Asparagales. Absolute phylogenomic dating12 (Extended Data Fig. 5 and Supplementary Note 2.1) revealed that the WGDs and the earliest diversification of extant orchid lineages occurred relatively close together in time, supporting the possibility of a single WGD event in the most recent common ancestor of extant orchids.
Intragenomic co-linearity and synteny analysis identified two WGD events in A. shenzhenica (Fig. 2b, Supplementary Fig. 3 and Supplementary Note 2.2). Co-linearity and synteny analyses between A. shenzhenica and Amborella trichopoda, and between A. shenzhenica and Vitis vinifera, also support at least two WGDs in A. shenzhenica (Supplementary Figs 4 and 5); for example, four paralogous segments in A. shenzhenica corresponded to one orthologous region in A. trichopoda (Fig. 2c). Detailed genome comparisons of A. shenzhenica with Ananas comosus (pineapple) and A. officinalis revealed a specific 4:4 co-linearity pattern (Extended Data Fig. 6 and Supplementary Figs 6–8) that is consistent with the two monocot WGDs proposed for A. comosus, indicating that all three species possess a similar evolutionary history with regard to WGDs (Supplementary Note 2.2). Together, these patterns of co-linearity suggest that the older of the two WGDs evident in A. shenzhenica is likely to be shared with A. comosus and A. officinalis (representing the τ WGD13,14 shared by most monocots), and corroborate the idea that the younger WGD represents an independent event, specific to the Orchidaceae lineage. Analyses of gene trees that contained at least one paralogue pair from co-linear regions from one of the three orchid genomes placed the younger A. shenzhenica WGD and the P. equestris and D. catenatum WGDs on the orchid stem branch, and also provided additional evidence for the monocot τ WGD13,14 (Fig. 3 and Supplementary Note 2.3). We therefore find strong support for a WGD event shared by all extant orchids, which is likely to be only slightly older than their earliest divergence and might be correlated with orchid diversification. In addition, as observed for many other plant lineages, this orchid-specific WGD might be associated with the Cretaceous/Palaeogene boundary15.
Apostasia presents a number of characters that are plesiomorphic in orchids, such as an actinomorphic perianth with an undifferentiated labellum, a gynostemium with partially fused androecium and gynoecium, pollen that is not aggregated into pollinia, and underground roots for terrestrial growth1,5,6,7. The A. shenzhenica genome contains 36 putative functional MADS-box genes (Table 1, Supplementary Table 8 and Supplementary Fig. 9), 27 of which are type II MADS-box genes (Table 1). Two type II MADS-box classes appear to be reduced: A. shenzhenica seems to have fewer genes in the B-AP3 (two members) and E classes (three members) than P. equestris (four B-AP3 and six E-class members) and D. catenatum (four B-AP3 and five E-class members) (Fig. 4a). Previous studies have shown that expanded B-AP3 and E classes with members that have different expression patterns in floral organs are associated with the innovation of the unique labellum and gynostemium in orchids9,16,17, and that duplicated B-AP3 genes are responsible for the modularization of the perianth of orchid flowers18. We identified B-AP3 genes from the transcriptomes of species of each of the orchid subfamilies and the B-class MADS-box genes from the floral transcriptome data of Molineria capitulata, a member of Hypoxidaceae that possesses a flower with petaloid tepals and powdery pollen (similar to that found in Apostasia). We found one member in each of the two B-AP3 subclades for both A. shenzhenica and M. capitulata, but one or two members in each B-AP3 subclade for the other orchids (Extended Data Fig. 7). All these B-AP3 genes are highly expressed in flower buds (Extended Data Fig. 7). These similarities suggest that the lower gene numbers in MADS-box B-AP3 and E classes in Apostasia represent an ancestral state, responsible for producing the plesiomorphic flower with an undifferentiated labellum and partially fused gynostemium. The B-AP3 and E classes may have expanded independently only in the non-apostasioid orchids or, alternatively, in the common ancestor of all extant orchids, possibly as a result of the shared orchid WGD, with subsequent loss of paralogous genes in Apostasia causing reversion to the ancestral state. The B-AP3 gene tree topology and some evidence from co-linearity analysis of orchid B-AP3 genes (Supplementary Fig. 10) suggest the latter. We hypothesize that differential paralogue retention and subsequent sub- and neo-functionalization of B-AP3 and E-class members resulted in the derived labellum found in other orchids (Fig. 4b).
The packaging of pollen grains into a compact unit known as the pollinium, specialized for transfer as a unit by pollinating vectors, was a key innovation in the evolutionary history of Orchidaceae and may have played a role in promoting the tremendous radiation of the group19. In seed plants, the P- and S-subclades of MIKC*-type genes are major regulators of male gametophytic development20,21. The P-subclade, however, is absent in all orchids except A. shenzhenica (Extended Data Fig. 8). Gene expression analysis showed that, in orchids and M. capitulata, MIKC*-type genes are expressed in the pollinia or pollen, suggesting they play roles in its development (Extended Data Fig. 9). Although most orchids have a pollinium, Apostasia has scattered pollen, similar to M. capitulata, Oryza sativa (rice), and Arabidopsis thaliana. Therefore, we propose that the loss of the P-subclade members of MIKC*-type genes is related to the evolution of the pollinium (Fig. 4a, c and Supplementary Note 3).
The reduction of seed volume and content to an absolute minimum is a pivotal aspect of Orchidaceae evolution: in all orchid species, endosperm is absent from the seed. Type I MADS-box genes are important for the initiation of endosperm development22, and transcripts of type I Mα and Mγ MADS-box genes were found in developing seeds of A. shenzhenica, P. equestris, and M. capitulata (Extended Data Fig. 10 and Supplementary Fig. 11). Notably, the three orchid genomes do not contain any type I Mβ MADS-box genes (Fig. 4a and Supplementary Fig. 12), which are found in Arabidopsis, Populus trichocarpa (poplar), O. sativa (Table 1), and in M. capitulata (Supplementary Fig. 13). The lack of endosperm in orchids might therefore be related to the missing type I Mβ MADS-box genes (Fig. 4d).
Orchids are one of very few flowering plant lineages that have been able to successfully colonize epiphytic or lithophytic niches, clinging to trees or rocks and growing in dry conditions using crassulacean acid metabolism2,9,10. The roots of epiphytic orchids, such as Phalaenopsis and Dendrobium, are extremely specialized and differ from the roots of terrestrial orchids such as Apostasia. These aerial roots develop the velamen radicum, a spongy epidermis that traps the nutrient-rich flush during rainfall, representing an important adaptation of epiphytic orchids23,24,25. The Arabidopsis AGL12 gene is involved in root cell differentiation26. A. shenzhenica contains one AGL12 clade gene, as do Arabidopsis and rice. In addition, we found transcripts similar to AGL12 in M. capitulata. In both A. shenzhenica and M. capitulata, these genes are highly expressed in root tissue (Supplementary Fig. 14). Notably, we did not find similar genes in epiphytic orchids, suggesting that the loss of these gene(s) may be involved in losing the ability to develop true roots for terrestrial growth (Fig. 4e). Utricularia gibba, an asterid in the order Lamiales (only distantly related to the orchids) that lacks true roots, also lacks these AGL12 clade or similar genes27. The Arabidopsis ANR1 gene is a key gene involved in regulating lateral root development in response to external nitrate supply28. We found that the MADS-box gene subfamily ANR1 is probably reduced in P. equestris (two members) and D. catenatum (three members), compared with four members in A. shenzhenica (Fig. 4a): this is consistent with no development of lateral (aerial) roots in epiphytic orchids.
In conclusion, the genome sequence of A. shenzhenica, an orchid belonging to a small clade that is sister to the rest of Orchidaceae, provides a reference for studying orchid evolution, revealing clear evidence of an ancient WGD shared by all orchids, facilitating reconstruction of the ancestral orchid gene toolkit, and providing insights into many orchid-specific features such as the development of the labellum and gynostemium, pollinia, and seeds without endosperm, as well as the evolution of epiphytism.
No statistical methods were used to predetermine sample size.
Sample preparation and sequencing
For genome sequencing, we collected leaves, stems, and flowers from wild A. shenzhenica, a self-pollinating species found in southeast China4 that has a karyotype of 2N = 2X = 68 with uniform small chromosomes (Supplementary Fig. 16). We extracted genomic DNA using a modified cetyltrimethylammonium bromide (CTAB) protocol. Sequencing libraries with insert sizes ranging from 180 bp to 20 kb (Supplementary Table 1) were constructed using a library construction kit (Illumina). These libraries were then sequenced using an Illumina HiSeq 2000 platform. The 80.02-Gb raw reads generated were filtered according to sequencing quality, the presence of adaptor contamination, and duplication. Only high-quality reads were used for genome assembly.
Total RNA was extracted from this study’s samples using the RNAprep Pure Plant Kit and genomic DNA contamination was removed using RNase-Free DNase I (both from Tiangen). The integrity of RNA was evaluated on a 1.0% agarose gel stained with ethidium bromide (EB), and its quality and quantity were assessed using a NanoPhotometer spectrophotometer (IMPLEN) and an Agilent 2100 Bioanalyzer (Agilent Technologies). As the RNA integrity number (RIN) was greater than 7.0 for all samples, they were used in cDNA library construction and Illumina sequencing, which was completed by Beijing Novogene Bioinformatics Technology Co., Ltd. The cDNA library was constructed using the NEBNext Ultra RNA Library Prep Kit for Illumina (NEB) and 3 μg RNA per sample, following the manufacturer’s recommendations. The PCR products obtained were purified (AMPure XP system) and library quality was assessed on the Agilent Bioanalyzer 2100 system. Library preparations were sequenced on an Illumina Hiseq 2000 platform, generating 100-bp paired-end reads.
Genome size estimation and preliminary assembly
The genome size of species in Apostasioideae is between 0.38 pg and 5.96 pg31, which is relatively small compared to that of other subfamilies (ranging from 0.38 pg to 55.4 pg)32. To estimate the genome size of A. shenzhenica, we used reads from paired-end libraries to determine the distribution of K-mer values. According to the Lander–Waterman theory33, genome size can be determined by the total number of K-mers divided by the peak value of the K-mer distribution. Given only one peak in the K-mer distribution, we found that A. shenzhenica has no heterozygosity (Supplementary Fig. 17). With the peak at the expected K-mer depth and the formula genome size = total K-mer/expected K-mer depth, the size of the haploid genome was estimated to be 471.0 Mb (haploid). We used ALLPATHS-LG software34 and obtained a preliminary assembly of A. shenzhenica with a scaffold N50 size of 1.196 Mb and corresponding contig N50 size of 30.1 Kb.
PacBio library construction and sequencing and filling gaps
Genomic DNA was isolated from the leaves of A. shenzhenica, P. equestris and D. catenatum. For a 20-kb insert size library, at least 20 μg of sheared DNA was required. SMRTbell template preparation involved DNA concentration, damage repair, end repair, ligation of hairpin adapters, and template purification, and used AMPure PB Magnetic Beads. Finally, the sequencing primer was annealed and sequencing polymerase was bound to SMRTbell template. The instructions specified as calculated by the RS Remote software were followed. We carried out 20-kb single-molecule real-time DNA sequencing by PacBio and sequenced the DNA library on the PacBio RS II platform, yielding about 5.44 Gb (A. shenzhenica), 10.54 Gb (P. equestris) and 11.06 Gb (D. catenatum) PacBio data (read quality ≥ 0.80, mean read length of A. shenzhenica ≥ 7 Kb, of P. equestris and D. catenatum ≥ 10 Kb) (Supplementary Table 2).
We used PBjelly software35 to fill gaps with PacBio data. The options were “<blasr>-minMatch 8 -sdpTupleSize 8 -minPctIdentity 75 -bestn 1 -nCandidates 10 -maxScore -500 -nproc 10 -noSplitSubreads</blasr>” for the protocol.xml file. Then, we used Pilon36 with default settings to correct assembled errors. For the input BAM file, we used BWA to align all the Illumina short reads to the assembly and SAMTOOLS to sort and index the BAM file.
10X Genomics library construction, sequencing, and extending scaffolds
DNA sample preparation, indexing, and barcoding were done using the GemCode Instrument from 10X Genomics. About 0.7 ng input DNA with 50 kb length was used for GEM reaction procedure during PCR, and 16-bp barcodes were introduced into droplets. Then, the droplets were fractured following the purifying of the intermediate DNA library. Next, we sheared DNA into 500 bp for constructing libraries, which were finally sequenced on the Illumina HiseqXTen37 (Supplementary Table 3).
We used BWA mem to align the 10X Genomics data to the filled gaps assembly using default settings. Then, we used fragScaff38 for scaffolding. The options were as follows: A. shenzhenica (stages1 “-m 3000 -q 30”; stages2 “-C 2”; stages3 “-j 1.25 -u 2”), D. catenatum (stages1 “-m 3000 -q 30”; stages2 “-C 1”; stages3 “-j 2 -u 2”) and P. equestris (stages1 “-m 3000 -q 30”; stages2 “-C 1”; stages3 “-j 2 -u 2”)39.
The total length of the final assembly for A. shenzhenica was 349 Mb with a scaffold N50 size of 3.029 Mb and corresponding contig N50 size of 80.1 Kb. (Supplementary Table 4). For the two previously published orchid genomes of P. equestris and D. catenatum, the scaffold N50 size as well as the completeness (see below) improved considerably: for P. equestris, the scaffold N50 size increased from 359.12 Kb9 to 1.217 Mb and the corresponding contig N50 size from 20.56 Kb9 to 45.79 Kb, while for D. catenatum the scaffold N50 size increased from 391.46 Kb10 to 1.055 Mb, and the corresponding contig N50 size from 33.1 Kb10 up to 51.7 Kb (Supplementary Table 7).
A total of 146.65 Mb of repetitive elements occupying more than 42.05% of the A. shenzhenica genome were annotated using a combination of structural information and homology prediction10. Retrotransposable elements, known to be the dominant form of repeats in angiosperm genomes, constituted a large part of the A. shenzhenica genome and included the most abundant subtypes, such as LTR/Copia (4.97%), LTR/Gypsy (11.84%), LINE/L1 (2.78%) and LINE/RTE-BovB (9.32%), among others. In addition, the percentage of de novo predicted repeats was notably larger than that obtained for homologous repeats based on Repbase40, indicating that A. shenzhenica has multiple unique repeats compared with other sequenced plant species (Supplementary Table 9).
Gene and non-coding RNA prediction
MAKER41 was used to generate a consensus gene set based on de novo predictions from AUGUSTUS42 and GlimmerHMM43, homology annotation with the universal single-copy genes from CEGMA44 and the genes from Arabidopsis (TAIR10) and another four sequenced monocots (O. sativa, P. equestris, S. bicolor and Zea mays) using exonerate45, and RNA-seq prediction by Cufflinks46 and Tophat47. These results were integrated into a final set of protein-coding genes for annotation (Supplementary Table 5). Using the same annotation pipeline as for A. shenzhenica, 29,545 and 29,257 protein-coding genes were predicted for P. equestris and D. catenatum, respectively (Supplementary Table 7). A. shenzhenica was found to have a greater average gene length (here we considered the start and stop codons as the two boundaries for a gene) than most other sequenced plants, but this length was similar to that of P. equestris and D. catenatum (Supplementary Fig. 18 and Supplementary Table 10), in both of which this is due to a long average intron length9,10.
We then generated functional assignments of the A. shenzhenica genes with BLAST (version 2.2.28+) by aligning their protein-coding regions to sequences in public protein databases, including KEGG (59.3)48, SwissProt (release 2013_06)49, TrEMBL (release 2013_06)50 and NCBI non-redundant protein database (20150617), and InterProScan (v5.11-51.0)51 was also used to provide functional annotation (Supplementary Table 11). We were able to generate functional assignments for 84.2% of the A. shenzhenica genes from at least one of the public protein databases (Supplementary Table 11).
The tRNA genes were searched by tRNAscan-SE52. For rRNA identification, we downloaded the Arabidopsis rRNA sequences from NCBI and aligned them with the A. shenzhenica genome to identify possible rRNAs. Additionally, other types of non-coding RNAs, including miRNA and snRNA, were identified by using INFERNAL53 to search from the Rfam database. In the end, we identified 43 microRNAs, 203 transfer RNAs, 452 ribosomal RNAs and 93 small nuclear RNAs in the A. shenzhenica genome (Supplementary Table 12).
Before assembly, we got high-quality reads by removing adaptor sequences and filtered low-quality reads by using TRIMMOMATIC54 from raw reads with parameters: ILLUMINACLIP:path/adaptor:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15 MINLEN:36. The resulting high-quality reads were de novo assembled and annotated with the TRINITY program55. The commands and parameters used for running TRINITY were as follows: Trinity –seqType fq –JM 200G –left sample_1.fq –right sample_2.fq –normalize_by_read_set –CPU 32 –output sample –min_kmer_cov 2. Protein sequences and coding sequences of transcripts were predicted using TransDecoder (http://transdecoder.github.io), a software tool that identifies likely coding sequences from transcript sequences and compares the translated coding sequences with the PFAM domain database55. For genes with more than one transcript, the longest one was used to calculate transcript abundance and coverage. Transcript abundance level was normalized using the fragments per kilobase per million mapped reads (FPKM) method, and FPKM values were computed as proposed by Mortazavi et al.56.
Transcriptomes of Agave deserti57 and Allium cepa58 were downloaded from Dryad (h5t68) and NCBI (PRJNA175446), respectively. We removed the redundant unigenes in A. cepa by CD-HIT-EST with 99% identity and used TransDecoder to predict proteins with default parameters.
We carried out BLASTP (E value <1 × 10−3) to search the best hits for the proteins predicted in the transcriptomes against a customized database, built with proteins from the genomes of A. shenzhenica, P. equestris9, D. catenatum10, and A. officinalis (GenBank accession number GCF_001876935.1) as well as public databases, such as NCBI Plant RefSeq (release 80), Ensembl (release 77), Ensembl Metazoa (release 24), Ensembl Fungi (release 24), and Ensembl Protists (release 24). Only plant-homologous proteins were retained in the transcriptomes to eliminate the effects of genes derived from commensal organisms, laboratory contaminants, and artefacts resulting from incorrect assembly (Supplementary Table 13).
Gene family identification
We downloaded genome and annotation data of A. trichopoda (http://amborella.huck.psu.edu, version1.0), A. comosus (GenBank accession number GCF_001540865.1), A. thaliana (TAIR 10), A. officinalis (GenBank accession number GCF_001876935.1), B. distachyon (purple false brome; Phytozome v9.0), M. acuminata (http://ensemblgenomes.org, release-21), O. sativa (Nipponbare, IRGSP-1.0), P. dactylifera (http://qatar-weill.cornell.edu/research/datepalmGenome), P. trichocarpa (http://ensemblgenomes.org, release-21), S. bicolor (sorghum; Phytozome v9.0), S. polyrhiza (common duckweed; http://www.spirodelagenome.org), and V. vinifera (Phytozome v9.0). We chose the longest transcript to represent each gene and removed gene models with open reading frames shorter than 150 bp. Gene family clustering was performed using OrthoMCL59 based on the set of 21,841 predicted genes of A. shenzhenica and the protein sets of the above ten other monocots, three dicots and the outgroup A. trichopoda. This analysis yielded 11,995 gene families in A. shenzhenica containing 18,268 predicted genes (83.6% of the total genes identified; orthologous genes in the 15 sequenced plant species are shown in Supplementary Fig. 19 and Supplementary Table 14) (see also Supplementary Note 1).
Phylogenetic tree construction and phylogenomic dating
We constructed a phylogenetic tree based on a concatenated sequence alignment of 439 single-copy gene families from A. shenzhenica and the 14 other plant species using MrBayes60 software with GTR+Γ model (Fig. 1). For the phylogenetic analysis incorporating ten additional transcriptome species (Extended Data Fig. 2), we first picked up the genes of A. shenzhenica, D. catenatum, and P. equestris in the single-copy gene families as seed genes, and then made a BLASTP alignment between the transcriptome unigenes and the seed sequences. For one single-copy family, if the three seed genes all had the identical best-hit to a unigene, this gene was identified as the orthologous gene to the gene family. With this method we found 132 single-copy gene families of the total 25 species, then constructed the phylogenetic tree based on a concatenated sequence alignment of them using PhyML61 with GTR+Γ model. Divergence times were estimated by PAML MCMCTREE62. The Markov chain Monte Carlo (MCMC) process was run for 1,500,000 iterations with a sample frequency of 150 after a burn-in of 500,000 iterations. Other parameters used the default settings of MCMCTREE. Two independent runs were performed to check convergence. The following constraints were used for time calibrations: (i) the O. sativa and B. distachyon divergence time (40–54 million years ago (Ma))63; (ii) the P. trichocarpa and A. thaliana divergence time (100–120 Ma)64; (iii) the monocot and eudicot divergence time with a lower boundary of 130 Ma65; and (iv) 200 Ma as the upper boundary for the earliest-diverging angiosperms66.
Identification of WGD events in A. shenzhenica and phylogenomic analyses
KS-based age distributions were constructed as previously described67. In brief, the paranome was constructed by performing an all-against-all protein sequence similarity search using BLASTP with an E value cutoff of 1 × 10−10, after which gene families were built with the mclblastline pipeline (v10-201) (http://micans.org/mcl)68. Each gene family was aligned using MUSCLE (v3.8.31)69, and KS estimates for all pairwise comparisons within a gene family were obtained through maximum likelihood estimation using the CODEML program70 of the PAML package (v4.4c)62. Gene families were then subdivided into subfamilies for which KS estimates between members did not exceed a value of 5. To correct for the redundancy of KS values (a gene family of n members produces n(n−1)/2 pairwise KS estimates for n−1 retained duplication events), a phylogenetic tree was constructed for each subfamily using PhyML61 under default settings. For each duplication node in the resulting phylogenetic tree, all m KS estimates between the two child clades were added to the KS distribution with a weight of 1/m (where m is the number of KS estimates for a duplication event), so that the weights of all KS estimates for a single duplication event summed to one. The resulting age distribution of the A. shenzhenica paranome is shown in Extended Data Fig. 4a.
Absolute dating of the identified WGD event in A. shenzhenica was performed as previously described9,12. In brief, paralogous gene pairs located in duplicated segments (anchors) and duplicated pairs lying under the WGD peak (peak-based duplicates) were collected for phylogenetic dating. Anchors, assumed to correspond to the most recent WGD event, were detected using i-ADHoRe (v3.0)71,72. Their KS distribution is shown in Extended Data Fig. 4b. The identified anchors confirmed the presence of a WGD peak near a KS value of 1 (the long tail and additional peaks in the anchor pair distribution are most likely due to small saturation effects67 and the remnants of older WGD events in the monocot lineage, such as the τ WGD13,14). We selected anchor pairs and peak-based duplicates present under the WGD peak and with KS values between 0.6 and 1.4 (dashed lines in Extended Data Fig. 4a, b) for absolute dating. For each WGD paralogous pair, an orthogroup was created that included the two paralogues plus several orthologues from other plant species as identified by InParanoid (v4.1)73 using a broad taxonomic sampling: one representative orthologue from the order Cucurbitales, one from the Rosales, two from the Fabales, one from the Malpighiales, two from the Brassicales, one from the Malvales, one from the Solanales, two from the Poaceae (Poales), one from A. comosus14 (Bromeliaceae, Poales), one from either M. acuminata74 (Zingiberales) or P. dactylifera75 (Arecales), and one orthologue from the Alismatales, either from S. polyrhiza76 or Zostera marina77. In total, 85 orthogroups based on anchors and 230 orthogroups based on peak-based duplicates were collected. The node joining the two A. shenzhenica WGD paralogues was then dated using the BEAST v1.7 package78 under an uncorrelated relaxed clock model and an LG+G (four rate categories) evolutionary model. A starting tree with branch lengths satisfying all fossil prior constraints was created according to the consensus APGIV phylogeny79. Fossil calibrations were implemented using log-normal calibration priors on the following nodes: the node uniting the Malvidae based on the fossil Dressiantha bicarpellata80 with prior offset = 82.8, mean = 3.8528, and s.d. = 0.581; the node uniting the Fabidae based on the fossil Paleoclusia chevalieri82 with prior offset = 82.8, mean = 3.9314, and s.d. = 0.583; the node uniting the A. shenzhenica WGD paralogues with the other non-Alismatalean monocots based on fossil Liliacidites84 with prior offset = 93.0, mean = 3.5458, and s.d. = 0.585; and the root with prior offset = 124, mean = 4.0786, and s.d. = 0.586. The offsets of these calibrations represent hard minimum boundaries, and their means represent locations for their respective peak mass probabilities in accordance with some recent and most taxonomically complete dating studies available for these specific clades87. A run without data was performed to ensure proper placement of the marginal calibration prior distributions88. The MCMC for each orthogroup was run for 10 million generations with sampling every 1,000 generations, resulting in a sample size of 10,000. The resulting trace files of all orthogroups were evaluated manually using Tracer v1.578 with a burn-in of 1,000 samples to ensure proper convergence (minimum ESS for all statistics was at least 200). In total, 303 orthogroups were accepted, and all age estimates for the node uniting the WGD paralogous pairs were then grouped into one absolute age distribution (Extended Data Fig. 5; too few anchor pairs were available to evaluate them separately from the peak-based duplicates), for which KDE and a bootstrapping procedure were used to find the peak consensus WGD age estimate and its 90% confidence interval boundaries, respectively. More detailed methods are available in Vanneste et al.12.
To compare the relative timing of speciations and WGD event(s) in orchids based on KS distributions, we first identified 839 anchors from D. catenatum and 355 anchors from P. equestris using i-ADHoRe 3.0 and calculated their KS as described above. Identification of orthologues between A. shenzhenica and A. officinalis, A. shenzhenica and P. equestris, A. shenzhenica and D. catenatum, and P. equestris and D. catenatum was performed first by reciprocal BLASTP with E value <1 × 10−5 for proteins from the three orchids and asparagus, followed by sorting BLAST hits by bit-scores and E values. Reciprocal best hits in the four comparisons were selected as orthologues. In this way, we identified 9,142 orthologues between A. shenzhenica and A. officinalis, 10,699 orthologues between A.shenzhenica and P. equestris, 11,386 orthologues between A. shenzhenica and D. catenatum, and 13,139 orthologues between P. equestris and D. catenatum. For each pair of orthologues, ClustalW89 alignment was carried out to perform sequence alignment using the parameter for amino acids recommended by Hall90. PAL2NAL91 was then used to back-translate aligned protein sequences into codon sequences and to remove any gaps in the alignment. Estimates of KS values were obtained from CODEML in PAML using the Goldman-Yang model with codon frequencies estimated by the F3 × 4 model.
We performed pairwise co-linearity analysis within A. shenzhenica and between A. shenzhenica and A. officinalis, A. comosus, V. vinifera, and A. trichopoda. Homologous pairs of A. shenzhenica and the above species were identified by all-against-all BLASTP (E value <1 × 10−5), followed by the removal of weak matches by applying a c-score of 0.5 (indicating their BLASTP bit-scores were below 50% of the bit-scores of the best matches)92. Then, i-ADHoRe 3.0 was used to identify co-linear segments with parameters as described above except using ‘level_2_only = FALSE’, enabling the functionality to detect highly degenerated co-linear segments resulting from more ancient large-scale duplications (this is achieved by recursively building genomic profiles based on relatively recent co-linear segments). All co-linear dot plots were drawn by selecting co-linear segments according to a specified required number of anchor pairs (given in the figure legend of each of the dot plots). For the comparisons between A. shenzhenica and the chromosome-level assembled genomes (A. officinalis, A. comosus, and V. vinifera) we retained co-linear segments with at least ten anchor pairs (Extended Data Fig. 6 and Supplementary Figs 5, 7, 8). For the comparisons with fragmented genomes, like A. trichopoda, and the self-comparison of A. shenzhenica, we kept co-linear segments with five anchor pairs (Fig. 2b and Supplementary Figs 3, 4). The start and end boundaries of selected co-linear segments were used to define broader regions containing such segments on the chromosomes or scaffolds by further connecting co-linear segments if they overlapped with each other. Then, duplication depths, that is, the number of connected co-linear segments overlapping at each position of a broader region, were illustrated in the margins of the plots by mapping the connected co-linear segments over each other. The number of anchors required in the co-linear segments could affect the duplication depth in such a way that increasing the number of anchors required tends to remove co-linear segments originating from more ancient WGD(s) due to increased gene loss.
To identify the duplication events that resulted in the 1,488 anchor pairs in A. shenzhenica, the 839 anchor pairs in D. catenatum, and the 355 anchor pairs in P. equestris, we performed phylogenomic analyses employing protein-coding genes from 20 species, including 12 orchids across all five subfamilies of Orchidaceae (the three orchids with genomes (A. shenzhenica, D. catenatum and P. equestris) plus nine orchid transcriptomes (Supplementary Table 13)), four non-orchid Asparagales (A. officinalis (genome), M. capitulata (Supplementary Table 13), A. deserti57 and A. cepa58), three commelinid monocots (Elaeis guineensis, P. dactylifera, and A. comosus), and A. trichopoda. OrthoMCL (v2.0.9)59 was used with default parameters to identify gene families based on sequence similarities resulting from an all-against-all BLASTP with E value <1 × 10−5. Then, 1,101 of the 2,582 anchor pairs with KS values greater than five were removed. If the remaining anchors fell into different gene families, indicating incorrect assignment of gene families by OrthoMCL, we merged the corresponding gene families. In this way, we obtained 32,217 multi-gene gene families. Next, phylogenetic trees were constructed for the subset of 777 gene families with no more than 300 genes that had at least one pair of anchors and one gene from A. trichopoda. Multiple sequence alignments were produced by MUSCLE (v3.8.31) using default parameters. These were trimmed by trimAl (v1.4)93 to remove low-quality regions based on a heuristic approach (-automated1) that depends on a distribution of residue similarities inferred from the alignments for each gene family. RAxML (v8.2.0)94 was then used with the GTR+Γ model to estimate a maximum likelihood tree starting with 200 rapid bootstraps followed by maximum likelihood optimizations on every fifth bootstrap tree. Gene trees were rooted based on genes from A. trichopoda if these formed a monophyletic group in the tree; otherwise, mid-point rooting was applied. The timing of the duplication event for each anchor pair relative to the lineage divergence events was then inferred using the following approach (Supplementary Fig. 20): we first mapped internodes from a gene tree to the species phylogeny according to the common ancestor of the genes in the gene tree. Each internode of the gene tree was then defined as either a duplication node, a speciation node, or a ‘dubious’ node. A duplication node is a node that shares at least one pair of paralogues, a speciation node is a node that has no paralogues and is consistent with divergence in the species phylogeny, and a ‘dubious’ node is a node that has no paralogues and is inconsistent with divergence in the species phylogeny. Then, if a pair of anchors coalesced to a duplication node, we traced back its parental node(s) until we reached a speciation node in the gene tree. In this way, we circumscribed the duplication event as between these two nodes with the duplication node as the lower bound and the speciation node as the upper bound on the species tree. If the two nodes were directly connected by a single branch on the species tree, the duplication was thus considered to have occurred on the branch. To reduce biased estimations, we used the bootstrap value on the branch leading to the common ancestral node of an anchor pair as support for a duplication event. In total, 628 anchor pairs in 493 gene families coalesced as duplication events on the species phylogeny, and duplication events from 318 anchor pairs in 262 gene families (or from 448 anchor pairs in 367 gene families) had bootstrap values greater than or equal to 80% (or 50%).
Evolution and expression analysis of orchid MADS box genes
We identified candidates of MADS-box genes by searching the InterProScan51 result of all the predicted A. shenzhenica proteins. The candidates of MADS-box genes were further determined by SMART95, which identified MADS-box domains comprised by 60 amino acids. The protein-sequence set of the MADS-box gene candidates was BLAST against the assembled A. shenzhenica transcriptomes with the TBLASTN program. The matched transcript sequences were then assembled with the candidates of MADS-box genes using Sequencher v5.1 (Gene Codes Corp.) and the exon structure of the final MADS-box genes was manually edited (Supplementary Data 1). In the end, we aligned all the identified MADS-box genes using the ClustalW program89. An unrooted neighbour-joining phylogenetic tree was constructed in MEGA596 with default parameters.
Transcriptomic analysis of other orchids
In addition, 53 more transcriptomes derived from 9 more taxa and 8 tissues (flower bud, anther, pollinium, shoot, stem, leaf, aerial root and root) (Supplementary Table 13) were sampled to investigate the roles of the genes that may be important for the evolution of orchid traits. The gene expression levels were indicated by FPKM on the longest assembled transcript.
Genome sequences and whole-genome assembly of A. shenzhenica and whole transcriptomes have been submitted to the National Center for Biotechnology Information (NCBI) database under BioProject PRJNA310678; the remaining transcriptomes used in this study can be found in the previously available BioProjects PRJNA288388, PRJNA304321, and PRJNA348403; the raw data and the updated whole-genome assembly of P. equestris have been submitted to NCBI under BioProject PRJNA389183; and the raw data and the updated whole-genome assembly of D. catenatum have been renewed under the already existing BioProject PRJNA262478. All other data are available from the corresponding authors upon reasonable request.
We acknowledge support from The Funds for Environmental Project of Shenzhen, China (no. 2013-02); The 948 programme of the State Forestry Administration P. R. China (no. 2011–4–53); The Funds for Forestry Science and Technology Innovation Project of Guangdong, China (no. 2016KJCX025; no. 2013KJCX014-05); Fundamental Research Project of Shenzhen, China (no. JCYJ20170307170746099; no. JCYJ20150403150235943); and the teamwork projects funded by Guangdong Natural Science Foundation (no. 2017A030312004) awarded to Z.-J.L. Y.V.d.P. acknowledges the Multidisciplinary Research Partnership ‘Bioinformatics: from nucleotides to networks’ Project (no. 01MR0310W) of Ghent University and the European Union Seventh Framework Programme (FP7/2007-2013) under European Research Council Advanced Grant Agreement 322739–DOUBLEUP.
Extended data figures
This file contains Supplementary Notes 1-3, Supplementary Figures 1-27, Supplementary Tables 1-23, Supplementary Data and additional references.