Orchidaceae, renowned for its spectacular flowers and other reproductive and ecological adaptations, is one of the most diverse plant families. Here we present the genome sequence of the tropical epiphytic orchid Phalaenopsis equestris, a frequently used parent species for orchid breeding. P. equestris is the first plant with crassulacean acid metabolism (CAM) for which the genome has been sequenced. Our assembled genome contains 29,431 predicted protein-coding genes. We find that contigs likely to be underassembled, owing to heterozygosity, are enriched for genes that might be involved in self-incompatibility pathways. We find evidence for an orchid-specific paleopolyploidy event that preceded the radiation of most orchid clades, and our results suggest that gene duplication might have contributed to the evolution of CAM photosynthesis in P. equestris. Finally, we find expanded and diversified families of MADS-box C/D-class, B-class AP3 and AGL6-class genes, which might contribute to the highly specialized morphology of orchid flowers.
Ever since Darwin published Fertilization of Orchids in 1862 (ref. 1), orchids have attracted great interest from evolutionary biologists and botanists. Orchidaceae is one of the largest plant families, with between 22,075 and 26,567 species in 880 genera (see URLs), and is known for its diversity in specialized reproductive and ecological strategies. The specific development of the labellum (the 'lip') and gynostemium (a fused structure of the stamens and pistils) to trick pollinators and facilitate pollination is well documented, and the coevolution of orchid flowers and pollinators is well known2. In addition to the highly sophisticated floral structure contributing to the diversification of orchids3, CAM and epiphytism4 might also be linked to the adaptive radiation of orchids.
Phalaenopsis species are popular ornamental plants worldwide because of their elegant appearance and extended longevity, and they are of great economic importance for the floral industry. P. equestris is an important breeding parent because of its many colorful flowers in a single inflorescence. It has a karyotype of 2N = 2X = 38 with uniform small-size chromosomes of 1–2.5 μm in length5. Its genome size is estimated to be 1.6 × 109 bp per haploid genome, which is relatively small in comparison to the genomes of other species in the same genus5 or even other genera6. Some transcriptome sequence data have previously been generated for this species, and a transcriptome database, OrchidBase7,8, has been established. Here we present the first whole-genome sequence and analysis of P. equestris, providing fundamental knowledge for further research in orchid biology.
Genome sequencing and assembly
We adopted a whole-genome shotgun strategy to sequence and assemble the genome of P. equestris (Supplementary Table 1) and estimated the genome size to be 1.16 Gb (Supplementary Fig. 1 and Supplementary Note), which is smaller than previous estimates by flow cytometry (∼1.6 Gb)5. We assembled 236,185 scaffolds (Supplementary Table 2), with about 90% of the total assembled genome (∼980 Mb) contained in the 6,359 longest scaffolds.
The total genome assembly amounted to 1.086 Gb (1.0 Gb without unknown (N) bases), which is ∼93% (∼86% without N bases) of the estimated total genome size. We assessed the coverage of the genome assembly using BACs (Supplementary Tables 3 and 4, and Supplementary Note). For contigs longer than 1 kb in length (979 in total) generated by sequencing and assembling BAC pools, 97% could be mapped to the assembled genome (Supplementary Fig. 2). Comparison of ten randomly sequenced BAC scaffolds indicated a low error rate (Supplementary Table 4). Of the 248 conserved core eukaryotic genes that were used to assess genome completeness9, 234 (94%) were uncovered in our genome assembly (Supplementary Table 5 and Supplementary Note), and a majority of transcription fragments could be mapped to the genome assembly (Supplementary Tables 6, 7, 8). Further detailed information is provided in the Online Methods.
Genome annotation and gene expression analysis
About 62% of the genome assembly was found to be composed of repetitive DNA, a higher proportion than the 29% in rice10 and 41% in grape11 but similar to the proportion in sorghum (61%)12. Interspersed repeats and transposable elements (TEs) occupied about 59% of the genome, and tandem repeats accounted for 3% (Supplementary Table 9). Among the TEs, LTRs (long terminal repeats) were the most abundant and occupied ∼46% of the genome, followed in frequency by LINEs (long interspersed nuclear elements), which accounted for ∼8% of the genome (Supplementary Tables 10 and 11).
We analyzed the distribution of the divergence times for the complete LTRs in the genome assembly and estimated that most LTRs (71%) in the orchid genome arose during a relatively recent insertion between 11.7 and 43 million years ago (Supplementary Fig. 3), long after the origin of the last common ancestor of orchids (74–86 million years ago)13. Separate divergence analysis of the Copia and Gypsy TEs suggested that these two types of LTRs experienced a recent burst (Supplementary Fig. 3).
Next, protein-coding gene models were constructed using a pipeline combining de novo prediction, homology-based prediction and RNA sequence–aided prediction. In total, we predicted 29,431 protein-coding gene models, excluding genes with similarity to known TEs (Supplementary Table 12). We also predicted alternatively spliced forms of these protein-coding genes (Supplementary Table 13) and found 9,021 splice variants for 6,389 genes (21.7% of the total protein-coding genes). Of the 29,431 predicted genes, 20,398 (69.3%) were supported by transcriptome data from at least 1 of 4 tissues examined (leaf, flower, stem and root), and 15,530 gene models (52.8%) were supported by all 3 prediction methods, namely, homology-based, de novo and transcriptome-based prediction, and are therefore considered to represent the high-confidence gene set (Supplementary Fig. 4). For gene model validation, we manually examined 500 randomly selected genes from 2,038 of the gene families shared among monocots that only had 1 copy in both P. equestris and rice (Supplementary Table 14). Most of these 500 genes were correctly predicted (Supplementary Note), although 20 contained potential annotation errors, overall reflecting an annotation accuracy of 96%.
Gene similarity clustering for the set of 29,431 predicted genes yielded 3,694 gene families containing at least 2 orchid genes. Furthermore, 4,171 orchid genes could not be grouped with any of the genes from the following species—Populus trichocarpa, Arabidopsis thaliana, Vitis vinifera, Oryza sativa, Brachypodium distachyon, Sorghum bicolor, Zea mays, Physcomitrella patens, Chlamydomonas reinhardtii and Ostreococcus lucimarinus—and are referred to as orphan genes. A 4-way comparison of orchid, rice, grape and A. thaliana (Fig. 1a) showed that there were 5,696 gene families shared by all 4 species, and 4,775 gene families were unique to P. equestris, more than in A. thaliana (2,647) and grape (3,634) but fewer than in rice (10,905). For all species, expanded and contracted gene families (in comparison to ancestors) were compared with those of P. equestris to identify gene families that were only expanded or contracted in P. equestris. In total, 2,497 gene families were expanded in P. equestris, whereas only 3 gene families were contracted (Supplementary Tables 15 and 16). For the gene families specifically expanded in P. equestris, we conducted Gene Ontology (GO) enrichment analysis and found enrichment for 'transition metal ion binding' and 'zinc ion binding', probably reflecting the importance of genes for the binding of metal ions (including iron and zinc) in orchid.
To obtain functional annotations for the coding genes, we first used InterProScan14 to identify known protein domains. This approach uncovered 94,693 domains encoded by 17,931 genes (60.93% of all genes). On the basis of their encoding specific protein domains, 12,739 genes were linked to 129,064 GO annotations. Second, we used high-quality functional annotations from rice, excluding GO labels inferred through electronic annotation alone, in combination with a tree-based approach to transfer these labels to orchid15. In this way, 8,518 new or more specific GO labels were assigned to 3,090 genes. The combination of both approaches resulted in a total of 13,954 genes (47.41%) with GO terms assigned to them (Supplementary Table 17).
To quantify general gene expression levels, we mapped all RNA sequencing (RNA-seq) data to the annotated genes and calculated RPKM values (reads per kilobase per million mapped reads) for every gene in the four different tissues analyzed (Supplementary Table 18). Differentially expressed genes were detected using the method described by Chen et al.16. Using false discovery rate (FDR) P < 0.05 as the threshold for significance, we found 2,283, 1,499, 947 and 1,288 genes that were preferentially expressed in flower, leaf, stem and root, respectively. Among the genes preferentially expressed in flowers, there was high enrichment for GO terms related to 'cell wall', 'cell wall modification' and 'pectinesterase activity' (P < 0.05; Supplementary Table 19), confirming the correlation of modification and organization of cell walls with flower development and wilting17. In leaves, GO terms related to 'photosynthesis', 'electron transport chain' and 'photosystem' were significantly enriched (P < 0.05), consistent with photosynthesis being the major function of this tissue. The 'heme binding' and 'iron ion binding' functions were enriched in both root and stem, suggesting that both organs have an important role in metal ion metabolism in orchid. GO terms related to 'transition metal ion metabolism' were also uncovered in GO term enrichment analysis for gene families with P. equestris–specific expansion and might suggest that the metabolism of transition metals (including iron, zinc and copper) in both stems and roots has an important role in adaptation to P. equestris–specific epiphytic growth niches. Mineral ions in epiphytic growth niches are unevenly distributed, usually diluted and only sporadically available in comparison to their distribution in growth niches in soils18. Gene duplication in these metabolic pathways might have contributed to different regulation of mineral ion metabolism adapted to epiphytic niches. It would be very interesting to see whether the same gene families have also been expanded in other non-orchid epiphytes (such as bromeliads).
To estimate the level of heterozygosity in the genome, we carried out k-mer distribution approximation with simulated heterozygous genome sequences and found that the real k-mer distribution was fitted best by a simulated k-mer (k represents the chosen length of substrings) distribution with 1.2% (between 1.1% and 1.3%) heterozygosity (Supplementary Fig. 1). We also investigated the level of heterozygosity in the assembled part of the genome by mapping all reads back to the assembly, finding that the heterozygosity was about 0.4%. Because the part of the genome showing the lowest heterozygosity was the best assembled part, 0.4% is probably an underestimation due to sampling bias.
With a heterozygous genome, the assembler might assemble the two alleles for a site separately, which would result in an excess of the assembly with half of the average sequencing depth. To plot the sequencing depth distribution, we mapped all the sequencing reads to the assembly; we found that, although the major peak of sequencing depth was at ∼100×, there was indeed a minor peak at ∼50× (Supplementary Fig. 5), indicating that there were genomic regions with half of the average sequencing depth due to underassembly of allelic regions with high heterozygosity. Using the depth distribution, we estimated the length of the genome assembly from these heterozygous regions to be 135 Mb (Supplementary Fig. 6). We then identified 58,241 contigs with half of the average sequencing depth (total length of 131.2 Mb, consistent with the estimation of heterozygous region length) (Supplementary Fig. 7); these contigs explained the 50× peak for sequencing depth, as the depth distribution was normal after excluding them (Supplementary Fig. 5 and Supplementary Note). The 2,454 genes from these contigs (Supplementary Table 20) were significantly over-represented in the GO terms 'apoptosis', (P value 4.19 × 10–6) 'programmed cell death' (P value 4.19 × 10–6) and 'defense response' (P value 1.77 × 10–4) and are possibly related to self-incompatibility19,20 (Supplementary Fig. 8 and Supplementary Table 21). The heterozygous contigs suggest that there is a block-wise distribution of heterozygosity, and we further identified heterozygous SNPs in the genome to characterize their distribution (Supplementary Fig. 9). We indeed found that the 1.7 million high-quality heterozygous SNPs identified were not distributed randomly in the genome.
TE insertions in introns
We compared the gene models in 13 plant species and found that the average intron length for P. equestris was substantially longer than for the other species (Table 1). In comparing the distributions of TE proportions in introns, we identified a distinctive major peak near 45% in orchid (Supplementary Fig. 10 and Supplementary Table 22). In addition, the proportion of genes with long introns seemed substantially higher in P. equestris than in most other species, even after excluding TEs (Supplementary Fig. 11). Further comparison of intron length distributions showed that P. equestris had the highest proportion of introns with a length of ≥2,000 bp (27.7%) (Supplementary Table 23). To explore the functional consequences of intronic TE insertions, we compared the expression levels of genes with TE insertions to those of corresponding paralogous genes without intronic TE insertions. Overall, the expression levels of genes with TE insertions were lower than those of their paralogs (P < 1 × 10−11, Wilcoxon rank-sum paired test) in all four tissues examined (Supplementary Fig. 12). A decrease in the expression levels of these genes can probably be explained by negative selection due to an increase in the transcription cost for the longer transcript, which is consistent with previous findings suggesting that natural selection favors short introns in highly expressed genes to minimize the cost of transcription and other molecular processes such as splicing21. Previous studies on paralogs formed by whole-genome duplication (WGD) showed significant differences in expression levels, with genes with weaker expression having a tendency to accumulate more transposon insertions22.
We constructed a phylogenetic tree on the basis of a concatenated sequence alignment of the 72 single-copy genes shared by orchid and 10 other green plant species (Fig. 1b). In this phylogenetic tree, orchid, as expected, clustered with other monocots, although the evolutionary distance from orchid to cereals such as rice, Brachypodium species, sorghum and maize was relatively large. Although the 72 genes already provided enough phylogenetic signals for phylogeny construction in the 11 green plant species, accurate dating of the divergence times between orchid and other monocots requires a larger gene set while compromising the phylogenetic coverage. Applying the PAML MCMCTree program to the 342 single-copy genes shared among orchid and 7 other species (monocots and dicots), the divergence time between P. equestris and the other monocots was estimated at 135.1 ± 17 million years ago, which is in line with estimates from both angiosperm-wide23 and Orchidaceae-specific13 molecular sequence divergence estimation studies. We also studied gene family expansion and contraction in different evolutionary lineages (Fig. 1b).
Like many other plant genomes sequenced thus far24,25,26,27, the orchid genome harbored the remnants of one or more large-scale duplication events. Although only a small fraction of the genome (3.51%) showed collinearity (conservation of gene order and content) with other regions in the genome, this proportion most likely constitutes a substantial underestimate. Indeed, in total, 12,000 orchid genes resided on scaffolds with fewer than 20 genes, which are of limited use for the intragenome detection of collinearity. Furthermore, about 6,500 genes were located on scaffolds with fewer than 5 genes. However, a considerable number of genes (5,492) were contained in syntenic regions that showed conservation of gene content, regardless of gene order. The notable difference between retained homeologs in collinear versus syntenic regions can probably be explained by a high degree of reshuffling of genes after duplication, fractionation (loss of either homeolog) and the low gene density of P. equestris (which has about the same number of genes as A. thaliana with a genome size that is about ten times larger). In the absence of a close outgroup and because other sequenced monocot species diverged more than 100 million years ago, the above factors render multi-level collinearity with other sequenced angiosperm species hard to detect. However, analysis of the number of synonymous substitutions per synonymous site (KS) for the whole paranome (the set of all duplicated genes in the genome), identified a peak between 0.6 and 1.1 that corresponds to contemporarily created gene duplicates (Supplementary Fig. 13a) and most likely represents an ancient WGD event28. Furthermore, when only duplicates retained in collinear regions were considered, duplicates from small-scale duplications were excluded from the distribution and the WGD signature peak in the KS distribution became even more pronounced (Supplementary Fig. 13b). Putative peaks at higher KS values might represent more ancient WGD events in the monocot lineage that might have been shared by orchids or their ancestors29,30 (Supplementary Fig. 13b).
We performed absolute dating through phylogenomic analysis to establish the age of this paleopolyploidy in relation to the monocot phylogeny27. The absolute dating of genes present under the signature WGD peak in the KS distribution rendered an absolute age distribution with a clear peak at ∼76 million years ago (Fig. 2a). More specifically, the WGD event was dated at 75.57 million years ago, with lower and upper 90% confidence interval limits of 71.50 and 80.73 million years ago, respectively (Supplementary Note). As the common ancestor of the crown group of orchids is supposed to have lived during the Late Cretaceous period sometime between 76 and 84 million years ago13, this finding would suggest that the orchid-specific WGD event occurred in association with the origin of this clade, and polyploidy is indeed proposed as a frequent mechanism of speciation in angiosperms31,32. In contrast, many members of the Orchidaceae family underwent drastic rate shifts (transition and transversion) during their evolutionary history due to periods of accelerated molecular evolution caused by their short life cycles and altered life history strategies33,34,35. These rate shifts complicate absolute dating27, especially considering the long distant relationship from orchid to the other monocot species for which the complete genome sequence is currently available, such that the current WGD age estimate could be an overestimate, with the actual age most likely closer to the lower confidence interval boundary. This WGD age estimate would suggest that paleopolyploidy enabled survival across the Cretaceous-Paleogene boundary, as witnessed in many other angiosperms36 (Fig. 2b). Determining whether orchid-specific paleopolyploidy occurred in association with either its origin or the Cretaceous-Paleogene boundary will necessitate information on other non-cereal monocot genomes. Nevertheless, the orchid-specific paleopolyploidy identified in this study followed by the documented vast radiation of orchid not long after the Cretaceous-Paleogene boundary13, which enabled Orchidaceae to become the second largest angiosperm plant family with a remarkable diversity in flower morphology37, might suggest that the WGD event contributed to the success of orchids38.
Evolutionary analysis of CAM genes
In contrast to all vascular plants, of which about 10% are estimated to be epiphytes39, most orchid species (72%) are epiphytes, with the majority being restricted to tropical regions40,41. Many orchids use the CAM pathway for photosynthesis rather than the C3 pathway, which is considered to be an adaptation to their epiphytic lifestyle that limits water supply. CAM is an important metabolic pathway that evolved convergently in many different plant lineages, and it has been estimated that CAM pathway components are encoded by at least 343 genera in 35 plant families, comprising ∼6% of flowering plant species42,43,44. The CAM pathway bears resemblance to the C4 pathway in that both act to concentrate CO2 around RuBisCO, thereby increasing its efficiency. However, where C4 plants concentrate CO2 spatially, CAM plants concentrate CO2 temporally, providing CO2 during the day but not at night when respiration is the dominant reaction. P. equestris is the only CAM plant thus far for which the genome has been sequenced.
To gain further insight into the evolution of the CAM pathway from the ancestral C3 pathway, we identified genes encoding the key enzymes of the CAM pathway (Fig. 3 and Supplementary Table 24) and compared the P. equestris CAM genes with their homologs in Poaceae, including O. sativa, S. bicolor and Z. mays, using the dicot A. thaliana as an outgroup. We analyzed six key enzymes in CAM, namely, carbonic anhydrase (CA), malic enzyme (ME), malate dehydrogenase (MDH), pyruvate phosphate dikinase (PPDK), phosphoenolpyruvate carboxykinase (PPCK) and phosphoenolpyruvate carboxylase. Gene trees were constructed from alignments of the coding sequences for each gene family (Supplementary Fig. 14 and Supplementary Data Set). We identified gene duplication and loss events along the lineage leading to P. equestris by manually inspecting each gene tree individually. In particular, we uncovered one gene loss and one gene duplication event in the PEPC gene family, two gene duplications in the ME gene family (in the NADP-ME subfamily), one gene loss in the PPCK gene family and at least six gene duplications in the CA gene family (Fig. 3). CA catalyzes the reaction converting carbon dioxide into carbonate, which is the first step in CO2 fixation. There are two CA subfamilies (α and β) in P. equestris. The most obvious expansion for a gene family was found in the α CA gene family. However, the functional differentiation between the two gene families is still not clear. Gene family expansion might substantially increase the efficiency of the reaction through dosage effects and might also provide the possibility of adaptive evolution of the duplicated copies. We did not find any CAM genes among the retained duplicates in the WGD-derived homeologous segments and therefore did not find any obvious evidence that the paleopolyploidy event has been of crucial importance for CAM evolution in P. equestris.
MADS-box gene family analysis
MADS-box genes are known to be involved in many important processes during plant development but are especially known for their roles in flower development. Because orchids are famous for their flower morphology, we focused on identifying and characterizing the MADS-box genes in more detail. In total, 51 putative functional MADS-box genes and 9 pseudogenes were identified (Table 2). Perhaps surprisingly, these numbers are smaller than what is documented for most other sequenced angiosperms. P. equestris has 29 type II MADS-box genes, much fewer than the number found in rice (48) or other cereals. Phylogenetic analysis (Supplementary Fig. 15) showed that most of the genes in the type II MADS-box clades had been duplicated, except those in the B-PI, Bs, SVP and MIKC* clades. Among the duplicated type II clades, the E-class (six members), C/D-class (five members, three in C class and two in D class), B-class AP3 (four members) and AGL6 clades (three members) contained more genes than A. thaliana and rice (Supplementary Table 25). However, genes from the FLC, AGL12 and AGL15 clades could not be found in the P. equestris genome. Recently, FLC genes have been found in cereals, but they are difficult to identify because they are highly divergent and relatively short45. However, genes in both the AGL12 and AGL15 clades are present in the genomes of rice and A. thaliana; therefore, orthologs of FLC, AGL12 and AGL15 might have been specifically lost in orchids. Although AGL12-like genes (XAL1 in A. thaliana) are necessary for root development and flowering46, it seems that a different mechanism has evolved in P. equestris for the same function. Genes in the B-class AP3, C/D-class and E-class clades are well known for their roles in the specification of floral organ identity and have been well studied in P. equestris. These expanded clades including members with differential expression patterns in orchid floral organs, as well as divergent encoded protein domains, support the unique evolutionary routes of these floral organ identity genes associated with the unique labellum and gynostemium innovation in orchids47,48,49,50. Notably, one of the three gene copies in the expanded AGL6 clade had an expression pattern similar to that of the AP3-like PeMADS4 gene, which is specifically expressed in the labellum51. The AGL6-like gene OsMADS6 could specify floral organ identities and meristem fate by interacting with the floral homeotic genes SUPERWOMAN1, MADS3, MADS58, MADS13 and DROOPING LEAF in rice52. The OsMADS6 gene product has also been shown to act as an integrator by forming complexes with B-class and C/D-class MADS-box proteins53,54. Combinatorial protein interaction networks among the members of the expanded B-class, C/D-class, E-class and AGL6 clades during orchid floral development might have led to the evolutionary novelties of orchid flowers directly involved in speciation50.
Only 22 putative functional type I genes and 8 pseudogenes were found (Table 2), suggesting that the P. equestris type I MADS-box genes have experienced a lower birth rate or, alternatively, a higher death rate than type II MADS-box genes. Tandem gene duplications seem to have contributed to the increase in the number of type I genes in the α group (type I Mα) and suggest that the type I MADS-box genes have mainly been duplicated by smaller-scale and more recent duplications55. Interestingly, the P. equestris genome does not contain the β group of type I MADS-box genes (type I Mβ), although these do exist in A. thaliana, poplar and rice. Interactions among type I MADS-box genes are important for the initiation of endosperm development56.
We also determined the expression levels of all orchid MADS-box genes (Supplementary Table 26) and found that 20 of these genes were preferentially expressed in flower tissue. In particular, five genes (PEQU_41930 (D), PEQU_16438 (D), PEQU_12328 (Bs), PEQU_17261 (Mα) and PEQU_09539 (MIKC*)) were exclusively expressed in flower, suggestive of a distinct role in orchid floral morphogenesis.
Here we have presented a high-quality draft sequence of P. equestris, the first orchid for which the genome sequence has been determined. All around the world, orchids are highly endangered species because of illegal collection and loss of habitat. The complete genome sequence of P. equestris will provide an important resource to start exploring orchid diversity and evolution at the genome level, which will be important for ecological and conservation purposes. The genome sequence will also be a key resource for the development of new concepts and techniques in genetic engineering, such as molecular marker–assisted breeding and the production of transgenic plants, which are necessary to increase the efficiency of orchid breeding and aid orchid horticulture research.
Sample preparation and sequencing.
For genome sequencing, we collected leaves and flowers from several individuals of an inbred line of P. equestris and extracted genomic DNA using the modified CTAB protocol58. Sequencing libraries with insert sizes ranging from 160 bp to 20 kb were then constructed using a library construction kit (Illumina). Libraries were sequenced on the Illumina HiSeq 2000 platform. A library with an insert size of 40 kb was constructed using a modified fosmid library construction pipeline as described previously59. The raw reads generated were filtered according to sequencing quality and with regard to adaptor contamination and duplicated reads. Thus, only high-quality reads were remained and used in the genome assembly.
Genome assembly and assessment.
We adopted a whole-genome shotgun strategy to sequence and assemble the genome of P. equestris and obtained 119.4 Gb of data from seven DNA libraries (Supplementary Table 1). Using k-mer frequency distribution analysis, we estimated the genome size to be 1.16 Gb (Supplementary Fig. 1 and Supplementary Note), which is smaller than previous estimates by flow cytometry (∼1.6 Gb; ref. 5). In the k-mer distribution, we also observed a secondary peak indicating considerable heterozygosity, which posed serious challenges for the genome assembly algorithm. In general, genome assembly was carried out using SOAPdenovo60. Contig construction, scaffolding and gap filling processes were performed using the corresponding methods provided by SOAPdenovo. The parameters used in genome assembly were as follows: pregraph -s Pha.lib -a 200 -p 12 -K 35 -d 2 -o Pha; contig -g Pha_1213 -M 3; map -s Pha.lib -g Pha; scaff -g Pha. Using these data, we assembled 236,185 scaffolds, with a total length of ∼1.1 Gb and an N50 length of 359.1 kb (Supplementary Table 2). About 90% of the total assembled genome (∼980 Mb) was contained in the 6,359 longest scaffolds.
The total genome assembly amounted to 1.086 Gb (1.0 Gb without N bases), which is ∼93% (∼86% without N bases) of the estimated total genome size. We assessed the coverage of the genome assembly using BACs (Supplementary Note). First, 18,486 BAC-end sequences (>100 bp in length) from a previous study61 and our data (W.-C. Tsai and H.-H. Chen, unpublished data) were downloaded and mapped back to the scaffolds using BLAT62. Of the BAC-end sequences, 92.9% could be mapped to the assembled genome. We then used a pooling strategy to sequence the BAC clones from a BAC library of P. equestris. We made mixtures of 10, 20 and 30 randomly selected BAC clones, each with an independent replication. Each of the six pooled BAC clones was amplified with liquid medium and then used to extract BAC DNA. The pooled BAC DNA was sheared to generate sequencing libraries of short insert size, and these libraries were sequenced. For each BAC pool, we obtained more than 5 Gb of data on average. Taking these data, we used SOAPdenovo to assemble each BAC pool. Although the overall assembly results of these BACs had some gaps (Supplementary Table 27), we were able to generate some long contigs and use these to assess the quality of the whole-genome assembly. We mapped these long contigs back to the genome assembly with BLASTZ. The mapped length was then calculated, and the mapping details were displayed. For contigs longer than 1 kb in length (979 in total), 97% could be mapped to the assembled genome (Supplementary Fig. 2). Finally, we sequenced and assembled ten randomly chosen BAC clones using 454 sequencing technology. Comparison of these assembled BAC scaffolds with our assembly also indicated a low error rate (Supplementary Note).
We also assessed the completeness and accuracy of the assembly using conserved genes and RNA-seq data (Supplementary Note). Of the 248 conserved core eukaryotic genes that were used to assess genome completeness6, 234 (94%) were uncovered in our genome assembly (Supplementary Table 5). Using the ∼9 Gb of RNA-seq data from four different P. equestris tissues (leaf, flower, stem and root), we assembled transcription fragments, and 93% (root) and ∼97% (leaf) of the assembled sequences could be mapped to the genome assembly with 90% identity and 90% mapping coverage (Supplementary Table 6). Thus, the coverage of the assembly in gene-rich regions was estimated to be > 93%, which was higher than the estimated genome coverage overall.
Tandem repeats and TEs in the genome were identified separately. The repeat annotation process was similar to that applied and described in a previous study59. Tandem repeats were identified using TRF63 and RepeatMasker64 (version 3.2.7). To identify TEs, we first used RepeatMasker with the Repbase65 database of known repeat sequences to search the TEs in the orchid genome. We then used LTR_FINDER66 (version 1.0.3), PILER67 and RepeatScout68 (version 1.05) to construct a repeat sequence database for orchid. Further, applying this de novo repeat sequence database, we used RepeatMasker to search for repeats in the genome. We also used RepeatProteinMask (version 3.2.2) implemented in RepeatMasker to identify repeat proteins. All the repeat sequences identified by the different methods were combined into the final repeat annotation. The repeat elements were categorized in a hierarchical way, as described previously59.
To study the divergence of LTRs, we identified LTRs with complete structure using LTR_STRUC69 with default parameters. Divergence was then estimated as described previously59. The LTRs with complete structure were aligned using MUSCLE70, and divergence was estimated using the Kimura two-parameter model in distmat implemented in the EMBOSS package.
Gene annotation was performed as described previously59. We carried out gene annotation using the following methods: (i) de novo gene prediction, (ii) homolog prediction, (iii) RNA-seq annotation and (iv) integration of the gene set. First, two de novo gene prediction programs, AUGUSTUS71 (version 2.03) and SNAP72, were applied to predict genes on the masked genome sequences. In homolog prediction, we utilized the protein sequences from four angiosperm genomes (A. thaliana, O. sativa, S. bicolor and Z. mays) to align against the unmasked genome using TBLASTN, with an E-value cutoff of 1 × 10−5, and then used Genewise73 (version 2.2.0) to predict gene structures. In RNA-seq annotation, the RNA-seq reads from four tissues were aligned against the reference genome using TopHat74 (version 1.0.14). After alignment, the transcripts were assembled using Cufflinks75 (version 0.8.2). BESTORF was then used to predict ORFs with parameters trained on monocot genes without filtering out UTRs. Finally, we generated an integrated gene set using GLEAN76. Details on software parameters are provided in the Supplementary Note.
To detect alternative splicing, we first used an in-house script to train fifth-order Markov parameters with our annotated gene set. Using this training set, we predicted an ORF for each transcript generated by Cufflinks. Transcripts without predicted ORFs were discarded. The transcripts predicted were compared to our gene models. Redundant transcripts, transcripts encoding short proteins (<50 amino acids in length) and transcripts for which the protein product was shorter than 30% of the proteins encoded in the gene set were filtered out.
For gene model validation, we manually examined 500 randomly selected genes from 2,038 of the gene families shared among monocots that only had one copy in both P. equestris and rice. Most of these 500 genes were correctly predicted, although 20 contained dubious annotation errors, reflecting an annotation accuracy of 96%. We further manually checked the well-studied MADS-box genes in P. equestris and found that the annotation of those genes was consistent with previous gene models determined by the sequencing of full-length cDNAs (W.-C.T., Y.-Y.H., K.-W.L., Z.-J.L., G.-Q.Z. et al., unpublished data). Of 40 genes, only 3 had incomplete annotations, again reflecting the high quality of the gene annotation.
Building gene families.
To build gene families, the PLAZA pipeline15 was used. Along with the orchid genome, we included the genomes of four monocots (O. sativa, B. distachyon, S. bicolor and Z. mays), three eudicots (A. thaliana, P. trichocarpa and V. vinifera) and three outgroup species (P. patens, C. reinhardtii and O. lucimarinus). First, all pairwise similarities between the 364,344 coding genes in the data set were calculated using all-against-all BLASTP77, retaining the top 1,250 hits for each gene with an E-value cutoff of 1 × 10−5. Using tribeMCL78, these similarities were clustered into homologous gene families (in mclblastline, using I = 2 and scheme = 4; other parameters were left at default values).
Genes included in a family with a similarity score (BLASTP) of less than 25% of the median similarity score for gene pairs within that family were flagged as outliers and were not included in the alignment and phylogenetic tree. For each family, the amino acid sequences encoded by all genes were obtained and aligned using MUSCLE. Ambiguously aligned positions (sites with gaps in the majority of the sequences and misalignments) were automatically removed from the alignments. Note that each singleton, for example, a gene with no homologs, was considered to represent a separate gene family (as shown in Fig. 1a).
Two methods were used for functional annotation. First, InterProScan14 (using default settings) was run to map known protein domains to all genes. Using InterPro to GO mapping, GO labels were obtained for the InterPro domains. Second, on the basis of phylogenetic trees, reliable rice orthologs were identified for orchid genes. Functional annotations from the orthologous rice genes, excluding those with an evidence tag of 'inferred from electronic annotation', were transferred to the orchid genes.
Detection of genomic homology.
Genomic homology was detected using i-ADHoRe 3.0 (ref. 79), included in the PLAZA pipeline, using the following settings: alignment method gg2, gap size 30, tandem gap 30, cluster gap 35, q value 0.85, prob cutoff 0.01, anchor points 5 and multiple hypothesis correction FDR). The output was processed by the pipeline and included in a relational database to which visualization programs can connect and on which additional statistical analysis can be performed. For synteny detection, the cloud mode was enabled (cluster_type = cloud) and appropriate settings were selected: cloud_gap_size 20, cloud_cluster_gap 20, cloud_filter_method binomial, prob cutoff 0.01, anchor points 5, multiple hypothesis correction FDR and level_2_only true.
Relative dating using synonymous substitutions.
KS values for homologous gene pairs were calculated by first aligning the coding sequences with ClustalW80 using the protein sequences as a guide. Positions aligned with low confidence (regions near gaps in the alignment) were stripped. Codeml (PAML package81) was used to determine the actual KS value of each pair. To build the orchid paranome (all duplicated genes) KS age distribution, a correction was performed as described in Maere et al.10.
Gene family evolution and phylogenomic dating.
We used 72 single-copy families shared by all 11 species to construct a phylogenetic species tree. We applied the Café program82 to identify gene families that had undergone expansions or contractions.
To date the divergence times of orchid and other monocots, we used 342 single-copy gene families shared by orchid, the other 4 monocots (O. sativa, B. distachyon, S. bicolor and Z. mays) and 3 eudicots (A. thaliana, P. trichocarpa and V. vinifera). The PAML MCMCTree program was used to estimate species divergence times, with the options 'correlated molecular clock' and 'JC69' model. The 'alpha' parameter was estimated by PhyML using the same set of sequence data. MCMC analysis was run for 20,000 generations, using a burn-in of 1,000 iterations. Other parameters were left at default settings. Phase 1 (non-degenerate) sequences for all single-copy gene families that were identified by the PLAZA pipeline were used as the input file for PhyML and MCMCTree. We used the O. sativa and B. distachyon divergence time (40–54 million years ago83), the P. trichocarpa and A. thaliana divergence time (100–120 million years ago84) and the monocot and eudicot divergence time (130–240 million years ago11) as calibrators to predict the divergence time of other nodes and obtained a predicted divergence time of 135.1 million years ago (95% credibility interval of 118.1–152.7 million years ago) for P. equestris and cereals.
Evolutionary analysis of CAM genes.
We identified putative CAM genes by searching the InterProScan result of all predicted P. equestris proteins. We identified orthologs of the P. equestris CAM genes in the genomes of other species, including A. thaliana, Z. mays, O. sativa and S. bicolor, using a reciprocal best hit (RBH) strategy implemented with NCBI BLAST and custom scripts. We then aligned the coding sequences of each gene family using MUSCLE implemented in MEGA5 (ref. 85). Before constructing the phylogeny, we removed dubious short reading frames and obviously unrelated genes resulting from the relaxed annotation of InterProScan. A gene tree was then constructed with MEGA5 using maximum likelihood for each gene family.
MADS-box gene analysis.
MADS-box genes were identified by searching the InterProScan result of all predicted P. equestris proteins. MADS-box domains comprising 60 amino acids, identified by SMART86 for all the MADS-box genes, were then aligned using ClustalW. An unrooted neighbor-joining phylogenetic tree was constructed in MEGA5 with default parameters. Bootstrap analysis was performed using 1,000 iterations.
OrchidBase, http://orchidbase.itps.ncku.edu.tw/; Angiosperm Phylogeny Website, http://www.mobot.org/MOBOT/research/APweb/; World Checklist of Orchidaceae, http://apps.kew.org/wcsp/; RepeatProteinMask, http://www.repeatmasker.org/cgi-bin/RepeatProteinMaskRequest; EMBOSS, http://emboss.sourceforge.net/; BESTORF, http://linux1.softberry.com/berry.phtml?topic=bestorf&group=help&subgroup=gfind.
Sequencing data, annotations and analyses results have all been uploaded to the FTP site ftp://ftp.genomics.org.cn/from_BGISZ/20130120/ for evaluation. The data have also been submitted to the NCBI database under BioProject PRJNA192198.
We thank J.C. Pires for helpful discussions. The authors acknowledge support from the 948 Program of the State Forestry Administration (China) (2011-4-53) to Z.-J.L. and X.-M.Z., the Shenzhen Municipal Science & Technology Programs for Building State and Shenzhen Key Laboratories (no. 2006464, 200712 and CXB201005260070A) to L.-Q.H., the Youth Talent Fund of the Shenzhen municipal government (JC201005310690A) to J.C. and the Ghent University Multidisciplinary Research Partnership 'Bioinformatics: From Nucleotides to Networks' to Y.V.d.P. Y.V.d.P. also acknowledges support from the European Union Seventh Framework Programme (FP7/2007-2013) under European Research Council Advanced Grant Agreement 322739–DOUBLE-UP. Part of this work was carried out using the Stevin Supercomputer Infrastructure at Ghent University, funded by Ghent University, the Hercules Foundation and the Flemish Government Department of Economy, Science and Innovation.
CAM gene alignments.