Eucalypts are the world’s most widely planted hardwood trees. Their outstanding diversity, adaptability and growth have made them a global renewable resource of fibre and energy. We sequenced and assembled >94% of the 640-megabase genome of Eucalyptus grandis. Of 36,376 predicted protein-coding genes, 34% occur in tandem duplications, the largest proportion thus far in plant genomes. Eucalyptus also shows the highest diversity of genes for specialized metabolites such as terpenes that act as chemical defence and provide unique pharmaceutical oils. Genome sequencing of the E. grandis sister species E. globulus and a set of inbred E. grandis tree genomes reveals dynamic genome evolution and hotspots of inbreeding depression. The E. grandis genome is the first reference for the eudicot order Myrtales and is placed here sister to the eurosids. This resource expands our understanding of the unique biology of large woody perennials and provides a powerful tool to accelerate comparative biology, breeding and biotechnology.
At a glance
- 303–346 (Science Publishers, 2008) Phylogeny, diversity and evolution of eucalypts. in Plant Genome: Biodiversity and Evolution, Part E: Phanerogams-Angiosperm Vol. 1 (eds & )
- http://www.git-forestry.com (GIT Forestry Consulting, retrieved, 29 March 2009) & in Eucalyptologics Information Resources on Eucalypt Cultivation Worldwide
- 254 (Earthscan, 2010) , & Ecosystem Goods and Services from Plantation Forests
- The effects of age and environment on the expression of inbreeding depression in Eucalyptus globulus. Heredity 107, 50–60 (2011) , , &
- The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 313, 1596–1604 (2006) et al.
- High-density genetic linkage maps with over 2,400 sequence-anchored DArT markers for genetic dissection in an F2 pseudo-backcross of Eucalyptus grandis × E. urophylla. Tree Genet. Genomes 8, 163–175 (2012) et al.
- Genomic characterization of DArT markers based on high-density linkage analysis and physical mapping to the Eucalyptus genome. PLoS ONE 7, e44684 (2012) et al.
- Rosid radiation and the rapid rise of angiosperm-dominated forests. Proc. Natl Acad. Sci. USA 106, 3853–3858 (2009) et al.
- The banana (Musa acuminata) genome and the evolution of monocotyledonous plants. Nature 488, 213–217 (2012) et al.
- The genome of woodland strawberry (Fragaria vesca). Nature Genet. 43, 109–116 (2011) et al.
- Chloroplast genome phylogenetics: why we need independent approaches to plant molecular evolution. Trends Plant Sci. 10, 203–209 (2005) , , , &
- Phylogenomics: the beginning of incongruence? Trends Genet. 22, 225–231 (2006) , , &
- Plants with double genomes might have had a better chance to survive the Cretaceous-Tertiary extinction event. Proc. Natl Acad. Sci. USA 106, 5737–5742 (2009) , &
- The age and diversification of the angiosperms re-revisited. Am. J. Bot. 97, 1296–1303 (2010) , &
- The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449, 463–467 (2007) et al.
- Importance of lineage-specific expansion of plant tandem duplicates in the adaptive response to environmental stimuli. Plant Physiol. 148, 993–1003 (2008) , , , &
- Bias in plant gene content following different sorts of duplication: tandem, whole-genome, segmental, or by transposition. Annu. Rev. Plant Biol. 60, 433–453 (2009)
- High synteny and colinearity among Eucalyptus genomes revealed by high-density comparative genetic mapping. Tree Genet. Genomes 8, 339–352 (2012) et al.
- Nuclear DNA content of commercially important Eucalyptus species and hybrids. Can. J. For. Res. 24, 1074–1078 (1994) &
- A new classification of the genus Eucalyptus L'Her. (Myrtaceae). Aust. Syst. Bot. 13, 79–148 (2000)
- Flammable biomes dominated by eucalypts originated at the Cretaceous-Palaeogene boundary. Natuer Commun. 2, 193 (2011) , , , &
- Co-evolution between transposable elements and their hosts: a major factor in genome size evolution? Chromosome Res. 19, 777–786 (2011) &
- Comparative SNP diversity among four Eucalyptus species for genes from secondary metabolite biosynthetic pathways. BMC Genomics 10, 452 (2009) , , , &
- High-throughput gene and SNP discovery in Eucalyptus grandis, an uncharacterized genome. BMC Genomics 9, 312 (2008) et al.
- Genomic selection for growth and wood quality in Eucalyptus: capturing the missing heritability and accelerating breeding for complex traits in forest trees. New Phytol. 194, 116–128 (2012) et al.
- Progress in Myrtaceae genetics and genomics: Eucalyptus as the pivotal genus. Tree Genet. Genomes 8, 463–508 (2012) et al.
- What genes make a tree a tree? Trends Plant Sci. 10, 210–214 (2005)
- Lignin biosynthesis. Annu. Rev. Plant Biol. 54, 519–546 (2003) , &
- Cellulose factories: advancing bioenergy production from forest trees. New Phytol. 194, 54–62 (2012) , &
- Hemicelluloses. Annu. Rev. Plant Biol. 61, 263–289 (2010) &
- Distribution of foliar formylated phloroglucinol derivatives amongst Eucalyptus species. Biochem. Syst. Ecol. 28, 813–824 (2000) , , &
- α,β-Unsaturated monoterpene acid glucose esters: structural diversity, bioactivities and functional roles. Phytochemistry 72, 2259–2266 (2011) &
- Some evolutionary consequences of being a tree. Annu. Rev. Ecol. Evol. Syst. 37, 187–214 (2006) &
- Regulation and function of SOC1, a flowering pathway integrator. J. Exp. Bot. 61, 2247–2254 (2010) &
- 30–56 (Cambridge Univ. Press, 1997) Reproductive biology of eucalypts. in Eucalypt Ecology: Individuals to Ecosystems (ed. )
- Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13, 91–96 (2003) et al.
- Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003) et al.
- Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314, 1041–1052 (2001) , &
- InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res. 38, D196–D203 (2010) et al.
- InterPro: the integrative protein signature database. Nucleic Acids Res. 37, D211–D215 (2009) et al.
- Improved prediction of signal peptides: SignalP 3.0. J. Mol. Biol. 340, 783–795 (2004) , , &
- Predotar: A tool for rapidly screening proteomes for N-terminal targeting sequences. Proteomics 4, 1581–1590 (2004) , , &
- A hidden Markov model for predicting transmembrane helices in protein sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 6, 175–182 (1998) , &
- RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688–2690 (2006)
- Improved criteria and comparative genomics tool provide new insights into grass paleogenomics. Brief. Bioinform. 10, 619–630 (2009) , , , &
- Reconstruction of monocotelydoneous proto-chromosomes reveals faster evolution in plants than in animals. Proc. Natl Acad. Sci. USA 106, 14908–14913 (2009) et al.
- VISTA: computational tools for comparative genomics. Nucleic Acids Res. 32, W273–W279 (2004) , , , &
- Multiple whole-genome alignments without a reference organism. Genome Res. 19, 682–689 (2009) , , &
- In silico archeogenomics unveils modern plant genome organisation, regulation and evolution. Curr. Opin. Plant Biol. 15, 122–130 (2012)
- Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnol. 28, 511–515 (2010) et al.
- BLAT–the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002)
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997) et al.
- Gramene database in 2010: updates and extensions. Nucleic Acids Res. 39, D1085–D1094 (2011) et al.
- Gramene database: a hub for comparative plant genomics. Methods Mol. Biol. 678, 247–275 (2011)
- Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genet. 25, 25–29 (2000) et al.
- Hal: an automated pipeline for phylogenetic analyses of genomic data. PLoS Curr. 3, RRN1213 (2011) , , , &
- HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29–W37 (2011) , &
- Quantifying the mechanisms of domain gain in animal proteins. Genome Biol. Evol. 11, R74 (2010) , &
- Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions. J. Mol. Biol. 348, 231–243 (2005) , , &
- Domain tree-based analysis of protein architecture evolution. Mol. Biol. Evol. 25, 254–264 (2008) , , &
- TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105–1111 (2009) , &
- Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009) , , &
- Infernal 1.0: inference of RNA alignments. Bioinformatics 25, 1335–1337 (2009) , &
- Rfam: an RNA family database. Nucleic Acids Res. 31, 439–441 (2003) , , , &
- Rfam: Wikipedia, clans and the “decimal” release. Nucleic Acids Res. 39, D141–D145 (2011) et al.
- LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 13, 721–731 (2003) et al.
- Isolation of milligram quantities of nuclear DNA from tomato (Lycopersicon esculentum), a plant containing high levels of polyphenolic compounds. Plant Mol. Biol. Rep. 15, 148–153 (1997) , &
- A rapid method for tissue collection and high-throughput isolation of genomic DNA from mature trees. Plant Mol. Biol. Rep. 24, 81–91 (2006) , , &
- MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066 (2002) , , &
- MEGA4: Molecular evolutionary genetics analysis (MEGA) software version 4.0. Mol. Biol. Evol. 24, 1596–1599 (2007) , , &
Extended data figures and tables
Extended Data Figures
- Extended Data Figure 1: RNA-seq-based expression evidence for predicted Eucalyptus grandis gene models. (287 KB)
Gene expression was assessed with Illumina RNA-seq analysis (240 million RNA sequences from six tissues, mapped to 36,376 E. grandis genes, V1.1 annotation). Genes were counted as expressed in a tissue if a minimum of FPKM = 1.0 was observed in the tissue. A total of 23,485 gene models (64.6%) were detected in all six tissues compared here and 32,697 (89.9%) in at least one of the six tissues. Expression profiles for individual genes are accessible in the Eucalyptus Genome Integrative Explorer (EucGenIE, http://www.eucgenie.org/).
- Extended Data Figure 2: Sharing of protein-coding gene families, protein domains and domain arrangements in Eucalyptus, Arabidopsis, Populus and Vitis. (524 KB)
a, The four rosid lineages have a total of 16,048 protein coding gene clusters (from a total of 35,118 identified in 29 sequenced genomes; see Methods and Supplementary Information section 3) of which a core set of 6,926 clusters are shared among all four lineages. Of the 36,376 high-confidence annotated gene models in E. grandis, 30,341 (84%) are included in 10,049 clusters. E. grandis has 851 unique gene clusters (that is, not shared with any of the three other rosid genomes, but shared with at least one other of the 29 genomes). b, A total of 3,160 Pfam A domains are shared among the four rosid lineages, the majority of which are single-domain arrangements (3,138 shared among the four lineages). Thirteen PfamA domains were only detected in Eucalyptus and 392 domain arrangements are specific to Eucalyptus in this four-way comparison.
- Extended Data Figure 3: Green plant phylogeny based on shared gene clusters from 17 sequenced plant genomes. (262 KB)
The phylogenetic tree was generated by RAxML analysis including at least one protein from at least half of the species per protein cluster in a concatenated MUSCLE alignment adjusted by Gblocks with liberal settings (Supplementary Data 7). The corresponding bootstrap partitions are provided at each node. The tree was rooted with Physcomitrella (a moss) as outgroup. The Myrtales lineage represented by Eucalyptus grandis is supported as sister to fabids and malvids (core rosid) clades together with the basal rosid lineage Vitales, whereas Populus trichocarpa (Malpighiales) is grouped with malvids.
- Extended Data Figure 4: Dating of the Eucalyptus lineage-specific whole-genome duplication event. (191 KB)
a, Eucalyptus Ks whole-paranome (the set of all duplicate genes in the genome) age distribution. On the x axis the Ks is plotted (bin size of 0.1); on the y axis the number of retained duplicate paralogous gene pairs is plotted. b, Eucalyptus Ks anchor age distribution. On the x axis the Ks is plotted (bin size of 0.1); on the y axis the number of retained duplicate anchors is plotted. Anchors falling within the Ks range of 0.8–1.5 were used for absolute dating. c, Eucalyptus absolute dated anchors from the most recent WGD. The smooth green curve represents the maximum likelihood normal fit of dated anchors derived from the most recent WGD in Eucalyptus, whereas the blue dots represent a histogram of the raw data. The dashed line indicates the ML estimate of the distribution mode, whereas the dotted lines delimit the corresponding 95% confidence intervals. The mode of dated anchors is estimated at 109.93 Myr ago with its lower and upper 95% boundaries at 105.96 and 113.91 Myr ago, respectively. d, Genome duplication pattern in the core eudicot (rosid and asterid) ancestor and lineages leading to Solanum (asterid), Vitis and Eucalyptus (basal rosids) and the core rosids. The three Eucalyptus (E1–E3), Vitis (V1–V3) and Solanum (S1–S3) orthologues were generated by the shared hexaploidy event (purple box, ~130 to 150 Myr ago) and an additional set of Eucalyptus orthologues (E1'–E3′) were created in the lineage-specific WGD (orange boxes, ~110 Myr ago).
- Extended Data Figure 5: Genome-wide analysis of tandem gene assemblies. (209 KB)
The number and distribution of contig breaks was evaluated for pairs of tandem genes (located within 50 kb of each other). a, Distribution of the number of contig breaks between gene pairs (blue bars) and cumulative proportion of gene pairs separated by contig breaks (black line). b, Distribution of the number of contig breaks per separation distance showing that the number of breaks is positively correlated with separation distance. The red line shows the distribution of distance between gene pairs with three or more contig breaks. c, Distribution of KS divergence of tandem gene pairs in clusters with exactly two tandem genes showing a gradient of similarity (that is, age of duplication) expected for authentic tandem gene pairs. d, Rate of tandem gene duplication (TD) and gene loss in Eucalyptus grandis (Eg), Populus trichocarpa (Pt), 2Vitis vinifera (Vv) and Arabidopsis thaliana (At). All of the rosid genomes (except Arabidopsis) exhibit constant rates of tandem duplication and loss. The rate of tandem gene duplication in Eucalyptus has been stable and consistently higher than in Populus and Vitis. 1 Myr ~ 0.0026 transversions at fourfold degenerate sites, consistent with Populus and Eucalyptus having diverged ~100 Myr ago.
- Extended Data Figure 6: Illumina PE100 read coverage of the ~760-kb region containing a R2R3–MYB tandem gene array. (217 KB)
Illumina PE100 reads generated from BRASUZ1 (E. grandis) and X46 (E. globulus) were aligned to the E. grandis (BRASUZ1, V1.0) genome assembly, and insert (green bars) and sequence (blue line) coverage investigated for the ~760-kb region including a R2R3–MYB tandem array (details in Supplementary Data 3) in the E. grandis genome assembly. a, Read coverage profile of the BRASUZ1 reads mapped to the region showing 1× relative coverage across all nine of the tandem duplicates (red blocks) in the region, and b, X46 (E. globulus) reads mapped to the region showing 1× relative coverage on approximately half of the region with some tandem duplicates apparently absent from the E. globulus genome. Note that insert coverage (green bars) is relatively higher for E. globulus (X46, panel b) due to the larger insert size of the genomic library sequenced for X46 (~300 bp) than for BRASUZ1 (~150 bp).
- Extended Data Figure 7: Alternative homozygous classes observed in the 28 M35D2 siblings as a function of position on chromosomes 1–11. (223 KB)
Several peaks of conserved heterozygosity (peaks >80%) are seen on all chromosomes except 5 and 11. A region of 25 Mb on chromosome 4 from 11 to 36 Mb is completely devoid of homozygous versions of one of the alleles (red line), but has roughly 25–32% of the siblings homozygous for the other allele (green line) and the rest heterozygous in a roughly 1:4 ratio. The blue line is the total proportion of siblings out of 28 that are heterozygous in the region. One would expect 50% under the null model, but almost the entire chromosome is biased towards heterozygosity. In several other regions (for example, chromosomes 6, 7, 9 and 10) both homozygous classes are depleted, suggesting the presence of genetic load at different loci along the two parental homologues and explaining the strong selection for heterozygosity in such regions.
- Extended Data Figure 8: Genes involved in lignin biosynthesis in woody tissues of Eucalyptus. (378 KB)
Relative (yellow–blue scale) and absolute (white–red scale) expression profiles of secondary cell-wall-related genes implicated in lignin biosynthesis. Detailed gene annotation and mRNA-seq expression data are provided in Supplementary Data 9. Five novel Eucalyptus candidates that have not previously been associated with lignification are indicated by asterisks (Carocha et al., unpublished data). ST, shoot tips; YL, young leaves; ML, mature leaves; FL, floral buds; RT, roots; PH, phloem, IX, immature xylem. Absolute expression level (FPKM50) is only shown for immature xylem.
- Extended Data Figure 9: Phylogenetic tree of R2R3 MYB sequences from subgroups expanded and/or preferentially found in woody species. (540 KB)
A total of 133 amino acid sequences from Eucalyptus grandis (50), Vitis vinifera (34), Populus trichocarpa (40), Arabidopsis thaliana (6) and Oryza sativa (3) corresponding to three woody-expanded (subgroups 5, 6 and AtMYB5 based on Arabidopsis classification) and five woody-preferential subgroups (I through V). The latter do not contain any Arabidopsis nor Oryza sequences. Sequences were aligned using MAFFT with the FFT-NS-i algorithm69 (Supplementary Data 10). Evolutionary history was inferred constructing a Neighbour-joining tree with 1,000 bootstrap replicates (bootstrap support is shown next to branches) using MEGA5 (ref. 70). The evolutionary distances were computed using the Jones-Taylor-Thornton substitution model and the rate variation among sites was modelled with a gamma distribution of 1. Positions containing gaps and missing data were not considered in the analysis. The tree is drawn to scale, with branch lengths in the same units as those of the evolutionary distances used to infer the phylogenetic tree. RNA-seq-based relative transcript abundance data for six different tissues, expressed in FPKM values (fragments per kilobase of exon per million fragments mapped), are shown for each Eucalyptus gene next to each subgroup. ST, shoot tips; YL, young leaves; ML, mature leaves; FL, flowers; PH, phloem; and IX, immature xylem.
- Extended Data Figure 10: Phylogenetic tree of type II MIKC MADS box proteins. (456 KB)
Neighbour-joining consensus tree of the type II MIKC sub-clade using protein sequences from Eucalyptus grandis, Arabidopsis thaliana, Populus trichocarpa and Vitis vinifera (Supplementary Data 11). Bootstrap values from 1,000 replicates were used to assess the robustness of the tree. Bootstrap values lower than 40% were removed from the tree. Eucalyptus genes are denoted with green dots, Arabidopsis genes with red dots, Populus genes with yellow dots and Vitis genes with blue dots. The gene model numbers from Populus and Vitis were abbreviated to better fit in the figure (P. trichocarpa, Pt; V. vinifera, Vv).
- Supplementary Information (1.9 MB)
This file contains Supplementary Notes, Supplementary Tables 1-18, a list of the Supplementary Data files and Supplementary References.
- Supplementary Data (5.5 MB)
This file contains Supplementary Data files 1-2 – see Supplementary Information p.52 for details.
- Supplementary Data (23.7 MB)
This file contains Supplementary Data files 3-7 - see Supplementary Information p.52 for details.
- Supplementary Data (16.9 MB)
This file contains Supplementary Data files 8-22 - see Supplementary Information p.52 for details.