The genome of Eucalyptus grandis

Journal name:
Nature
Volume:
510,
Pages:
356–362
Date published:
DOI:
doi:10.1038/nature13308
Received
Accepted
Published online

Abstract

Eucalypts are the world’s most widely planted hardwood trees. Their outstanding diversity, adaptability and growth have made them a global renewable resource of fibre and energy. We sequenced and assembled >94% of the 640-megabase genome of Eucalyptus grandis. Of 36,376 predicted protein-coding genes, 34% occur in tandem duplications, the largest proportion thus far in plant genomes. Eucalyptus also shows the highest diversity of genes for specialized metabolites such as terpenes that act as chemical defence and provide unique pharmaceutical oils. Genome sequencing of the E. grandis sister species E. globulus and a set of inbred E. grandis tree genomes reveals dynamic genome evolution and hotspots of inbreeding depression. The E. grandis genome is the first reference for the eudicot order Myrtales and is placed here sister to the eurosids. This resource expands our understanding of the unique biology of large woody perennials and provides a powerful tool to accelerate comparative biology, breeding and biotechnology.

At a glance

Figures

  1. Eucalyptus grandis genome overview.
    Figure 1: Eucalyptus grandis genome overview.

    Genome features in 1-Mb intervals across the 11 chromosomes. Units on the circumference show megabase values and chromosomes. a, Gene density (number per Mb, range 6–131). b, Repeat coverage (22–88% per Mb). c, Average expression state (fragments per kilobase of exon per million sequences mapped, FPKM, per gene per Mb, 6–41 per Mb). d, Heterozygosity in inbred siblings (proportion of 28 S1 offspring heterozygous at position, 0.39–0.93). e, Telomeric repeats. f, Tandem duplication density (2–50). g, h, Single nucleotide polymorphisms (SNPs) identified by resequencing BRASUZ1 in 1-Mb bins (g) and per gene (h, 11,656 genes); homozygous regions (~24%) and genes in green and heterozygous regions and genes in purple. Central blue lines connect gene pairs from the most recent whole-genome duplication event (Supplementary Data 1).

  2. Eucalyptus grandis genome synteny, duplication pattern and evolutionary history.
    Figure 2: Eucalyptus grandis genome synteny, duplication pattern and evolutionary history.

    a, Paralogous gene pairs in Eucalyptus for the identified palaeohexaploidization (bottom) and palaeotetraploidization (top) events. Each line represents a duplicated gene, and colours reflect origin from the seven ancestral chromosomes (A1, A4, A7, A10, A13, A16, A19). b, Number of synonymous substitutions per synonymous site (Ks) distributions of Eucalyptus paralogues (top) and EucalyptusVitis orthologues (bottom). Blue bars (top) indicate Ks values for 378 gene pairs from the palaeotetraploidization WGD event (red dot), and red bars show Ks values for 274 gene pairs of the palaeohexaploidization event (red star). c, Evolutionary scenario of genome rearrangements from the Eudicot ancestor to Eucalyptus and other sequenced plant genomes; palaeohistory modified from ref. 49.

  3. Genes involved in cellulose and xylan biosynthesis in wood-forming tissues of Eucalyptus.
    Figure 3: Genes involved in cellulose and xylan biosynthesis in wood-forming tissues of Eucalyptus.

    Relative (yellow–blue scale) and absolute (white–red scale) expression profiles of secondary cell-wall-related genes implicated in cellulose and xylan biosynthesis29. Sugar and polymer intermediates are shown in green, while the proteins (enzymes) involved in each step are shown in blue. Detailed protein names, annotation and mRNA-seq expression data are provided in Supplementary Data 5. ST, shoot tips; YL, young leaves; ML, mature leaves; FL, floral buds; RT, roots; PH, phloem, IX, immature xylem. Absolute expression level (FPKM50) is only shown for immature xylem, the target secondary cell-wall-producing tissue. DUF, domain of unknown function; GATL, galacturonosyl transferase-like; GUX, glucuronic acid substitution of xylan; HEX, hexokinase; INV, invertase; IRX, irregular xylem; PGM, phosphoglucomutase; SUSY, sucrose synthase; RWA, reduced wall acetylation; UGD, UDP-glucose dehydrogenase; UGP, UDP-glucose pyrophosphorylase; UXS, UDP-xylose synthase.

  4. Interspecific phylogenetic analysis and classification of terpene synthase (TPS) genes from Eucalyptus grandis and other sequenced plant genomes.
    Figure 4: Interspecific phylogenetic analysis and classification of terpene synthase (TPS) genes from Eucalyptus grandis and other sequenced plant genomes.

    The phylogenetic tree shows all TPS genes found in eight plant genomes (Supplementary Data 6). TPS subfamilies are indicated on the circumference of the circle. The tree has been rooted between the two major groups of type I and type III TPS. The table shows the number of TPS genes from several species obtained from a Pfam search for the two Pfam motifs (PF01397 and PF03936) found in the TPS genes. Colour coding in the table corresponds to that in the tree. The scale bar (0.3) shows the number of amino acid substitutions per site.

  5. RNA-seq-based expression evidence for predicted Eucalyptus grandis gene models.
    Extended Data Fig. 1: RNA-seq-based expression evidence for predicted Eucalyptus grandis gene models.

    Gene expression was assessed with Illumina RNA-seq analysis (240 million RNA sequences from six tissues, mapped to 36,376 E. grandis genes, V1.1 annotation). Genes were counted as expressed in a tissue if a minimum of FPKM = 1.0 was observed in the tissue. A total of 23,485 gene models (64.6%) were detected in all six tissues compared here and 32,697 (89.9%) in at least one of the six tissues. Expression profiles for individual genes are accessible in the Eucalyptus Genome Integrative Explorer (EucGenIE, http://www.eucgenie.org/).

  6. Sharing of protein-coding gene families, protein domains and domain arrangements in Eucalyptus, Arabidopsis, Populus and Vitis.
    Extended Data Fig. 2: Sharing of protein-coding gene families, protein domains and domain arrangements in Eucalyptus, Arabidopsis, Populus and Vitis.

    a, The four rosid lineages have a total of 16,048 protein coding gene clusters (from a total of 35,118 identified in 29 sequenced genomes; see Methods and Supplementary Information section 3) of which a core set of 6,926 clusters are shared among all four lineages. Of the 36,376 high-confidence annotated gene models in E. grandis, 30,341 (84%) are included in 10,049 clusters. E. grandis has 851 unique gene clusters (that is, not shared with any of the three other rosid genomes, but shared with at least one other of the 29 genomes). b, A total of 3,160 Pfam A domains are shared among the four rosid lineages, the majority of which are single-domain arrangements (3,138 shared among the four lineages). Thirteen PfamA domains were only detected in Eucalyptus and 392 domain arrangements are specific to Eucalyptus in this four-way comparison.

  7. Green plant phylogeny based on shared gene clusters from 17 sequenced plant genomes.
    Extended Data Fig. 3: Green plant phylogeny based on shared gene clusters from 17 sequenced plant genomes.

    The phylogenetic tree was generated by RAxML analysis including at least one protein from at least half of the species per protein cluster in a concatenated MUSCLE alignment adjusted by Gblocks with liberal settings (Supplementary Data 7). The corresponding bootstrap partitions are provided at each node. The tree was rooted with Physcomitrella (a moss) as outgroup. The Myrtales lineage represented by Eucalyptus grandis is supported as sister to fabids and malvids (core rosid) clades together with the basal rosid lineage Vitales, whereas Populus trichocarpa (Malpighiales) is grouped with malvids.

  8. Dating of the Eucalyptus lineage-specific whole-genome duplication event.
    Extended Data Fig. 4: Dating of the Eucalyptus lineage-specific whole-genome duplication event.

    a, Eucalyptus Ks whole-paranome (the set of all duplicate genes in the genome) age distribution. On the x axis the Ks is plotted (bin size of 0.1); on the y axis the number of retained duplicate paralogous gene pairs is plotted. b, Eucalyptus Ks anchor age distribution. On the x axis the Ks is plotted (bin size of 0.1); on the y axis the number of retained duplicate anchors is plotted. Anchors falling within the Ks range of 0.8–1.5 were used for absolute dating. c, Eucalyptus absolute dated anchors from the most recent WGD. The smooth green curve represents the maximum likelihood normal fit of dated anchors derived from the most recent WGD in Eucalyptus, whereas the blue dots represent a histogram of the raw data. The dashed line indicates the ML estimate of the distribution mode, whereas the dotted lines delimit the corresponding 95% confidence intervals. The mode of dated anchors is estimated at 109.93Myr ago with its lower and upper 95% boundaries at 105.96 and 113.91 Myr ago, respectively. d, Genome duplication pattern in the core eudicot (rosid and asterid) ancestor and lineages leading to Solanum (asterid), Vitis and Eucalyptus (basal rosids) and the core rosids. The three Eucalyptus (E1–E3), Vitis (V1–V3) and Solanum (S1–S3) orthologues were generated by the shared hexaploidy event (purple box, ~130 to 150Myr ago) and an additional set of Eucalyptus orthologues (E1'–E3′) were created in the lineage-specific WGD (orange boxes, ~110Myr ago).

  9. Genome-wide analysis of tandem gene assemblies.
    Extended Data Fig. 5: Genome-wide analysis of tandem gene assemblies.

    The number and distribution of contig breaks was evaluated for pairs of tandem genes (located within 50kb of each other). a, Distribution of the number of contig breaks between gene pairs (blue bars) and cumulative proportion of gene pairs separated by contig breaks (black line). b, Distribution of the number of contig breaks per separation distance showing that the number of breaks is positively correlated with separation distance. The red line shows the distribution of distance between gene pairs with three or more contig breaks. c, Distribution of KS divergence of tandem gene pairs in clusters with exactly two tandem genes showing a gradient of similarity (that is, age of duplication) expected for authentic tandem gene pairs. d, Rate of tandem gene duplication (TD) and gene loss in Eucalyptus grandis (Eg), Populus trichocarpa (Pt), 2Vitis vinifera (Vv) and Arabidopsis thaliana (At). All of the rosid genomes (except Arabidopsis) exhibit constant rates of tandem duplication and loss. The rate of tandem gene duplication in Eucalyptus has been stable and consistently higher than in Populus and Vitis. 1Myr ~ 0.0026 transversions at fourfold degenerate sites, consistent with Populus and Eucalyptus having diverged ~100Myr ago.

  10. Illumina PE100 read coverage of the [sim]760-kb region containing a R2R3-MYB tandem gene array.
    Extended Data Fig. 6: Illumina PE100 read coverage of the ~760-kb region containing a R2R3–MYB tandem gene array.

    Illumina PE100 reads generated from BRASUZ1 (E. grandis) and X46 (E. globulus) were aligned to the E. grandis (BRASUZ1, V1.0) genome assembly, and insert (green bars) and sequence (blue line) coverage investigated for the ~760-kb region including a R2R3–MYB tandem array (details in Supplementary Data 3) in the E. grandis genome assembly. a, Read coverage profile of the BRASUZ1 reads mapped to the region showing 1× relative coverage across all nine of the tandem duplicates (red blocks) in the region, and b, X46 (E. globulus) reads mapped to the region showing 1× relative coverage on approximately half of the region with some tandem duplicates apparently absent from the E. globulus genome. Note that insert coverage (green bars) is relatively higher for E. globulus (X46, panel b) due to the larger insert size of the genomic library sequenced for X46 (~300bp) than for BRASUZ1 (~150bp).

  11. Alternative homozygous classes observed in the 28 M35D2 siblings as a function of position on chromosomes 1-11.
    Extended Data Fig. 7: Alternative homozygous classes observed in the 28 M35D2 siblings as a function of position on chromosomes 1–11.

    Several peaks of conserved heterozygosity (peaks >80%) are seen on all chromosomes except 5 and 11. A region of 25Mb on chromosome 4 from 11 to 36Mb is completely devoid of homozygous versions of one of the alleles (red line), but has roughly 25–32% of the siblings homozygous for the other allele (green line) and the rest heterozygous in a roughly 1:4 ratio. The blue line is the total proportion of siblings out of 28 that are heterozygous in the region. One would expect 50% under the null model, but almost the entire chromosome is biased towards heterozygosity. In several other regions (for example, chromosomes 6, 7, 9 and 10) both homozygous classes are depleted, suggesting the presence of genetic load at different loci along the two parental homologues and explaining the strong selection for heterozygosity in such regions.

  12. Genes involved in lignin biosynthesis in woody tissues of Eucalyptus.
    Extended Data Fig. 8: Genes involved in lignin biosynthesis in woody tissues of Eucalyptus.

    Relative (yellow–blue scale) and absolute (white–red scale) expression profiles of secondary cell-wall-related genes implicated in lignin biosynthesis. Detailed gene annotation and mRNA-seq expression data are provided in Supplementary Data 9. Five novel Eucalyptus candidates that have not previously been associated with lignification are indicated by asterisks (Carocha et al., unpublished data). ST, shoot tips; YL, young leaves; ML, mature leaves; FL, floral buds; RT, roots; PH, phloem, IX, immature xylem. Absolute expression level (FPKM50) is only shown for immature xylem.

  13. Phylogenetic tree of R2R3 MYB sequences from subgroups expanded and/or preferentially found in woody species.
    Extended Data Fig. 9: Phylogenetic tree of R2R3 MYB sequences from subgroups expanded and/or preferentially found in woody species.

    A total of 133 amino acid sequences from Eucalyptus grandis (50), Vitis vinifera (34), Populus trichocarpa (40), Arabidopsis thaliana (6) and Oryza sativa (3) corresponding to three woody-expanded (subgroups 5, 6 and AtMYB5 based on Arabidopsis classification) and five woody-preferential subgroups (I through V). The latter do not contain any Arabidopsis nor Oryza sequences. Sequences were aligned using MAFFT with the FFT-NS-i algorithm69 (Supplementary Data 10). Evolutionary history was inferred constructing a Neighbour-joining tree with 1,000 bootstrap replicates (bootstrap support is shown next to branches) using MEGA5 (ref. 70). The evolutionary distances were computed using the Jones-Taylor-Thornton substitution model and the rate variation among sites was modelled with a gamma distribution of 1. Positions containing gaps and missing data were not considered in the analysis. The tree is drawn to scale, with branch lengths in the same units as those of the evolutionary distances used to infer the phylogenetic tree. RNA-seq-based relative transcript abundance data for six different tissues, expressed in FPKM values (fragments per kilobase of exon per million fragments mapped), are shown for each Eucalyptus gene next to each subgroup. ST, shoot tips; YL, young leaves; ML, mature leaves; FL, flowers; PH, phloem; and IX, immature xylem.

  14. Phylogenetic tree of type II MIKC MADS box proteins.
    Extended Data Fig. 10: Phylogenetic tree of type II MIKC MADS box proteins.

    Neighbour-joining consensus tree of the type II MIKC sub-clade using protein sequences from Eucalyptus grandis, Arabidopsis thaliana, Populus trichocarpa and Vitis vinifera (Supplementary Data 11). Bootstrap values from 1,000 replicates were used to assess the robustness of the tree. Bootstrap values lower than 40% were removed from the tree. Eucalyptus genes are denoted with green dots, Arabidopsis genes with red dots, Populus genes with yellow dots and Vitis genes with blue dots. The gene model numbers from Populus and Vitis were abbreviated to better fit in the figure (P. trichocarpa, Pt; V. vinifera, Vv).

Accession codes

Primary accessions

GenBank/EMBL/DDBJ

References

  1. Byrne, M. Phylogeny, diversity and evolution of eucalypts. in Plant Genome: Biodiversity and Evolution, Part E: Phanerogams-Angiosperm Vol. 1 (eds Sharma, A. K. & Sharma, A.) 303346 (Science Publishers, 2008)
  2. Iglesias, I. & Wiltermann, D. in Eucalyptologics Information Resources on Eucalypt Cultivation Worldwide http://www.git-forestry.com (GIT Forestry Consulting, retrieved, 29 March 2009)
  3. Bauhus, J., van der Meer, P. J. & Kanninen, M. Ecosystem Goods and Services from Plantation Forests 254 (Earthscan, 2010)
  4. Costa e Silva, J., Hardner, C., Tilyard, P. & Potts, B. M. The effects of age and environment on the expression of inbreeding depression in Eucalyptus globulus. Heredity 107, 5060 (2011)
  5. Tuskan, G. A. et al. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 313, 15961604 (2006)
  6. Kullan, A. R. K. et al. High-density genetic linkage maps with over 2,400 sequence-anchored DArT markers for genetic dissection in an F2 pseudo-backcross of Eucalyptus grandis × E. urophylla. Tree Genet. Genomes 8, 163175 (2012)
  7. Petroli, C. D. et al. Genomic characterization of DArT markers based on high-density linkage analysis and physical mapping to the Eucalyptus genome. PLoS ONE 7, e44684 (2012)
  8. Wang, H. et al. Rosid radiation and the rapid rise of angiosperm-dominated forests. Proc. Natl Acad. Sci. USA 106, 38533858 (2009)
  9. D'Hont, A. et al. The banana (Musa acuminata) genome and the evolution of monocotyledonous plants. Nature 488, 213217 (2012)
  10. Shulaev, V. et al. The genome of woodland strawberry (Fragaria vesca). Nature Genet. 43, 109116 (2011)
  11. Martin, W., Deusch, O., Stawski, N., Grunheit, N. & Goremykin, V. Chloroplast genome phylogenetics: why we need independent approaches to plant molecular evolution. Trends Plant Sci. 10, 203209 (2005)
  12. Jeffroy, O., Brinkmann, H., Delsuc, F. & Philippe, H. Phylogenomics: the beginning of incongruence? Trends Genet. 22, 225231 (2006)
  13. Fawcett, J. A., Maere, S. & Van de Peer, Y. Plants with double genomes might have had a better chance to survive the Cretaceous-Tertiary extinction event. Proc. Natl Acad. Sci. USA 106, 57375742 (2009)
  14. Bell, C. D., Soltis, D. E. & Soltis, P. S. The age and diversification of the angiosperms re-revisited. Am. J. Bot. 97, 12961303 (2010)
  15. Jaillon, O. et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449, 463467 (2007)
  16. Hanada, K., Zou, C., Lehti-Shiu, M. D., Shinozaki, K. & Shiu, S. H. Importance of lineage-specific expansion of plant tandem duplicates in the adaptive response to environmental stimuli. Plant Physiol. 148, 9931003 (2008)
  17. Freeling, M. Bias in plant gene content following different sorts of duplication: tandem, whole-genome, segmental, or by transposition. Annu. Rev. Plant Biol. 60, 433453 (2009)
  18. Hudson, C. J. et al. High synteny and colinearity among Eucalyptus genomes revealed by high-density comparative genetic mapping. Tree Genet. Genomes 8, 339352 (2012)
  19. Grattapaglia, D. & Bradshaw, H. D. Nuclear DNA content of commercially important Eucalyptus species and hybrids. Can. J. For. Res. 24, 10741078 (1994)
  20. Brooker, M. I. H. A new classification of the genus Eucalyptus L'Her. (Myrtaceae). Aust. Syst. Bot. 13, 79148 (2000)
  21. Crisp, M. D., Burrows, G. E., Cook, L. G., Thornhill, A. H. & Bowman, D. M. Flammable biomes dominated by eucalypts originated at the Cretaceous-Palaeogene boundary. Natuer Commun. 2, 193 (2011)
  22. Ågren, J. A. & Wright, S. I. Co-evolution between transposable elements and their hosts: a major factor in genome size evolution? Chromosome Res. 19, 777786 (2011)
  23. Külheim, C., Hui Yeoh, S., Maintz, J., Foley, W. & Moran, G. Comparative SNP diversity among four Eucalyptus species for genes from secondary metabolite biosynthetic pathways. BMC Genomics 10, 452 (2009)
  24. Novaes, E. et al. High-throughput gene and SNP discovery in Eucalyptus grandis, an uncharacterized genome. BMC Genomics 9, 312 (2008)
  25. Resende, M. D. et al. Genomic selection for growth and wood quality in Eucalyptus: capturing the missing heritability and accelerating breeding for complex traits in forest trees. New Phytol. 194, 116128 (2012)
  26. Grattapaglia, D. et al. Progress in Myrtaceae genetics and genomics: Eucalyptus as the pivotal genus. Tree Genet. Genomes 8, 463508 (2012)
  27. Groover, A. T. What genes make a tree a tree? Trends Plant Sci. 10, 210214 (2005)
  28. Boerjan, W., Ralph, J. & Baucher, M. Lignin biosynthesis. Annu. Rev. Plant Biol. 54, 519546 (2003)
  29. Mizrachi, E., Mansfield, S. D. & Myburg, A. A. Cellulose factories: advancing bioenergy production from forest trees. New Phytol. 194, 5462 (2012)
  30. Scheller, H. V. & Ulvskov, P. Hemicelluloses. Annu. Rev. Plant Biol. 61, 263289 (2010)
  31. Eschler, B. M., Pass, D. M., Willis, R. & Foley, W. J. Distribution of foliar formylated phloroglucinol derivatives amongst Eucalyptus species. Biochem. Syst. Ecol. 28, 813824 (2000)
  32. Goodger, J. Q. & Woodrow, I. E. α,β-Unsaturated monoterpene acid glucose esters: structural diversity, bioactivities and functional roles. Phytochemistry 72, 22592266 (2011)
  33. Petit, R. J. & Hampe, A. Some evolutionary consequences of being a tree. Annu. Rev. Ecol. Evol. Syst. 37, 187214 (2006)
  34. Lee, J. & Lee, I. Regulation and function of SOC1, a flowering pathway integrator. J. Exp. Bot. 61, 22472254 (2010)
  35. House, S. M. Reproductive biology of eucalypts. in Eucalypt Ecology: Individuals to Ecosystems (ed. Woinarski, J.) 3056 (Cambridge Univ. Press, 1997)
  36. Jaffe, D. B. et al. Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13, 9196 (2003)
  37. Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 56545666 (2003)
  38. Remm, M., Storm, C. E. & Sonnhammer, E. L. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314, 10411052 (2001)
  39. Ostlund, G. et al. InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res. 38, D196D203 (2010)
  40. Hunter, S. et al. InterPro: the integrative protein signature database. Nucleic Acids Res. 37, D211D215 (2009)
  41. Bendtsen, J. D., Nielsen, H., von Heijne, G. & Brunak, S. Improved prediction of signal peptides: SignalP 3.0. J. Mol. Biol. 340, 783795 (2004)
  42. Small, I., Peeters, N., Legeai, F. & Lurin, C. Predotar: A tool for rapidly screening proteomes for N-terminal targeting sequences. Proteomics 4, 15811590 (2004)
  43. Sonnhammer, E. L., von Heijne, G. & Krogh, A. A hidden Markov model for predicting transmembrane helices in protein sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 6, 175182 (1998)
  44. Stamatakis, A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 26882690 (2006)
  45. Salse, J., Abrouk, M., Murat, F., Quraishi, U. M. & Feuillet, C. Improved criteria and comparative genomics tool provide new insights into grass paleogenomics. Brief. Bioinform. 10, 619630 (2009)
  46. Salse, J. et al. Reconstruction of monocotelydoneous proto-chromosomes reveals faster evolution in plants than in animals. Proc. Natl Acad. Sci. USA 106, 1490814913 (2009)
  47. Frazer, K. A., Pachter, L., Poliakov, A., Rubin, E. M. & Dubchak, I. VISTA: computational tools for comparative genomics. Nucleic Acids Res. 32, W273W279 (2004)
  48. Dubchak, I., Poliakov, A., Kislyuk, A. & Brudno, M. Multiple whole-genome alignments without a reference organism. Genome Res. 19, 682689 (2009)
  49. Salse, J. In silico archeogenomics unveils modern plant genome organisation, regulation and evolution. Curr. Opin. Plant Biol. 15, 122130 (2012)
  50. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnol. 28, 511515 (2010)
  51. Kent, W. J. BLAT–the BLAST-like alignment tool. Genome Res. 12, 656664 (2002)
  52. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 33893402 (1997)
  53. Youens-Clark, K. et al. Gramene database in 2010: updates and extensions. Nucleic Acids Res. 39, D1085D1094 (2011)
  54. Jaiswal, P. Gramene database: a hub for comparative plant genomics. Methods Mol. Biol. 678, 247275 (2011)
  55. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genet. 25, 2529 (2000)
  56. Robbertse, B., Yoder, R. J., Boyd, A., Reeves, J. & Spatafora, J. W. Hal: an automated pipeline for phylogenetic analyses of genomic data. PLoS Curr. 3, RRN1213 (2011)
  57. Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29W37 (2011)
  58. Buljan, M., Frankish, A. & Bateman, A. Quantifying the mechanisms of domain gain in animal proteins. Genome Biol. Evol. 11, R74 (2010)
  59. Ekman, D., Bjorklund, A. K., Frey-Skott, J. & Elofsson, A. Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions. J. Mol. Biol. 348, 231243 (2005)
  60. Forslund, K., Henricson, A., Hollich, V. & Sonnhammer, E. L. Domain tree-based analysis of protein architecture evolution. Mol. Biol. Evol. 25, 254264 (2008)
  61. Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 11051111 (2009)
  62. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009)
  63. Nawrocki, E. P., Kolbe, D. L. & Eddy, S. R. Infernal 1.0: inference of RNA alignments. Bioinformatics 25, 13351337 (2009)
  64. Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A. & Eddy, S. R. Rfam: an RNA family database. Nucleic Acids Res. 31, 439441 (2003)
  65. Gardner, P. P. et al. Rfam: Wikipedia, clans and the “decimal” release. Nucleic Acids Res. 39, D141D145 (2011)
  66. Brudno, M. et al. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 13, 721731 (2003)
  67. Peterson, D. G., Kevin, S. & Stephen, M. Isolation of milligram quantities of nuclear DNA from tomato (Lycopersicon esculentum), a plant containing high levels of polyphenolic compounds. Plant Mol. Biol. Rep. 15, 148153 (1997)
  68. Tibbits, J. F. G., McManus, L. J., Spokevicius, A. V. & Bossinger, G. A rapid method for tissue collection and high-throughput isolation of genomic DNA from mature trees. Plant Mol. Biol. Rep. 24, 8191 (2006)
  69. Katoh, K., Misawa, K., Kuma, K. & Miyata, T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 30593066 (2002)
  70. Tamura, K., Dudley, J., Nei, M. & Kumar, S. MEGA4: Molecular evolutionary genetics analysis (MEGA) software version 4.0. Mol. Biol. Evol. 24, 15961599 (2007)

Download references

Author information

Affiliations

  1. Department of Genetics, Forestry and Agricultural Biotechnology Institute (FABI), University of Pretoria, Private bag X20, Pretoria 0028, South Africa

    • Alexander A. Myburg,
    • Eshchar Mizrachi,
    • Anand R. K. Kullan,
    • Steven G. Hussey,
    • Desre Pinard,
    • Karen van der Merwe &
    • Pooja Singh
  2. Genomics Research Institute (GRI), University of Pretoria, Private bag X20, Pretoria 0028, South Africa

    • Alexander A. Myburg,
    • Eshchar Mizrachi,
    • Anand R. K. Kullan,
    • Steven G. Hussey,
    • Desre Pinard,
    • Karen van der Merwe,
    • Pooja Singh,
    • Fourie Joubert &
    • Yves Van de Peer
  3. Laboratório de Genética Vegetal, EMBRAPA Recursos Genéticos e Biotecnologia, EPQB Final W5 Norte, 70770-917 Brasília, Brazil

    • Dario Grattapaglia,
    • Marilia R. Pappas,
    • Danielle A. Faria,
    • Carolina P. Sansaloni &
    • Cesar D. Petroli
  4. Programa de Ciências Genômicas e Biotecnologia - Universidade Católica de Brasília SGAN 916, 70790-160 Brasília, Brazil

    • Dario Grattapaglia
  5. US Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, California 94598, USA

    • Gerald A. Tuskan,
    • Uffe Hellsten,
    • Richard D. Hayes,
    • Erika Lindquist,
    • Hope Tice,
    • Diane Bauer,
    • David M. Goodstein,
    • Inna Dubchak,
    • Alexandre Poliakov,
    • Kerrie Barry,
    • Daniel S. Rokhsar &
    • Jeremy Schmutz
  6. Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, USA

    • Gerald A. Tuskan,
    • Xiaohan Yang,
    • Priya Ranjan,
    • Timothy J. Tschaplinski,
    • Chu-Yu Ye &
    • Ting Li
  7. HudsonAlpha Institute for Biotechnology, 601 Genome Way, Huntsville, Alabama 35801, USA

    • Jane Grimwood,
    • Jerry Jenkins &
    • Jeremy Schmutz
  8. Bioinformatics and Computational Biology Unit, Department of Biochemistry, University of Pretoria, Pretoria, Private bag X20, Pretoria 0028, South Africa

    • Ida van Jaarsveld,
    • Charles A. Hefer &
    • Fourie Joubert
  9. Laboratório de Bioinformática, EMBRAPA Recursos Genéticos e Biotecnologia, EPQB Final W5 Norte, 70770-917 Brasília, Brazil

    • Orzenil B. Silva-Junior &
    • Roberto C. Togawa
  10. Department of Plant Biotechnology and Bioinformatics (VIB), Ghent University, Technologiepark 927, B-9000 Ghent, Belgium

    • Lieven Sterck,
    • Kevin Vanneste &
    • Yves Van de Peer
  11. INRA/UBP UMR 1095, 5 Avenue de Beaulieu, 63100 Clermont Ferrand, France

    • Florent Murat &
    • Jérôme Salse
  12. Laboratoire de Recherche en Sciences Végétales, UMR 5546, Université Toulouse III, UPS, CNRS, BP 42617, 31326 Castanet Tolosan, France

    • Marçal Soler,
    • Hélène San Clemente,
    • Naijib Saidi,
    • Hua Cassan-Wang,
    • Christophe Dunand,
    • Victor Carocha &
    • Jacqueline Grima-Pettenati
  13. Department of Botany, University of British Columbia, 3529-6270 University Blvd, Vancouver V6T 1Z4, Canada

    • Charles A. Hefer
  14. Evolutionary Bioinformatics, Institute for Evolution and Biodiversity, University of Muenster, Huefferstrasse 1, D-48149, Muenster, Germany

    • Erich Bornberg-Bauer &
    • Anna R. Kersting
  15. Department of Bioinformatics, Institute for Computer Science, University of Duesseldorf, Universitätsstrasse 1, 40225 Düsseldorf, Germany

    • Anna R. Kersting
  16. Department of Forest Ecosystems and Society, Oregon State University, Corvallis, Oregon 97331, USA

    • Kelly Vining,
    • Vindhya Amarasinghe,
    • Martin Ranik &
    • Steven H. Strauss
  17. Department of Botany and Plant Pathology, Oregon State University, 2082-Cordley Hall, Corvallis, Oregon 97331, USA

    • Sushma Naithani,
    • Justin Elser,
    • Aaron Liston,
    • Joseph W. Spatafora,
    • Palitha Dharmwardhana,
    • Rajani Raja &
    • Pankaj Jaiswal
  18. Center for Genome Research and Biocomputing, Oregon State University, Corvallis, Oregon 97331, USA

    • Sushma Naithani,
    • Alexander E. Boyd,
    • Aaron Liston,
    • Joseph W. Spatafora,
    • Christopher Sullivan &
    • Pankaj Jaiswal
  19. Laboratório de Biologia Evolutiva Teórica e Aplicada, Departamento de Genética, Universidade Federal do Rio de Janeiro (UFRJ), Av. Prof. Rodolpho Paulo Rocco, 21949900 Rio de Janeiro, Brazil

    • Elisson Romanel
  20. Departamento de Biotecnologia, Escola de Engenharia de Lorena-Universidade de São Paulo (EEL-USP), CP116, 12602-810, Lorena-SP, Brazil

    • Elisson Romanel
  21. Laboratório de Genética Molecular Vegetal (LGMV), Departamento de Genética, Universidade Federal do Rio de Janeiro (UFRJ), Av. Prof. Rodolpho Paulo Rocco, 21949900 Rio de Janeiro, Brazil

    • Elisson Romanel &
    • Marcio Alves-Ferreira
  22. Research School of Biology, Australian National University, Canberra 0200, Australia

    • Carsten Külheim &
    • William Foley
  23. IICT/MNE; Palácio Burnay - Rua da Junqueira, 30, 1349-007 Lisboa, Portugal

    • Victor Carocha &
    • Jorge Paiva
  24. IBET/ITQB, Av. República, Quinta do Marquês, 2781-901 Oeiras, Portugal

    • Victor Carocha &
    • Jorge Paiva
  25. Arizona Genomics Institute, University of Arizona, Tucson, Arizona 85721, USA

    • David Kudrna
  26. Dep. de Fitopatologia, Universidade Federal de Viçosa, Viçosa 36570-000, Brazil

    • Sergio H. Brommonschenkel
  27. Centro de Biotecnologia, Universidade Federal do Rio Grande do Sul, 91501-970 Porto Alegre, Brazil

    • Giancarlo Pasquali
  28. Science and Conservation Division, Department of Parks and Wildlife, Locked Bag 104, Bentley Delivery Centre, Western Australia 6983, Australia

    • Margaret Byrne
  29. GYDLE, 1363 av. Maguire, suite 301, Québec, Quebec G1T 1Z2, Canada

    • Philippe Rigault
  30. Department of Environment and Primary Industries, Victorian Government, Melbourne, Victoria 3085, Australia

    • Josquin Tibbits
  31. Melbourne School of Land and Environment, University of Melbourne, Melbourne, Victoria 3010, Australia

    • Antanas Spokevicius
  32. School of Biological Sciences and National Centre for Future Forest Industries, University of Tasmania, Private Bag 55, Hobart, Tasmania 7001, Australia

    • Rebecca C. Jones,
    • Dorothy A. Steane,
    • René E. Vaillancourt &
    • Brad M. Potts
  33. Faculty of Science, Health, Education and Engineering, University of the Sunshine Coast, Queensland 4558, Australia

    • Dorothy A. Steane
  34. Departamento de Biologia Celular, Universidade de Brasília, Brasília 70910-900, Brazil

    • Georgios J. Pappas

Contributions

A.A.M., D.G. and G.A.T. are the lead investigators and contributed equally to the work. J.Sc., J.J., J.G., R.D.H., D.M.G., I.D., A.P., U.H., D.S.R., E.L., H.T., D.B. and K.B. contributed to the assembly, annotation and sequence analysis, S.H.B., D.K. and D.G. to BAC library construction, G.P., S.H.B., M.R.P., D.A.F. and D.G. to various parts of biological sample collection, preparation and quality control, U.H., K.V., L.S., Y.V.d.P., F.M. and J.Sa. to genome duplication analyses, D.M.G., P.J. and J.E. to gene family clustering, U.H., P.J., J.E., A.L., A.E.B. and J.W.S. to green plant phylogeny, R.D.H., D.M.G., P.J., S.N. and R.R. to InterPro and Gene Ontology based functional annotation, X.Y., C.-Y.Y., T.L., T.J.T., M.R.P. and G.J.P. to non-coding RNA analyses, I.v.J., E.M., F.J. and A.A.M. to 5′ UTR analysis, M.R., E.M., C.A.H., K.V.d.M, F.J. and A.A.M. to RNA sequencing and expression profiling, A.R.K., E.B.-B., E.M. and A.A.M. to protein domain and arrangement analysis, D.A.S., J.T., P.R. and A.S. to E. globulus genome resequencing and analysis, U.H., M.S., V.C., H.S.C., J.P., J.G.-P. and G.J.P. to tandem duplicate analysis, K.V., V.A., P.D., C.S., C.A.H., E.R., M.R., A.A.M., S.H.S., R.C.J. and M.A.-F. to MADS box analyses, E.M., D.P., P.J., F.J. and A.A.M. to cellulose, xylan and CAZyme analysis, V.C., J.P., C.D., E.M., A.A.M. and J.G.-P. to lignin biosynthesis genes analysis, J.G.-P., S.G.H., C.A.H., M.S., N.S., H.C.-W., H.S.C., J.T. and P.R. to NAC and MYB analysis, C.K., P.J. and W.F. to terpene synthase gene family analysis, G.J.P. and R.C.T. to transposable elements analysis, U.H., A.R.K.K., C.P.S., C.D.P., D.A.F., O.B.S.-J., D.G. and A.A.M. to genetic mapping, U.H., D.A.F., M.R.P., P.S. and D.G. to genetic load and heterozygosity analysis, and S.N. and P.J. to SDRLK gene family analysis. B.M.P., D.A.S., R.E.V. and M.B. contributed to taxonomic and biological background text. G.A.T. headed and K.B. managed the sequencing project, D.S.R. coordinated the bioinformatics activities, A.A.M., D.G. and G.A.T. wrote and edited most of the manuscript. All authors read and commented on the manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

The E. grandis whole-genome sequences are deposited in GenBank under accession number AUSX00000000. A genome browser and further information on the project are available at http://www.phytozome.net/eucalyptus.php.

Author details

Extended data figures and tables

Extended Data Figures

  1. Extended Data Figure 1: RNA-seq-based expression evidence for predicted Eucalyptus grandis gene models. (287 KB)

    Gene expression was assessed with Illumina RNA-seq analysis (240 million RNA sequences from six tissues, mapped to 36,376 E. grandis genes, V1.1 annotation). Genes were counted as expressed in a tissue if a minimum of FPKM = 1.0 was observed in the tissue. A total of 23,485 gene models (64.6%) were detected in all six tissues compared here and 32,697 (89.9%) in at least one of the six tissues. Expression profiles for individual genes are accessible in the Eucalyptus Genome Integrative Explorer (EucGenIE, http://www.eucgenie.org/).

  2. Extended Data Figure 2: Sharing of protein-coding gene families, protein domains and domain arrangements in Eucalyptus, Arabidopsis, Populus and Vitis. (524 KB)

    a, The four rosid lineages have a total of 16,048 protein coding gene clusters (from a total of 35,118 identified in 29 sequenced genomes; see Methods and Supplementary Information section 3) of which a core set of 6,926 clusters are shared among all four lineages. Of the 36,376 high-confidence annotated gene models in E. grandis, 30,341 (84%) are included in 10,049 clusters. E. grandis has 851 unique gene clusters (that is, not shared with any of the three other rosid genomes, but shared with at least one other of the 29 genomes). b, A total of 3,160 Pfam A domains are shared among the four rosid lineages, the majority of which are single-domain arrangements (3,138 shared among the four lineages). Thirteen PfamA domains were only detected in Eucalyptus and 392 domain arrangements are specific to Eucalyptus in this four-way comparison.

  3. Extended Data Figure 3: Green plant phylogeny based on shared gene clusters from 17 sequenced plant genomes. (262 KB)

    The phylogenetic tree was generated by RAxML analysis including at least one protein from at least half of the species per protein cluster in a concatenated MUSCLE alignment adjusted by Gblocks with liberal settings (Supplementary Data 7). The corresponding bootstrap partitions are provided at each node. The tree was rooted with Physcomitrella (a moss) as outgroup. The Myrtales lineage represented by Eucalyptus grandis is supported as sister to fabids and malvids (core rosid) clades together with the basal rosid lineage Vitales, whereas Populus trichocarpa (Malpighiales) is grouped with malvids.

  4. Extended Data Figure 4: Dating of the Eucalyptus lineage-specific whole-genome duplication event. (191 KB)

    a, Eucalyptus Ks whole-paranome (the set of all duplicate genes in the genome) age distribution. On the x axis the Ks is plotted (bin size of 0.1); on the y axis the number of retained duplicate paralogous gene pairs is plotted. b, Eucalyptus Ks anchor age distribution. On the x axis the Ks is plotted (bin size of 0.1); on the y axis the number of retained duplicate anchors is plotted. Anchors falling within the Ks range of 0.8–1.5 were used for absolute dating. c, Eucalyptus absolute dated anchors from the most recent WGD. The smooth green curve represents the maximum likelihood normal fit of dated anchors derived from the most recent WGD in Eucalyptus, whereas the blue dots represent a histogram of the raw data. The dashed line indicates the ML estimate of the distribution mode, whereas the dotted lines delimit the corresponding 95% confidence intervals. The mode of dated anchors is estimated at 109.93Myr ago with its lower and upper 95% boundaries at 105.96 and 113.91 Myr ago, respectively. d, Genome duplication pattern in the core eudicot (rosid and asterid) ancestor and lineages leading to Solanum (asterid), Vitis and Eucalyptus (basal rosids) and the core rosids. The three Eucalyptus (E1–E3), Vitis (V1–V3) and Solanum (S1–S3) orthologues were generated by the shared hexaploidy event (purple box, ~130 to 150Myr ago) and an additional set of Eucalyptus orthologues (E1'–E3′) were created in the lineage-specific WGD (orange boxes, ~110Myr ago).

  5. Extended Data Figure 5: Genome-wide analysis of tandem gene assemblies. (209 KB)

    The number and distribution of contig breaks was evaluated for pairs of tandem genes (located within 50kb of each other). a, Distribution of the number of contig breaks between gene pairs (blue bars) and cumulative proportion of gene pairs separated by contig breaks (black line). b, Distribution of the number of contig breaks per separation distance showing that the number of breaks is positively correlated with separation distance. The red line shows the distribution of distance between gene pairs with three or more contig breaks. c, Distribution of KS divergence of tandem gene pairs in clusters with exactly two tandem genes showing a gradient of similarity (that is, age of duplication) expected for authentic tandem gene pairs. d, Rate of tandem gene duplication (TD) and gene loss in Eucalyptus grandis (Eg), Populus trichocarpa (Pt), 2Vitis vinifera (Vv) and Arabidopsis thaliana (At). All of the rosid genomes (except Arabidopsis) exhibit constant rates of tandem duplication and loss. The rate of tandem gene duplication in Eucalyptus has been stable and consistently higher than in Populus and Vitis. 1Myr ~ 0.0026 transversions at fourfold degenerate sites, consistent with Populus and Eucalyptus having diverged ~100Myr ago.

  6. Extended Data Figure 6: Illumina PE100 read coverage of the ~760-kb region containing a R2R3–MYB tandem gene array. (217 KB)

    Illumina PE100 reads generated from BRASUZ1 (E. grandis) and X46 (E. globulus) were aligned to the E. grandis (BRASUZ1, V1.0) genome assembly, and insert (green bars) and sequence (blue line) coverage investigated for the ~760-kb region including a R2R3–MYB tandem array (details in Supplementary Data 3) in the E. grandis genome assembly. a, Read coverage profile of the BRASUZ1 reads mapped to the region showing 1× relative coverage across all nine of the tandem duplicates (red blocks) in the region, and b, X46 (E. globulus) reads mapped to the region showing 1× relative coverage on approximately half of the region with some tandem duplicates apparently absent from the E. globulus genome. Note that insert coverage (green bars) is relatively higher for E. globulus (X46, panel b) due to the larger insert size of the genomic library sequenced for X46 (~300bp) than for BRASUZ1 (~150bp).

  7. Extended Data Figure 7: Alternative homozygous classes observed in the 28 M35D2 siblings as a function of position on chromosomes 1–11. (223 KB)

    Several peaks of conserved heterozygosity (peaks >80%) are seen on all chromosomes except 5 and 11. A region of 25Mb on chromosome 4 from 11 to 36Mb is completely devoid of homozygous versions of one of the alleles (red line), but has roughly 25–32% of the siblings homozygous for the other allele (green line) and the rest heterozygous in a roughly 1:4 ratio. The blue line is the total proportion of siblings out of 28 that are heterozygous in the region. One would expect 50% under the null model, but almost the entire chromosome is biased towards heterozygosity. In several other regions (for example, chromosomes 6, 7, 9 and 10) both homozygous classes are depleted, suggesting the presence of genetic load at different loci along the two parental homologues and explaining the strong selection for heterozygosity in such regions.

  8. Extended Data Figure 8: Genes involved in lignin biosynthesis in woody tissues of Eucalyptus. (378 KB)

    Relative (yellow–blue scale) and absolute (white–red scale) expression profiles of secondary cell-wall-related genes implicated in lignin biosynthesis. Detailed gene annotation and mRNA-seq expression data are provided in Supplementary Data 9. Five novel Eucalyptus candidates that have not previously been associated with lignification are indicated by asterisks (Carocha et al., unpublished data). ST, shoot tips; YL, young leaves; ML, mature leaves; FL, floral buds; RT, roots; PH, phloem, IX, immature xylem. Absolute expression level (FPKM50) is only shown for immature xylem.

  9. Extended Data Figure 9: Phylogenetic tree of R2R3 MYB sequences from subgroups expanded and/or preferentially found in woody species. (540 KB)

    A total of 133 amino acid sequences from Eucalyptus grandis (50), Vitis vinifera (34), Populus trichocarpa (40), Arabidopsis thaliana (6) and Oryza sativa (3) corresponding to three woody-expanded (subgroups 5, 6 and AtMYB5 based on Arabidopsis classification) and five woody-preferential subgroups (I through V). The latter do not contain any Arabidopsis nor Oryza sequences. Sequences were aligned using MAFFT with the FFT-NS-i algorithm69 (Supplementary Data 10). Evolutionary history was inferred constructing a Neighbour-joining tree with 1,000 bootstrap replicates (bootstrap support is shown next to branches) using MEGA5 (ref. 70). The evolutionary distances were computed using the Jones-Taylor-Thornton substitution model and the rate variation among sites was modelled with a gamma distribution of 1. Positions containing gaps and missing data were not considered in the analysis. The tree is drawn to scale, with branch lengths in the same units as those of the evolutionary distances used to infer the phylogenetic tree. RNA-seq-based relative transcript abundance data for six different tissues, expressed in FPKM values (fragments per kilobase of exon per million fragments mapped), are shown for each Eucalyptus gene next to each subgroup. ST, shoot tips; YL, young leaves; ML, mature leaves; FL, flowers; PH, phloem; and IX, immature xylem.

  10. Extended Data Figure 10: Phylogenetic tree of type II MIKC MADS box proteins. (456 KB)

    Neighbour-joining consensus tree of the type II MIKC sub-clade using protein sequences from Eucalyptus grandis, Arabidopsis thaliana, Populus trichocarpa and Vitis vinifera (Supplementary Data 11). Bootstrap values from 1,000 replicates were used to assess the robustness of the tree. Bootstrap values lower than 40% were removed from the tree. Eucalyptus genes are denoted with green dots, Arabidopsis genes with red dots, Populus genes with yellow dots and Vitis genes with blue dots. The gene model numbers from Populus and Vitis were abbreviated to better fit in the figure (P. trichocarpa, Pt; V. vinifera, Vv).

Supplementary information

PDF files

  1. Supplementary Information (1.9 MB)

    This file contains Supplementary Notes, Supplementary Tables 1-18, a list of the Supplementary Data files and Supplementary References.

Zip files

  1. Supplementary Data (5.5 MB)

    This file contains Supplementary Data files 1-2 – see Supplementary Information p.52 for details.

  2. Supplementary Data (23.7 MB)

    This file contains Supplementary Data files 3-7 - see Supplementary Information p.52 for details.

  3. Supplementary Data (16.9 MB)

    This file contains Supplementary Data files 8-22 - see Supplementary Information p.52 for details.

Additional data