The genome of Theobroma cacao

Journal name:
Nature Genetics
Year published:
Published online


We sequenced and assembled the draft genome of Theobroma cacao, an economically important tropical-fruit tree crop that is the source of chocolate. This assembly corresponds to 76% of the estimated genome size and contains almost all previously described genes, with 82% of these genes anchored on the 10 T. cacao chromosomes. Analysis of this sequence information highlighted specific expansion of some gene families during evolution, for example, flavonoid-related genes. It also provides a major source of candidate genes for T. cacao improvement. Based on the inferred paleohistory of the T. cacao genome, we propose an evolutionary scenario whereby the ten T. cacao chromosomes were shaped from an ancestor through eleven chromosome fusions.

At a glance


  1. FISH analysis of T. cacao chromosomes.
    Figure 1: FISH analysis of T. cacao chromosomes.

    (a) In situ hybridization of T. cacao chromosomes stained with DAPI (blue) using a ThCen repeat probe (red). (b) In situ hybridization using Gaucho LTR retrotransposon (green) and ThCen repeat (red) probes.

  2. T. cacao genome heat map.
    Figure 2: T. cacao genome heat map.

    The ten T. cacao chromosomes harboring 11 chromosome fusions (in black dotted boxes) identified in these genomes are illustrated according to their ancestral chromosomal origin (see paleo-chromosome color code in Fig. 4). Centromeres are marked 'Cent'. For the ten chromosomes, heat maps are provided for the CDS (blue <60%, yellow 60%–90% and red >90%), class I and II transposable elements (blue <80%, yellow >80% and red ~100%), ThCen and Gaucho elements (blue <50% of maximum, yellow ≥50% of maximum and red = maximum) and telomeric repeats (blue = 0, yellow <40% and red >40%). Only the elements present in the assembled part of the genome are represented. Therefore, the genome distribution of the repeated sequences represented in this figure could be biased due to the major limitations of de novo sequencing of complex genomes using next-generation sequencing (NGS), which is limited in its ability to assemble highly repeated sequences.

  3. Venn diagram showing the distribution of shared gene families among Theobroma cacao, Arabidopsis thaliana, Populus trichocarpa, Glycine max and Vitis vinifera.
    Figure 3: Venn diagram showing the distribution of shared gene families among Theobroma cacao, Arabidopsis thaliana, Populus trichocarpa, Glycine max and Vitis vinifera.

    Numbers in parentheses indicate the number of genes in each cluster. The Venn diagram was created with web tools provided by the Bioinformatics and Systems Biology of Gent (see URLs).

  4. T. cacao genome paleohistory.
    Figure 4: T. cacao genome paleohistory.

    (a) T. cacao genome synteny. A schematic representation of the orthologs identified between cacao chromosomes (c1 to c10) at the center and the grape (g1 to g19), Arabidopsis (a1 to a5), poplar (p1 to p19), soybean (s1 to s20) and papaya (p1 to p9) chromosomes. Each line represents an orthologous gene. The seven different colors used to represent the blocks reflect the origin from the seven ancestral eudicot linkage groups. (b) T. cacao genome duplication. The seven major triplicated chromosomes groups in T. cacao (c1 to c10) are illustrated (colored blocks) and related with paralogous gene pairs identified between the T. cacao chromosomes (colored lines). The seven different colors reflect the seven ancestral eudicot linkage groups. (c) T. cacao genome evolutionary model updated from Abrouk et al.46. The eudicot chromosomes are represented with a seven-color code to illustrate the evolution of segments from a common ancestor with seven protochromosomes (top). The different lineage-specific shuffling events that have shaped the structure of the six genomes during their evolution from the common paleo-hexaploid ancestor are indicated as R (for rounds of whole-genome duplication (WGD)) and F (for fusions of chromosomes). The current structure of the eudicot genomes is represented at the bottom of the figure.

Accession codes

Referenced accessions



  1. Davie, J.H. Chromosome studies in the Malvaceae and certain related families. II. Genetica 17, 487498 (1935).
  2. Henderson, J.S., Joyce, R.A., Hall, G.R., Hurst, W.J. & McGovern, P.E. Chemical and archaeological evidence for the earliest cacao beverages. Proc. Natl. Acad. Sci. USA 104, 1893718940 (2007).
  3. Coe, S.D. & Coe, M.D. The True History of Chocolate. (Thames and Hudson Ltd., London, England, 1996).
  4. Motamayor, J.C. et al. Cacao domestication I: the origin of the cacao cultivated by the Mayas. Heredity 89, 380386 (2002).
  5. Motamayor, J.C., Risterucci, A.M., Heath, M. & Lanaud, C. Cacao domestication II: progenitor germplasm of the Trinitario cacao cultivar. Heredity 91, 322330 (2003).
  6. Mooleedhar, V., Maharaj, W. & O'Brien, H. The collection of Criollo cocoa germplasm in Belize. Cocoa Grower's Bull. 49, 2640 (1995).
  7. Cocoa Resources in consuming Countries–ICCO Market Committee, 10th meeting. EBRD Offices London, MC 10, 16 (2007).
  8. Paterson, A.H. et al. The Sorghum bicolor genome and the diversification of grasses. Nature 457, 551556 (2009).
  9. International Rice Genome Sequencing Project. The map-based sequence of the rice genome. Nature 436, 793800 (2005).
  10. Jaillon, O. et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449, 463467 (2007).
  11. Wolfgruber, T.K. et al. Maize centromere structure and evolution: sequence analysis of centromeres 2 and 5 reveals dynamic loci shaped primarily by retrotransposons. PLoS Genet. 5, e1000743 (2009).
  12. Foissac, S. et al. Genome annotation in plants and fungi: EuGène as a model platform. Curr. Bioinform. 3, 8797 (2008).
  13. Schnable, P.S. et al. The B73 maize genome: complexity, diversity, and dynamics. Science 326, 11121115 (2009).
  14. Voinnet, O. Origin, biogenesis, and activity of plant microRNAs. Cell 136, 669687 (2009).
  15. Griffiths-Jones, S., Saini, H.K., van Dongen, S. & Enright, A.J. miRBase: tools for microRNA genomics. Nucleic Acids Res. 36, D154D158 (2008).
  16. Afzal, A.J., Wood, A.J. & Lightfoot, D.A. Plant receptor-like serine threonine kinases: roles in signaling and plant defense. Mol. Plant Microbe Interact. 21, 507517 (2008).
  17. Diévart, A. & Clark, S.E. LRR-containing receptors regulating plant development and defense. Development 131, 251261 (2004).
  18. Lehti-Shiu, M.D., Zou, C., Hanada, K. & Shiu, S.H. Evolutionary history and stress regulation of plant receptor-like kinase/pelle genes. Plant Physiol. 150, 1226 (2009).
  19. DeYoung, B.J. & Innes, R.W. Plant NBS-LRR proteins in pathogen sensing and host defense. Nat. Immunol. 7, 12431249 (2006).
  20. Tarr, D.E.K. & Alexander, H.M. TIR-NBS-LRR genes are rare in monocots: evidence from diverse monocot orders. BMC Res. Notes 2, 197 (2009).
  21. Pan, Q., Wendel, J. & Fluhr, R. Divergent evolution of plant NBS-LRR resistance gene homologues in dicot and cereal genomes. J. Mol. Evol. 50, 203213 (2000).
  22. Mukhtar, M.S., Nishimura, M.T. & Dangl, J. NPR1 in plant pefense: it's not over 'til it's turned over. Cell 137, 804806 (2009).
  23. Shi, Z., Maximova, S., Lui, Y., Verica, J. & Guiltinan, M.J. Functional analysis of the Theobroma cacao NPR1 Gene in Arabidopsis . BMC Plant Biol. 10, 248 (2010).
  24. Lehmann, P. Structure and evolution of plant disease resistance genes. J. Appl. Genet. 43, 403414 (2002).
  25. Lanaud, C. et al. A meta–QTL analysis of disease resistance traits of Theobroma cacao L. Mol. Breed. 24, 361374 (2009).
  26. Griffiths, G. & Harwood, J.L. The regulation of triacylglycerol biosynthesis in cocoa (Theobroma cacao) L. Planta 184, 279284 (1991).
  27. Beisson, F. et al. Arabidopsis genes involved in acyl lipid metabolism. A 2003 census of the candidates, a study of the distribution of expressed sequence tags in organs, and a web-based database. Plant Physiol. 132, 681 (2003).
  28. Pourcel, L., Routaboul, J., Cheynier, V., Lepiniec, L. & Debeaujon, I. Flavonoid oxidation in plants: from biochemical properties to physiological functions. Trends Plant Sci. 12, 2936 (2007).
  29. Spencer, J.P. Flavonoids and brain health: multiple effects underpinned by common mechanisms. Genes Nutr. 4, 243250 (2009).
  30. Rimbach, G., Melchin, M., Moehring, J. & Wagner, A.E. Polyphenols from cocoa and vascular health-a critical review. Int. J. Mol. Sci. 10, 42904309 (2009).
  31. Liu, Y. Molecular analysis of genes involved in the synthesis of proanthocyanidins in theobroma cacao. Thesis 1146 (2010).
  32. Tomas-Barberan, F.A. et al. A new process to develop a cocoa powder with higher flavonoid monomer content and enhanced bioavailability in healthy humans. J. Agric. Food Chem. 55, 39263935 (2007).
  33. Liu, Y., Wang, H., Ye, H. & Li, G. Advances in the plant isoprenoid biosynthesis pathway and its metabolic engineering. J. Integr. Plant Biol. 47, 769782 (2005).
  34. Ziegleder, G. Linalol contents as characteristics of some flavour grade cocoas. Z. Lebensm. Unters. Forsch. 191, 306309 (1990).
  35. Chanliau, S. & Cros, E. Influence du traitement post-récolte et de la torréfaction sur le développement de l'arôme cacao. 12th Int. Cocoa Res. Conf., Salvador de Bahia (Brazil) 959964 (1996).
  36. Tuskan, G.A. et al. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 313, 15961604 (2006).
  37. Argout, X. et al. Towards the understanding of the cocoa transcriptome: production and analysis of an exhaustive dataset of ESTs of Theobroma cacao generated from various tissues and under various conditions. BMC Genomics 9, 512 (2008).
  38. Lanaud, C. et al. Identification of QTLs related to fat content, seed size and sensorial traits in Theobroma cacao L. Proc. 14th Int. Cocoa Res. Conf. 1318 (2003).
  39. Araújo, I.S. et al. Mapping of quantitative trait loci for butter content and hardness in cocoa beans (Theobroma cacao L.). Plant Mol. Bio. Rep. 27, 177183 (2009).
  40. Tang, H. et al. Synteny and collinearity in plant genomes. Science 320, 486488 (2008).
  41. Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796815 (2000).
  42. Schmutz, J. et al. Genome sequence of the palaeopolyploid soybean. Nature 463, 178183 (2010).
  43. Ming, R. et al. The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus). Nature 452, 991996 (2008).
  44. Salse, J., Abrouk, M., Murat, F., Quraishi, U.M. & Feuillet, C. Improved criteria and comparative genomics tool provide new insights into grass paleogenomics. Briefings Bioinf. 10, 619630 (2009).
  45. Salse, J. et al. Reconstruction of monocotelydoneous proto-chromosomes reveals faster evolution in plants than in animals. Proc. Natl. Acad. Sci. USA 106, 1490814913 (2009).
  46. Abrouk, M. et al. Palaeogenomics of plants: synteny-based modelling of extinct ancestors. Trends Plant Sci. 15, 479487 (2010).
  47. Murat, F. et al. Ancestral grass karyotype reconstruction unravels new mechanisms of genome shuffling as a source of plant evolution. Genome Res. 11, 15451547 (2010).
  48. Maximova, S.N. et al. Over-expression of a cacao class I chitinase gene in Theobroma cacao L. enhances resistance against the pathogen, Colletotrichum gloeosporioides . Planta 224, 740749 (2006).
  49. Ammiraju, J.S.S. et al. The Oryza bacterial artificial chromosome library resource: construction and analysis of 12 deep-coverage large-insert BAC libraries that represent the 10 genome types of the genus Oryza. Genome Res. 16, 140147 (2006).
  50. Aury, J.M. et al. High quality draft sequences for prokaryotic genomes using a mix of new sequencing technologies. BMC Genomics 9, 603 (2008).
  51. Marie, D. & Brown, S.C. A cytometric exercise in plant DNA histograms, with 2C values for 70 species. Biology of the Cell/Under the Auspices of the European Cell Biology Organization 78, 4151 (1993).
  52. Pugh, T. et al. A new cacao linkage map based on codominant markers: development and integration of 201 new microsatellite markers. Theor. Appl. Genet. 108, 11511161 (2004).
  53. Fouet, O. et al. Structural characterization and mapping of functional EST-SSR markers in Theobroma cacao , in the press.
  54. Allegre, M. et al. A high-density consensus genetic map for Theobroma cacao L., in the press.
  55. D'hont, A. et al. Characterisation of the double genome structure of modern sugarcane cultivars (Saccharum spp.) by molecular cytogenetics. Mol. Gen. Genet. 250, 405413 (1996).
  56. Degroeve, S., Saeys, Y., De Baets, B., Rouzé, P. & Van de Peer, Y. SpliceMachine: predicting splice sites from high-dimensional local context representations. Bioinformatics 21, 13321338 (2005).
  57. Gremme, G., Brendel, V., Sparks, M.E. & Kurtz, S. Engineering a software tool for gene structure prediction in higher organisms. Inf. Softw. Technol. 47, 965978 (2005).
  58. Schmutz, J. et al. Genome sequence of the palaeopolyploid soybean. Nature 463, 178183 (2010).
  59. Li, L., Stoeckert, C.J. & Roos, D.S. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13, 21782189 (2003).
  60. Goldman, N. & Yang, Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11, 725736 (1994).

Download references

Author information

  1. These authors contributed equally to this work.

    • Xavier Argout,
    • Jerome Salse,
    • Jean-Marc Aury &
    • Mark J Guiltinan


  1. Centre de coopération Internationale en Recherche Agronomique pour le Développement (CIRAD)-Biological Systems Department-Unité Mixte de Recherche Développement et Amélioration des Plantes (UMR DAP) TA A 96/03-34398, Montpellier, France.

    • Xavier Argout,
    • Gaetan Droc,
    • Mathilde Allegre,
    • Thierry Legavre,
    • Olivier Fouet,
    • Manuel Ruiz,
    • Yolande Roguet,
    • Maguy Rodier-Goud,
    • Anne Dievart,
    • Christopher Viot,
    • Michel Boccara,
    • Ange Marie Risterucci,
    • Valentin Guignon,
    • Xavier Sabau,
    • Didier Clement,
    • Ronan Rivallan,
    • Bertrand Pitollat,
    • Angélique D'Hont,
    • Emmanuel Guiderdoni,
    • Stephanie Bocs &
    • Claire Lanaud
  2. Institut National de la Recherché Agronomique UMR 1095, Clermont-Ferrand, France.

    • Jerome Salse,
    • Michael Abrouk &
    • Florent Murat
  3. Commissariat à l'Energie Antomique (CEA), Institut de Génomique (IG), Genoscope, Evry, France.

    • Jean-Marc Aury,
    • Julie Poulain &
    • Patrick Wincker
  4. Centre National de Recherche Scientifique (CNRS), UMR 8030, CP5706, Evry, France.

    • Jean-Marc Aury,
    • Julie Poulain &
    • Patrick Wincker
  5. Université d'Evry, Evry, France.

    • Jean-Marc Aury,
    • Julie Poulain &
    • Patrick Wincker
  6. Penn State University, Department of Horticulture and the Huck Institutes of the Life Sciences, University Park, Pennsylvania, USA.

    • Mark J Guiltinan &
    • Siela N Maximova
  7. Penn State University, Plant Biology Graduate Program and the Huck Institutes of the Life Sciences, University Park, Pennsylvania, USA.

    • Mark J Guiltinan,
    • Zi Shi &
    • Yufan Zhang
  8. Institut National de la Recherche Agronomique (INRA)-CNRS Laboratoire des Interactions Plantes Micro-organismes (LIPM), Castanet Tolosan Cedex, France.

    • Jerome Gouzy &
    • Erika Sallet
  9. UMR 5096 CNRS-Institut de Recherche pour le Développement (IRD)-Université de Perpignan Via Domitia (UPVD), Laboratoire Génome et Développement des Plantes, Perpignan Cedex, France.

    • Cristian Chaparro,
    • Jose Fernandes Barbosa-Neto,
    • Francois Sabot &
    • Olivier Panaud
  10. Arizona Genomics Institute and School of Plant Sciences, University of Arizona, Tucson, Arizona, USA.

    • Dave Kudrna,
    • Jetty Siva S Ammiraju,
    • Wolfgang Golser,
    • Xiang Song &
    • Rod Wing
  11. Penn State University, Department of Biochemistry and Molecular Biology, University Park, Pennsylvania, USA.

    • Stephan C Schuster
  12. Penn State University, the School of Forest Resources and the Huck Institutes of the Life Sciences, University Park, Pennsylvania, USA.

    • John E Carlson
  13. The Department of Bioenergy Science and Technology (WCU), Chonnam National University, Buk-Gu, Gwangju, Korea.

    • John E Carlson
  14. Unité de Biométrie et d'Intelligence Artificielle (UBIA), UR875 INRA, Castanet Tolosan, France.

    • Thomas Schiex
  15. Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA.

    • Melissa Kramer,
    • Laura Gelley,
    • Yufan Zhang &
    • W Richard McCombie
  16. INRA, UR 1279 Etude du Polymorphisme des Génomes Végétaux, CEA Institut de Génomique, Centre National de Génotypage, CP5724, Evry, France.

    • Aurélie Bérard &
    • Dominique Brunel
  17. Penn State University, Bioinformatics and Genomics PhD Program and Department of Biology, University Park, Pennsylvania, USA.

    • Michael J Axtell &
    • Zhaorong Ma
  18. Institut des Sciences du Végétal, UPR 2355, CNRS, Gif-Sur-Ivette, France.

    • Spencer Brown,
    • Mickael Bourge &
    • Ismael Kebe
  19. Centre National de la Recherche Agronomique (CNRA), Divo, Côte d'Ivoire.

    • Mathias Tahi &
    • Joseph Moroh Akaza
  20. Comissão Executiva de Planejamento da Lavoura Cacaueira (CEPLAC), Itabuna Bahia, Brazil.

    • Karina Gramacho
  21. Centro Nacional de Biotecnología Agrícola, Instituto de Estudios Avanzados (IDEA), Caracas, Venezuela.

    • Diogenes Infante
  22. Chocolaterie VALRHONA, Tain l'Hermitage, France.

    • Pierre Costet
  23. Département de Biologie, Université d'Evry Val d'Essonne, Evry, France.

    • Francis Quetier


X.A., J.S., J.-M.A., M.J.G., J.G., D.K., M.J.A., S. Brown, K.G., A. D'Hont, A. Dievart, D.B., D.I., P.C., R.W., W.R.M., E.G., F.Q., O.P., P.W., S. Bocs and C.L. designed the analyses.

X.A., J.S., J.-M.A., M.J.G., J.G., M.R., D.K., M.J.A., S. Brown, A. D'Hont, D.B., W.R.M., O.P., P.W., S. Bocs and C.L. managed the several components of the project.

X.A., M.A., O.F., Y.R., A.B., M. Bocca, D.C., R.R., M.T., J.M.A., K.G., I.K., J.-M.A. and C.L. performed material preparation and multiplication, DNA and RNA extractions, genotyping, genetic mapping and anchoring of the assembly.

D.K., J.S.S.A., W.G. and X.S. performed BAC libraries.

J.-M.A., J.P., S.C.S., J.E.C., M.K., L.G. and W.R.M. performed sequencing and assembly.

X.A., G.D., J.G., M. Allegre, T.L., S.N.M., E.S., T.S., Z.S., C.V., V.G., Y.Z., B.P. and S. Bocs performed automatic and manual gene annotations and database management.

C.C., J.F.B.-N., F.S., A.M.R., M.J.A., Z.M., O.P. and S. Brown performed repeated elements and miRNA analyses.

M.R.-G., M. Bourge, S. Brown and A. D'Hont performed in situ hybridizations and genome-size evaluations.

M.J.G., G.D., T.L., S.N.M., M.R., A. Dievart, Z.S., X.S. and Y.Z. performed gene family analyses.

J.S., M. Abrouk and F.M. performed evolution analyses.

X.A., J.S., J.-M.A., M.J.G., G.D., J.G., C.C., T.L., S.N.M., M.R., M.R.-G., D.K., S.C.S., A. D'Hont, A. Dievart, X.S., M.J.A., S. Brown, P.C., F.Q., O.P., S. Bocs and C.L. wrote and/or revised the paper.

C.L. initiated and coordinated the whole project.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (6M)

    Supplementary Note, Supplementary Tables 1–19 and Supplementary Figures 1–18

Additional data