Chromosome-level genome assembly of Ophiorrhiza pumila reveals the evolution of camptothecin biosynthesis

Plant genomes remain highly fragmented and are often characterized by hundreds to thousands of assembly gaps. Here, we report chromosome-level reference and phased genome assembly of Ophiorrhiza pumila, a camptothecin-producing medicinal plant, through an ordered multi-scaffolding and experimental validation approach. With 21 assembly gaps and a contig N50 of 18.49 Mb, Ophiorrhiza genome is one of the most complete plant genomes assembled to date. We also report 273 nitrogen-containing metabolites, including diverse monoterpene indole alkaloids (MIAs). A comparative genomics approach identifies strictosidine biogenesis as the origin of MIA evolution. The emergence of strictosidine biosynthesis-catalyzing enzymes precede downstream enzymes’ evolution post γ whole-genome triplication, which occurred approximately 110 Mya in O. pumila, and before the whole-genome duplication in Camptotheca acuminata identified here. Combining comparative genome analysis, multi-omics analysis, and metabolic gene-cluster analysis, we propose a working model for MIA evolution, and a pangenome for MIA biosynthesis, which will help in establishing a sustainable supply of camptothecin.


Supplementary
# Three pachytenes were used for measurements. # The relative length of chromosome (%) is in the percentage of the karyotype length. # Total length (Mb) is the length of assembled sequences for each chromosome. # The relative length of the sequence (%) is calculated using the assembled genome size. Gaps+unplaced contigs (X) 3,000 Supplementary Fig. 2. Assembly statistics for chromosome-scale plant genome assemblies deposited at the NCBI genome database. Chromosome-scale plant genome assemblies being published and deposited in NCBI genome database are included here, while Ophiorrhiza pumila genome was assembled in this study. All publicly available genome stats of monoterpene indole alkaloids producing plants were highlighted through red circle.  Supplementary Fig. 8. The Ophiorrhiza pumila reference genome assembly validation using Bionano de novo genome assembly as orthogonal evidence. All 11 chromosomes were supported by Bionano de novo assembly (blue color bars), with green and red color shades between these two assembly alignments representing regions of insertions or deletions, respectively. a b Supplementary Fig. 9. Hi-C contact map representing chromosome-scale phased genomes of Ophiorrhiza pumila. Hi-C contact matrix visualization for the O. pumila phased genomes, haplotig1 (a) and haplotig2 (b). The pixel intensity represents the count of Hi-C links at 150 Kb size-windows on the chromosomes on a logarithmic scale. Darker red color indicates higher contact probability, while white space represents no or fewer contacts. The blue lines separate chromosomes for each phased genome.      Supplementary Fig. 18. Estimated divergence and whole genome duplication for monoterpene indole alkaloids producing species. The lineage divergence time as branch length is indicated at each of the branch points in Million years ago (Mya). Divergence time for Ophiorrhiza pumila with coffee and Camptotheca acuminata was estimated at 47 Mya and 120 Mya, respectively. The whole-genome duplication for C. acuminata was identified and the duplication time was estimated. "#" and "$" depicts species from Rubiaceae and Apocynaceae families, respectively.     Supplementary Fig. 23. Accumulation of specialized metabolites in six tissues of Ophiorrhiza pumila. * and ** represent chemically assigned metabolites based on MS/MS analysis using the public database and pure standards, respectively; m/z-mass by charge ratio. The color intensity refers to the scaled median intensity of metabolites across tissues of O. pumila using five biological replicates for each tissue.

Supplementary Fig. 24. Coexpression based hierarchical clustering for genes assigned to enzymes from the secoiridoid biosynthesis pathway in
Ophiorrhiza pumila. Spearman's correlation coefficients were calculated using transcript expression data for seven tissues of O. pumila. Hierarchical clustering identified complete secoiridoid biosynthesis pathway as a gene cluster at the second order of the dendrogram (selected within a yellow-colored box). Source data are provided as a Source Data file. Supplementary Fig. 25. Coexpression based hierarchical clustering for genes assigned to enzymes from monoterpene indole alkaloid (MIA) biosynthesis in Ophiorrhiza pumila. Spearman's correlation coefficients were calculated using transcripts expression data for seven tissues of O. pumila. Genes associated with MIA biosynthesis and membered within a coexpressed gene-cluster (selected within a yellow-colored box) including genes assigned to secoiridoid biosynthesis pathway were selected. Genes colored as green are genes assigned to secoiridoid biosynthesis pathways. Source data are provided as a source Data file.   Supplementary Fig. 27. Gene-to-metabolite relationship network associated with MIA biosynthesis. The Pearson's correlation matrix was calculated using the psych package in R and drawn using Cytoscape (v3.6.0). The circle and square nodes represent genes assigned to secoiridoid and MIA biosynthesis pathways (shown in the Supplementary Fig. 26), and the metabolites, respectively. Edges represent gene-metabolite pairs with paired Pearson Correlation coefficient scores over 0.7 and corrected p-value < 0.05. Source data are provided as a Source Data file.  Orthogene family OG0015245 consisting of functionally characterized STRs and specifically gained in the strictosidine derived monoterpene indole alkaloids producing plant species. Evolutionary analysis of gene content was performed using COUNT software. Phylogenetic tree and orthogene classifications were performed as described in the material and method section. Empty rectangles at the nodes denote the absence of gene from an ancestral node, shaded node denotes presence. The green bar represents gene gain or expansion, while the red bar represents gene loss or contraction at the node.

Secologanin synthase (SLS)
Supplementary Fig. 31. Phylogenetic analysis of secologanin synthase (SLS) assigned to the orthogene families, OG0002438 and OG0013616. Gene family classification was performed based on 13 plant genomes, and the genes are colored based on the associated plant species. Functionally characterized genes are represented by "#" with the GenBank accession IDs. Strongly supported nodes with bootstrap value above 60% are shown.     Supplementary Fig. 34. Phylogenetic analysis of strictosidine beta-D-glucosidase (SGD) assigned to the orthogene families, OG0000057 and OG0016082. Gene family classification was performed based on 13 plant genomes, and the genes are colored based on the associated plant species. Functionally characterized genes are represented by "#" with the GenBank accession IDs. Strongly supported nodes with bootstrap value above 60% are shown.

Tetrahydroalstonine synthase (THAS)
Supplementary Fig. 35. Phylogenetic analysis of tetrahydroalstonine synthase (THAS) assigned to the orthogene family, OG0013119. Gene family classification was performed based on 13 plant genomes, and the genes are colored based on the associated plant species. Functionally characterized genes are represented by "#" with the GenBank accession IDs. Strongly supported nodes with bootstrap value above 60% are shown.

Coffea canephora Ophiorrhiza pumila Gelsemium sempervirens Catharanthus roseus
Sarpagan bridge enzyme (SBE) Supplementary Fig. 36. Phylogenetic analysis of sarpagan bridge enzyme (SBE) assigned to the orthogene family, OG0011713. Gene family classification was performed based on 13 plant genomes, and the genes are colored based on the associated plant species. Functionally characterized genes are represented by "#" with the GenBank accession IDs. Strongly supported nodes with bootstrap value above 60% are shown.
augustus masked-scaffold 61-processed-gene-3.1-mRNA-1 snap masked-scaffold 61-processed-gene-3.21-mRNA-1          Supplementary Fig. 39. Phylogenetic analysis of O-acetylstemmadenine oxidase (ASO/PAS) assigned to the orthogene families, OG0000009 and OG0003863. Gene family classification was performed based on 13 plant genomes, and the genes are colored based on the associated plant species. Functionally characterized genes are represented by "#" with the GenBank accession IDs. Strongly supported nodes with bootstrap value above 60% are shown.  Supplementary Fig. 40. Phylogenetic analysis of polyneuridine-aldehyde esterase (PR) assigned to the orthogene families, OG0000154, and OG0001044. Gene family classification was performed based on 13 plant genomes, and the genes are colored based on the associated plant species. Functionally characterized genes are represented by "#" with the GenBank accession IDs. Strongly supported nodes with bootstrap value above 60% are shown.  C. acuminata WGD OG0015245 (STR) Supplementary Fig. 41. Evolution of secoiridoid and MIA biosynthesis associated orthogene families in camptothecin producing plants and coffee genome. Median for synonymous substitutions per synonymous sites (Ks) were calculated using genes assigned to a given orthogene family for a plant species as described in method section. Small Ks-median values for several key enzymes associated with MIA biosynthesis in O. pumila and C. acuminata, while high Ks-median values for genes from coffee genome suggests faster and active evolution of the functional genes in the camptothecin producing plants post established STR enzymes. Results suggested evolution of STR as key event that preceded with evolution of genes associated with MIA biosynthesis, a factor that was missing for coffee genome. Source data are provided as a Source Data file. Deacetoxyvindoline 4-hydroxylase Opuchr01_g0009950-1.1