Calophyllaceae plastomes, their structure and insights in relationships within the clusioids

A complete chloroplast genome is not yet available for numerous species of plants. Among the groups that lack plastome information is the clusioid clade (Malpighiales), which includes five families: Bonnetiaceae, Calophyllaceae, Clusiaceae, Hypericaceae, and Podostemaceae. With around 2200 species, it has few published plastomes and most of them are from Podostemaceae. Here we assembled and compared six plastomes from members of the clusioids: five from Calophyllaceae (newly sequenced) and one from Clusiaceae. Putative regions for evolutionary studies were identified and the newly assembled chloroplasts were analyzed with other available chloroplasts for the group, focusing on Calophyllaceae. Our results mostly agree with recent studies which found a general conserved structure, except for the two Podostemaceae species that have a large inversion (trnK-UUU–rbcL) and lack one intron from ycf3. Within Calophyllaceae we observed a longer LSC and reduced IRs in Mahurea exstipulata, resulting in some genic rearrangement, and a short inversion (psbJ–psbE) in Kielmeyera coriacea. Phylogenetic analyses recovered the clusioids and the five families as monophyletic and revealed that conflicts in relationships reported in the literature for the group agree with nodes concentrating uninformative or conflicting gene trees. Our study brings new insights about clusioid plastome architecture and its evolution.


Results
Chloroplast structure. The newly assembled plastomes presented a quadripartite structure with one LSC, one SSC and two IRs (Fig. 1). The total length ranged from 149,535 bp in Mahurea exstipulata to 160,253 bp in Calophyllum brasiliense, with mean depth coverage ranging from 52.0 to 474.7 reads. LSC, SSC and IR sizes and their respective GC content are presented in Table 1. Among the newly assembled plastomes M. exstipulata showed the most distinct plastome with the largest LSC (98,042 bp) and the smallest IRs (16,553 bp), respectively 9,923 to 12,143 bp longer and 8,966 to 10,781 bp shorter respectively than those of the other species. All species presented an SSC similar in size ranging from 17,464 bp in C. brasiliense to 19,102 bp in Clusia panapanari (Aubl.) Choisy. The GC content was similar among all species (Table 1).
Five plastomes have 87 protein-coding genes, 37 transfer RNA (tRNA) and eight ribosomal (rRNA), totaling 132 genes. However, M. exstipulata had lost two tRNA genes and one of the copies of the rpl2, rpl23 and ycf2 genes and both copies of the ycf15 gene, thus having a total of 125 genes in its plastome. A list of all genes is found in Table 2. Only clpP gene has two introns. The clusioids have lost one of the two ycf3 introns, thus this gene has two exons and one intron as in Jin et al. 40 . Among the duplicated genes in the IRa, there are seven tRNAs, eight CDS (seven in C. heterocarpa in which the ycf15 is pseudogenized, and four in M. exstipulata due to its shorter IR), and four rRNAs (Fig. 1). Overall, the gene content is similar within the clusioids, the two Podostemaceae (Marathrum foeniculaceum Bonpl. and Tristicha trifaria (Bory ex Willd.) Spreng.) being the most distinct. www.nature.com/scientificreports/ Few differences in the limits of the four regions of the plastomes were observed (Fig. 2). The LSC-IRb limit is flanked by rpl22 on the LSC side and by rpl2 on the IRb side, with rps19 spanning the junction in all species but M. exstipulata. This species lacks rps19 in this region and has ycf2 flanking the junction on the LSC side and trnL on the IRb side. The IRb-SSC junction always has the ycf1-fragment on IRb and the ndhF on SSC, while Table 1. GenBank accession numbers and comparison of chloroplast genome size and GC content across three different regions (LSC, SSC, and IR) for 12 clusioid species. LSC large single copy, SSC small single copy, IR inverted repeat.
Long and simple sequence repeats (SSRs) and sequence polymorphism. REPuter 44 identified between 20 long repeats in Tristicha trifaria and 50 in Cratoxylum cochinchinense (Lour.) Blume most of them were forward and palindromic repeats; complementary repeats were found only in Garcinia gummi-gutta (L.) N.Robson, which has two of them (Table 3). We found a repeat flanking rbcL in Marathrum foeniculaceum and a repeat flanking trnK in T. trifaria, each gene being at one end of the 50 kb inversion reported for the family. No repeats were found flanking the psbJ-psbE inversion in K. coriacea, although they are often associated with genomic rearrangements 17 . In almost all clusioid representatives, most long repeats were distributed throughout the LSC, the only exceptions were B. paniculata with 15/32 repeats in the LSC and C. cochinchinense with 21/50. The number of long repeats in the SSC ranged from one in T. trifaria to eight in M. ferrea, and in the IR it ranged from three in M. foeniculaceum to 31 in C. cochinchinense (Supplementary Figure S1). The location of most of those long repeats regarding coding and noncoding regions were between genes, in the intergenic spacers. In our study, repeats were also found within genes (accD, ccsA, ndhG, psaA, psaB, rps18, ycf1, and ycf2, and in the trnS-GCU, -UGA, and -GGA), and in some introns (clpP, ndhA, ndhB, petB, petD, rps16, and ycf3) (Supplementary Table S1). The presence of repeats in these genes was already noted in other studies [45][46][47] . MISA 48,49 identified between 297 SSRs in Caraipa heterocarpa and 403 in M. foeniculaceum, most were mononucleotide A or T repeats. The two Podostemaceae species had fewer dinucleotide SSRs than all other species with 25 in T. trifaria and 28 in M. foeniculaceum. Tri-and tetranucleotide SSRs were found in all species. Pentanucleotide SSRs were found in almost all species (9/12), and hexanucleotide SSRs were found only in five species (Table 3).
Sequence polymorphism analyses indicated that the ten longest regions are equally distributed (four in IR, three in LSC and three in SSC) and that eight of them are CDSs. Nine of the ten regions with more segregating sites and more estimated mutations were the same; at least eight of them are CDSs, and six of them are in the LSC. All the ten regions with highest nucleotide diversity were intergenic spacers and seven out of ten are in the LSC. Overall, LSC and SSC had higher variability (nucleotide diversity between 2.9 and 3.5-fold, respectively) when compared to the IRs (Supplementary Table S2).
Phylogenetic analyses. Our 59 sequences complete alignment prior to the removal of poorly aligned regions on Gblocks 50 (nogb) has 77,015 bp, including 34,595 distinct patterns, 21,558 parsimony-informative sites, and 21.16% of gaps/missing data. The alignment processed on Gblocks 50 (gb) has 66,671 bp, including 30,582 distinct patterns, 19,548 parsimony-informative, and 12.94% gaps/missing data. We used three different methods (maximum-likelihood-ML, Bayesian inference-BI, and multispecies coalescent-MC) and eight different datasets (CU: 82 protein-coding genes concatenated unpartitioned-"supergene approach", CP: 82 protein-coding genes partitioned with individual evolutionary models, 82: a single consensus gene tree per locus used as input, 2050: 25 consensus gene trees from independent runs per locus used as input; each of these four datasets have two versions: one without removal of poorly aligned regions-nogb, and one after removal using Gblocks 50 -gb) in our analyses.
The trees we obtained exhibited the same topology for most relationships. In all analyses the clusioids were recovered in a strongly supported monophyletic clade with all the five families also being monophyletic. Only ML and BI unpartitioned analysis without removal of poorly aligned regions (CUnogb), and MC species tree using a single gene tree per gene as input without removal of poorly aligned regions (82nogb) recovered Bonnetiaceae as sister to the remaining clusioid families (Fig. 5-IIB). Support values for this relationship were generally low to moderate. For conflicting relationship 2, all ML and BI consensus trees regardless of partitioning scheme or removal of poorly aligned regions (CUgb, CUnogb, CPgb, CPnogb) recovered Mammea as sister to the remaining Calophyllaceae ( Fig. 5-IIC). All MC species trees (82gb, 82nogb, 2050gb, 2050nogb) recovered Mammea as sister to the clade Mesua + Calophyllum, and this clade as sister to a clade including the remaining Calophyllaceae ( Fig. 5-IID). In all ML and BI analyses, support values for this relationship were moderate to high whereas they were low in MC analyses. Relationship 3 will not be discussed since few samples from the genus Hypericum L. were included here (3/370 species). PhyParts 51 gene tree discordance analyses using MC trees after removal of poorly aligned regions (82gb, 2050gb) as input showed that the two of three branches with lower support in our analyses are the ones where there is more gene conflict (Fig. 5-I). Particularly for taxa in Calophyllaceae, around 50% of the genes were not informative or resulted in conflicting topologies. The discussion regarding gene tree discordance will be based on MC species trees generated using the datasets with poorly aligned regions removed (82gb and 2050gb).

Discussion
Five out of the six newly assembled plastomes (including the almost complete plastome of Calophyllum brasiliense) presented here showed a conserved structure with total size, general organization in a quadripartite structure, IR, LSC and SSC sizes, number of genes and GC content in agreement with the values for an ordinary angiosperm plastome 7,17 . However, as is the case in several groups, our study focusing on Calophyllaceae revealed that there is variation of plastome structure for the family. Mahurea exstipulata chloroplast had only a single copy of the usually duplicated genes rpl2, rpl23 and ycf2, which moved from the IRs to the LSC, resulting in the contraction of former and the expansion of the latter and in distinct IR/single-copy regions junctions. The expansion or contraction of IRs is a common type of rearrangement in plastomes that was already documented for the clusioids in Podostemaceae 39,40 and for some other groups like Geraniaceae 52 , Acanthaceae 53 and the genus Passiflora 46 . However, this phenomenon was not observed in Calophyllaceae until the present study. Although Table 3. Comparison of the number of simple sequence repeats (SSRs) and of long repeats present in 12 clusioid species. www.nature.com/scientificreports/ Podostemaceae species and M. exstipulata share IR contractions, their plastomes are quite distinct and these contractions represent independent evolutionary events. In M. exstipulata the IR contraction resulted from the loss of one copy of the genes rpl2, rpl23 and ycf2 and it was associated with a large expansion of the LSC; it is between 9,258 and 19,040 bp longer when compared to the other clusioids. The loss of a copy of the same three genes is a rare event and has been reported only in Strobilanthes cusia (Nees) Kuntze (Acanthaceae, Lamiales) 53 , and the loss of rpl2 and rpl23 is known from Cuscuta reflexa Roxb. (Convolvulaceae, Solanales) 54 and Lonicera japonica Thunb. (Caprifoliaceae, Dipsacales) 55 On the other hand, in Podostemaceae the IR contraction resulted from both gene (ycf1 and ycf2) and intron (rps16) loss and pseudogenization of accD, rpl22, and clpP, all associated with a reduction also in the single copy regions, which resulted in a smaller chloroplast as a whole. Among the newly sequenced plastomes, all species but M. exstipulata had the junctions between IR and LSC and SSC regions similar to those of a canonical angiosperm plastome 56 . The differences observed in the limits of the four main parts of the plastome correspond to one of the sources of variation in the plastome structure and have been described from different taxonomic levels, e.g., within Calophyllaceae and between the clusioids or within Cercidoideae legumes 56 . The usually slower evolutionary rates of genes IRs when compared to single copy regions is widely reported in the literature 14,57 and was also confirmed in our study. Regarding genes and intergenic spacers, the latter were the most variable and should be considered in future evolutionary studies. In agreement with Walker and collaborators (2019) 25 40 also reported this inversion for Cratoxylum cochinchinense (Hypericaceae) and suggested it could be a synapomorphy of the clade composed by these two families, it was observed only in Podostemaceae in our study. We also registered the first inversion within Calophyllaceae (psbJ-psbE) in Kielmeyera coriacea. Repetitive regions are known to flank inversion breakpoints. In our data, it was observed only in Podostemaceae (Supplementary Table 1). The ycf3 intron loss has not been reported for any other angiosperm except the clusioids to our knowledge and could be a synapomorphy of the clade but further investigation of the chloroplast structure of the closely related families is necessary to confirm this hypothesis. We speculated that the intron 2 was lost from ycf3, since in   62 , but also reveal much gene tree disagreement, particularly for two relationships: (1) Clusiaceae + Bonnetiaceae that is supported by only 340 of the total 2050 gene trees and 18 of the total 82 gene trees, and (2) Calophyllaceae + (Hypericaceae + Podostemaceae) is supported by 500 of the total 2050 gene trees and 21 of the total 82 gene trees (Fig. 5-I). Interestingly, a position of Bonnetiaceae as sister to the rest of the clusioid clade was proposed by Engler 64 and mainly supported by the lack of the schizogenic latex or resin ducts in Bonnetiaceae. However, this position was not recovered in any recent molecular phylogeny until the current study and in Baker et al. 65 , but in the latter Podostemaceae was not with the other clusioids. Relationships within Bonnetiaceae, Clusiaceae, and Podostemaceae are largely in agreement with other studies 30 , Clusia and Garcinia L. being the only genera with more than one representative and recovered as monophyletic in all our analyses. Within Hypericaceae, our results recovered the same relationship as found by Ruhfel et al. (2016) 41  Mammea was recovered as sister to the clade Calophyllum + Mesua, as in Ruhfel et al. (2011Ruhfel et al. ( , 2016 35,41 , in all four coalescent schema, i.e., 82gb, 82nogb, 2050gb and 2050nogb, but all with low support (PP 0.37, 0.44, 0.76 and 0.22, respectively) ( Fig. 5-IID). The high level of uninformative gene trees (uninformative gene trees/total number of gene trees: 1195 /2050, 13/82) associated with a considerable number of discordant trees (discordant trees/ total number of trees: 677/2050, 60/82) may help to explain the low support observed (Fig. 5-I). In all four ML and BI concatenated analyses (CUgb, CUnogb, CPgb, CPnogb), Mammea was recovered as sister to a large clade including the other genera, as in Cabral et al. (2021) 42 , with high support (100% UB/1.0 PP), and the relationship between the clade (Mesua + Calophyllum) and the clade with the remaining genera was moderately to strongly supported (UB/PP: CUnogb: 100%/1.0, CUgb: 77%/0.7, CPnogb: 100%/1.0, CPgb: 87%/0.92) (Fig. 5-IIC).
The conflicts reported here show that partitioning the dataset, filtering for poorly aligned regions and the method of inference can impact the topology of the tree. The use of MSC, considering each plastid gene individually, was recently recommended to infer phylogenies with plastid data 22 ; this approach breaks some assumptions of the model (see 24 for a discussion). However, despite scarce, there is evidence of recombination in chloroplasts [see 25 , 66 ], raising a question that deserves further exploration. Thus, it would be interesting to see studies testing for non-recombinant units in the plastid prior to phylogenetic inference. There is a lot to be learned about organellar genomes, including a deeper investigation of its multibranched structure.
Plastid data can be a good starting point to investigate phylogenetic relationships, particularly in large and neglected genera such as Clusia and Calophyllum, with around 300-400 and 200 species, respectively. Whenever possible, it should be combined with data from other genomic compartments. Obtaining genomic DNA from herbarium specimens to target nuclear regions is still challenging, although it may become more common with the development of probe sets that can be used in different groups and of standardized protocols 16 . In this scenario, chloroplast DNA data availability will also increase since this information is being recovered with nuclear DNA in target enrichment capture due to the presence of plastids in high copy number in a cell. Furthermore, an initial exploration may help to point out groups where more sampling is needed or where relationships will need a different source of data to be clarified.
The disagreements found in this study have been reported since the first molecular phylogenies for the clusioids appeared. We now need to add data from other genomic compartments (i.e., nucleus and mitochondria). However, what Cai et al. (2021) 34 have shown with nuclear data is that reconstructing phylogenies in Malpighiales, to which the clusioids belong, represents a huge challenge due to incomplete lineage sorting, tree error estimation and gene flow. Therefore, for a better comprehension of phylogenetic relationships, more analytical refinement is required, such as testing different partitioned schema.   68 . The DNA was quantified on NanoDrop Spectrophotometer (Thermo Scientific, Waltham, MA, USA) or on Qubit 2.0 (Invitrogen, Carlsbad, CA, USA). Genomic libraries for whole-genome were prepared and sequenced 2 × 150 bp on Illumina NextSeq 500 Mid-Output by Genohub (Austin, TX, USA). All adapters were removed by the company.

Methods
Genome assembly and annotation. Reads were assembled in two ways. (1) Using GetOrganelle v. 1.6.4 11 with the following parameters: -R 15 -k 21,35,45,55,65,75 -max-reads 4E7 -F embplant_pt. All assembly graphs from GetOrganelle 11 were visually checked in Bandage v 0.8.1 69 , so parameters could be adjusted. For all species two configurations of the plastome were assembled. For Calophyllum brasiliense and for Kielmeyera coriacea, plastomes were assembled in a different way. (2) Reads were mapped to the closest chloroplast available at the time this study started, i.e., Garcinia mangostana L. (GenBank accession NC_036341.1) with Bowtie2 v 2.2.1 70 plugin on Geneious 9 12 by adding the command -no-discordant to the default parameters; mapped reads were assembled with Platanus v.1.2.4 71 . Since the two haplotypes found are present in the same proportion in a cell 28,72 , we chose the same configuration of the plastome published for G. mangostana to use in subsequent analysis. To validate our assemblies and to verify the coverage, reads were mapped to the assembled genomes using Bowtie2 v 2.2.1 70 as mentioned above but using the assembled plastome as reference. Between 49,148 and 540,562 reads mapped the respective assembled chloroplast (Table 1).
Assembled chloroplasts were automatically annotated using GeSeq 73 implemented on Chlorobox website (https:// chlor obox. mpimp-golm. mpg. de/). The annotation was checked and corrected on Geneious 9 12 . Start and stop codons were checked for all the genes and tRNAs limits were checked using ARAGORN 74 output from Chlorobox (https:// chlor obox. mpimp-golm. mpg. de/). Potential pseudogenes were defined by Blast following Silva et al. (2018) 75 . Finally, circular gene maps were generated with OGDRAW 76 . All subsequent analyses, except IRscope 77 , were performed on plastomes with only one IR to avoid redundant results.
Plastome structure and identity. The gene content was compared between the five Calophyllaceae and the Clusiaceae newly assembled plastomes ( Table 2). For general plastome structure and all phylogenetic analyses, all the annotated and checked plastomes available for the clusioid clade on GenBank were included: Bonnetia paniculata Spruce ex. Benth. (Bonnetiaceae), Mesua ferrea L. (Calophyllaceae) Garcinia gummi-gutta (Clusiaceae), Cratoxylum cochinchinense (Hypericaceae), Marathrum foeniculaceum (Podostemaceae) and Tristicha trifaria (Podostemaceae). All the GenBank accession numbers are found in Table 1. We manually adjusted the beginning of each plastome on Geneious 9 12 . Beginning in the same region is important to avoid Mauve infer false rearrangements. A total of 12 plastomes were aligned using progressive Mauve algorithm in Mauve Plugin v. 2.3.2 in Geneious 9 12 to check for structural differences such as inversions or rearrangements. IR boundaries were evaluated on IRScope online 77 .
Repetitive regions and polymorphism. Since reorganizations can be associated with small dispersed repeats 17 , REPuter 44 was used to identify direct, complement, palindromic, and reverse repeats with the following parameters: minimal size of 30 pb and Hamming distance of 3. For Calophyllum brasiliense and for Kielmeyera coriacea the Ns and IUPAC ambiguous bases had to be manually removed from the respective fasta file in order to run REPuter 44 . And MISA 48,49 was used to identify simple sequence repeats (SSR) with a minimum number of 7, 4, 4, 3, 3, and 3, for mono-, di-, tri-, tetra-, penta-, and hexanucleotide repeats, respectively. The parameters followed Silva et al. (2019) 78 .
To evaluate sequence polymorphism, protein-coding and intergenic regions were extracted with parseGenbank.pl script from Mitofy 79 or on Geneious 9 12 and aligned on MAFFT v. 7.308 80 with the 'adjustdirection' option added to the default parameters. Only intergenic regions longer than 50 bp were included and Azara serrata Ruiz & Pav. (Salicaceae) was included in the alignments as an outgroup. Alignments were used as input on DnaSP 6 81 to calculate variation between regions. We compared total number of sites, of analyzed positions (NetSites), of segregating sites, of conserved sites, the estimated number of mutations, parsimony informative sites (PIS), proportion of PIS, nucleotide diversity and average number of substitutions per site. The proportion of PIS was calculated as the PIS/NetSites × 100 and gives an estimate about the absolute informativeness of the region. This information, when combined with the length of the region, may be helpful for markers design.
Phylogenetic analysis. For the phylogenetic analyses we assembled a dataset with the 82-protein-coding genes that includes two Bonnetiaceae species (two genera), seven Calophyllaceae species (six genera), four Clusiaceae species (two genera), five Hypericaceae species (three genera) and three Podostemaceae species (three genera). Additionally, data for other 36 species within Malpighiales, Averrhoa carambola L. (Oxalidales) and Elaeodendron orientale Jacq. (Celastrales) were also included as outgroups. A list with all the GenBank accession numbers for additional species is included in the Supplementary Material S1. To assure poorly aligned regions were not interfering in tree topology, these regions were removed with Gblocks 50 with the following parameters: -t = d -b5 = a -n = y -e = gb1 -d = y. Therefore, the following analyses were conducted both for the original dataset (nogb) and for the dataset after gblocks (gb). Maximum likelihood inference was conducted based on the concatenated approach for the unpartitioned (CUnogb and CUgb) and partitioned datasets (CPnogb and CPgb) in IQ-tree 2 82 . Models were selected based on Bayesian information criterion (BIC) and support was assessed Scientific Reports | (2021) 11:20712 | https://doi.org/10.1038/s41598-021-99178-z www.nature.com/scientificreports/ through 1000 ultrafast bootstraps. For both the nogb and the gb alignments, a total of 25 independent runs were performed for each gene, totaling 2050 consensus gene trees which were included in a single file to generate the 2050 datasets, i.e., 2050nogb and 2050gb. Bayesian Inference for the unpartitioned (CUnogb and CUgb) and partitioned datasets (CPnogb and CPgb) was conducted in Mr.Bayes v.3.2.7a under the best-of-fit model in accordance with BIC assessed in IQ-tree2 82,83 . It was run for 30 million generations sampled every 1000 generations, using two runs and four chains, until the average standard deviation of split frequencies became less than 0.01, beginning with random trees. The initial trees were discarded after reaching stationarity (~ 25%). Also, a coalescent-based species tree estimation was done in Astral-III 84 . We conducted the analysis with one tree per gene as input (82gb and 82nogb) and with 25 trees per gene as input (2050gb and 2050nogb). All gene trees used as input in Astral-III 84 were rooted and had branches with support lower than 10% collapsed in Newick Utilities 85 through the functions nw_reroot and nw_ed. Rooting the trees and collapsing branches with low support are known to improve the performance of summary methods 86 .
To explore tree conflict within the clusioid families the splits.nex file generated in IQ-tree2 82 for both partitioned and unpartitioned analysis was visualized as a network in SplitsTree4 v. 4.16.2 87 . Additionally, tree conflict was explored through PhyParts 51 and pie charts were plotted on the species phylogeny using the PieCharts python script developed by M. Johnson 88 . PhyParts calculates the number of concordant gene trees, of the top alternative bipartition, of other conflicting topologies, and of uninformative genes for all branches in the tree. To root the trees and remove ultrafast bootstrap and branch length values the files for both PhyParts 51 and PieCharts were prepared using Newick Utilities 85 through the functions nw_reroot and nw_topology.

Data availability
The complete plastome sequences of Calophyllum brasiliense, Caraipa heterocarpa, Kielmeyera appariciana, K. coriacea, and Mahurea exstipulata have been submitted to GenBank under the accession numbers MW85378 6MW853790. License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.