Comparative chloroplast genomics and insights into the molecular evolution of Tanaecium (Bignonieae, Bignoniaceae)

Species of Tanaecium (Bignonieae, Bignoniaceae) are lianas distributed in the Neotropics and centered in the Amazon. Members of the genus exhibit exceptionally diverse flower morphology and pollination systems. Here, we sequenced, assembled, and annotated 12 complete and four partial chloroplast genomes representing 15 Tanaecium species and more than 70% of the known diversity in the genus. Gene content and order were similar in all species of Tanaecium studied, with genome sizes ranging between 158,470 and 160,935 bp. Tanaecium chloroplast genomes have 137 genes, including 80–81 protein-coding genes, 37 tRNA genes, and four rRNA genes. No rearrangements were found in Tanaecium plastomes, but two different patterns of boundaries between regions were recovered. Tanaecium plastomes show nucleotide variability, although only rpoA was hypervariable. Multiple SSRs and repeat regions were detected, and eight genes were found to have signatures of positive selection. Phylogeny reconstruction using 15 Tanaecium plastomes resulted in a strongly supported topology, elucidating several relationships not recovered previously and bringing new insights into the evolution of the genus.


Scientific Reports
| (2023) 13:12469 | https://doi.org/10.1038/s41598-023-39403-z www.nature.com/scientificreports/ Tanaecium Sw. emend. L.G. Lohmann (Bignonieae, Bignoniaceae) is a genus of Neotropical lianas that includes 21 species distributed from Mexico and the Antilles to Argentina, and centered in the Amazon 21 . The genus exhibits exceptionally diverse flower morphology and pollination systems 21 , seeds that can be winged or wingless and corky, and bromeliad-like prophylls of the axillary buds, a putative vegetative synapomorphy 21 . The genus was first sampled in a molecular phylogeny reconstructed using the chloroplast gene ndhF and the nuclear pepC 12 . Subsequent molecular phylogenetic studies with this group used the same molecular markers [22][23][24] . While representatives of this genus have been sampled in multiple studies, sampling remains limited, even lacking sampling of the type species of the genus. Moreover, the Tanaecium plastome structure has never been explored. Even though a study reported data on the plastome of T. tetragonolobum (Jacq.) L.G.Lohmann 25 , this plastome turned out to be Callichlamys latifolia (Rich.) K.Schum. 26 .
This study aims to increase our knowledge of Bignoniaceae plastome structure and evolution and bring new insights into the evolutionary history of Tanaecium by reporting on the plastome structure of the genus for the first time. To achieve this goal, we (1) sequenced and assembled complete or nearly complete plastomes of 15 species of Tanaecium, representing more than 70% of the known diversity in the genus 21; (2) characterized the overall plastome structure; (3) performed comparative genomic analyses; (4) identified putative repeats; (5) investigated patterns of selection in the chloroplast genes; and (6) reconstructed a phylogeny for Tanaecium using the newly assembled plastomes.

Results
Plastome assembly and characteristics. The paired-end raw reads of the 16 Tanaecium plastomes sequenced (Table 1) varied between 3,858,109 and 14,350,498 bp for T. parviflorum and T. tetragonolobum, respectively ( Table 2). Of these, 12 plastomes were complete and four were partial. Mapped reads varied from 101,125 to 660,086 bp for T. duckei and T. revillae, respectively ( Table 2). The average read depth varied between 85 × for T. tetragonolobum and 679 × for T. dichotomum 2 ( Table 2). All plastomes showed the typical quadripartite structure of angiosperms ( Fig. 1), with a pair of IR regions that range from 30,284 bp (T. duckei) to 31,089 bp (T. bilabiatum), intercalated by one LSC region that ranges from 83,490 bp (T. crucigerum; nearly complete, but without missing data in the LSC) to 86,213 bp (T. xanthophyllum), and one SSC region that ranges from 12,504 bp (T. tetragonolobum) to 12,920 bp (T. dichotomum 1) ( Table 2). The Tanaecium plastomes have an average length of 159,359 bp, with Tanaecium xanthophyllum representing the largest plastome assembled here, with a total length of 160,935 bp ( Table 2). The large size of the T. xanthophyllum plastome is due to an expansion in the LSC region ( Table 2). The interquartile range (IQR) and median size ratio for Tanaecium was 0.5%; in turn, the IQR reported for Adenocalymma was 0.7%, for Anemopaegma was 0.4%, and for Amphilophium was 4% as expected based on an earlier study 9 (Supplementary Table S7). The average GC content is 38% for all Tanaecium   Table 1. Taxa, voucher, reference, and GenBank accession numbers of the samples analyzed in this study.  (Table 2). All plastomes encode 137 genes, including 80-81 unique coding genes (CDS) (9 duplicated), 37 tRNA genes, and four rRNA genes (Tables 2 and 3). The Mauve analysis retrieved a single synteny block, indicating no rearrangements in Tanaecium plastomes ( Supplementary Fig. S1). The boundaries between the chloroplast main regions are similar within Tanaecium, except for the LSC/IRb border, which can be located between the genes rps19 and rpl2 or within the rps19 gene (Fig. 2).
Nucleotide diversity analyses. The analysis performed using the DnaSP to calculate the nucleotide variability (π) values within 800 bp across plastomes showed that there is intrageneric variability in Tanaecium (Fig. 3A). The π values range from 0 to 0.06, with a mean value of 0.009. The most variable region, the only one containing π > 0.05, was the rpoA gene. Seven regions showed π values between 0.03 and 0.049 (i.e., clpP, psaI-ycf4, petD-rpoA, rps11, rps12-clpP, ycf4, and rpoA), while twelve regions showed π > 0.02 (i.e., ycf2, ycf1, rpl33, clpP-psbB, rpl33-rps18, rpl32-trnL, rpl32, clpP, ycf4, rpl20-rps12, rps11, and rps18) (Fig. 3A). The noncoding regions are more variable (7.65% of the intergenic regions (IGS) and 6.05% of the introns) than the coding regions (5.75%; Supplementary   Table S3). Most of these tandem repeats are found in the LSC regions, followed by the IR, with only a few tandem repeats found in the SSC (  Table S3). Most of the tandem repeats are located in the IGS, followed by the CDS, while few repeats were found in introns ( Fig. 5D; Supplementary Table S3). The plastomes of Tanaecium contain 20-67 forward repeats, up to two reverse repeats, and single palindromic repeats, leading to a total of 22-67 repeats (Supplementary Table S3). The longest repeats vary between 79 bp in T. parviflorum and 418 bp in T. pyramidatum (Supplementary Table S3). The longest repeats are located in eight regions: accD, rpoA, ycf1, and rps18 genes, or the rpl23/trnI-CAU, rpl33/rps18, psaA/ycf3, and trnN-GUU/ycf1 intergenic regions (Supplementary Table S3). A shared repeat with 41 bp showed the first repeat in the intergenic region rps2/trnV-GAC, the second in the ndhA intron for all Tanaecium species, and four additional Bignonieae plastomes included in this study ( Fig. 5A; Supplementary Table S4).  Table S5). The most abundant codons encoded leucine (10.5%), followed by isoleucine (8.3%); whereas the least abundant codons encoded cysteine (1.07%), followed by the stop codons (0.35%) (Fig. 6). Thirty-two codons showed codon usage bias (RSCU < 1), of which only three are not G-and C-ending. Thirty codons were used more frequently than expected at equilibrium (RSCU > 1), with one not representing an A/U-end codon. Codon bias was not detected (RSCU = 1) in the frequency of use for the start codon AUG (methionine) and UGG (tryptophan) (Supplementary Table S5). None of the 81 genes were found to be under positive selection in Tanaecium using HyPhy 30 in MEGA 7 31 . However, signals of positive selection were detected using the codon models BUSTED 32 (Fig. 7).

Discussion
In this study, we sequenced and assembled for the first time 16 plastomes representing 15 of the 21 Tanaecium species currently recognized 21 . These plastomes were compared with previously published Bignoniaceae plastomes, providing novel insights into chloroplast evolution in the family. The newly assembled plastomes were used as a basis to reconstruct the most comprehensive phylogeny of Tanaecium to date. The phylogenetic placement of Tanaecium jaroba, the type species of the genus, was inferred for the first time, corroborating the current generic classification 21 .
The quadripartite plastome structure found in Tanaecium is the most common among angiosperms 3,7,8,34 . Some exceptions for this structure have been reported in the papilionoid legumes 35 , saguaro cactus 36 , and Geraniaceae 37 . Although plastome structural changes have been reported for angiosperms 6,8 , including tribe Table 3. Genes encoded by the Tanaecium plastomes and their type and function. Asterisks (*) after gene names indicate genes with one intron, and double asterisks (**) indicate genes with two introns. Number one after gene names indicate genes duplicated.

Gene function
Gene type Gene  (Fig. 2). Contractions and expansions of IRs were detected multiple times during land plant evolution 38 , including other Bignoniaceae 16,19,20,28 . Within this plant family, the plastomes of Bignonia magnifica bear exceptionally large IR regions, representing the largest plastome among all Lamiids known to date 28 . The obtained Tanaecium plastomes show a pattern of size range variation that matches that of the LSC expansions/contractions (Table 2). This is a typical pattern among seed plants, although the number of genes and intergenic region length is more commonly used to explain plastome size variation 10 . In other Bignonieae, the LSC size variation is relatively common 16,19,20 , and the variation in gene number seems less frequent for the group 16,19,20 .
When the Bignonieae IQR and median size variation ratio are compared with those expected for other angiosperms 9 , Tanaecium, Adenocalymma, and Anemopaegma show less than 1% variation at the genus level   Table S7). Even though the high variation found in Amphilophium was previously attributed to polyphyly 9 , this interpretation was based on an outdated classification system. Amphilophium monophyly has been shown repeatedly 39,40 . In this context, we attribute the high IQR and median size variation ratio found in Amphilophium to the gene number and LSC length variation 10,16 . The total number of genes found in Tanaecium plastomes is similar to those found in other Bignoniaceae 16,20 . While ycf15 and ycf68 genes are lacking in some Bignoniaceae genera 16,19,20 , those genes were found in Tanaecium, Callichlamys latifolia (Rich.) K.Schum. 25 , and Crescentia cujete L. 29 . Partial ycf15 genes were also recorded in the Convolvulaceae 41 . The complete or partial loss of genes is common in land plants 6,9,10 , including the Bignoniaceae 20 .
The most variable locus in Tanaecium is rpoA, which contains hypervariable sites with π > 0.05. This gene is frequently listed among the most variable regions in other plant clades 42 and has been shown to represent one of the most hypervariable genes for Amphilophium (Bignoniaceae) 16 . In turn, the accD gene is the most variable in terms of absolute numbers in Tanaecium (Fig. 3), and the second most variable in Amphilophium, followed by the ycf1 gene 16 . The accD gene is highly variable in other Bignoniaceae species and angiosperm clades such as Artemisia (Asteraceae) 43 and Lamprocapnos (Papaveraceae) 8 . The rps18 gene is among the most variable in absolute numbers in Tanaecium, Stemonaceae 44 , Bromeliaceae 45 , and Campanulaceae 17 . Interestingly, the rps18 gene shows low evolutionary rates in Anemopaegma (Bignoniaceae) 19 , indicating that chloroplast genes can hold different levels of variation in distinct lineages and at different taxonomic levels. This aspect complicates the selection of candidate barcode genes for the angiosperms as a whole, emphasizing the importance of studies aiming to characterize plastomes of entire clades.
Single Sequence Repeats (SSRs) are commonly detected in plastomes, often showing interspecific polymorphism, and high variation at lower taxonomic levels, representing useful tools for population-level studies 46 . The SSRs identified in Tanaecium vary in location, type, and number. Most SSRs are located in the LSC region, with the mononucleotide A/T repeats representing the most abundant type (Fig. 4). The higher frequency of mononucleotides is a common trend among land plants 47 . Most of the long repeats of Tanaecium are located in www.nature.com/scientificreports/ the LSC, followed by IR regions, with only a few located in the SSC. This pattern differs from that found in other Bignoniaceae species, where most of the larger than 30 bp repeats are located in the IR, with only a few cases showing a pattern that is similar to that found here 16,48 . The chloroplast SSRs detected in Tanaecium will likely be helpful for future population genetics and microevolutionary studies, as well as for community-level studies of potential barcode designs, given the presence of shared repeats. Plastomes have a synonymous codon usage bias in the protein-coding genes, which affects gene expression and plays an essential role in the evolution of these genomes 49 . Our results showed that amino acids that have A-and U-ending codons are more common in Tanaecium, consistent with codon usage bias in most of the angiosperm plastomes, including Bignoniaceae representatives 48,50 . In plants, the main evolutionary driving force acting on codon use are natural selection and mutation pressure [51][52][53] . Thus, the patterns observed in Tanaecium bring important information not only about the nature of plastome mutations, but also about putative environmental impact. More expressed genes might display higher codon bias 54 , which can be seen in plastomes due to the photosynthetic machinery associated with the chloroplast function. Our results also showed a preference for using the amino acid leucine, which has a high RSCU ( Fig. 6; Supplementary Table S5), suggesting a potential impact of selection pressure on codon usage 51,54 .
Adaptive evolution or positive selection is generally estimated using the synonymous/non-synonymous substitutions ratio 55 . Even though our analyses using a maximum likelihood approach in HyPhy have failed to detect any signal of positive selection, evidence for positive selection was recovered through the analyses conducted with BUSTED and FUBAR. This result likely reflects the fact that a relatively high fraction of sites (5-10%) needs to be under positive selection for accurate detection in BUSTED 32 , while FUBAR assumes that the selection pressure for each site is constant throughout the phylogeny 33 . Thus, it is likely that the genes really have evidence for selection. For the eight genes under positive selection in Tanaecium, seven of them were also shown to be under positive selection in Amphilophium (except ycf4) 16 , while three were shown to be under positive selection in Handroanthus impetiginosus (Mart. ex DC.) Mattos (i. e., rps7, ycf1, and ycf4) 48 . The genes found under selection are associated with different plant cell functions. They are associated explicitly with ribosome biogenesis and The ML phylogeny reconstructed here sampled 15 out of the 21 currently accepted species of Tanaecium, representing the most comprehensive phylogeny of the genus to date, regarding the number of characters and taxa. A previous topology was inferred to investigate the relationship of a recently described Tanaecium species, sampling 11 species of the genus and using only the nuclear marker pepC and the chloroplast gene ndhF 21 . The sampling used here is different, making comparisons among the resulting topologies difficult. In addition, some relationships were not clearly solved in the previously published tree reconstructed with two markers, with several nodes showing low/moderate support 21 . Yet, the placement of the newly described species in that study was similar to the one inferred here (i.e., T. decorticans + T. pyramidatum). Moreover, the phylogeny inferred here is the first to include the type species of the genus (i.e., T. jaroba), confirming the monophyly of the genus hypothesized earlier 12 . Our results indicate that the variation found among plastomes is sufficient to reconstruct robust phylogenetic relationships of the 16 Tanaecium taxa sampled here with good support. Additional studies will be released soon, further investigating the phylogenetic relationships among Tanaecium species, their morphological evolution, and biogeographical history.

Materials and methods
Taxon sampling, DNA extraction, genomic sequencing, plastome assembly, and annotation. We Table 1.
Leaf tissue was pulverized with Tissuelyzer ® (Qiagen, Duesseldorf, Germany) for 5 min at 50 Hz and DNA was subsequently extracted following the CTAB protocol 62 . The protocol was adapted by adding 2-Mercaptoethanol and polyvinylpyrrolidone (PVP). DNA was quantified using the Qubit ® Fluorometer (Thermo Fisher Scientific, Waltham, MA, USA). A total of 5 μg of DNA was fragmented using a Covaris S-series sonicator, generating DNA fragments of approximately 300 bp. Libraries for Illumina platform sequencing were prepared following Nazareno et al. 25  For species for which it was harder to obtain comprehensive contigs, we tested values between 10 and 20 for minimum contig (-p) parameter overlap. The final assembly from Fast-Plast or afin was checked, and edited with Geneious 9.0.2 66 . The plastome assembly was verified through a coverage analysis conducted in Jellyfish 2.1.3 67 using a 25-bp sliding window of coverage across the plastome of each species. Only sites with a depth higher than two were kept.
Plastome annotation was initially conducted in Geneious 9.0.2 66 using the Adenocalymma peregrinum plastome as a reference 20 . The annotated loci were verified using BLAST 68,69 , with correct start and stop codons of the Open Reading Frames (ORFs) checked manually in Geneious 9.0.2 66 . The boundaries between the LSC, IRs, and SSC regions were verified using the online IRscope 70 and confirmed manually in Geneious 9.0.2 66 . The graphical representation of the annotated Tanaecium plastomes was created using OGDRAW 71 .
Plastome comparative analyses. We performed comparative analyses using the 16 Tanaecium plastomes sequenced (Table 1). We removed one of the IR regions from all plastomes to avoid data duplication, except for the analyses to determine synteny and identify possible rearrangements which were conducted for the complete plastomes using Mauve 2.4.0 72 . These analyses utilized mauveAligner as alignment algorithm, MUSCLE   1). To compare the length variation of Tanaecium plastomes and other Bignonieae genera with previously published plastomes, we used the box-plot approach proposed by Turudić et al. 9 .
Tanaecium plastomes were aligned in MAFFT 7 online version 74 where analyses of intrageneric variability were conducted. The poorly aligned regions were removed using Gblocks 0.91b 75 , assuming the least stringent settings. We calculated nucleotide variability values (π) within the assembled Tanaecium plastomes using DnaSP 6.10 76 through a sliding window analysis with a 200 bp step size and 800 bp window length. We used R 77 to plot the DnaSP results. We extracted annotated coding and non-coding regions using Geneious 9.0.2 66 to evaluate the number of variable sites (V) using the software MEGA 7 31 . The protein-coding regions were previously re-aligned individually with the translation alignment tool in Geneious 9.0.2 66 using the ClustalW plugin 78 .
Analyses of the repeated regions. To identify and locate microsatellites or Simple Sequence Repeats (SSRs) in Tanaecium plastomes, we used MISA 79 with the following parameters: motif length of SSR between one and six nucleotides, a minimum repetition number set as 10 units for mono-, five for di-, and four for trinucleotide SSRs, and three units for each tetra-, penta-, and hexanucleotide SSRs. We used REPuter 80 to identify tandem repetitions, allowing forward, palindrome, and reverse repeated elements with a minimum repeat size ≥ 30 bp and Hamming distance of 0.
Plastome codon usage and signature of molecular selection. To investigate the codon usage and the role of selection on Tanaecium plastomes, we extracted 81 protein-coding genes from the 16 genomes aligned and annotated. Each coding region was re-aligned separately in Geneious 66 , using the translation alignment tool ClustalW plugin. Codon usage bias occurs when some codons are used more often than other synonymous codons during gene translation between different taxa 81 . We assessed the relative synonymous codon usage (RSCU) from the 81 protein-coding genes using MEGA 7 31 , with default parameters.
In addition, we investigated synonymous (Ks) and non-synonymous (Ka) substitutions and their ratio (Ka/ Ks) in the 81 coding regions using the package HyPhy 30 in MEGA 7 31 . We also used other codon models to further analyze the selective pressure on the protein-coding genes using HyPhy 30 in the Datamonkey server 82 : i.e., BUSTED (branch-site unrestricted statistical test for episodic diversification; Murrell et al. 32 ) was used to investigate diversifying selection on the selected genes, while FUBAR (fast unconstrained Bayesian AppRoximation; Murrell et al. 33 ) was used to identify episodic/diversifying selection on codon sites with posterior probability of > 0.9.
Phylogeny reconstruction. The 16 plastomes of the 15 Tanaecium species assembled here were aligned using the Adenocalymma peregrinum (MG008314) plastome as an outgroup and the online version of MAFFT 7 74 . The Ira regions were excluded from the alignment to avoid data duplication. We used Gblocks to remove poorly aligned regions with the least stringent settings 75 . The number of variable and parsimony informative sites for the resulting alignment was calculated in MEGA 7 31 . The final alignment was used to perform maximum likelihood (ML) analyses in IQ-TREE 1.5.5 83 , including model selection and 1000 bootstrap (BS) replicates in a single run 84 .

Data availability
The assembled plastomes of Tanaecium are available in GenBank (NCBI) with the accession numbers OL782596, OP169019-OP169021, and OP218850-OP218861.