A comparative analysis of complete chloroplast genomes of seven Ocotea species (Lauraceae) confirms low sequence divergence within the Ocotea complex

The genus Ocotea (Lauraceae) includes about 450 species, of which about 90% are Neotropical, while the rest is from Macaronesia, Africa and Madagascar. In this study we present the first complete chloroplast genome sequences of seven Ocotea species, six Neotropical and one from Macaronesia. Genome sizes range from 152,630 (O. porosa) to 152,685 bp (O. aciphylla). All seven plastomes contain a total of 131 (114 unique) genes, among which 87 (80 unique) encode proteins. The order of genes (if present) is the same in all Lauraceae examined so far. Two hypervariable loci were found in the LSC region (psbA-trnH, ycf2), three in the SSC region (ycf1, ndhH, trnL(UAG)-ndhF). The pairwise cp genomic alignment between the taxa showed that the LSC and SSC regions are more variable compared to the IR regions. The protein coding regions comprise 25,503–25,520 codons in the Ocotea plastomes examined. The most frequent amino acids encoded in the plastomes were leucine, isoleucine, and serine. SSRs were found to be more frequent in the two dioecious Neotropical Ocotea species than in the four bisexual species and the gynodioecious species examined (87 vs. 75–84 SSRs). A preliminary phylogenetic analysis based on 69 complete plastomes of Lauraceae species shows the seven Ocotea species as sister group to Cinnamomum sensu lato. Sequence divergence among the Ocotea species appears to be much lower than among species of the most closely related, likewise species-rich genera Cinnamomum, Lindera and Litsea.

Determination of the most variable regions. The nucleotide diversity (Pi) values within 600 bp across the seven Ocotea plastomes vary from 0 to 0.015, with a mean value of 0.001 (Fig. 2a). Four variable loci with Pi ≥ 0.006 were found in the LSC region (psbA-trnH, Pi = 0.007; ycf2, Pi = 0.006) and in the SSC region (ycf1, Pi = 0.008; ndhH, Pi = 0.008; trnL(UAG)-ndhF, Pi = 0.015). At the family level, sequence divergence was calculated using published chloroplast genomes of Alseodaphne, Cinnamomum, Laurus, Lindera, Litsea, Machilus, Neolitsea, Parasassafras, Persea, Phoebe, and Sassafras (see "Materials and methods" section). Unfortunately, the sequence of Nectandra angustifolia (marked as "unverified" in GenBank) had to be excluded because it differs so strongly from those of all other Core Lauraceae that large parts of it could not be readily aligned.

Comparative analysis of plastomes. A comparison of the LSC, IR and SSC junction positions in the
Ocotea plastomes is shown in Fig. 3. The ycf1 gene crosses the boundary between the IRb (1408 bp) and the SSC (4163 bp) regions. The ycf2 gene is found in the boundary between the LSC (3852 bp) and the IRb (3162 bp) regions. Fragments (pseudogenes) of ycf1 (1408 bp) and ycf2 (3162 bp) are located in the IRa region. The distances between the ndhF gene and the ycf1 fragment and between the ycf2-fragment and the trnH gene are 21 bp and 27 bp, respectively. The pairwise cp genomic alignment between six Ocotea species and O. aciphylla as reference showed very high similarity in all sequences (Fig. 4). The LSC and SSC regions were more variable in comparison with the IR regions. The noncoding regions showed a relatively higher mutation rate than proteincoding regions in the Ocotea plastomes examined.  Table S4). The GC content at coding positions is about 39.1% in the examined Ocotea plastomes. The GC contents at second and at third codon positions were also very similar (35.5-35.6%; 39.2% respectively). All possible codon types are used for each amino acid. The most frequent amino acids encoded in the Ocotea plastomes are leucine (Leu; 11.76-11.83%), isoleucine (Ile; 8.05%-8.11%), and serine (Ser; 7.93-8.01%) (Fig. 5). The amino acids arginine (Arg), glycine (Gly), lysine (Lys), phenylalanine (Phe), and valine (Val) account for 5.02-5.95% each. Least represented in the chloroplast genomes examined were cysteine (Cys; 1.81-1.87%) and tryptophan (Trp; 1.94-1.95%). The relative synonymous codon usage (RSCU) was greater than 1.0 in 31 codons (Supplementary Table S3). The count of preferred codons ending with A/U or G/C were 25 and six, respectively. The frequency of different codons coding for the same amino acid was almost the same in all Ocotea species examined. The Macaronesian Ocotea foetens presented slightly higher frequencies for arginine, cysteine, serine, histidine (His), and tyrosine (Tyr) in comparison with the Neotropical Ocotea species, whereas the contents of alanine (Ala), isoleucine and leucine were slightly lower.
A total of 87 protein coding genes were identified in the Ocotea species examined. The counts of total protein coding genes in Lauraceae in previous studies ranged from 73 genes in Cassytha species via 79 in Cinnamomum camphora to 86 genes in Caryodaphnopsis henryi Airy Shaw 16,25,30 . Consistently 85 protein coding genes have been reported for the genera of the early divergent Cryptocaryeae (Beilschmiedia, Cryptocarya and Eusideroxylon), as far as they have been examined. Among the remaining Lauraceae, the most frequent count is 84 30 . Lower numbers have been reported for Cinnamomum micranthum and C. kanehirae (83) 29 , Litsea glutinosa (83) 17 , Cinnamomum camphora (79) 16 and two Cassytha species (73) 25 . The differences among the counts of genes in the Lauraceae species, except the hemiparasitic Cassytha, may be due to different annotation of genes. Particularly the rpl22 gene has not been annotated in most of the earlier studies 17,19,23,24,26,27,29,30,33,[39][40][41] .
In our study and in Chen et al. 16 the codons coding for leucine and for cysteine were the most and the least frequent, respectively. 11.76-11.83% of the codons in Ocotea and 10.87% in Cinnamomum camphora are coding for leucine, whereas only 1.81-1.87% or 1.25%, respectively, are coding for cysteine. Like in C. camphora, preferred codons in Ocotea are more frequently ending in A/U than in G/C (27 vs   As expected, addition of the seven Ocotea species does not change the result of the phylogenetic analysis significantly compared to previous cp genome studies 28,33 . The topology among the major clades, Cinnamomeae, Laureae and the Perseeae, is the same in all studies. As excpected, the seven Ocotea species form a monophyletic  www.nature.com/scientificreports/ porosa in our study. Ocotea daphnifolia, which is nested among the Praelicaria taxa in our result, is a member of the O. minarum group and as such a member of the Pluriocotea clade in the study by Penagos et al. 12 . In their result, the Pluriocotea clade is the sister group to a clade consisting of the dioecious taxa (Diocotea, represented by O. guianensis and O. tabacifolia in our study), the O. helicterifolia group and the genera Nectandra, Pleurothyrium and Damburneya, which are not represented in our study. It needs to be checked if these differences persist when further cp genomes become available. As expected, entire plastomes have the potential to increase resolution and support values among the clades of the Ocotea complex. Our phylogeny is fully resolved, and not only the Ocotea complex receives 100% bootstrap support, but also four of the five nodes within it. There is still one node that is scarcely supported, but that may change with denser taxon sampling. Sequence divergence among the seven Ocotea species is rather low, compared to the most closely related, likewise species-rich genera Cinnamomum, Lindera and Litsea. Even though we selected Ocotea species from widely divergent clades, there were only 168 parsimony-informative characters among them in the entire chloroplast genomes. If we arbitrarily select the first seven species of Cinnamomum, Lindera or Litsea from our data matrix, these numbers are 414, 423 or 410, respectively. This confirms the results of the tests of individual established www.nature.com/scientificreports/ chloroplast markers mentioned in the introduction, and may point to a rather recent diversification of the Ocotea complex, as was first suggested by Chanderbali et al. 10 . However, a much larger number of sequences will be required for a molecular clock analysis of this group.   Plastomes assembly and annotation. Analyses of genome sequence and genomic organization were performed using Geneious Prime 2021.0.3 44 . The generated contigs of Ocotea foetens were assembled de novo and annotated using the plastomes of Cinnamomum camphora (GenBank accession number MH050970) and Persea americana (NC_031189) for comparison. The contigs of the remaining taxa were assembled and annotated using the chloroplast genome of O. foetens as a reference. The contigs were inspected visually for any signs of erroneous assembly. In a few cases, doubtful regions were verified by Sanger sequencing (methods described earlier 2,3,9,11  . Also a few additional minor adjustments of the alignment were made manually during inspection of the sequences, mostly in regions of SSRs. DnaSP v6 49 was used for calculating the nucleotide variability values (Pi) within the plastomes. The sliding window length was set to 600 bp, and the step size was set to 200 bp. Microsoft Excel 50 was used to plot the Pi values. These data were used to identify hypervariable regions among the seven Ocotea plastomes examined as well as among the sequences retrieved from the NCBI GenBank (Supplementary Tables S7, S8). Codon usage and SSRs analyses. The protein-coding genes of Ocotea plastoms were extracted using the program Geneious Prime 2021.0.3 44 . The sequences were aligned using MAFFT v7 48 . Codon usage frequency, Codon Bias, and G + C content were calculated using the program DnaSP v6. The SSR motifs were scanned using MISA v2.1 51 . The minimum thresholds were set to 10 repetitions for mononucleotide SSRs, five repeat units for dinucleotide SSRs, four repetitions for trinucleotide SSRs and three repetitions for tetra-, penta-and hexanucleotide SSRs. The maximum length of interruption between two SSRs was chosen as 100 bp.

Comparative analysis of
Phylogenetic analysis of Lauraceae plastomes. The data matrix that had been prepared for the determination of the most variable regions was analyzed using maximum likelihood analyses (ML) in MEGA 10.2.5 52 , with the following parameters: nrep = 500, Tamura-Nei model, uniform rates among sites and Nearest-Neighbor-Interchange (NNI). The chloroplast genomes of the Perseeae (Alseodaphne spp., Machilus spp., Persea americana, and Phoebe spp.) were used as outgroup.

Data availability
The complete cp genome sequences of the seven Ocotea species have been submitted to the NCBI GenBank. www.nature.com/scientificreports/