Mitogenome selection in the evolution of key ecological strategies in the ancient hexapod class Collembola

A longstanding question in evolutionary biology is how natural selection and environmental pressures shape the mitochondrial genomic architectures of organisms. Mitochondria play a pivotal role in cellular respiration and aerobic metabolism, making their genomes functionally highly constrained. Evaluating selective pressures on mitochondrial genes can provide functional and ecological insights into the evolution of organisms. Collembola (springtails) are an ancient hexapod group that includes the oldest terrestrial arthropods in the fossil record, and that are closely associated with soil environments. Of interest is the diversity of habitat stratification preferences (life forms) exhibited by different species within the group. To understand whether signals of positive selection are linked to the evolution of life forms, we analysed 32 published Collembola mitogenomes in a phylomitogenomic framework. We found no evidence that signatures of selection are correlated with the evolution of novel life forms, but rather that mutations have accumulated as a function of time. Our results highlight the importance of nuclear-mitochondrial interactions in the evolution of collembolan life forms and that mitochondrial genomic data should be interpreted with caution, as complex selection signals may complicate evolutionary inferences.


Results
Dated phylogeny and ancestral state reconstruction. Our dated mitogenomic phylogeny demonstrates that Collembola first diverged sometime in the early Devonian (ca. 410 Ma), with most diversification taking place in the Triassic to Cretaceous periods (ca. 250-60 Ma) (Fig. S1).
To identify whether a link exists between signals of mitochondrial DNA selection and collembolan life form (see Dataset S1), we assessed phylogenetic relationships between 32 collembolan and three outgroup taxa (see Table S1). We assigned a life form trait (i.e. aquatic, epiedaphic, euedaphic, hemiedaphic, and myrmecophilous) to the ecology (i.e. habitat preference) of each springtail species included in this study (see Dataset S1 for more detail). We reconstructed the ancestral ecological life form state for the collembolan clade using three different methods (parsimony, maximum likelihood, and Bayesian inference), all of which identified hemiedaphic as the ancestral character (see Fig. 1). Across the collembolan phylogeny, five independent life form shifts occurred from the hemiedaphic to alternative life forms (see Fig. 1). For the terminal nodes, the most common ecological life form was hemiedaphic (22 taxa), followed by epiedaphic (five taxa), euedaphic (three taxa), myrmecophilous www.nature.com/scientificreports/ (one taxon), and aquatic (one taxon). To assess phylogenetic conservatism and homoplasy across the phylogeny, we calculated the retention index (RI) 60 for each life form trait, which was 0.75 for epiedaphic, 0.36 for hemiedaphic, 0.33 for euedaphic, and 0 for both myrmecophilous and aquatic.

Selection analysis.
To assess signals and direction of selection at different loci (i.e. positive vs purifying selection), we used the codon-based maximum likelihood (CodeML) algorithm to determine the ratio of nonsynonymous to synonymous mutations (ω = dN/dS), and the number of nonsynonymous mutations as a proportion of the total number of mutations (dN/(dN + dS)) [61][62][63][64] . Site model selection analyses indicated overall positive selection across the phylogeny, shown by significant likelihood ratio tests (LRT) for the comparative site models, however, with no nucleotide positions under positive selection (Bayes Empirical Bayes score (BEB) < 0.95 62 ) (Table S2). The comparison between the numbers of synonymous to nonsynonymous mutations across the terminal nodes of the phylogeny was significantly different (paired t test: t 415 = − 9.48, p < 0.05). The number of nonsynonymous mutations as a proportion of the total number of mutations revealed no strong evidence for a link between life form transitions and the number of nonsynonymous mutations. Departures from the expectation that synonymous mutations are more common than nonsynonymous mutations 65 2). The 95% confidence intervals of these regressions broadly overlap, and the two models were not significantly different (p > 0.05). This indicates that no significant difference was found in the proportion of nonsynonymous mutations in relation to branch length for taxa that retained the ancestral state and for those whose ecological life form shifted. The relationships (represented as coloured pie charts in Fig. 1) between the taxa that represent the life form transitions to alternative life forms and the respective sister taxon/taxa (ancestral state, i.e. hemiedaphic), again revealed that the longer branches have the greatest number of nonsynonymous mutations in comparison to the sister clades (see coloured pie charts A and D). No significant difference was found within pie charts B, C, and E, which show that the proportion of mutations for the branches that represent the alternative life form vs. ancestral state (51% for pie charts B and C, and 50% for pie chart E; paired t test: t 4 = 0.23, p > 0.05) (Fig. 1).
Evidence for purifying (ω < 1) or diversifying selection (ω > 1) was determined for each of the terminal branches, and a total of 19 taxa had mitogenomes with signatures of purifying selection, while the mitogenomes of 11 taxa showed signs of diversifying selection, and two taxa had mitogenomes that satisfied the assumptions of neutrality (see Table S3). The proportion of changes per gene and per taxon reveal that atp8 exhibits the highest number of changes for the majority of the taxa (Figs. S2, S3; Table S4). Following atp8, all genes from complex I (i.e. nad1-nad6) had a large proportion of nonsynonymous mutations. Of the 13 PCGs, eight genes were identified as experiencing purifying selection, while five genes were identified as experiencing diversifying selection, with atp8 being under particularly strong diversifying selection (ω > 3.0) (Table S4). At a complex level, complex V (ATPase) revealed the highest percentage of nonsynonymous mutations, followed by complex I (NDs), complex IV (COXs), and complex III (cytb) (Fig. S3), which follows the estimated mutation rate of these complexes (i.e. ATPase > ND > COX > CYTB) 66 . The average value of ω for all genes across all terminal nodes was 0.96, which indicates that, overall, Collembola mitogenomes are experiencing slight purifying selection or no selection (i.e. selective neutrality).
Radical changes in physicochemical properties of amino acids. The impact of positive selection on the physical and chemical properties of 31 amino acids was evaluated for each of the 13 PCGs by detecting observed changes inferred by the phylogenetic tree under neutral conditions. Both positive and negative significant z-scores on the physicochemical properties of amino acids were found across the tree. Most physicochemical properties indicated that the radical changes were under purifying selection, with occasional radical changes under positive selection (Fig. S4). We detected two physicochemical properties-equilibrium constant (ionisation of COOH) and solvent accessible reduction ratio-with significant radical changes (z > 3.09; p < 0.001), indicating positive/diversifying selection for all 13 PCGs, while a third physicochemical property, buriedness, is under positive/diversifying selection for 10 PCGs (except cytb, nad3, and nad4l), with atp8 having the highest z-score. www.nature.com/scientificreports/

Discussion
Using a comprehensive mitogenome dataset for springtails (32 species from all four orders), we determined the ancestral ecological life form state for Collembola. We then used this dataset to assess whether a correlation exists between signatures of mitogenome selection and the evolution of novel life forms. We show that the ancestral ecological life form state of Collembola is hemiedaphic, with epiedaphic life forms first emerging in the early Permian, roughly 290 Ma. Even though several previous studies have identified links between Collembolan mitogenomes and ecology, morphology, and physiology 50,53,67-71 , we found no strong evidence for a link between signatures of mitochondrial genome selection and life form evolution in Collembola. Our mitogenomic phylogeny is broadly consistent with those presented in earlier studies [37][38][39][72][73][74][75] . The first phylogenetic split within Collembola occurred during the early Devonian (ca. 410 Ma), with most diversification taking place in the Triassic to Cretaceous periods (ca. 250-60 Ma). It is not surprising, therefore, that Collembola fossils preserved in amber and dating back to the Eocene (50-40 Ma) can be assigned to extant taxa with high certainty 54 , demonstrating the ultraconservative nature of life forms and basic body plans of the group.
Ancestral state reconstructions showed the ancestral ecological life form of collembolans to be hemiedaphic. Our finding contrasts with the assumption of a euedaphic origin of springtails proposed by D'Haese 36,76 . Additionally, we found that the epiedaphic life form evolved at least twice, first during the origin of Symphypleona (Fig. S1), approximately 290 Ma, and then again in the mid-Cretaceous with the emergence of the evolutionary lineage represented by Salina celebensis, roughly 115 Ma. During the Permian period, roughly 300 Ma, seedproducing gymnosperms such as gnetophytes, Pinaceae, and cupressophytes diverged from cycads and conifers 77 . Symphypleona show a habitat preference for Mediterranean-type shrublands, pine plantations, mixed forests, bushes, grasslands, and leaf litter (see Supplementary Information Dataset S1). This suggests that the emergence of major plant lineages, such as gymnosperms, created new ecological niches for the epiedaphic Symphypleona to exploit and diversify. The extant Bourletiella arvalis and Sminthurus viridis, however, show a habitat preference for grasslands, although these habitats would have only become available much later in the Cenozoic [78][79][80] . It is, therefore, reasonable to assume that the evolutionary lineage represented by these species would have occupied gymnosperm ecological analogs until their current niches became available. Unsurprisingly, the morphological features of these taxa are characteristic of the epiedaphic life form 47,49,81 36,76 , who found that the evolution of terrestrial to aquatic habitats, in a physiological context, is indeed more plausible and that an aquatic life form can be viewed as a "rare secondary acquisition" 36,43,76 .
We deem the five life forms we have identified to be somewhat homoplastic (see Yu et al. 53 ) best illustrated by the epiedaphic and euedaphic forms that independently originated twice within the group. Additionally, the myrmecophilous and aquatic forms are homoplastic or autapomorphic, as these traits are unique to single taxa. There is also a great deal of ecological conservatism for the hemiedaphic trait where most extant taxa have retained the ancestral state.
While there appeared to be a general trend for taxa on longer phylogenetic branches to have a greater overall proportion of nonsynonymous mutations, we found no consistent relationship between life form shifts and the proportion of nonsynonymous mutations on these branches. Furthermore, even when comparing individual genes, no correlations were found between the proportion of nonsynonymous changes and the evolution of new life forms (Fig. S2). This may seem surprising given the pivotal role that mitochondria play in aerobic metabolism, and given that different life forms are constrained in different ways metabolically. It should be noted, however, that for a group as ancient as Collembola, the many adaptive nuclear changes that produced the conspicuous ecological changes upon which life forms are classified may have obscured mitochondrial signatures of selection related to the evolution of life forms. While our expectations pertaining to selection were premised on nonoverlapping habitat stratification preferences, it should also be noted that interactions between collembolans and their environments are complex, and apparently divergent life forms may impose similar functional constraints on the organism, thereby confounding signatures of selection 84 .
Even though mitochondrial protein coding genes are typically believed to have faster mutation rates than nuclear genes 16 , most of these mutations are probably synonymous 85,86 , and thus functionally inconsequential (but see Lawrie et al. 87 ). Our results support this idea, with more taxa and more genes under purifying selection than under diversifying selection (Tables S3 and S4). However, selection on the mitochondrial genome appears to be largely neutral due to the random fixation of selectively neutral mutations 21 . Given the crucial role of mitochondria in aerobic metabolism, functional requirements likely constrain the molecular evolution of mitogenomes, and this is perhaps why many studies have identified purifying selection [88][89][90] . It makes sense that strong functional constraints would mean that any deleterious mutation that arises would be rapidly removed from the population via purifying selection. Additionally, genetic drift plays a major role in structuring genetic diversity, and its effects on mitochondrial genomes are stronger than on nuclear genomes due to differences in effective population sizes between them and the uniparental inheritance and lack of recombination within the mitochondrial genome 16,17,21 .
When radical amino acid changes are favoured by selection, local directional shifts in the structure and function of proteins evolve. The physicochemical properties under positive selection for all PCGs were equilibrium constant (ionisation of COOH) and solvent accessible reduction ratio, while buriedness was under www.nature.com/scientificreports/ positive selection for 10 of the 13 PCGs. Positive selection for equilibrium constant is suggestive of adaptations to increased oxygen levels and metabolic requirements 66 , while positive selection for solvent accessible reduction ratio may lead to bulkier proteins, which increases the space for active site formation 66 . Buriedness increases protein stability, and thus optimal protein function, in the context of hydrophobicity, whereby some residues are buried on the interior of the globular protein 91,92 . These results illustrate the importance of these properties in the context of environmental conditions (oxygen levels and metabolic requirements). The mitochondrial genome has long been considered to be selectively neutral [22][23][24][25] , but a plethora of studies have illustrated that it may be under strong positive selection as a result of thermal and aerobic respiratory adaptations 11,13,14,[27][28][29][30][31][32] . Here, we investigated whether a link exists between mitochondrial genome selection and the evolution of ecological life forms in Collembola. The expected outcome of such a link would be evidence for strong signals of selection at branches representing the evolution of new life forms, even if mitochondrial genes themselves are not the primary drivers of ecophysiological adaptation. Our selection results reveal that, overall, positive selection is present across the collembolan phylogeny, best illustrated by the presence of positive selection on three amino acid physicochemical properties which likely play instrumental roles in adaptive responses to environmental conditions (oxygen levels and metabolic requirements). However, the only major trend identified here was that the proportion of nonsynonymous mutations increased with time, with longer branches (i.e. older taxa) having a greater proportion of nonsynonymous mutations than shorter branches. In addition to investigating selection associated with life form evolution, we assessed whether there was a link between signatures of selection across a latitudinal gradient from where each species can be found. This was done in two ways; first by selecting species across all latitudes but where their ranges overlapped, and secondly by selecting species that had no overlapping distributions. As was the case for life form shifts, we found no clear correlation between the number of nonsynonymous mutations and latitudinal gradients for either analysis.
The lack of a correlation between life form shifts and mitogenome selection may hint at nuclear selection, and suggests that nuclear-mitochondrial interactions are of greater importance in the evolution of life form shifts than selection pressures on the mitochondrial genome alone. Our findings raise some precautionary flags in that selection, where present, is more complex than generally assumed, given the complex interactions of the mitochondrial and nuclear genomes. Importantly, evidence for strong selection on the mitochondrial genome cannot simply be extrapolated to reflect adaptation to changing environmental conditions.

Phylomitogenomic analysis. Phylomitogenomic relationships between Collembola taxa based on avail-
able data on GenBank (May 2020) were reconstructed using the protein-coding genes (PCGs) of 32 partial or complete Collembola mitogenomes, as well as three outgroup taxa (see Supplementary Information Table S1). Each PCG was aligned using the MUSCLE algorithm in MEGA X 93 , followed by concatenating the alignments into a single dataset. Substitution saturation was assessed on each codon position using DAMBE v7.3.5 94 , which revealed high substitution saturation, likely due to highly divergent taxa [95][96][97] . As a result, we took a conservative approach and used the amino acid sequences of each PCG for our phylogenetic reconstructions. Each PCG was independently translated into amino acid sequences using the invertebrate mitochondrial genetic code, and were then aligned in MEGA X using the MUSCLE algorithm, followed by manual inspection. Bayesian phylogenetic inferences were executed in BEAST v2.6.3 98 by linking site and clock models as well as trees. This was done because of the largely uniparental inheritance of mitochondrial genomes and a lack of recombination [15][16][17]27 . A discrete trait was added to define the ecological life form of each species (see Supplementary Information Dataset S1 for more detail). Given the role of mitochondria in different environments, and rather than linking life forms to morphological characteristics, we linked the life forms (i.e. aquatic, myrmecophilous, euedaphic, hemiedaphic, and epiedaphic) to the ecology (i.e. habitat preferences) of each springtail species included in this study however, with the exception of three taxa; Onychiurus orientalis, Orthonychiurus folsomi, and Tullbergia bisetosa, which were assigned based on morphology. These species, although ecologically hemiedaphic 47,49,50 (see Dataset S1 for ecological life forms), differ from other hemiedaphic taxa by having all three major euedaphic morphological characteristics (no eyes, no pigment, and no furca), and are thus categorised as "euedaphomorphic" sensu lato 99 . As such, these taxa were assigned to the euedaphic life form which aligns to a peculiar ecology within litter (i.e. lower dispersal ability and associated traits) that differs from the other hemiedaphic taxa. The mitochondrial REV + Г model was selected, with an uncorrelated relaxed lognormal clock and a birth-death tree prior. Monophyly was constrained for several clades based on a consensus tree constructed using the NCBI taxonomy database in PhyloT v2 (https:// phylot. bioby te. de/), and visualised in iTOL 100 . These were (a) all taxa (to calibrate the root of the tree, as described below), (b) the Hexapoda clade (all taxa excluding the distantly related Daphnia pulex) to calibrate the split between Collembola and other basal hexapod taxa (Japyx solifugus and Trigoniophthalmus alternatus) and Collembola as described below, and (c) a clade that excluded all non-Collembola taxa. Monophyly was not constrained for each collembolan order (Poduromorpha, Entomobryomorpha, Neelipleona, and Symphypleona) since monophyly is not always supported for these groups [101][102][103] .
Considering that some collembolan fossils have been assigned to contemporary species despite being more than 20 Ma old, dating recent nodes using fossil data can be challenging for these taxa. We therefore used secondary calibration points from Leo et al. 104 and included calibration points that matched the two deepest fossil-calibrated nodes. To ensure that our calibration points matched the 95% highest probability density (HPD) confidence intervals of Leo et al. 104 for the same nodes, we followed their protocol in setting the priors for these nodes. Accordingly, we set the prior at the root of the tree to have a normal distribution and a mean of 510 Ma with a standard deviation of 7 Ma. The second prior was placed at the split between the higher insects (Japyx solifugus and Trigoniophthalmus alternatus) and Collembola, with a normal distribution and a mean of 485 Ma with a standard deviation of 6 Ma (for the rationale behind this prior assignment, see Leo et al. 104 ). www.nature.com/scientificreports/ A Bayesian consensus phylogenetic tree was calculated using BEAST submitted through the CIPRES Science Gateway 105 from three independent chains at different starting seeds for a chain length of 280 million generations, a thinning interval of 10,000, and 25% burn-in to ensure that stationarity was reached. Convergence was assessed using Tracer v1.7 106 , trees were summarised using TreeAnnotator v2.6.2 98 (20% burn-in), and the resulting topology was visualised in FigTree v1.4.3 (http:// tree. bio. ed. ac. uk/ softw are/ figtr ee/).

Ancestral state reconstruction.
A quantitative phyletic approach was used to assess the evolutionary relationships between taxa based on this biological information. Ancestral state reconstructions were performed using both parsimony and maximum likelihood methods in Mesquite v3.61 107 ; the discrete trait of each taxon (from Supplementary Information Dataset S1) was provided as an input for the analyses. The likelihood ratio test of the maximum likelihood method was used for the Markov k-state one-parameter (Mk1) and the asymmetrical Markov k-state 2 parameter (AsymmMk) models 107,108 , while the rate of character evolution was estimated under the Mk1 model. The Bayesian consensus phylogenetic tree obtained from the 28,000 trees (after the burn-in) sampled from BEAST was used for the ancestral state reconstruction. The retention index (RI) 60 was evaluated for each life form trait to assess phylogenetic conservatism and homoplasy across the phylogeny.
High RI values represent little to no homoplasy (typically above 0.85, see Yu et al. 53 ), while low RI values indicate homoplasy and low phylogenetic signal in a given dataset.
To account for uncertainties with phylogenetic mapping and the occurrence of topographical incongruence across sampled trees, a Bayesian MCMC method implemented in BayesTraits v3.0.2 109 , was used for ancestral state reconstructions. The AddTag and AddMRCA options were implemented to find the ancestral state associated with the node of collembolan clade. These options were used to overcome uncertainty when estimating ancestral states 109 . A reversible-jump (RJ) hyper-prior was implemented to reduce uncertainty and arbitrariness when selecting priors 109 . The hyper-prior draws up values from a uniform distribution between 0 and 30 which is used to seed the exponential distribution for the MCMC analyses. Three independent MCMC chains were run for 30 million iterations with a 25% burn-in. These analyses were repeated with a gamma reversible-jump hyper-prior (mean: 0, 30; variance 0, 30), but no appreciable difference was identified between the posterior distributions between these analyses when visualised in Tracer. MCMC convergence was assessed in Tracer by ensuring that the effective sample size (ESS) ≥ 200 for all estimated parameters. The proportion of each state was calculated from the mean values from approximately 20,000 trees randomly chosen by BayesTraits from the 28,000 trees sampled from BEAST. The states that were reconstructed to be equivocal (probability of 0.2 for each state) were discarded and the probability that the collembolan clade was in a given state was determined. The average of the estimated probability for the three independent runs was calculated, and any character state with a proportion of > 0.7 was considered to be supported 110 .
Selection analysis. The ratio of nonsynonymous, (dN; i.e. amino acid-altering) to synonymous (dS; i.e. silent) substitution rates, ω = dN/dS, is used to understand the evolutionary dynamics of genes [61][62][63][64] . We computed ω using the codon-based maximum likelihood (CodeML) algorithm in EasyCodeML v1.21 62,111 . To evaluate the influence of selection on the PCG sequence alignments, we implemented the preset (nested models) running mode with the site model within the framework of the topology of the phylogeny retrieved by the phylogenetic analyses above 112 . The codon substitution models described by Geo et al. 62 , and the significance of their respective dN/dS ratios, were evaluated using the likelihood ratio test (LRT). To identify whether a link between signatures of mitochondrial selection and a shift in ecological life form exists, we determined the number of nonsynonymous mutations per gene as a proportion of the total mutations (i.e. dN/(dN + dS)) for all terminal branches, and then averaged these proportions per branch. Moreover, we compared the proportion of nonsynonymous mutations for each life form shift (aquatic, myrmecophilous, euedaphic, and epiedaphic) vs. the respective sister branches/taxa that represented the ancestral state (hemiedaphic) to ascertain whether a link between the number nonsynonymous mutations to the ecological life form transition exists. Where multiple branches were present (clades), we accounted for the number of taxa by averaging the proportion of mutations for each terminal branch, and then followed the path from the terminal branch back until reaching the node where life form divergence (i.e. nearest common ancestor) took place. Statistical analyses were performed using a paired t test to compare the proportion of synonymous to nonsynonymous mutations across the phylogeny as well as where each branch is paired with its sister branch that represents a life form shift. Plots of regression analyses were performed using the R package ggplot2 113 to assess the correlation between the proportions of nonsynonymous changes vs. branch length for the taxa that retained the ancestral state and for those who represent a life form shift (i.e. the alternative state). Univariate general linear models of the regressions were performed in SPSS 27 (Inc., Chicago, IL, USA). Furthermore, and following the same method as described above, we assessed whether there was a link between selection and species that originated from different latitudinal gradients. This was done twice, first by identifying species across latitudinal gradients where their ranges overlapped, and secondly by selecting species that had no overlapping distributions. A phylogenetic tree was reconstructed for both analyses, and selection analyses were conducted, and the number of nonsynonymous mutations were determined for each terminal branch/species. Additionally, we sought to determine the evolutionary dynamics in the context of purifying or diversifying selection for each terminal branch based on the count of nonsynonymous and synonymous mutations (i.e. ω) for each gene and complex. To represent these results to a gene or complex level (in the form of a percentage), the number of changes per gene/complex were standardised based on gene size (i.e. total number of nonsynonymous changes/(gene size × total number of branches) × 100). www.nature.com/scientificreports/ Radical changes in physicochemical properties of amino acids. The impact of positive selection on physical and chemical properties of 31 amino acids was evaluated for each of the 13 PCGs using TreeSAAP v3.2 (Selection on Amino Acid Properties) 114 by detecting observed changes inferred by the phylogenetic tree under neutral conditions. The impact of 20 physicochemical properties (excluding 11 amino acids with < 85% accuracy following McClellan and Ellison 115 ) were evaluated. Under the assumption of neutrality, a significant positive z-score as calculated by TreeSAAP (z-score > 3.09, p < 0.001) for a physicochemical property is indicative of the prevalence of nonsynonymous substitutions, while a significant negative z-score (z-score < − 3.09, p < 0.001) indicates the prevalence of synonymous substitutions. A sliding window of 15 codons was utilized, based on a categorical scale of 1 (most conservative) to 8 (most radical). Only the most radical changes (category scores of 7 and 8, p ≤ 0.001) were considered, and for the physicochemical properties that were significant for both category 7 and 8, only the z-scores that were significant for category 8 were included.

Data availability
The authors confirm that all relevant data used for this manuscript are available online from the GenBank NCBI online repository (https:// www. ncbi. nlm. nih. gov/ genba nk/). All sequence accession numbers are list in the Supplementary Information of this manuscript (Table S1).