Global determinants of insect mitochondrial genetic diversity

Understanding global patterns of genetic diversity is essential for describing, monitoring, and preserving life on Earth. To date, efforts to map macrogenetic patterns have been restricted to vertebrates, which comprise only a small fraction of Earth’s biodiversity. Here, we construct a global map of predicted insect mitochondrial genetic diversity from cytochrome c oxidase subunit 1 sequences, derived from open data. We calculate the mitochondrial genetic diversity mean and genetic diversity evenness of insect assemblages across the globe, identify their environmental correlates, and make predictions of mitochondrial genetic diversity levels in unsampled areas based on environmental data. Using a large single-locus genetic dataset of over 2 million globally distributed and georeferenced mtDNA sequences, we find that mitochondrial genetic diversity evenness follows a quadratic latitudinal gradient peaking in the subtropics. Both mitochondrial genetic diversity mean and evenness positively correlate with seasonally hot temperatures, as well as climate stability since the last glacial maximum. Our models explain 27.9% and 24.0% of the observed variation in mitochondrial genetic diversity mean and evenness in insects, respectively, making an important step towards understanding global biodiversity patterns in the most diverse animal taxon.

Global maps of the top environmental predictors of GDM and GDE.MTWM is the maximum temperature of the warmest month bioclimatic variable (°C * 10) (a), Temp.Trend describes centennial trends in temperature variation since the last glacial maximum (LGM) (b), PWM is the precipitation of the wettest month bioclimatic variable (mm) (c), Precip.Trend describes centennial trends in precipitation variation since the LGM (d), Temp.Range is the mean diurnal air temperature range bioclimatic variable (ºC * 10) (e), Precip.Seasonality is the precipitation seasonality bioclimatic variable (kg m -2 ) (f), Precip.Variation is the standard deviation around centennial trends in precipitation variation since the LGM (g), and PDM is the precipitation of the driest month bioclimatic variation (kg m -2 ) (h).Precip.Seasonality, PWM, PDM, Temp.
Trend, Precip.Variation, MTWM, Temp.Range, and Precip.Trend were the eight predictors in a Bayesian GLMM model of GDM, while Temp.Trend, PWM, and MTWM were the three predictors in a model of GDM.
The residuals of genetic diversity mean (GDM) (a) and genetic diversity evenness (GDE) (b) for the reported dataset with a minimum operational taxonomic unit (OTU) threshold of 100.The residuals had no spatial autocorrelation (see Table 2).
Percentages indicate the percentage of the prior distribution that the posterior distribution overlaps with.
Maps showing the global distribution of model uncertainty in genetic diversity mean (GDM) (a, b, c) and genetic diversity evenness (GDE) (d, e, f).The upper 95% highest density interval (HDI) (a, d) and lower 95% HDI (b, e) are plotted.The upper 95% HDI and lower 95% HDI broadly reflect similar patterns to the reported median value (see Fig. 2).We masked in gray areas with environments non-analogous to the environments used for modeling.Multivariate environmental similarity surfaces, which indicate environmental space that is analogous and non-analogous to that used to train the spatial Bayesian generalized linear mixed models of genetic diversity mean (GDM) (a) and genetic diversity evenness (GDE) (b).We masked areas with non-analogous climate in our final projections to limit model extrapolation and uncertainty.
The proportion of operational taxonomic units (OTUs) belonging to each insect order from the observed data set.Three orders contain over 10% of sampled OTUs each, while 20 orders contain fewer than 5% of sampled OTUs in total ("Other") (see Supplementary Table 3 for full sampling information).Supplementary Fig. 11 The distribution of the number of grid cells each operational taxonomic unit (OTU) occupies.
The tick marks along the x-axis indicate the presence of OTUs that occupy 25 or more cells, but are not visible on the histogram.Over 74 percent of OTUs occupy a single grid cell and over 99 percent occupy fewer than 12 grid cells.
The per-cell distributions of genetic diversity mean (GDM) (a) and genetic diversity evenness (GDE) (b) with the most-sampled orders removed compared to the full data set.The boxplot center represents the median of the data, while the lower and upper hinges correspond to the first and third quartiles.The whiskers extend to the largest value no further than 1.5 times the inter-quartile range from the hinge.Three orders included over 10% of sampled operational taxonomic units (OTUs) -Diptera (34.0%),Lepidoptera (32.4%), and Hymenoptera (17.3%) (see Supplementary Table 3 for full sampling information).The two datasets with the top three  The ELPD is the expected log (pointwise) predictive density, where a higher ELPD indicates more support for the model.The predictor column is additive, where the model with the predictor listed contains all of the predictors listed before.In addition to ELPD, we report the difference in ELPD (ELPD-DIF), which is the difference in ELPD from the reference model.The SEELPD-DIF is the standard error of the difference in ELPD from the reference model.SEELPD is a coarse summary of model support variation that is sensitive to sampling quality, and we take the advice of Vehtari et al. 7 to use ELPD-DIF and SEELPD-DIF when comparing models.
We chose the simplest model whose ELPD overlapped with the reference model's ELPD (ELPD-DIF + SEELPD-DIF overlaps zero) and corroborated our selection with the heuristic method of selection employed by the projpred R package function suggest_size().The chosen model is indicated with gray shading.The ELPD is the expected log (pointwise) predictive density, where a higher ELPD indicates more support for the model.The predictor column is additive, where the model with the predictor listed contains all of the predictors listed before.In addition to ELPD, we report the difference in ELPD (ELPD-DIF), which is the difference in ELPD from the reference model.The SEELPD-DIF is the standard error of the difference in ELPD from the reference model.SEELPD is a coarse summary of model support variation that is sensitive to sampling quality, and we take the advice of Vehtari et al. 7 to use ELPD-DIF and SEELPD-DIF when comparing models.
We chose the simplest model whose ELPD overlapped with the reference model's ELPD (ELPD-DIF + SEELPD-DIF overlaps zero) and corroborated our selection with the heuristic method of selection employed by the projpred R package function suggest_size().The chosen model is indicated with gray shading.

Effect of duplicate alleles on genetic diversity
We used coalescent simulations in msprime 8 to explore the potential bias in calculation of nucleotide diversity (hereafter pi; 9 ) where duplicate alleles may or may not be present in the data.The goal of this simulation study was to replicate data which may be obtained from online databases such as BOLD, where duplicate haplotypes within a focal species may not be represented by duplicate sequence records within the database, potentially biasing estimates of pi.For all simulations we set the sequence length to 500bp and the mutation rate to 1e-8/bp/generation, to approximate the features of a typical arthropod COI barcoding dataset.We explored combinations of three different sample sizes (ss = 5, 10, 50) to replicate shallow, moderate, and deep sampling of individuals, and a range of effective population sizes (Ne) for 20 values of Ne equally spaced between 1e4 and 1e6, thus capturing a wide range of pi values on the same order as those reported in the literature for arthropods 10 .We performed 1000 simulation replicates for each combination of sample size and Ne, calculated nucleotide diversity for all samples (pi) and for only those samples bearing unique haplotypes (pi*), and recorded the value of the difference between (pi* -pi) and the ratio between (pi*/pi) these quantities averaged across all simulations.
Our coalescent simulation experiments showed a small but consistent upward bias in pi* with respect to pi (Supplementary pi* Fig. ).The average absolute difference between pi* and pi increased with increasing Ne values uniformly for all sample sizes, while larger sample sizes showed consistently greater difference in pi*-pi across the spectrum of Ne values examined.In contrast, the relative difference between pi* and pi decreased with increasing Ne, with pi*/pi approaching 1 as Ne approached 1e6 for all sample sizes.We draw three main conclusions from these results: 1) The direction of the bias is consistent (pi* > pi) and predictable across a range of Ne values; 2) The bias is small for large Ne (> 1e6) which is typical for arthropods; 3) For small Ne (< 1e6), small sample sizes reduce both the absolute and relative difference between pi* and pi.Taken together, and given the sample sizes and presumed Ne values of species from our empirical datasets, we conclude that the bias introduced by sampling only unique haplotypes should in most cases be small and predictable, and that, therefore, it should not unduly influence either the results or our main conclusions.

Modeling methods
We used projective prediction feature selection (see Methods) for variable selection 11 .
This method projects the posterior information in the full model to smaller submodels by replacing the posterior distribution of the full model with a simpler distribution that is restricted by constraining relevant model parameters.In this case, the regression coefficients decline to zero.The predictive accuracy of these models was evaluated with an approximate leave-one-out (LOO) cross-validation technique, which avoids repeatedly fitting the full model by using an importance weighting scheme outlined in 7,11 .Under this framework, we used the expected log predictive density (ELPD), specifically the difference in ELPD from the reference model, as the utility function instead of other utility functions (e.g., mean squared error) because it measures how well the predictive uncertainties are calibrated, in addition to measuring point predictions.
While incorporating spatial autocorrelation into this variable selection step would be ideal, the full 11 variable set contains variables that do not necessarily vary linearly with the response and necessitates an exceedingly complex spatial covariance matrix, which results in the model failing to converge and displaying poor model diagnostics.Therefore, we opted for the simpler nonspatial approach to reduce the parameter space included in the final model.
Rather than estimating random effects at each location, which can be computationally intensive, a spatial field of correlated random effects at a subset of locations or "knots" was modeled.The choice of the number of knots is somewhat subjective, so we fit models using 5, 10, 15, 20, 25, and 30 knots.Models with 20 knots had the lowest residual SAC for GDE, while 30 knots had the lowest residual SAC for GDM, so we chose 20 knots for the GDE models and 30 knots for the GDM models.We used regularizing priors on all slope parameters (N(0, 0.1)) and sigma (N(0, 1)), as well as on the Gaussian process θ parameter (N(0, 5)), which controls how steeply the correlation between knots declines, and the Gaussian process sigma parameter (N(0, 1)), which controls the amplitude of spatial deviations.N() implies a normal distribution, where the first number in the parentheses is the mean and the second is the standard deviation.

Modeling output
Model output objects for genetic diversity mean (GDM) and genetic diversity evenness (GDE) at all minimum OTU thresholds (10,25,50,100,150,200) orders removed contained N = 82 cells.The taxonomic distribution of operational taxonomic unit (OTU) sampling varied geographically.Of the three most sampled orders, Diptera (a) dominated sampling towards the poles and Lepidoptera (b) dominated sampling in the tropics and in some temperate localities at middle latitudes.Hymenoptera (c) typically consisted of fewer than 50% of OTUs sampled per grid cell, although it dominated sampling in Madagascar.Proportions were derived from the number of OTUs sampled per cell belonging to each order.The change in the number of grid cells occupied per operational taxonomic unit (OTU) with latitude.The OTUs are binned, showing the vast majority of OTUs in the data set occupy few grid cells.Sampling variability in estimates of genetic diversity mean (GDM) (a) and genetic diversity evenness (GDE) (b).We estimated potential sampling variability by resampling the ten assemblages with the largest number of operational taxonomic units (OTUs) (2,748 to 13,300 OTUs per grid cell) 1000 times without replacement, with 100 OTUs per sample.The vertical black lines indicate the genetic diversity value calculated with the full sample size.